138 2 70MB
English Pages [417]
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4404
Simeon J. Simoff Michael H. Böhlen Arturas Mazeika (Eds.)
Visual Data Mining Theory, Techniques and Tools for Visual Analytics
13
Volume Editors Simeon J. Simoff University of Western Sydney School of Computing and Mathematics NSW 1797, Australia E-mail: [email protected] Michael H. Böhlen Arturas Mazeika Free University of Bozen-Bolzano Faculty of Computer Science Dominikanerplatz 3, 39100 Bozen-Bolzano, Italy E-mail: {boehlen,arturas}@inf.unibz.it
Library of Congress Control Number: 2008931578 CR Subject Classification (1998): H.2.8, I.3, H.5 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-540-71079-5 Springer Berlin Heidelberg New York 978-3-540-71079-0 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12280612 06/3180 543210
Foreword Visual Data Mining—Opening the Black Box
Knowledge discovery holds the promise of insight into large, otherwise opaque datasets. The nature of what makes a rule interesting to a user has been discussed widely1 but most agree that it is a subjective quality based on the practical usefulness of the information. Being subjective, the user needs to provide feedback to the system and, as is the case for all systems, the sooner the feedback is given the quicker it can influence the behavior of the system. There have been some impressive research activities over the past few years but the question to be asked is why is visual data mining only now being investigated commercially? Certainly, there have been arguments for visual data mining for a number of years – Ankerst and others2 argued in 2002 that current (autonomous and opaque) analysis techniques are inefficient, as they fail to directly embed the user in dataset exploration and that a better solution involves the user and algorithm being more tightly coupled. Grinstein stated that the “current state of the art data mining tools are automated, but the perfect data mining tool is interactive and highly participatory,” while Han has suggested that the “data selection and viewing of mining results should be fully interactive, the mining process should be more interactive than the current state of the art and embedded applications should be fairly automated 2 .” A good survey on techniques until 2003 was published by de Oliveira and Levkowitz3 . However, the deployment of visual data mining (VDM) techniques in commercial products remains low. There are, perhaps, four reasons for this. First, VDM, as a strong sub-discipline of data mining only really started around 2001. Certainly there was important research before then but as an identifiable subcommunity of data mining, the area coalesced around 2001. Second, while things move fast in IT, VDM represents a shift in thinking away from a discipline that itself has yet to settle down commercially. Third, to fully contribute to VDM a researcher/systems developer must be proficient in both data mining and visualization. Since both of these are still developing themselves, the pool from which to find competent VDM researchers and developers is small. Finally, if the embedding is to be done properly, the overarching architecture of the knowledge 1
2
3
Geng, L., Hamilton, H.J.: Interestingness measures for data mining: A survey. ACM Computing Surveys 38 (2006) Ankerst, M.: The perfect data mining tool: Automated or interactive? In: Panel at ACM SIGKDD 2002, Edmonton, Canada. ACM, New York (2002) de Oliveira, M.C.F., Levkowitz, H.: From visual data exploration to visual data mining: a survey. IEEE Transactions on Visualization and Computer Graphics 9, 378–394 (2003)
VI
Foreword
discovery process must be changed. The iterative paradigm of mine and visualize must be replaced with the data mining equivalent of direct manipulation4 . Embedding the user within the discovery process, by, for example, enabling the user to change the mining constraints, results in a finer-grained framework as the interaction between user and system now occurs during analysis instead of between analysis runs. This overcomes the computer’s inability to incorporate evolving knowledge regarding the problem domain and user objectives, not only facilitating the production of a higher quality model, but also reducing analysis time for two reasons. First, the guidance reduces the search space at an earlier stage by discarding areas that are not of interest. Second, it reduces the number of iterations required. It also, through the Hawthorn Effect, has the effect of improving the user’s confidence in, and ownership of, the results that are produced5 . While so-called guided data mining methods have been produced for a number of data mining areas including clustering6 , association mining4,7 , and classification8 , there is an architectural aspect to guided data mining, and to VDM in general, that has not been adequately explored and which represents an area for future work. Another area of future work for the VDM community is quantification. Although the benefits that VDM can provide are clear to us, due to its subjective nature, the benefits of this synergy are not easily quantified and thus may not be as obvious to others. VDM methods can be more time-consuming to develop and thus for VDM to be accepted more widely we must find methods of showing that VDM demonstrates either (or both of) a time improvement or a quality improvement over non-visual methods. This book has been long awaited. The VDM community has come a long way in a short time. Due to its ability to merge the cognitive ability and contextual awareness of humans with the increasing computational power of data mining systems, VDM is undoubtedly not just a future trend but destined to be one of the main themes for data mining for many years to come. April 2008 4
5
6
7
8
John F. Roddick
Ceglar, A., Roddick, J.F.: GAM - a guidance enabled association mining environment. International Journal of Business Intelligence and Data Mining 2, 3–28 (2007) Ceglar, A.: Guided Association Mining through Dynamic Constraint Refinement. PhD thesis, Flinders University (2005) Anderson, D., Anderson, E., Lesh, N., Marks, J., Perlin, K., Ratajczak, D., Ryall, K.: Human guided simple search: combining information visualization and heuristic search. In: Workshop on new paradigms in information visualization and manipulation; In conjunction with the 8th ACM international conference on Information and Knowledge Management, Kansas City, MO, pp. 21–25. ACM Press, New York (2000) Ng, R., Lakshmanan, L., Han, J., Pang, A.: Exploratory mining and pruning optimizations of constrained association rules. In: 17th ACM SIGACT-SIGMOD-SIGART Symposium on the Principles of Database Systems, Seattle, WA, pp. 13–24. ACM Press, New York (1998) Ankerst, M., Ester, M., Kriegel, H.P.: Towards an effective cooperation of the user and the computer for classification. In: 6th International Conference on Knowledge Discovery and Data Mining (KDD 2000), Boston, MA, pp. 179–188 (2000)
Preface
John W. Tukey, who made unparalleled contributions to statistics and to science in general during his long career at Bell Labs and Princeton University, emphasized that seeing may be believing or disbelieving, but above all, data analysis involves visual, as well as statistical, understanding. Certainly one of the oldest visual explanations in mathematics is the visual proof of the Pythagorean theorem. The proof, impressive in its brevity and elegance, stresses the power of an interactive visual representation in facilitating our analytical thought processes. Thus, visual reasoning approaches to extracting and comprehension of the information encoded in data sets became the focus of what is called visual data mining. The field emerged from the integration of concepts from numerous fields, including computer graphics, visualization metaphors and methods, information and scientific data visualization, visual perception, cognitive psychology, diagrammatic reasoning, 3D virtual reality systems, multimedia and design computing, data mining and online analytical processing, very large databases last, and even collaborative virtual environments. The importance of the field had already been recognized in the beginning of the decade. This was reflected in the series of visual data mining workshops, conducted at the major international conferences devoted to data mining. Later, the conferences and periodicals in information visualization paid substantial attention to some developments in the field. Commercial tools and the work in several advanced laboratories and research groups across the globe provided working environments for experimenting not only with different methods and techniques for facilitating the human visual system in examination and patterns discovery, and understanding of patterns among massive volumes of multi-dimensional and multi-source data, but also for testing techniques that provide robust and statistically valid visual patterns. It was not until a panel of more than two dozen internationally renowned individuals was assembled, in order to address the shortcomings and drawbacks of the current state of visual information processing, that the need for a systematic and methodological development of visual analytics was placed in the top priorities on the research and development agenda in 2005. This book aims at addressing this need. Through a collection of 21 chapters selected from more than 46 submissions, it offers a systematic presentation of the state of the art in the field, presenting it in the context of visual analytics. Since visual analysis is such a different technique, it is an extremely significant topic for contemporary data mining and data analysis. The editors would like to thank all the authors for their contribution to the volume and their patience in addressing reviewers’ and editorial feedback. Without their contribution and support the creation of this volume would have been impossible. The editors would like to thank the reviewers for their thorough reviews and detailed comments.
VIII
Preface
Special thanks go to John Roddick, who, on short notice, kindly accepted the invitation to write the Foreword to the book.
April 2008
Simeon J. Simoff Michael Böhlen Arturas Mazeika
Table of Contents
Visual Data Mining: An Introduction and Overview . . . . . . . . . . . . . . . . . . Simeon J. Simoff, Michael H. B¨ ohlen, and Arturas Mazeika
1
Part 1 – Theory and Methodologies The 3DVDM Approach: A Case Study with Clickstream Data . . . . . . . . . Michael H. B¨ ohlen, Linas Bukauskas, Arturas Mazeika, and Peer Mylov
13
Form-Semantics-Function – A Framework for Designing Visual Data Representations for Visual Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simeon J. Simoff
30
A Methodology for Exploring Association Models . . . . . . . . . . . . . . . . . . . . Alipio Jorge, Jo˜ ao Po¸cas, and Paulo J. Azevedo
46
Visual Exploration of Frequent Itemsets and Association Rules . . . . . . . . Li Yang
60
Visual Analytics: Scope and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel A. Keim, Florian Mansmann, J¨ orn Schneidewind, Jim Thomas, and Hartmut Ziegler
76
Part 2 – Techniques Using Nested Surfaces for Visual Detection of Structures in Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arturas Mazeika, Michael H. B¨ ohlen, and Peer Mylov
91
Visual Mining of Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dario Bruzzese and Cristina Davino
103
Interactive Decision Tree Construction for Interval and Taxonomical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fran¸cois Poulet and Thanh-Nghi Do
123
Visual Methods for Examining SVM Classifiers . . . . . . . . . . . . . . . . . . . . . . Doina Caragea, Dianne Cook, Hadley Wickham, and Vasant Honavar
136
Text Visualization for Visual Text Analytics . . . . . . . . . . . . . . . . . . . . . . . . . John Risch, Anne Kao, Stephen R. Poteet, and Y.-J. Jason Wu
154
Visual Discovery of Network Patterns of Interaction between Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simeon J. Simoff and John Galloway
172
X
Table of Contents
Mining Patterns for Visual Interpretation in a Multiple-Views Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e F. Rodrigues Jr., Agma J.M. Traina, and Caetano Traina Jr.
196
Using 2D Hierarchical Heavy Hitters to Investigate Binary Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Trivellato, Arturas Mazeika, and Michael H. B¨ ohlen
215
Complementing Visual Data Mining with the Sound Dimension: Sonification of Time Dependent Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monique Noirhomme-Fraiture, Olivier Sch¨ oller, Christophe Demoulin, and Simeon J. Simoff
236
Context Visualization for Visual Data Mining . . . . . . . . . . . . . . . . . . . . . . . Mao Lin Huang and Quang Vinh Nguyen
248
Assisting Human Cognition in Visual Data Mining . . . . . . . . . . . . . . . . . . . Simeon J. Simoff, Michael H. B¨ ohlen, and Arturas Mazeika
264
Part 3 – Tools and Applications Immersive Visual Data Mining: The 3DVDM Approach . . . . . . . . . . . . . . . Henrik R. Nagel, Erik Granum, Søren Bovbjerg, and Michael Vittrup
281
DataJewel: Integrating Visualization with Temporal Data Mining . . . . . . Mihael Ankerst, Anne Kao, Rodney Tjoelker, and Changzhou Wang
312
A Visual Data Mining Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephen Kimani, Tiziana Catarci, and Giuseppe Santucci
331
Integrative Visual Data Mining of Biomedical Data: Investigating Cases in Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia. . . . . Paul J. Kennedy, Simeon J. Simoff, Daniel R. Catchpoole, David B. Skillicorn, Franco Ubaudi, and Ahmad Al-Oqaily
367
Towards Effective Visual Data Mining with Cooperative Approaches . . . Fran¸cois Poulet
389
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
407
Visual Data Mining: An Introduction and Overview Simeon J. Simoff1,2, Michael H. Böhlen3, and Arturas Mazeika3 1
School of Computing and Mathematics, College of Heath and Science University of Western Sydney, NSW 1797, Australia [email protected] 2 Faculty of Information Technology, University of Technology, Sydney PO Box 123 Broadway NSW 2007 Australia 3 Faculty of Computer Science, Free University of Bolzano-Bozen, Italy {arturas, boehlen}@inf.unibz.it
1 Introduction In our everyday life we interact with various information media, which present us with facts and opinions, supported with some evidence, based, usually, on condensed information extracted from data. It is common to communicate such condensed information in a visual form – a static or animated, preferably interactive, visualisation. For example, when we watch familiar weather programs on the TV, landscapes with cloud, rain and sun icons and numbers next to them quickly allow us to build a picture about the predicted weather pattern in a region. Playing sequences of such visualisations will easily communicate the dynamics of the weather pattern, based on the large amount of data collected by many thousands of climate sensors and monitors scattered across the globe and on weather satellites. These pictures are fine when one watches the weather on Friday to plan what to do on Sunday – after all if the patterns are wrong there are always alternative ways of enjoying a holiday. Professional decision making would be a rather different scenario. It will require weather forecasts at a high level of granularity and precision, and in real-time. Such requirements translate into requirements for high volume data collection, processing, mining, modelling and communicating the models quickly to the decision makers. Further, the requirements translate into high-performance computing with integrated efficient interactive visualisation. From practical point of view, if a weather pattern can not be depicted fast enough, then it has no value. Recognising the power of the human visual perception system and pattern recognition skills adds another twist to the requirements – data manipulations need to be completed at least an order of magnitude faster than realtime change in data in order to combine them with a variety of highly interactive visualisations, allowing easy remapping of data attributes to the features of the visual metaphor, used to present the data. In this few steps in the weather domain, we have specified some requirements towards a visual data mining system.
2 The Term As a term visual data mining has been around for nearly a decade. There is some variety in what different research groups understand under this label. “The goal of S.J. Simoff et al. (Eds.): Visual Data Mining, LNCS 4404, pp. 1–12, 2008. © Springer-Verlag Berlin Heidelberg 2008
2
S.J. Simoff, M.H. Böhlen, and A. Mazeika
visual data mining is to help a user to get a feeling for the data, to detect interesting knowledge, and to gain a deep visual understanding of the data set” [1]. Niggemann [2] looked at visual data mining as visual presentation of the data close to the mental model. As humans understand information by forming a mental model which captures only a gist of the information, then a data visualisation close to the mental model can reveal hidden information encoded in that data [2]. Though difficult to measure, such closeness is important, taking in account that visualisation algorithms map data sets that usually lack inherent 2D and 3D semantics onto a physical display (for example, 2D screen space or 3D virtual reality platform). Ankerst [3], in addition to the role of the visual data representation, explored the relation between visualisation and the data mining and knowledge discovery (KDD) process, and defined visual data mining as “a step in the KDD process that utilizes visualisation as a communication channel between the computer and the user to produce novel and interpretable patterns.” Ankerst [3] also explored three different approaches to visual data mining, two of which are connected with the visualisation of final or intermediate results and the third one involves interactive manipulation of the visual representation of the data, rather than the results of the algorithm. The three definitions recognise that visual data mining relies heavily on human visual processing channel, and utilises human cognition. The three definitions also emphasise, respectively, the key importance of the following three aspects of visual data mining: (i) tasks; (ii) visual representation (visualisation); and (iii) the process. Our work definition looks at visual data mining as the process of interaction and analytical reasoning with one or more visual representations of an abstract data that leads to the visual discovery of robust patterns in these data that form the information and knowledge utilised in informed decision making. The abstract data can be the original data set or/and some output of data mining algorithm(s).
3 The Process Fig. 1 illustrates a visual data mining process that corresponds to our definition. The visual processing pipeline is central to the process. Each step of this pipeline involves interaction with the human analyst, indicated by the bi-directional links that connect each step in the process and the human analyst. These links indicate that all the iterative loops in the process close via the human analyst. In some cases, data mining algorithms can be used to assist the process. Data mining algorithm(s) can be applied to the data: (a) before any visualisation has been considered, and (b) after some visual interaction with the data. In the first case, any of the output (intermediate and final) of the data mining algorithm can be included in the visual processing pipeline, either on its own, or together with visualisation of the original data. For example, Fig. 2 illustrates the case when the output of a data mining algorithm, in this case an association rule mining algorithm, is visualised, visually processed and the result then is fed into the visual processing pipeline (in this example we have adapted the interactive mosaic plots visualisation technique for association rules [4]). In another iteration, the analyst can take in account the visualisation of the output of the data mining algorithm when interacting with raw data visualisation [5] or explore the association rule set using another visual representation [6].
Visual Data Mining: An Introduction and Overview
3
Central to the “Analytical reasoning” step in Fig. 1 is the sense-making process [7]. The process is not a straight-forward progression but has several iterative steps that are not shown in Fig. 1. Information, Knowledge Legend Data mining algorithm
Analytical reasoning Process step Interacting with visualisation Visual processing pipeline Mapping data to visual representation
Data mining algorithm
Selection of visual representation Data preparation
Collection of Visualisation Techniques
Data
Fig. 1. Visual data mining as a human-centred interactive analytical and discovery process
As the visual data mining process relies on visualisation and the interaction with it, the success of the process depends on the breadth of the collection of visualisation techniques (see Fig. 1), on the consistency of the design of these visualisations, the ability to remap interactively the data attributes to visualisation attributes, the set of functions for interacting with the visualisation and the capabilities that these functions offer in support of the reasoning process. In Fig. 1 the “Collection of Visualisation Techniques” consists of graphical representations for data sets coupled with user interface techniques to operate with each representation in search of patterns. This coupling has been recognised since the early days of the field (for instance, see the work done in Bell Laboratories/Lucent Technologies [8] for an example of two graphical representations for the area of mining of telecommunications data: one for calling communities and the other for showing individual phone calls, and the corresponding interfaces that were successfully applied for telecommunications fraud detection through visual data mining). Keim [9] emphasised further the links between information visualisation and the interactivity of the visual representation in terms of visual data mining, introducing a classification relating the two areas, based on the data type to be visualised, the visualisation technique and the interaction and distortion technique. Though interaction has been recognised, there is a strong demand on the development of interactive visualisations, which are fundamentally different from static visualisations. Designed with the foundations of perceptual and cognitive theory in mind and focused on supporting the processes and methods in visual data mining, these visual data representations are expected to be relevant to the visual data mining tasks and effective in terms of achieving the data mining goals.
4
S.J. Simoff, M.H. Böhlen, and A. Mazeika
not heineken
heineken soda cracker
Visualisation of Rule A
I
Function for combining visualisations of rules and redefining the rules
not heineken
Visualisation of Rule B not heineken
heineken
heineken soda
cracker
Association Rules Mining Algorithm Data preparation Data Fig. 2. Interaction with the visual representation of the output of an association rule mining algorithm as part of the visual reasoning process (the visualisation technique has been adapted from [4])
The separation of the visual analysis and visual results presentation tasks in visual data mining is reflected in recent visual data mining tools, for instance, Miner3D1, NetMap2 and NetMiner3, which include two type of interfaces: (a) visual data mining interface for manipulating visualisations in order to facilitate human visual pattern recognition skills; and (b) visual presentation interface for generating visual (static and animated) reports and presentations of the results, to facilitate human communication and comprehension of discovered information and knowledge. As the ultimate goal is that the visual data mining result is robust and not accidental, visualisation techniques may be coupled with additional statistical techniques that lead to data visualisations that more accurately represent the underlying properties of the data. For example, the VizRank approach [10] ranks and selects two-dimensional projections of class-labelled data that are expected to be the better ones for visual analysis. These projections (out of the numerous possible ones in multi-dimensional large data sets) then are recommended for visual data mining.
4 The Book Visual data mining has not been presented extensively in a single volume of works. Westphal and Blaxton [11] present an extensive overview of various data and 1
http://www.miner3d.com/ http://www.netmap.com.au/ 3 http://www.visualanalytics.com/ 2
Visual Data Mining: An Introduction and Overview
5
information visualisation tools and their visual displays in the context of visual data mining. The comprehensive survey by Oliveira and Levkowitz [12] of various visualisation techniques and what actually can be done with them beyond just navigating through the abstract representation of data, places a convincing case for the need for tighter coupling of data and information visualisation with the data mining strategies and analytical reasoning. During 2001-2003 there had been a series of workshops on visual data mining [9, 13-15] with proceedings focused on the state-of-the-art in the area and the research in progress. Soukup and Davidson’s monograph on visual data mining [16] has taken a business perspective and practice-based approach to the process and application of visual data mining, with a limitted coverage of visual data representations and interactive techniques. This book aims at filling the gap between the initiative started with the workshops and the business perspectives [11, 16]. The collection of chapters from leading researchers in the field presents the state-of-the-art developments in visual data mining and places them in the context of visual analytics. The volume is solely devoted to the research and developments in the field. The book is divided into three main parts: Part 1 – Theory and Methodologies, Part 2 – Techniques, and Part 3 – Tools and Applications. Part 1 includes five chapters that present different aspects of theoretical underpinnings, frameworks and methodologies for visual data mining. Chapter “The 3DVDM Approach: A Case Study with Clickstream Data” introduces an approach to visual data mining, whose development started as part of an interdisciplinary project at Aalborg University, Denmark in the late 90s. The 3DVDM project explored the development and potential of immersive visual data mining technology in 3D virtual reality environments. This chapter explores the methodology of the 3DVDM approach and presents one of the techniques in its collection − state-of-the-art interaction and analysis techniques for exploring 3D data visualisation at different levels of granularity, using density surfaces. The state-of-the-art is demonstrated with the analysis of clickstream data – a challenging task for visual data mining, taking into account the amount and categorical nature of the data set. Chapter “Form-Semantics-Function – A Framework for Designing Visual Data Representations for Visual Data Mining” addresses the issue of designing consistent visual data representations, as they are a key component in the visual data mining process [17, 18]. [16] aligned visual representations with some visual data mining tasks. It is acknowledged that in the current state of the development it is a challenging activity to find out the methods, techniques and corresponding tools that support visual mining of a particular type of information. The chapter presents an approach for comparison of visualisation techniques across different designs. Association rules are one of the oldest models generated by data mining algorithms – a model that has been applied in many domains. One of the common problems of the association rule mining algorithms is that they generate a large amount of frequent itemsets and, then, association rules, which are still difficult to comprehend due to their large quantity. The book includes several chapters providing methodologies, techniques and tools for visual analysis of sets of association rules. In chapter “A Methodology for exploring visual association models” Alipio Jorge, João Poças and Paulo Azevedo present a methodology for visual representation of association rule models and interaction with the model. As association rule mining
6
S.J. Simoff, M.H. Böhlen, and A. Mazeika
algorithms can produce a large amount of rules this approach enables visual exploration and to some extent mining of large sets of association rules. The approach includes strategies for separation of the rules into subsets and means for manipulating and exploring these subsets. The chapter provides practical solutions supporting the visual analysis of association rules. In chapter “Visual exploration of frequent itemsets and association rules” Li Yang introduces a visual representation of frequent itemsets and association rules based on the adaptation of the popular parallel coordinates technique for visualising multidimensional data [19]. The analyst has control on what is visualised and what is not and by varying the border can explore patterns among the frequent itemsets and association rules. In chapter “Visual analytics: Scope and challenges” Daniel Keim, Florian Mansmann, Jörn Schneidewind, Jim Thomas and Hartmut Ziegler present an overview of the context in which visual data mining is presented in this book – the field of visual analytics. Visual data mining enables discovering of regularities in data sets visually or regularities in the output of data mining algorithms, as above discussed. Visual decision making produces decisions that rely heavily on visual data mining, but useful regularities are only a part of the entire decision-making process (see chapters 1-3 in [20] for detailed analysis of the relations between information visualisation, visual data mining and visual decision making). With respect to visual analytics, visual data mining offers powerful techniques for extraction of visual patterns. Visual analytics adds on top of it a variety of analytical techniques that look for making sense and discoveries from the visual output of the visual data mining systems. Visual data mining systems that support visual analytics combine a collection of information visualisation metaphors and techniques with aspects of data mining, statistics and predictive modelling. Part 2 of the book groups chapters that present different techniques for visual data mining. Some of them are focused on revealing visual structures in the data, others on the output of data mining algorithms (association rules). The later techniques can be viewed as “visual extensions” of the data mining algorithms. Another interesting trend in this part is the tight coupling of interactive visualisations with conventional data mining algorithms which leads to advanced visual data mining and analytics solutions. Chapter “Using nested surfaces for visual detection of structures in databases” presents a technique for facilitating the detection of less obvious or weakly expressed visual structures in the data by equalising the presence of the more and less pronounced structures in a data set. The chapter presents the technical details of the algorithms. Though, the practical realisation of the work is part of the 3DVDM environment, the algorithms are valid for any visual data mining system utilising 3D virtual reality environments. In chapter “Visual mining of association rules” Dario Bruzzese and Cristina Davino present a framework for utilising various visualisation techniques to represent association rules and then to provide visual analysis of these rules. The chapter promotes the smorgasbord approach to employ a variety of visual techniques to unveil association structures and visually identify within these structures the rules that are relevant and meaningful in the context of the analysis.
Visual Data Mining: An Introduction and Overview
7
In chapter “Interactive decision tree construction for interval and taxonomical data” François Poulet and Thanh-Nghi Do provide a framework for extending conventional decision tree-building algorithms into visual interactive decision tree classifier builders. The authors demonstrate it on the modification of two decision tree inducing algorithms. The approach offers the benefit of incorporating the statistical underpinning mechanisms for guiding the classifier induction with the interactive visual intervention and manipulation capabilities, which allow the deep visual exploration of the data structures and relating those structures to specific statistical characteristics of the patterns. Support vector machines (SVMs) are another popular classifier. The winners in Task 2 in the KDD Cup 20024, Adam Kowalczyk and Bhavani Raskutti, applied a proprietary SVM algorithm. In chapter “Visual Methods for Examining SVM Classifiers” Doina Caragea, Dianne Cook, Hadley Wickham and Vasant Honavar look at coupling data visualisation with an SVM algorithm in order to obtain some knowledge about the data that can be used to select data attributes and parameters of the SVM algorithm. The visual technique complements conventional cross-validation method. The last chapter of the book - François Poulet’s “Towards effective visual data mining with cooperative approaches”, also considers the interaction between the SVM technique and visualisations, though as part of a broader philosophy of integrated visual data mining tools. In Chapter “Text visualization for visual text analytics” John Risch, Anne Kao, Stephen Poteet and Jason Wu explore visual text analytics techniques and tools. Text analytics couples semantic mapping techniques with visualisation techniques to enable interactive analysis of semantic structures enriched with other information encoded in the text data. The strength of the techniques is in enabling visual analysis of complex multidimensional relationship patterns within the text collection. Techniques, discussed in the chapter, enable human sense making, comprehension and analytical reasoning about the contents of large and complexly related text collections, without necessarily reading them. Chapter “Visual discovery of network patterns of interaction between attributes” presents a methodology, collection of techniques and a corresponding visual data mining tool which enables visual discovery of network patterns in data, in particular, patterns of interaction between attributes, which usually are assumed to be independent in the paradigm of conventional data mining methods. Techniques that identify network structures within the data are getting an increasing attention, as they attempt to uncover linkages between the entities and their properties, described by the data set. The chapter presents a human-centred visual data mining methodology and illustrates the details with two case studies – fraud detection in insurance industry and Internet traffic analysis. In chapter “Mining patterns for visual interpretation in a multiple views environment” José Rodrigues Jr., Agma Traina and Caetano Traina Jr. address the issue of semantically consistent integration of multiple visual representations and their interactive techniques. The chapter presents a framework which integrates three techniques and their workspaces according to the analytical decisions made by the analyst. The analyst can identify visual patterns when analysing in parallel multiple subsets of the 4
http://www.biostat.wisc.edu/~craven/kddcup/winners.html
8
S.J. Simoff, M.H. Böhlen, and A. Mazeika
data and cross link these patterns in order to facilitate the discovery of patterns in the whole data set. Chapter “Using 2D hierarchical heavy hitters to investigate binary relationships” presents an original technique for identification of hierarchical heavy hitters (HHH). The concept of hierarchical heavy hitters is very useful in identifying dominant or unusual traffic patterns. In terms of the Internet traffic, a heavy hitter is an entity which accounts for at least a specified proportion of the total activity on the network, measured in terms of number of packets, bytes, connections etc. A “heavy hitter” can be an individual flow or connection. Hierarchical accounts for the possible aggregation of multiple flows/connections that share some common property, but which themselves may not necessarily be heavy hitters (similar to hierarchical frequent itemsets). The chapter presents visual data mining technique which addresses the space complexity of HHHs in multidimensional data streams [21]. The tool utilises the 3DVDM engine discussed in details in Part 3. Chapter “Supporting visual data mining of time dependent data through data sonification” explores the potential integration of the visual and audio data representation. It presents the results of experimental study of different sonification strategies for 2D and 3D time series. As sound is a function of time, intuitively sonification seems to be suitable extension for mining time series and other sequence data. Following a human-computer interaction approach the study looks at the potential differentiation in the support that different sonification approaches provided to the analysts in addition to the visual graphs. There has been an increasing interest in experiments with data sonification techniques Though there are no systematic recommendations for mappings between data and sound, in general, these mappings can be split into three types: mapping data values or functions of them to (i) an audio frequency; (ii) musical notes and combinations of them; and (iii) specific sound patterns (like drum beat), where changes in the data are reflected in changes in the parameters of those sound patterns. The parameter selection process is complicated by constraints on the quality of the generated sound. For example, if the analyst decides to map data onto musical notes and use the tempo parameter to rescale too detailed time series, then, when selecting the range of the tempo and the instrument to render sonified data, the analyst has to be careful to avoid the introduction of distortions in the sound. Similar to the sensitivity of visual data representations to the cognitive style of the analyst, sonification may be sensitive towards the individual abilities to hear, which is one of the hypotheses of the study in this chapter. In chapter “Context visualisation for visual data mining” Mao Lin Huang and Quang Vinh Nguyen focused on the methodology and visualisation techniques which support the presence of history and the context of the visual exploration in the visual display. Context and history visualization plays an important role in visual data mining especially in the visual exploration of large and complex data sets. Incorporating in the visual display context and history information can assist the discovery and the sense making processes, allowing to grasp the context in which a pattern occurs and possible periodical reoccurrence in similar or changed context. Context and history facilitate our short- and long-term memory in the visual reasoning process. The importance of providing both context and some history has been reemphasised in the visual data mining framework presented in [22].
Visual Data Mining: An Introduction and Overview
9
Chapter “Assisting human cognition in visual data mining” brings in the focus of the reader the issue of visual bias, introduced by each visualisation technique, which may result in biased and inaccurate analytical outcomes. Visual bias is a perceptual bias of the visual attention introduced by the visualisation metaphor and leading to selective rendering of the underlying data structures. As interaction with visual representations of the data is subjective, the results of this interaction are prone to misinterpretation, especially in the case of analysts with lesser experience and understanding of the applicability and limitations of different visual data mining methods. Similar to the chapter “Visual discovery of network patterns of interaction between attributes”, this chapter takes the viewpoint on visual data mining as a “reflection-inaction” process. Then subjective bias within this framework can be addressed in two ways – one is to enable the human visual pattern recognition through data transformations that possess particular, known in advance, properties and facilitate visual separation of the data points. The other approach that authors propose to use in addressing subjective bias is a corrective feedback, by validating visual findings through employing another method to refine and validate the outcomes of the analysis. The two methods, labelled as “guided cognition” and “validated cognition”, respectively, are presented through examples from case studies. Part 3 – Tools and Applications includes chapters, whose focus is on specific visual data mining systems. This is a selective, rather than a comprehensive, inclusion of systems or combinations of tools that enable visual data mining methodologies. A fairly comprehensive design criteria for visual data mining systems is presented in [22]. In chapter “3DVDM: A system for exploring data in virtual reality”, Henrik Nagel, Erik Granum, Søren Bovbjerg and Michael Vittrup present an immersive visual data mining system which operates on different types of displays, spanning from a common desktop screen to virtual reality environments as a six-side CAVE, Panorama and PowerWall. The 3DVDM system was developed as part of the interdisciplinary research project at Aalborg University, Denmark, mentioned earlier in the chapter, with partners from the Department of Computer Science, the Department of Mathematical Sciences, the Faculty of Humanities, and the Institute of Electronic Systems. The system has an extensible framework for interactive visual data mining. The chapter presents the basic techniques implemented in the core of the system. The pioneering work and experiments in visual data mining in virtual reality demonstrated the potential of immersive visual data mining, i.e. the immersion of the analyst within the visual representation of the data. This is substantially different situation in comparison to viewing data on conventional screens, where analysts observe the different visual representations and projections of the data set from outside the visualisation. In the 3DVDM system, the analyst can walk in and out of the visual presence of the data, as well as take a dive into specific part of the data set. The value of the research and development of the 3DVDM system is in the numerous research topics that it has opened, spanning from issues in building representative lower dimension projections of the multidimensional data that preserve the properties of the data through to issues in human-data interaction and visual data mining methodologies in virtual reality environments. In chapter “DataJewel: Tightly integrating visualization with temporal data mining”, Mihael Ankerst, David Jones, Anne Kao and Changzhou Wang present the
10
S.J. Simoff, M.H. Böhlen, and A. Mazeika
architecture and the algorithms behind a visual data mining tool for mining temporal data. The novelty in the architecture is the tight integration of the visual representation, the underlying algorithmic component and the data structure that supports mining of large temporal databases. The design of the system enables the integration of temporal data mining algorithms in the system. DataJewel – that is the catchy name of the tool, offers an integrating interface, which uses colour to integrate the activities both of the algorithms and the analyst. The chapter reports on experiments in analysing large datasets incorporating data from airplane maintenance and discusses the applicability of the approach to homeland security, market basket analysis and web mining. In chapter “A visual data mining environment” Stephen Kimani, Tiziana Catarci, and Giuseppe Santucci present visual data mining system VidaMine. The philosophy behind the design of VidaMine is similar to the philosophy underpinning the design of DataJewel: (i) an architecture, open for inclusion of new algorithms and (ii) a design, which aims at supporting the analyst throughout the entire discovery process. The chapter presents the details of the architecture and looks at different data mining tasks and the way the environment supports the corresponding scenarios. The chapter includes a usability study of a prototype from the point of view of the analyst – a step that should become a standard for the development of systems for visual data mining process and, in general, any system that supports human-centred data mining process. Chapter “Integrative visual data mining of biomedical data: Investigating cases in chronic fatigue syndrome and acute lymphoblastic leukaemia” presents the application of visual data mining as part of an overall cycle for enabling knowledge support of clinical reasoning. The authors argued that similar to the holistic approach of the Eastern medicine, knowledge discovery in the biomedical domain requires methods that address the integrated data set of many sources that can potentially contribute to the accurate picture of the individual disease. Presented methods and algorithms, and the integration of the individual algorithms and methods, known informally as the “galaxy approach” (as opposed to the reductionist approach), addresses the issue of the complexity of biomedical knowledge, and, hence, the methods used to mine biomedical data. Data mining techniques are not only part of the patterns discovery process, but also contribute to relating models with existing biomedical knowledge bases and the creation of a visual representation of discovered existing relations in order to formulate hypotheses to question biomedical researchers. The practical application of the methodology and the specific data mining techniques are demonstrated in identifying the biological mechanisms of two different types of diseases: Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia, respectively, which share the structure of collected data. In the last chapter of the book – “Towards effective visual data mining with cooperative approaches”, François Poulet addresses the issues of tight coupling of the visualisation and analytical processes and forming an integrated data-mining tool that builds on the strengths of both camps. The author demonstrates his approach on coupling two techniques, some aspects of which have been discussed earlier in the book: the interactive decision tree algorithm CIAD (see chapter “Interactive decision tree construction for interval and taxonomical data”) and relating visualisation and SVM (see chapter “Visual Methods for Examining SVM Classifiers”). In this chapter these techniques are linked together – on the one hand, SVM optimises the interactive split
Visual Data Mining: An Introduction and Overview
11
in the tree node, on the other hand, interactive visualisation is coupled with SVM to tune SVM parameters and provide explanation of the results.
5 Conclusions and Acknowledgements The chapters that span this book draw together the state-of-the-art in the theory, methods, algorithms, practical implementations and applications of what constitutes the field of visual data mining. Through the collection of these chapters the book presents a selected slice of the theory, methods, techniques and technological development in visual data mining – a human-centric data mining approach, which has an increasing importance in the context of visual analytics [17]. There are numerous works on data mining, information visualisation, visual computing, visual reasoning and analysis, but there has been a gap in placing related works in the context of visual data mining. Human visual pattern recognition skills based on detection of changes in shape, colour and motion of objects have been recognised, but rarely positioned indepth in the context of visual data mining ([20] is an exception). The purpose of this book is to fill these gaps. The unique feature of visual data mining is that it emerges during the interaction with data displays. Through such interaction, classes of patterns are revealed both by the visual structures formed in the displays and the position of these structures within the displays. As an interdisciplinary field, visual data mining is in a stage of forming its own niche, going beyond the unique amalgamation of the disciplines involved.
References 1. Beilken, C., Spenke, M.: Visual interactive data mining with InfoZoom - the Medical Data Set, In: Proceedings 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 1999, Prague, Czech Republic (1999) 2. Niggemann, O.: Visual Data Mining of Graph-Based Data. Department of Mathematics and Computer Science. University of Paderborn, Germany (2001) 3. Ankerst, M.: Visual Data Mining, in Ph.D. thesis, Dissertation.de: Faculty of Mathematics and Computer Science, University of Munich (2000) 4. Hofmann, H., Siebes, A., Wilhelm, A.F.X.: Visualizing association rules with interactive mosaic plots. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, Boston (2000) 5. Zhao, K., et al.: Opportunity Map: A visualization framework for fast identification of actionable knowledge. In: Proceedings of the ACM Fourteenth Conference on Information and Knowledge Management (CIKM 2005), Bremen, Germany (2005) 6. Blanchard, J., Guillet, F., Briand, H.: Interactive visual exploration of association rules with rule-focusing methodology. Knowledge and Information Systems 13(1), 43–75 (2007) 7. Pirolli, P., Card, S.: Sensemaking processes of intelligence analysts and possible leverage points as identified through cognitive task analysis. In: Proceedings of the 2005 International Conference on Intelligence Analysis, McLean, Virginia (2005) 8. Cox, K.C., Eick, S.G., Wills, G.J.: Visual Data Mining: Recognizing Telephone Calling Fraud. Data Mining and Knowledge Discovery 1, 225–231 (1997)
12
S.J. Simoff, M.H. Böhlen, and A. Mazeika
9. Keim, D.A.: Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics 7(1), 100–107 (2002) 10. Leban, G., et al.: VizRank: Data visualization guided by machine learning. Data Mining and Knowledge Discovery 13(2), 119–136 (2006) 11. Westphal, C., Blaxton, T.: Data Mining Solutions: Methods and Tools for Solving RealWorld Problems. John wiley & Sons, Inc., New York (1998) 12. Oliveira, M.C.F.d., Levkowitz, H.: From Visual Data Exploration to Visual Data Mining: A Survey. IEEE Transactions On Visualization And Computer Graphics 9(3), 378–394 (2003) 13. Simoff, S.J., Noirhomme-Fraiture, M., Böhlen, M.H. (eds.): Proceedings of the International Workshop on Visual Data Mining VDM@PKDD 2001, Freiburg, Germany (2001) 14. Simoff, S.J., Noirhomme-Fraiture, M., Böhlen, M.H. (eds.): Proceedings InternationalWorkshop on Visual Data Mining VDM@ECML/PKDD 2002, Helsinki, Finland (2002) 15. Simoff, S.J., et al. (eds.): Proceedings 3rd International Workshop on Visual Data Mining VDM@ICDM 2003, Melbourne, Florida, USA (2003) 16. Soukup, T., Davidson, I.: Visual Data Mining: Techniques and Tools for Data Visualization and Mining. John Wiley & Sons, Inc., Chichester (2002) 17. Thomas, J.J., Cook, K.A.: Illuminating the Path: The Research and Development Agenda for Visual Analytics. IEEE CS Press, Los Alamitos (2005) 18. Keim, D.A., et al.: Challenges in visual data analysis. In: Proceedings of the International Conference on Information Visualization (IV 2006). IEEE, Los Alamitos (2006) 19. Inselberg, A.: The plane with parallel coordinates. The Visual Computer 1, 69–91 (1985) 20. Kovalerchuk, B., Schwing, J. (eds.): Visual and Spatial Analysis: Advances in Data Mining, Reasoning, and Problem Solving. Springer, Dordrecht (2004) 21. Hershberger, J., et al.: Space complexity of hierarchical heavy hitters in MultiDimensional Data Streams. In: Proceedings of the Twenty-Fourth ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems (2005) 22. Schulz, H.-J., Nocke, T., Schumann, H.: A Framework for Visual Data Mining of Structures. In: Twenty-Ninth Australasian Computer Science Conference(ACSC2006). Conferences in Research and Practice in Information Technology, Hobart, Tasmania, Australia. CPRIT, vol. 48 (2006)
The 3DVDM Approach: A Case Study with Clickstream Data Michael H. B¨ohlen1 , Linas Bukauskas2 , Arturas Mazeika1 , and Peer Mylov3 1
Faculty of Computer Science, Free University of Bozen-Bolzano, Dominikanerplatz 3, 39100 Bozen, Italy 2 Faculty of Mathematics and Informatics, Vilnius University, Naugarduko 24, 03225 Vilnius, Lithuania 3 ˜ Institute of Communication, Aalborg University, Niels Jernes Vej 12, 9220 Aalborg Ast, Denmark
Abstract. Clickstreams are among the most popular data sources because Web servers automatically record each action and the Web log entries promise to add up to a comprehensive description of behaviors of users. Clickstreams, however, are large and raise a number of unique challenges with respect to visual data mining. At the technical level the huge amount of data requires scalable solutions and limits the presentation to summary and model data. Equally challenging is the interpretation of the data at the conceptual level. Many analysis tools are able to produce different types of statistical charts. However, the step from statistical charts to comprehensive information about customer behavior is still largely unresolved. We propose a density surface based analysis of 3D data that uses state-of-the-art interaction techniques to explore the data at various granularities.
1 Introduction Visual data mining is a young and emerging discipline that combines knowledge and techniques form a number of areas. The ultimate goal of visual data mining is to devise visualizations of large amounts of data that facilitate the interpretation of the data. Thus, visual data mining tools should be expected to provide informative but not necessarily nice visualizations. Similarly, visual data mining tools will usually enrich or replace the raw data with model information to support the interpretation. These goals, although widely acknowledged, often get sidelined and are dominated by the development of new visual data mining tools. Clearly, data analysis tools are important but it is at least as important to design principles and techniques that are independent of any specific tool. We present an interdisciplinary approach towards visual data mining that combines mature and independent expertise from multiple areas: database systems, statistical analysis, perceptual psychology, and scientific visualization. Clearly, each of these areas is an important player in the visual data mining process. However, in isolation each area also lacks some of the required expertise. To illustrate this, we briefly recall three very basic properties that any realistic visual data mining system must fulfill. Typically each property is inherent to one area but poorly handled in the other areas. Relevant data. At no point in time is it feasible to load or visualize a significant part, let alone all, of the available data. The data retrieval must be limited to the relevant S.J. Simoff et al. (Eds.): Visual Data Mining, LNCS 4404, pp. 13–29, 2008. c Springer-Verlag Berlin Heidelberg 2008
14
M.H. B¨ohlen et al.
part of the data. The relevant part of the data might have to depend on the state of the data mining process, e.g., be defined as the part of the data that is analyzed at a specific point in time. Model information. It is necessary to visualize model information rather than raw data. Raw data suffers from a number of drawbacks and a direct visualization is not only slow but makes the quality depend on the properties of the data. A proper model abstracts from individual observations and facilitates the interpretation. Visual properties. It is very easy to overload visualizations and make them difficult to interpret. The effective use of visual properties (size, color, shape, orientation, etc.) is a challenge [16]. Often, a small number of carefully selected visual properties tuns out to be more informative than visual properties galore. This paper describes a general-purpose interactive visual data mining system that lives up to these properties. Specifically, we explain incremental observer relative data extraction to retrieve a controlled superset of the data that an observer can see. We motivate the use of density information as an accurate model of the data. We visualize density surfaces and suggest a small number of interaction techniques to explore the data. The key techniques are animation, conditional analyzes, equalization, and windowing. We feel that these techniques should be part of any state-of-the-art visual data mining system. Animation. Animation is essential to explore the data at different levels of granularity and get an overview and feel for the data. Without animations it is very easy to draw wrong conclusions. Windowing. For analysis purposes it is often useful to have a general windowing functionality that allows to selectively collapse and expands individual dimensions and that can be used to change the ordering of (categorical) attribute values. Equalization. The data distribution is often highly skewed. This means that visualizations are dominated by very pronounced patterns. Often these patterns are well known (and thus less interesting) and they make it difficult to explore other parts of the data if no equalization is provided. Conditional Analyzes. Data sets combine different and often independent information. It must be possible to easily perform and relate analyzes on different parts of the data (i.e., conditional data analyzes). In Section 4 we will revisit these techniques in detail and discuss how they support the interpretation of the data. Throughout the paper we use clickstream data to illustrate these techniques and show how they contribute to a robust interpretation of the data. Clickstream data is a useful yardstick because it is abound and because the amount of data requires scalable solutions. We experienced that a number of analysis tools exhibit significant problems when ran on clickstream data, and the performance quickly became prohibitive for interactive use. There are a number of tools that offer advanced visual data mining functionality: GGobi [20], MineSet [2], Clementine [11], QUEST [1], and Enterprise Miner [3]. Typically, these are comprehensive systems with a broad range of functionalities. Many of them support most of the standard algorithms known from data mining, machine learning, and statistics, such as association rules, clustering (hierarchical, density-based, Kohonen, k-means), classification, decision trees, regression, and principal components.
The 3DVDM Approach: A Case Study with Clickstream Data
15
None of them supports data mining based density surfaces for 3D data and advanced interaction techniques for the data analyst. Many of the systems also evolved from more specialized systems targeted at a specific area. Data access structures are used to identify the relevant data and are often based on the B- and R-tree [9]. R-tree based structures use minimum bounding rectangles to hierarchically group objects and they support fast lookups of objects that overlap a query point/window. Another family of access structures is based only on space partitioning as used in the kd-tree [17]. Common to all these structures is a spatial grouping. In our case the visible objects are not necessarily located in the same area. Each object has its own visibility range and therefore the visible objects may be located anywhere in the universe. In addition to the lack of a spatial grouping of visible objects the above mentioned access structures also do not support the incremental extraction of visible objects, which we use to generate the stream of visible objects. For the model construction we estimate the probability density function (PDF). Probability density functions and kernel estimation have been used for several decades in statistics [18,19,6,5,23]. The main focus has been the precision of the estimation, while we use it to interactively compute density surfaces. We use the density surfaces to partition the space. Space partitioning has been investigated in connection with clustering [10,8,22,24,4,12] and visualization. The visualization is usually limited to drawing a simple shape (dot, icon glyph, etc.) for each point in a cluster [4,12] or drawing (a combination of) ellipsoids around each cluster [21,7]. The techniques are usually not prepared to handle data that varies significantly in size, contains noise, or includes multiple not equally pronounced structures [15]. Moreover, a different density levels often requires a re-computation of the clusters, which is unsuitable for interactive scenarios let alone animation. Structure of the paper: Section 2 discusses clickstreams. We discuss the format of Web logs, show how to model clickstreams in a data warehouse, and briefly point to current techniques for analyzing clickstreams. Section 3 discusses selected aspects of the 3DVDM System, namely the use of incremental observer relative data extraction to generate a stream of visible observations, and the model computation from this stream. Section 4 uses the clickstream data to illustrate data mining based on density surfaces.
2 Clickstreams One of the most exciting data sources are Web logs (or clickstreams). A clickstream contains a record for every page request from every visitor to a specific site. Thus, a clickstream records every gesture each visitor makes and these gestures have the potential to add up to comprehensive descriptions of the behavior of users. We expect that clickstreams will identify successful and unsuccessful sessions on our sites, determine happy and frustrated visitors, and reveal the parts of our Web sites are (in)effective at attracting and retaining visitors. 2.1 Web Server Log Files We start out by looking at a single standardized entry in a log file of a web server:
16
M.H. B¨ohlen et al.
ask.cs.auc.dk [13/Aug/2002:11:49:24 +0200] "GET /general/reservation/cgi-bin/res.cgi HTTP/1.0" 200 4161 "http://www.cs.auc.dk/general_info/" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" Such an entry contains the following information. 1. 2. 3. 4.
The IP address of the visitor. Possibly a cookie ID is included as well. The date and time (GMT) of the page request. The precise HTTP request. The type of request is usually Get or Submit. The returned HTTP status code. For example, code 200 for a successful request, code 404 for a non-existing URL, code 408 for a timeout, etc. 5. The number of bytes that have been transferred to the requesting site. 6. The most recent referring URL (this information is extracted from the referrer field of the HTTP request header). 7. The requesting software; usually a browser like Internet Explorer or Netscape, but it can also be a robot of a search engine. Web logs do not only contain entries for pages that were explicitly requested. It is common for a page to include links to, e.g., pictures. The browser usually downloads these parts of a document as well and each such download is recorded in the Web log. Consider the two example entries below. The first entry reports the download of the root page. The second entry is the result of downloading an image that the root page refers to. 0x50c406b3.abnxx3.adsl-dhcp.tele.dk [13/Aug/2002:11:49:27 +0200] "GET / HTTP/1.1" 200 3464 "-" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 0x50c406b3.abnxx3.adsl-dhcp.tele.dk [13/Aug/2002:11:49:27 +0200] "GET /images/educate.jpg HTTP/1.1" 200 9262 "http://www.cs.auc.dk/" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" 2.2 Modeling Clickstreams in a Data Warehouse We are a long way from inferring user behavior just by analyzing the entries in a web log. We have to clean the raw log data and organize it in a way that supports the business perspective. A well known approach is to use ETL (extract transform load) techniques to pre-process and load the data into a data warehouse that supports business analyzes. Figure 1 shows an example data warehouse schema. The schema in Figure 1 is a classical data warehouse start schema [13]. The Clickstream fact table contains an entry for every record in the Web log and is several GB large. Each entry has a number of key attributes that point to descriptive entries in the
The 3DVDM Approach: A Case Study with Clickstream Data
Date day_of week day_in_month day_in_year month quarter holiday_flag weekday_flag year event Visitor URL Name Email Timezone Address Phone Browser
Clickstream DateKey TimeKey VisitorKey DocumentKey ReferrerKey SessionKey SessionID TotNrOfPages TotSessTime
Session Type Priority Secure_flag
17
Time hour minute second month Document Level1 Level2 Level3 Level4 Level5 cgi_flag embedded size Referrer URL Site Type
Fig. 1. Star Schema of a Clickstream Data Warehouse
dimension tables. The measure attributes carry information about individual sessions (ID, total number of pages accessed during a session, total session time). Note that these attributes are not available in the log file. They have to be inferred during the data preprocessing phase. A common approach is to sort the log on IP and time, and then use a sequential scan to identify sequences of clicks that form a session. Below we briefly describe those aspects of the dimensions that have unique properties related to the analysis of clickstreams [14]. Date Dimension. The date dimension is small and has a few thousand records at most (365 days per year). The business requirements determine the attributes that should be included. For example, we should record for each day whether it is a holiday or not if some of our analyzes will treat holidays separately. Time Dimension. All times must be expressed relative to a single standard time zone such as GMT that does not vary with daylight savings time. The time dimension has 86,400 records, one for each second of a given day. A separate time dimension allows analyzes across days and makes it easy to constrain the time independent of the day. Visitor Dimension. It is possible to distinguish three types of visitors. Anonymous visitors are identified by their IP addresses. In many cases, the IP address only identifies a port on the visitor’s Internet service provider. Often, these ports are dynamically reassigned, which makes it difficult to track visitors within a session let alone across sessions. A cookie visitor is one who has agreed to store a cookie. This cookie is a reliable identifier for a visitor machine. With a cookie, we can be pretty sure that a given machine is responsible for a session, and we can determine when the machine will visit us again. An authenticated visitor is the human-identified visitor who not only
18
M.H. B¨ohlen et al.
has accepted our cookie but sometime in the past has revealed the name and other information. The visitor dimension is potentially huge but its size can often be reduced significantly. For example, anonymous visitors can be grouped according to domain (and sub-domain). For cookie and authenticated visitors it is likely that we want to build up individual profiles, though. Document Dimension. The document dimension requires a lot of work to make the clickstream source useful. It is necessary to describe a document by more than its location in the file system. In some cases, the path name to the file is moderately descriptive (e.g., a .jpg extension identifies a picture), but this is certainly not always the case. Ideally, any file on a Web server should be associated with a number of categorical attributes that describe and classify the document. Session Dimension. The session dimension is a high-level diagnosis of sessions. Possible types are student, prospective student, researcher, surfer, happy surfer, unhappy surfer, crawler, etc. The session information is not explicitly stored in the Web log file. Basically, it has to be reverse engineered from the log file and added during data preprocessing. 2.3 Analyzing Clickstreams There are a number of standard statistical tools available that are being used to analyze web logs. These tools are used to summarize the data and display the summaries. Two typical charts are shown in Figure 2.
(a) Histogram (by Hour)
(b) Pie Chart (by Country)
Fig. 2. Statistical Analysis of a Web Log
Typically, the summaries are computed for one dimension as illustrated by the charts in Figure 7(a). 2D (or even 3D) summaries are more rare, although some of the tools offer (2D) crosstabing. As a result the information is often spread over a large number of tables and graphs, and it is difficult to analyze a clickstream comprehensively. We propose to combine much of this information into a single graph and use stateof-the-art interaction techniques to analyze the data from different perspectives and at different granularities.
The 3DVDM Approach: A Case Study with Clickstream Data
19
3 The 3DVDM System 3.1 Overall Architecture As mentioned in the introduction visual data mining is fundamentally inter-disciplinary. Any visual data mining system should make this a prime issue for the basic design of the system. An obvious approach is to go for a modular system with independent components as illustrated in Figure 3.
iORDE Incremental Observer Relative Data Extraction
Retrieval of visible data DB
Stream of visible data
KES Density Surfaces Navigation Path
Stream of visible data
VIS Interaction Visualization
Density surfaces View directions
Fig. 3. 3DVDM System Architecture
The modules are implemented as separate processes that communicate through streams. Each module has an event handler that reacts to specific events. These events can be triggered by any of the other modules. Other than this no cooperation between or knowledge of the modules is required, which is an asset for a true interdisciplinary system. The 3DVDM System is being used in environments (Panorama, Cave) where the data analyst has the possibility to explore the data from the inside and outside. BElow we describe techniques to retrieve and stream the data during such explorations and show how to process the stream and compute model information. 3.2 Streaming Visible Data Visible objects are identified by the distance in which the object is seen. We use the visibility range to define the visibility of an object. In the simplest case it is proportional to the size of an object: V R(oi ) = oi [s]·c. Here, V R(oi ) is the visibility range of object oi , oi [s] is the size of the object, and c is a constant. Thus, object oi will be visible in the circular or hyper-spherical area of size V R(oi ) around the center of oi . We write VPl to denote the set of visible objects from the point Pl . Objects that become visible when an observer moves from position Pl to position Pl+1 are denoted by Δ+ Pl ,Pl+1 . With this the stream of visible data from position P0 to Pk is defined as a + sequence S(P0 , Pk ) = VP0 , Δ+ P0 ,P1 . . . ΔPk−2 ,Pk−1 . Here, k is a number of observer positions and the number of stream slices. The definition says that we stream all visible data for the initial observer position. Any subsequent slices are produced as increments.
20
M.H. B¨ohlen et al.
Fig. 4. The Universe of Data
(a) Visible Objects
(b) Δ+ P0 ,P1
(c) Δ+ P1 ,P2
Fig. 5. Stream of Visible Data
Figure 3.2 shows a part of the clicktream data. Specifically, the clicks from four domains are shown: .it, .de, .uk, and .es. We assume equally sized visibility ranges for all clicks (alternatively, the VR could be chosen to be proportional to, e.g., the size of the downloaded document) and let the observer move along the clicks from .it. The observer starts at the time position 5 and the V R is 4. Initially, 1728 visible object are received. In Figure 5(b) and 5(c) the incremental slices are shown. Δ+ P0 ,P1 contains 503 and Δ+ 363 newly visible objects. P1 ,P2 To provide efficient streaming of visible objects we use the VR-tree, an extension of the R-tree, together with the Volatile Access Structure (VAST). The structures enable a fast and scalable computation of Δ+ when the observer is moving. VAST is called volatile because it is initialized and created when the observer starts moving. It is destroyed if the observer no longer needs streaming support. Figure 6 shows an example of VAST that mirrors a subset of the VR-tree. Figure 6(a) shows the hierarchical structure of the VR-Tree. Figure 6(b) shows only the visible objects. These are all objects with a VR that includes the observer position. VAST mirrors the visible part of the VR tree as illustrated in Figure 6(c). In contrast to the VR tree, which uses absolute Cartesian coordinates, VAST uses an observer relative representation with distances and angles. This representation is more compact and supports a moving observer. 3.3 Model Computation The visible data is streamed to the model computation module where it is processed in slices to estimate the PDF. The estimation algorithm starts with a coarse uniform grid data structure and processes the first slice of the data. The estimation process can be in
The 3DVDM Approach: A Case Study with Clickstream Data
Y
Y
A
II 5
Y II
MBR 4
VR
d
6
I
VR
4 III
A
7 3
3
3
I
d
1
VR
MBS
1
1
X
(a) VR-Tree
III
VR
8 2
21
X
(b) Visible Objects
X
(c) VAST
Fig. 6. The VR-Tree migration
two states: the estimation or skipping state. In the estimation state the algorithm refines the PDF by adding new kernels and refining the grid structure in regions where the PDF is non-linear. The process continues until the desired precision is reached. When the estimation is precise enough it enters the skipping state. In the skipping state the algorithm skips the slices until it finds new information that is not yet reflected in the PDF. If such a slice is found the processing is resumed. The individual steps of the stream-base processing are shown in code fragment 1. Code Fragment 1. Estimation of the Probability Density Function Algorithm: estimatePDF Input: vDataset: sequence of data points ε: accepted precision InitG: initial number of grids Output: APDF tree a Body: 1. Initialize a 2. skipState = FALSE 3. FOR EACH slice si DO 3.1 IF !skipState THEN 3.1.1 Process slice si . 3.1.2 Split the space according to the precision of the estimation 3.1.3 IF precisionIsOK(a) THEN skipState = TRUE END IF 3.2 ELSIF newStream(si ) THEN 3.2.1 skipState = FALSE END IF END FOR
The density surface module can be in one of two states: active (in the process mode) and inactive (waits for an wake up event). The module is active if: (i) a change of the input parameters has triggered the recalculation of the PDFs and/or density surfaces or (ii) the animation of density surfaces is switched on. Pseudo Code Fragment 2 sketches the event handling. The first block handles new data. A new (or changed) data set triggers the re-computation of the model. The block deletes previous model information if requested (sub-block 1.1), splits the dataset into conditional datasets, and calculates the PDFs for the individual datasets (sub-block 1.3).
22
M.H. B¨ohlen et al.
The second block handles animation events. It calculates the next density surface level (sub-blocks 2.1-2.2) and computes the corresponding density surfaces (sub-blocks 2.32.4). The third block handles events triggered by a change of the density level. This requires a recalculation of the density surfaces at the requested density level. The 4th block visualizes the computed density surfaces and, optionally, corresponding data samples. Code Fragment 2. Dispatcher of the DS module 1. IF new(vDataset) THEN 1.1 IF reset THEN vPDF = vSurfaces = vPlacement = ∅ END IF 1.2 IF bConditional THEN Split vDataset according to the categorical attribute into D1 ,...,Dk . ELSE D1 = vDataset END IF 1.3 FOR EACH Dataset Di DO vPDF = vPDF ∪ estimatePDF(Di , EstErr, InitG) vPlacement = vPlacement ∪ cPlacement END FOR END IF 2. IF bAnimate THEN 2.1 animateDSLevel += 1.0 / animateDSFrames 2.2 IF animateDSLevel >= 1.0 THEN animateDSLevel = 1.0 / animateDSFrames END IF 2.3 vSurface = ∅ 2.4 FOR EACH vPDFi DO vSurface = vSurface ∪ calculateDS(vPDFi , animateDSLevel, iDSGridSize, bEqualized) END FOR 2.5 Request for calculation END IF 3. IF !bAnimate AND changed(fDSLevel) THEN 3.1 FOR EACH vPDFi DO vSurface = vSurface ∪ calculateDS(vPDFi , fDSLevel, iDSGridSize, bEqualized) END IF 4. visualizeObs()
4 Data Analyzes and Interpretation This section discusses and illustrates the four key interaction techniques: animation, conditional analyzes, equalization, and windowing. These are general techniques that support the interpretation of the data and are useful in other settings as well. Figure 7 illustrates the challenges the data mining process has to deal with. Neither the visualization of the entire data set (Figure 7(b)) nor the visualization of two domains only (Figure 7(c)) provide detailed insights. We discuss how the above techniques help to investigate and interpret this data. Below we use I for the interpretation of a visualization. We use I to relate different visualization techniques. Specifically, we formulate theorems that relate the information content of different density surface visualizations.
The 3DVDM Approach: A Case Study with Clickstream Data 2002
2001 1st
2nd quater 6th month
23
5th 4th
Time
Telia
.dk
TDC AUC AAU .de
.com
Domain
.fr ...
(a) Warehouse
(b) All data
(c) .dk and .com
Fig. 7. Direct Visualization of a Clickstream Data Warehouse
4.1 Animation An animation of density surfaces is a (cyclic) visualization of a sequence of density surfaces with a decreasing density level. For example if the number of frames is chosen to be four then the density surfaces at levels α1 = 1/5, α2 = 2/5, α3 = 3/5, and α4 = 4/5 will be visualized in sequence. Since the printed version of the manual is limited and cannot show the animation we present the animation of density surfaces by a sequence of snapshots at different density levels. Figure 8 shows an animated solid sphere (a 3D normal distributions)1.
(a) α = 1/5
(b) α = 2/5
(c) α = 3/5
(d) α = 4/5
Fig. 8. Density Surfaces for Different Surface Grid Size gs
Theorem 1. Animated density surfaces are strictly more informative than any individual density surface: ∀k(I[∪i DS(D, αi )] I[DS(D, αk )]) Figure 8, which exhibits a very simple and regular data distribution, illustrates Theorem 1 quite nicely. It is necessary to look at the animated density surfaces to confirm the normal distribution. None of the four snapshots would allow us to infer the same interpretation as the four snapshots together. Note that for any given density level αk it is possible to construct a dataset D , such that I[DS(D, αk )] = I[DS(D , αk )] and ∀i = k : I[DS(D, αi )] = I[DS(D , αi )]. Figure 9 shows animated density surfaces for a subset of the clickstream data set. The figure shows the clicks from the following domains: Italy (30%), Germany (28%), the UK (22%), and Spain (20%). 1
Because of rescaling the sphere is deformed and resembles an ellipsoid.
24
M.H. B¨ohlen et al.
(a) α = 0.2
(b) α = 0.3
(c) α = 0.5
(d) α = 0.9
Fig. 9. .it, .de, .uk, .es
The four countries are very different in culture and working style. The density surfaces nicely reflect this. For example in Spain people have a long break (siesta) in the middle of the day. The siesta divides the density surface for Spain into two similar sized surfaces (cf. S1 and S2, Figure 9(c)). Italy produces most of the clicks during the second part of the day (cf. surface I, Figure 9(c)). In contrast most of the UK clicks are received in the first part of the day (cf. surface U 1, Figure 9(c)). The German domain is bound by the working hours and peaks in the afternoon. Figure 9(d) shows the peaks for the individual domains. 4.2 Conditional Density Surfaces Conditional density surfaces are the result of splitting a data set into several subsets and computing a model (i.e., density surfaces) for each of them. A common approach is to split a data set according to the values of a categorical attribute (e.g., sex, country, etc). An example of conditional density surfaces is shown in Figure 10. The data set contains two normal distributions. A binary attribute W was added to assign the observations to the clusters. All observations that fall into the cluster A have W = 1, while observations that fall into cluster B have W = 2. Figures 10(c) shows a non-conditional density surface. A single density surface encloses both structures in space. Figure 10(b) shows conditional density surfaces. The conditional analysis separates the data points and yields two independent (but possibly intersecting) density surfaces.
(a) The Dataset
(b) Conditional
(c) Unconditional
Fig. 10. Difference between Conditional and Unconditional Analysis
The 3DVDM Approach: A Case Study with Clickstream Data
25
Theorem 2. Let D1 , ..., Dk be independent data sets with identical schemas Di , and let D = ∪ki=1 π[Di , i](Di ) be the union of these data sets with schema D = Di ∪ {W }. If a conditional data set contains independent structures then an appropriate conditional analysis is more informative than an unconditional analysis. I[∪i DS(σ[W = i](D), α)] I[DS(D, α)]) Basically, a conditional analysis will yield the same result as an (unconditional) analysis of the individual data sets. If the data sets are independent this is the optimal result. In practice, conditional analyzes can also be useful if independence is not achieved completely. For example, it can be appropriate to treat the clicks from different countries as being independent even if this is not completely accurate. Often it is best to try out both types of analysis, which means that switching between consitional and unconditional analyzes should be supported by the interaction tools available to the data analyst. 4.3 Equalization of Structures Many real world data sets contain very skewed data. For example, it is likely that a web server will handle a huge amount of local traffic. This does not necessarily mean that non-local domains are less important for the data analysis. In order to support the exploration of not equally pronounced structures it is necessary to equalize the structures before they are displayed. The data set in Figure 11 contains two structures: a spiral (80% of all observations) and a sphere (20% of all observations). In Figure 11(a) we chose the number of observations that yields the best visualization of the spiral. The result is a scatter plot that does not show the sphere. In Figure 11(b) we increase the number of observations until the sphere can be identified. This yields a scatter plot that makes it difficult to identify the spiral. Clearly, it is much easier to identify the structures if equalized density surfaces are used (cf. Figure 11(c)).
(a) n = 3 000
(b) n = 100 000
(c) Equalized DS
Fig. 11. Scatterplots versus Equalized Density Surfaces
Since the densities of the structures are very different, an overall best density level does not exist as illustrated in Figure 12(a). A low density yields an appropriate sphere but only shows a very rough contour of the spiral. A high density shows a nice spiral but does not show the sphere at all. Basically, we experience the same problems as with the scatterplot visualization. The density information makes it fairly straightforward to
26
M.H. B¨ohlen et al.
1/10max1
1/10max2 Structure1
(a) Non-Equalized Density Surfaces
Structure2
(b) Lazy Equalization
Fig. 12. Equalization of Structures
equalize the two structures. The idea is illustrated in Figure 12(b). The structures are equalized by adaptively (i.e., based on the density of a structure) calculating the density level for each structure (e.g., 10% of the maximum density for each structure). Equalization requires a separation of structures. This comes for free if we use conditional density surfaces. Equalization and conditional density surfaces therefore complement each other nicely. With unconditional analyzes the separation has to be implemented as a separate process. Note that, in contrast to density surfaces, scatterplot visualizations are not easy to equalize. In principle, one has to determine the optimal number of observations that shall be visualized for each structure. This number depends on many factors and cannot be determined easily. Theorem 3. Any data set D can be completely dominated: ∀D∃D (I[DS(D , α)] = I[DS(D ∪ D , α)]) Figure 12(a) illustrates how the spiral dominates the sphere. Here the domination is not complete because there is still a density level at which the sphere can be identified. If the density of the spiral was increased further the sphere will eventually be treated as noise. Theorem 4. Equalization preserves the interpretation of individual structures: ∀D, D (I[DS eq (D ∪ D , α)] = I[DS(D, α)] ∪ I[DS(D , α)]) The preservation is illustrated in Figure 11(c). Here the two structures are shown as if each of them was analyzed individually. 4.4 Windowing The window functionality is used if we want to investigate a subset of the data in more detail. Technically, the window function filters out the observations that fall outside the selected window and re-scales the data to the unit cube. Conceptually, the window functionality enables an investigation at the micro level (a “smart” zoom into the dataset). In contrast to standard zooming techniques, windowing allows to zoom in on a single dimension. For example, it is possible to restrict the analysis to a few domains but preserve the complete time dimension. Note that windowing must trigger a recalculation
The 3DVDM Approach: A Case Study with Clickstream Data
27
of the model. This makes it possible that a single surface at the macro level can split into several surfaces at micro level, as one expects it to be. Figure 13 is a direct visualization of the clickstream and illustrates the problem. Because the coding preserves the natural ordering of the data all frequent domains (.dk, .com, etc.) are clustered at the beginning of the axis. This yields a very skewed plot that can only be used for a simple overall load analysis: The work-load is high during the very early morning hours and the local working hours (surfaces A and B in Figure 13(c)), and the work-load peaks around 8PM (surface C in Figure 13(d)).
(a) Scatterplot
(b) α = 0.2
(c) α = 0.5
(d) α = 0.9
Fig. 13. The Sunsite Dataset (Natural Ordering)
What we want is that the cluster starts to “unfold” as we move closer, i.e., we expect the domains to separate. Standard zooming techniques do not provide this functionality directly. Typically, when moving closer we not only narrow the IP domain but also loose the overview of the time domain. Moreover, zooming does not trigger a re-computation of the model, which means that the surfaces will simply be stretched. Figure 14 illustrates the idea of the windowing functionality. A simple windowing tool is shown in Figure 14(b). The menu allows to select the domains that shall be visualized (i.e., the window). It supports dimension hierarchies and does not require that the selected domains are contiguous. The effect of zooming in is illustrated in Figure 14(c). The original cluster of points has unfold and it is now possible to analyze selected domains/regions.
(a) The Scatter Plot
(b) α = 0.2
(c) α = 0.5
Fig. 14. The Clickstream Data from www.sunsite.dk
28
M.H. B¨ohlen et al.
Another useful feature of windowing is re-ordering. If the data analyst wants to change the ordering of the domains this can be very easily done using the tool shown in Figure 14(b).
5 Summary and Future Work We investigated the potential of density surfaces of 3D data to analyze clickstream data. Density surfaces accurately summarize the data and yields a good model. Density surfaces are simple visual structures that are easy to perceive and support the interpretation of the data. We showed that animation, conditional analyzes, equalization, and windowing are crucial interaction techniques. They make it possible to explore the data at different granularity levels, which leads to a robust interpretation. In the future it would be interesting to classify density surfaces and come up with an alphabet. The analysis of high-dimensional data is another interesting issue. One could for example investigate projection techniques before the density surfaces are computed, or the (moderate) use of additional visual properties.
References 1. Agrawal, R., Mehta, M., Shafer, J.C., Srikant, R., Arning, A., Bollinger, T.: The quest data mining system. In: Proceedings of ACM SIGKDD, 2-4, 1996, pp. 244–249. AAAI Press, Menlo Park (1996) 2. Brunk, C., Kelly, J., Kohavi, R.: MineSet: an integrated system for data mining. In: Proceedings of SIGKDD, pp. 135–138. AAAI Press, Menlo Park (1997) 3. Cerrito, P.B.: Introduction to Data Mining Using SAS Enterprise Miner. SAS Publishing (2006) 4. Davidson, I., Ward, M.: A Particle Visualization Framework for Clustering and Anomaly Detection. In: Proceedings of Workshop on Visual Data Mining in conjunction with SIGKDD (2001) 5. van den Eijkel, G.C., van der Lubbe, J.C.A., Backer, E.: A Modulated Parzen-Windows Approach for Probability Density Estimation. In: IDA (1997) 6. Farmen, M., Marron, J.S.: An Assesment of Finite Sample Performace of Adaptive Methods in Density Estimation. In: Computational Statistics and Data Analysis (1998) 7. Gross, M.H., Sprenger, T.C., Finger, J.: Visualizing information on a sphere. Visualization (1997) 8. Guha, S., Rastogi, R., Shim, K.: CURE: an Efficient Clustering Algorithm for Large Databases. In: Proceedings of SIGMOD, pp. 73–84 (1998) 9. Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. In: Proceedings of SIGMOD, pp. 47–57. ACM Press, New York (1984) 10. Hinneburg, A., Keim, D.A.: Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. The VLDB Journal, 506–517 (1999) 11. Clementine SPSS Inc. Data mining system: Clementine 12.0 (2008) 12. Keahey, T.A.: Visualization of High-Dimensional Clusters Using Nonlinear Magnification. In: Proceedings of SPIE Visual Data Exploration and Analysis (1999) 13. Kimball, R.: The Data Warehouse Toolkit. John Wiley & Sons, Inc., Chichester (1996) 14. Kimball, R., Merz, R.: The Data Webhouse Toolkit—Building the Web-Enabled Data Warehouse. Wiley Computer Publishing, Chichester (2000)
The 3DVDM Approach: A Case Study with Clickstream Data
29
15. Mazeika, A., B¨ohlen, M., Mylov, P.: Density Surfaces for Immersive Explorative Data Analyses. In: Proceedings of Workshop on Visual Data Mining in conjunction with SIGKDD (2001) 16. Nagel, H.R., Granum, E., Musaeus, P.: Methods for Visual Mining of Data in Virtual Reality. In: Proceedings of the International Workshop on Visual Data Mining, in conjunction with ECML/PKDD 2001 (2001) 17. Robinson, J.T.: The K-D-B-Tree: A Search Structure For Large Multidimensional Dynamic Indexes. In: Edmund Lien, Y. (ed.) Proceedings of SIGMOD, pp. 10–18. ACM Press, New York (1981) 18. Scott, D.W.: Multivariate Density Estimation. Wiley & Sons, New York (1992) 19. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall, London (1986) 20. Swayne, D.F., Lang, D.T., Buja, A., Cook, D.: Ggobi: Evolving from Xgobi into an Extensible Framework for Interactive Data Visualization. Comput. Stat. Data Anal. 43(4), 423–444 (2003) 21. Sprenger, T.C., Brunella, R., Gross, M.H.: H-BLOB: a Hierarchical Visual Clustering Method using Implicit Surfaces. Visualization (2000) 22. Wang, W., Yang, J., Muntz, R.R.: STING: A Statistical Information Grid Approach to Spatial Data Mining. The VLDB Journal, 186–195 (1997) 23. Wand, M.P., Jones, M.C.: Kernel Smoothing. Chapman & Hall, London (1985) 24. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an Efficient Data Clustering Method for Very Large Databases. In: Proceedings of SIGMOD, pp. 103–114 (1996)
Form-Semantics-Function – A Framework for Designing Visual Data Representations for Visual Data Mining Simeon J. Simoff School of Computing and Mathematics College of Heath and Science University of Western Sydney NSW 1797, Australia [email protected]
Abstract. Visual data mining, as an art and science of teasing meaningful insights out of large quantities of data that are incomprehensible in another way, requires consistent visual data representations (information visualisation models). The frequently used expression "the art of information visualisation" appropriately describes the situation. Though substantial work has been done in the area of information visualisation, it is still a challenging activity to find out the methods, techniques and corresponding tools that support visual data mining of a particular type of information. The comparison of visualisation techniques across different designs is not a trivial problem either. This chapter presents an attempt for a consistent approach to formal development, evaluation and comparison of visualisation methods. The application of the approach is illustrated with examples of visualisation models for data from the area of team collaboration in virtual environments and from the results of text analysis.
1 Introduction In visual data mining large and normally incomprehensible amounts of data are reduced and compactly represented visually through the use of visualisation techniques based on a particular metaphor or a combination of several metaphors. For instance, digital terrain models, based on the landscape metaphor and various geographical frameworks [1, 2] and CAD-based architectural models of cities take the metaphor of urban design into information visualisation. Their popularity has demonstrated that multi-dimensional visualisation can provide a superior means for exploring large data-sets, communicating model results to others and sharing the model [3]. Techniques based on animation and various multimedia support [4] appear frequently in the research and development radar [5]. Dr Hans Rosling1, a professor in global health at Sweden’s Karolinska Institute, has markedly demonstrated the power of animation and visual computing [6, 7] for visually extracting knowledge out of publicly available data, often drawn from United Nations data. 1
http://www.ted.com/
S.J. Simoff et al. (Eds.): Visual Data Mining, LNCS 4404, pp. 30–45, 2008. © Springer-Verlag Berlin Heidelberg 2008
Form-Semantics-Function – A Framework for Designing Visual Data Representations
31
The design of visualisation models for visual data mining, in broad sense, is the definition of the rules for conversion of data into graphics. Generally, the visualisation of large volumes of abstract data, known as ‘information visualisation’ is closely related but sometimes contrasted, to scientific visualisation, which is concerned with the visualisation of (numerical) data used to form concrete representations [7]. The frequently used expression "the art of visualisation" appropriately describes the state of research in that field. Currently, it is challenging activity for information designers to find out the methods, techniques and corresponding tools available to visualise a particular type of information. The comparison of visualisation techniques across different designs is not a trivial problem [8]. Partially, current situation is explained by the lack of generic criteria to access the value of visualisation models. This is a research challenge, since most people develop their own criteria for what makes a good visual representation. The design of visualisation schemata is dominated by individual points of views, which has resulted in a considerable variety of ad hoc techniques [5]. A recent example is the visualisation of association rules proposed in [9]. On the other hand, an integral part of information visualisation is the evaluation of how well humans understand visualised information as part of their cognitive tasks and intellectual activities, efficiency of information compression and level of cognitive overload. [10] has investigated some aspects of developing visualisation schemata from cognitive point of view. With the increasing attention towards development of interactive visual data mining methods the development of more systematic approach towards the design of visualisation techniques is getting on the "to do" list of the data mining research. The success of a visual data mining method depends on the development of an adequate computational model of selected metaphor. This is especially important in the context of communicating and sharing of discovered information, and in the context of the emerging methods of computer-mediated collaborative data mining (CMCDM). This new visual data mining methodology is based on the assumption that individuals usually may respond with different interpretations of the same information visualisation [11]. A central issue is the communicative role of abstract information visualisation components in collaborative environments for visual data mining. In fact, "miners" can become part of the visualisation. For example, in virtual worlds this can happen through their avatars2 [12]. In a virtual world, a collaborative perspective is inevitable, thus a shared understanding of the underlying mapping between the semantic space and the visualisation scheme [13] becomes a necessary condition in the development of these environments. The results of CMCDM are heavily influenced by the adequate formalisation of the metaphors that are used to construct the virtual environment, i.e. the visualisation schemata can influence the behavior of collaborators, the way they interact with each other, the way that they reflect on the changes and manipulations of visualisations, and, consequently, their creativity in the discovery process. Currently, it is a challenging task for designers of visual data mining environments to find the strategies, methods and corresponding tools to visualise a particular type of information. Mapping characteristics of data into a visual representation in virtual 2
3D representations of people and autonomous software agents in virtual worlds. Avatar is an ancient Sanskrit term meaning ‘a god’s embodiment on the Earth’.
32
S.J. Simoff
worlds is one promising way to make the discovery of encoded relations in this data possible. The model of semantically organised place for visual data exploration can be useful for the development of computer support for visual information querying and retrieval [14, 15] in collaborative information filtering. The development of a representational and computational model of selected metaphor(s) for data visualisation will assist the design of virtual environments, dedicated to visual data exploration. The formal approach presented in this chapter is based on the concept of semantic visualisation defined as a visualisation method, which establishes and preserves the semantic link between form and function in the context of the visualisation metaphor. Establishing a connection between form and functionality is not a trivial part of the visualisation design process. In a similar way, selecting the appropriate form for representing data graphically, whether the data consists of numbers or text, is not a straightforward procedure as numbers and text descriptions do not have a natural visual representation. On the other hand, how data are represented visually has a powerful effect on how the structure and hidden semantics in the data is perceived and understood. An example of a virtual world, which attempts to visualise an abstract semantic space, is shown in Fig. 1. A cylinder spike marks one of the 20 documents that have the highest documentquery relevance score
a
Papers from different years are colored in different colors
b
Fig. 1. An example of visualising abstract semantic structure encoded in publication data utilising virtual worlds metaphor and technology (adapted from [16]
Though the work appeared a decade ago, the way the visual representation model has been designed is a typical example of the problems that can be encountered. The visualisation of the semantic space of the domain of human-computer interaction is automatically constructed from a collection of papers from three consecutive ACM CHI conference proceedings [5]. The idea of utilising virtual worlds comes from the research in cognitive psychology, claiming that we develop cognitive (mental) maps in order to navigate within the environments where we operate. Hence the overall “publication” landscape is designed according to the theory of cognitive maps (see [17]in support of the theory of cognitive maps). In such a world, there is a variety of possibilities for data exploration. For example, topic areas of papers are represented by coloured spheres. If a cluster of spheres includes every colour but one, this suggests that the particular topic area, represented by the missing coloured sphere, has not been addressed by the papers during that year. However, without the background knowledge of the semantics of coloured spheres, selected information visualisation scheme does not provide cues for understanding and interpreting the environment landscape. It is not clear how the metaphor of a "landscape" has been formalised and
Form-Semantics-Function – A Framework for Designing Visual Data Representations
33
represented, what are the elements of the landscape. Associatively, this visualisation is closer with the visualisation of molecular structures. Semantic visualisation is considered in the context of two derivatives of visualisation - visibilisation and visistraction [18]. Visibilisation is visualisation focusing on the presentation and interpretation which complies with rigorous mapping from physical reality. By contrast, visistraction is the visualisation of abstract concepts and phenomena, which do not have a direct physical interpretation or analogy. Visibilisation has the potential to bring key insights, by emphasising aspects that were unseen before. The dynamic visualisation of the heat transfer during the design of the heatdissipating tiles cover of the underside of the space-shuttle is an early example of the application of visibilisation [19]. Visistracton can give a graphic depiction of intuition regarding objects and relationships. The 4D simulation of data flow is an example of visistraction, which provides insights impossible without it. In a case-base reasoning system, visistraction techniques can be used to trace the change of relationships between different concepts with the addition of new cases. Both kinds of semantic visualisation play important role in visual data mining. However, semantic visualisation remains a hand-crafted methodology, where each case is considered separately. This chapter presents an attempt to build a consistent approach to semantic visualisation based on a cognitive model of metaphors, metaphor formalisation and evaluation. We illustrate the application of this approach with examples from visistraction of communication and collaboration data. Further, the chapter is presents the Form-Semantics-Function framework for construction and evaluation of visualisation techniques, an example of the application of the framework towards the construction of a visualisation technique for identifying patterns of team collaboration, and an example of the application of the framework for evaluation and comparison of two visualisation models. The chapter concludes with the issues for building visualization models that support collaborative visual data mining.
2 Form-Semantics-Function: A Formal Approach Towards Constructing and Evaluating Visualisation Techniques The Form-Semantic-Function (FSF) approach includes the following steps: metaphor analysis; metaphor formalisation; and metaphor evaluation. Through the use of metaphor, people express the concepts in one domain in terms of another domain [20, 21]. The closest analogy is VIRGILIO [22], where the authors proposed a formal approach for constructing metaphors for visual information retrieval. The FSF framework develops further the formal approach towards constructing and evaluating visualisation techniques, approaching the metaphor in an innovative way. 2.1 Metaphor Analysis During metaphor analysis, the content of the metaphor is established. In the use of metaphor in cognitive linguistics, the terms source and target3 refer to the conceptual 3
In the research literature the target is variously referred to as the primary system or the topic, and the source is often called the secondary system or the vehicle.
34
S.J. Simoff
spaces connected by the metaphor. The target is the conceptual space that is being described, and the source is the space that is being used to describe the target. In this mapping the structure of the source domain is projected onto the target domain in a way that is consistent with inherent target domain structure [21, 23]. In the context of semantic visualisation, the consistent use of metaphor is expected to bring an understanding of a relatively abstract and unstructured domain in terms of more concrete and structured visual elements through the visualisation schemata. An extension of the source-target mapping, proposed by [24] includes the notion of generic space and blend space. Generic space contains the skeletal structure that applies to both source and target spaces. The blend space often includes structure not projected to it from either space, namely emergent structure on its own. The ideas and inspirations developed in the blend space can lead to modification of the initial input spaces and change the knowledge about those spaces, i.e. to change and evolve the metaphor. The process is called conceptual blending - it is the essence in the development of semantic visualisation techniques. In presented approach, the form-semantics-function categorisation of the objects being visualised, is combined with the [24] model. The form of an object can express the semantics of that object, that is, the form can communicate implicit meaning understood through our experiences with that form. From the form in the source space we can connect to a function in the target space via the semantics of the form. The resultant model is shown in Fig. 2. Generic Generic space: space: COMMON COMMON SEMANTICS SEMANTICS
Source Source space: space: FORM FORM
Target Target space: space: FUNCTION FUNCTION
Blended Blended space: space: NEW NEW SEMANTICS SEMANTICS
Fig. 2. A model for metaphor analysis for constructing semantic visualisation schemata
The term "visual form" refers to the geometry, colour, texture, brightness, contrast and other visual attributes that characterise and influence the visual perception of an object. Thus, the source space is the space of 2D and 3D shapes and the attributes of their visual representation. "Functions" (the generalisations of patterns, discovered in data) are described using concepts from the subject domain. Therefore the target space includes such concepts associated with the domain functions. This space is constructed from the domain vocabulary. The actual transfer of semantics has two components - the common semantics, which is carried by notions that are valid in both domains and what is referred as new semantics - the blend, which establishes the unique characteristics revealed by the correspondence between the form metaphor and functional characteristics of that form. The schema illustrates how metaphorical inferences produce parallel knowledge structures.
Form-Semantics-Function – A Framework for Designing Visual Data Representations
35
2.2 Metaphor Formalisation The common perception of the word "formalisation" is connected with the derivation of some formulas and equations that describe the phenomenon in analytical form. In this case, formalisation is used to describe a series of steps that ensure the correctness of the development of the representation of the metaphor. Metaphor formalisation in the design of semantic visualisation schemes includes the following basic steps: • Identification of the source and target spaces of the metaphor - the class of forms and the class of features or functions that these forms will represent; • Conceptual decomposition of the source and target spaces produces the set of concepts that describe both sides of the metaphor mapping. As a rule, metaphorical mappings do not occur isolated from one another. They are sometimes organized in hierarchical structures, in which `lower' mappings in the hierarchy inherit the structures of the `higher' mappings. In other words, this means that visualisation schemes, which use metaphor are expected to preserve the hierarchical structures of the data that they display. In visistraction, these are the geometric characteristics of the forms from the source space, and other form attributes like colours, line thickness, shading, etc. and the set of functions and features in the target space associated with these attributes and variations; • Identifying the dimensions of the metaphor along which the metaphor operates. These dimensions constitute the common semantics. In visistraction this can be for instance key properties of the form, like symmetry and balance with respect to the center of gravity, that transfer semantics to the corresponding functional elements in the target domain; • Establishing semantic links, relations and transformations between the concepts in both spaces, creating a resemblance between the forms in the source domain and the functions in the target domain. 2.3 Metaphor Evaluation In spite of the large number of papers describing the use of the metaphor in the design of computer interfaces and virtual environments, there is a lack of formal evaluation methods. In the FSF framework metaphor evaluation is tailored following the [25] model, illustrated in Fig. 3.
Fig. 3. Model for evaluating metaphor mapping (based on [25])
36
S.J. Simoff
The "V" and "F" are labels for visualisation and function features, respectively. The "VF" label with indices denotes numbers of features, namely: • V+ F+ - function features that are mapped to the visualisation schema; • V− F+ - function features that are not supported by the visualisation schema; • V+ F− - features in the visualisation schema, not mapped to the functional features. The ratio
V− F+ provides an estimate of the quality of the metaphor used for the V + F+
visualisation - the smaller the better. The elements of the Form-Semantics-Function approach are illustrated in the following examples. The first example illustrates metaphor analysis and formalisation for the creation of visualisation form and mapping it to the functional features. In the second example two different forms for visualising the same set of functional features are considered.
3 Constructing Visualisation Schema for Visual Data Mining for Identifying Patterns in Team Collaboration Asynchronous communication is an intrinsic part of computer-mediated teamwork. Among the various models and tools supporting this communication mode [13], perhaps the most popular in teamwork are bulletin (discussion) boards. These boards support multi-thread discussion, automatically archiving communication content. One way to identify patterns in team collaboration is via content analysis of team communications. However, it is difficult to automate such analysis, therefore, especially in large scale projects, monitoring and analysis of collaboration data can become a cumbersome task. In the research in virtual design studios [13, 26] there have been identified two extremes (labeled as "Problem comprehension" and "Problem division") in team collaboration, shown in Fig. 4. In "Problem comprehension" collaborative mode the resultant project output - a product, solution to a problem, etc., is a product of a continued attempt to construct and maintain a shared conception and understanding of the problem. In other words each of the participants is developing own view over the whole problem and the shared conception is established during the collaborative process via intensive information exchange. In "Problem division" mode the problem is divided among the participants in a way where each person is responsible for a particular portion of the investigation of the problem. Thus, it does not necessarily require the creation of a single shared
Problem division
Problem comprehension
Fig. 4. Two extremes in team collaboration
Form-Semantics-Function – A Framework for Designing Visual Data Representations
37
conception and understanding of the problem. The two modes of collaboration are two extreme cases. In general, the real case depends on the complexity of the problem. A key assumption in mining and analysis of collaboration data is that this two extreme styles should be some how reflected in the communication of the teams. Thus, different patterns in team communication on the bulletin board will reflect different collaboration modes. Fig. 5 shows a fragment of a team bulletin board. Course: Computer Based Design | Bulletin Board: Team 2 | Venue: Virtual Design Studio 4 Lighting etc - Derek 08:46:10 10/16/97 (1) Re: Lighting etc - Sophie Collins 10:43:18 10/17/97 (0) • 3 Seating - Derek Raithby 15:18:57 10/14/97 (2) Re: Seating - marky 17:22:56 10/14/97 (1) • Re: Seating - Sophie Collins 09:03:27 10/15/97 (0) • 2 Product Research. - Derek Raithby 14:37:43 10/14/97 (1) Re: Product Research. - mark 17:20:16 10/14/97 (0) • 1 Another idea - Sophie Collins 14:24:18 10/14/97 (1) Re: another idea - Derek Raithby 14:40:00 10/14/97 (0) •
(M14) (M24) (M13) (M23) (M33) (M12) (M22) (M11) (M21)
Fig. 5. Bulletin board fragment with task-related messages, presented as indention graph
The messages on the board are grouped in threads. [27, 28] propose a threefold split of the thread structure of e-mail messages in discussion archives in order to explore the interactive threads. It included (i) reference-depth: how many references were found in a sequence before this message; (ii) reference-width: how many references were found, which referred to this message; and (iii) reference-height: how many references were found in a sequence after this message. In addition to the threefold split, [29] included the time variable explicitly. Fig. 6 shows the formal representation of the bulletin board fragment in Fig. 5. M11 1
1
M12
M22
2
2
Level 2
M13
M23
M33
3
3
3
Level 1 M14 4
t11 t12 t21 t13 t22
t23
t33
t14
M24 4
t24
t
Fig. 6. Formal representation of the thread structure in the fragment presented in Fig. 5
38
S.J. Simoff
3.1 Metaphor Analysis Fig. 7 shows the Form-Semantics-Function mapping at the metaphor formalisation stage as a particular case of the [24] model applied to the visualisation of communication utterances data. The source space in this case is the space of 2D geometric shapes, rectangles in particular. The target space includes the concepts associated with the functions that are found in the analysis of a collaborative design session. Generic Generic space: space: COMMON COMMON SEMANTICS SEMANTICS • Symmetry • Balance Source Source space: space: FORM FORM
• Proportion
• Rectangle
Target Target space: space: FUNCTION FUNCTION • Participation
• Line thicknes
• Topic of discussion Blended Blended space: space: NEW NEW SEMANTICS SEMANTICS
• Shade
• correspondance between visual “heaviness” and intensity of collaboration • correspondance between visual perspective and depth of the tread
Fig. 7. The Form-Semantics-Function mapping for the source space of nested rectangles and the target space of bulletin board discussion threads
3.2 Metaphor Formalisation Below is a brief illustration of the metaphor formalisation in this example. • Identification of the source and target spaces of the metaphor - rectangles are the forms that will be used and the class of features or functions that these forms will represent are the messages on a bulletin board; • Conceptual decomposition of the source and target spaces leads to the notion of nested rectangles, whose centers of gravity coincide, with possible variation of the thickness of their contour lines and the background color. Each rectangle corresponds to a message within a thread. Rectangle that corresponds to a message at a level (n + 1) is placed within the rectangle that corresponds to a message at level n. Messages at the same level are indicated by a one step increase of the thickness of the contour line of the corresponding rectangle. Thus, a group of nested rectangles can represent several threads in a bulletin board discussion; • Identifying the dimensions of the metaphor - visual balance and the "depth" or "perspective" of the nested rectangles are the dimension of the metaphor, transferring via the visual appearance the semantics of different communication styles; • Establishing semantic links, relations and transformations - this is connected with the identification of typical form configurations that correspond to typical patterns of collaboration. For example, Fig. 8 illustrates two different fragments A and B (each of one thread). Fig. 9 illustrates the visualisation of this fragments according to the developed visualisation schema.
Form-Semantics-Function – A Framework for Designing Visual Data Representations
39
M1A M2A M3A A M4A M1B M2B M3B M2B+ M2B+ B M2B+ M3B+
Fig. 8. Bulletin board fragment with task-related messages, presented as indention graph • M1A • M2A
A
• M3A • M4A
M4A M 3A
M2A M 1A
• M1B • M2B • M3B B
• M2B+ • M2B+ • M2B+
M3B
M2B M 1B
•M3B +
Fig. 9. Visualisation of fragments A and B in Fig. 8
a. Communication patterns, typical for "problem division" collaboration
b. Communication patterns, typical for "problem comprehension" collaboration
Fig. 10. Communication patterns, corresponding to different collaboration styles
40
S.J. Simoff
The visualisation schema has been used extensively in communication analysis. Fig. 10 illustrates communication patterns corresponding to different collaboration styles. An additional content analysis of communication confirmed the correct identification of collaboration patterns.
4 Evaluation and Comparison of Two Visualisation Schemata We illustrate the idea by evaluating examples of semantic visualisation of textual data and objects in virtual environments. The role of visistraction in concept relationship analyis is to assist the discovery of the relationship between concepts, as reflected in the available text data. The analysis uses word frequencies, their co-occurence and other statistics, and cluster analysis procedures. We investigate two visual metaphors "Euclidian space" and "Tree", which provide a mapping from the numerical statistics and cluster analysis data into the target space of concepts and relations between them. The visualisation features for both metaphors and the function features of the target space are shown in Table 1. Examples of the two visualisation metaphors operating over the same text data set are shown in Fig. 11 and Fig. 12, respectively.
Form-Semantics-Function – A Framework for Designing Visual Data Representations
41
Table 1. Visualisation and function features Visualisation features of Euclidian space metaphor
Visualisation features of tree meta-
- point
- nodes
- simple/complex concept
- alphanumeric single-
- alphanumeric multi-word node labels
- subject key word
word point labels
- signs "+" and "-"
- hierarchical relationship
- axes
- branches
- context link
- plane
- numeric labels for branches
- link strength
Function features
phor
- color
- synonymy
- line segment
- hyponymy
a
b Fig. 11. Visualisation of the 10 most frequent words in the description of lateral load resisting system in one of the wide-span building cases4
4
This visualisation is used in TerraVision, which is part of the CATPAC system for text analysis by Provalis Research Co.
42
S.J. Simoff
Fig. 12. Visualisation of part of a semantic net5 for the term "load resisting system"
The source text is the collection of building design cases, which is part of the SAM (Structure and Materials) case library and case-based reasoning system [30]. 4.1 Metaphor Evaluation The first scheme6 maps the source domain of Euclidean space (coordinates of points in 2D/3D space) to the target domain of word statistics. The blending semantics is that the degree, to which the terms are related to each other, can be perceived visually from the distance between the corresponding data points - the closer the points the tighter is the relationship between the words. The second scheme7 maps the source domain of the topology of linked nodes to the same target domain of words statistics. This mapping generates one of the possible visualisations of semantic networks. This visualisation includes nodes with single- and multiple-word labels, numeric values of Table 2. Visualisation support for function features in Euclidean space and tree metaphors Function features
5
Support by the Euclidean Support by the Tree space metaphor metaphor
Simple/complex concept Subject key word
+
+ +
Hierarchical relationship Context link
-
+ +
Link strength Synonymy
-
+ -
Hyponymy
-
-
Semantic networks in our study are visualised in TextAnalyst by Megaputer Intelligence, Inc. The schema is used in the CATPAC qualitative analysis package by Terra Research Inc. 7 The schema is used in TextAnalyst by Megaputer Intelligence (see www.megaputer.com). 6
Form-Semantics-Function – A Framework for Designing Visual Data Representations
43
each link between terms and the weight of the term among the other terms in the tree. The results of the comparison between the two metaphors are presented in Table 2 and Table 3. The Euclidean space metaphor has a poor performance for visualisation of concept relationships. What is the meaning of such closeness - it is difficult to make a steady judgement about what the relation is and whether we deal with simple (one word) or complex (more than one word) terms. The distance to the surface, proportional to the frequency of the words, can convey the message that a word is a key word. However, there is no feature in the visualisation, which shows context links between words, the strength of this links and other relations between words. Table 3. Comparison of in Euclidean space and tree metaphors
V+ F+ V− F+ V− F+ V + F+
Euclidean space metaphor 1 6 6
Tree metaphor 5 2 0.4
5 Conclusion and Future Directions The Form-Semantics-Function framework resented in this work is an attempt to develop a formal approach towards the use of metaphors in constructing consistent visualisation schemes. In its current development, the evaluation part of the framework does not include the analysis of the cognitive overload from the point of information design. Some initial work in that direction has been started in [31]. Currently the research on the FSF framework is further developed in the context of supporting collaborative visual data mining in virtual worlds. The different perceptions of a visualisation model in such environment may increase the gap between individuals as they interact with it in a data exploration session. However, individual differences may lead to a potential variety of discovered patterns and insights in the visualised information across participants. Consequently, current research within the FSF framework is focused on exploring: • whether people attach special meanings to abstract visualisation objects; • what are the design criteria towards visualisation objects, engaged in visual data exploration, that people can effectively construct and communicate knowledge in visual data mining environments; • what are the necessary cues that should be supported in semantically organised virtual environments; • how individual differences in visual perspectives can be channelled to stimulate the creation of “out of the box” innovative perspectives.
44
S.J. Simoff
References 1. Hetzler, B., Harris, W.M., Havre, S., Whitney, P.: Visualising the full spectrum of document relationships, in Structures and Relations in Knowledge Organisation. In: Proceedings of the Fifth International Society for Knowledge Organization (ISKO) Conference, Lille, France (1998) 2. Hetzler, B., Whitney, P., Martucci, L., Thomas, J.: Multi-faceted insight through interoperable visual information analysis paradigms. In: Proceedings of the 1998 IEEE Symposium on Information Visualization. IEEE Computer Society, Washington, DC (1998) 3. Brown, I.M.: A 3D user interface for visualisation of Web-based data-sets. In: Proceedings of the 6th ACM International Symposium on Advances in Geographic Information Systems. ACM, Washington, D.C (1998) 4. Noirhomme-Fraiture, M.: Multimedia support for complex multidimensional data mining. In: Proceedings of the First International Workshop on Multimedia Data Mining (MDM/KDD 2000), in conjunction with Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 2000. ACM Press, Boston (2000) 5. Chen, C.: Information Visualization: Beyond the Horizon. Springer, London (2004) 6. Gross, M.: Visual Computing: The Integration of Computer Graphics. Springer, Heidelberg (1994) 7. Nielson, G.M., Hagen, H., Muller, H.: Scientific Visualization: Overviews, Methodologies, and Techniques. IEEE Computer Society, Los Alamitos (1997) 8. Chen, C., Yu, Y.: Empirical studies of information visualization: A meta-analysis. International Journal of Human-Computer Studies 53(5), 851–866 (2000) 9. Hofmann, H., Siebes, A.P.J.M., Wilhelm, A.F.X.: Visualizing association rules with interactive mosaic plots. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 2000. ACM, Boston (2000) 10. Crapo, A.W., Waisel, L.B., Wallace, W.A., Willemain, T.R.: Visualization and the process of modeling: A cognitive-theoretic approach. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 2000. ACM, New York (2000) 11. Snowdon, D.N., Greenhalgh, C.M., Benford, S.D.: What You See is Not What I See: Subjectivity in virtual environments. In: Proceedings Framework for Immersive Virtual Environments (FIVE 1995). QMW University of London, UK (1995) 12. Damer, B.: Avatars. Peachpit Press, an imprint of Addison Wesley Longman (1998) 13. Maher, M.L., Simoff, S.J., Cicognani, A.: Understanding virtual design studios. Springer, London (2000) 14. Del Bimbo, A.: Visual Information Retrieval. Morgan Kaufmann Publishers, San Francisco (1999) 15. Gong, Y.: Intelligent Image Databases: Towards Advanced Image Retrieval. Kluwer Academic Publishers, Boston (1998) 16. Börner, K., Chen, C., Boyack, K.: Visualizing knowledge domains. Annual Review of Information Science &Technology, 179–355 (2003) 17. Kumaran, D., Maguire, E.A.: The human hippocampus: Cognitive maps or relational memory? The Journal of Neuroscience 25(31), 7254–7259 (2005) 18. Choras, D.N., Steinmann, H.: Virtual reality: Practical applications in business and industry. Prentice-Hall, Upper Saddle River (1995) 19. Gore, R.: When the space shuttle finally flies. National Geographic 159, 317–347 (1981) 20. Lakoff, G., Johnson, M.: Metaphors We Live By. University of Chicago Press, Chicago (1980)
Form-Semantics-Function – A Framework for Designing Visual Data Representations
45
21. Lakoff, G.: The contemorary theory of metaphor, in Metaphor and Thought. In: Ortony, A. (ed.), pp. 202–251. Cambridge University Press, Cambridge (1993) 22. L’Abbate, M., Hemmje, M.: VIRGILIO - The metaphor definition tool, in Technical Report: rep-ipsi-1998-15. 2001, European Research Consortium for Informatics and Mathematics at FHG (2001) 23. Turner, M.: Design for a theory of meaning. In: Overton, W., Palermo, D. (eds.) The Nature and Ontogenesis of Meaning, pp. 91–107. Lawrence Erlbaum Associates, Mahwah (1994) 24. Turner, M., Fauconnier, G.: Conceptual integration and formal expression. Journal of Metaphor and Symbolic Activity 10(3), 183–204 (1995) 25. Anderson, B., Smyth, M., Knott, R.P., Bergan, M., Bergan, J., Alty, J.L.: Minimising conceptual baggage: Making choices about metaphor. In: Cocton, G., Draper, S., Weir, G. (eds.) People and Computers IX, G, pp. 179–194. Cambridge University Press, Cambridge (1994) 26. Maher, M.L., Simoff, S.J., Cicognani, A.: Potentials and limitations of virtual design studios. Interactive Construction On-Line 1 (1997) 27. Berthold, M.R., Sudweeks, F., Newton, S., Coyne, R.: Clustering on the Net: Applying an autoassociative neural network to computer-mediated discussions. Journal of Computer Mediated Communication 2(4) (1997) 28. Berthold, M.R., Sudweeks, F., Newton, S., Coyne, R.: It makes sense: Using an autoassociative neural network to explore typicality in computer mediated discussions. In: Sudweeks, F., McLaughlin, M., Rafaeli, S. (eds.) Network and Netplay: Virtual Groups on the Internet, pp. 191–220. AAAI/MIT Press, Menlo Park, CA (1998) 29. Sudweeks, F., Simoff, S.J.: Complementary explorative data analysis: The reconciliation of quantitative and qualitative principles. In: Jones, S. (ed.) Doing Internet Research, pp. 29–55. Sage Publications, Thousand Oaks (1999) 30. Simoff, S.J., Maher, M.L.: Knowledge discovery in hypermedia case libraries - A methodological framework. In: Proceedings of the Fourth Australian Knowledge Acquisition Workshop AKAW 1999, in conjunction with 12th Australian Joint Conference on Artificial Intelligence, AI 1999, Sydney, Australia (1999) 31. Chen, C.: An information-theoretic view of visual analytics. IEEE Computer Graphics and Applications 28(1), 18–23 (2008)
A Methodology for Exploring Association Models* Alipio Jorge1, João Poças2, and Paulo J. Azevedo3 1
LIACC/FEP, Universidade do Porto, Portugal [email protected] 2 Instituto Nacional de Estatística, Portugal [email protected] 3 Departamento de Informática, Universidade do Minho, Portugal [email protected]
Abstract. Visualization in data mining is typically related to data exploration. In this chapter we present a methodology for the post processing and visualization of association rule models. One aim is to provide the user with a tool that enables the exploration of a large set of association rules. The method is inspired by the hypertext metaphor. The initial set of rules is dynamically divided into small comprehensible sets or pages, according to the interest of the user. From each set, the user can move to other sets by choosing one appropriate operator. The set of available operators transform sets of rules into sets of rules, allowing focusing on interesting regions of the rule space. Each set of rules can also be then seen with different graphical representations. The tool is web-based and dynamically generates SVG pages to represent graphics. Association rules are given in PMML format.
1 Introduction Visualization techniques are mainly popular in data mining and data analysis for data exploration. Such techniques try to solve problems such as the dimensionality curse [13], and help the data analyst in easily detecting trends or clusters in the data and even favour the early detection of bugs in the data collection and data preparation phases. However, not only the visualization of data can be relevant in data mining. Other two important fields for visual data mining are the graphical representation of data mining models, and the visualization of the data mining process in a visual programming style [6]. The visualization of models in data mining potentially increases the comprehensibility and allows the post processing of those models. In this chapter, we describe a tool and methodology for the exploration/post processing of large sets of association rules. Small sets of rules are shown to the user according to preferences the user states implicitly. Numeric properties of the rules in each rule subset are also graphically represented. *
This work is supported by the European Union grant IST-1999-11.495 Sol-Eu-Net and the POSI/2001/Class Project sponsored by Fundação Ciência e Tecnologia, FEDER e Programa de Financiamento Plurianual de Unidades de I & D.
S.J. Simoff et al. (Eds.): Visual Data Mining, LNCS 4404, pp. 46–59, 2008. © Springer-Verlag Berlin Heidelberg 2008
A Methodology for Exploring Association Models
47
This environment also takes advantage of PMML (Predictive Model Markup Language) being proposed as a standard by the Data Mining Group [6]. This means that any data mining engine producing association rules in PMML can be coupled with the tool being proposed. Moreover, this tool can be easily used simultaneously with other post processing tools that read PMML, for the same problem. Association Rule (AR) discovery [1] is many times used in data mining applications like market basket analysis, marketing, retail, study of census data, design of shop layout, among others [e.g., 4, 6, 7, 10]. This type of knowledge discovery is particularly adequate when the data mining task has no single concrete objective to fulfil (such as how to discriminate good clients from bad ones), contrarily to what happens in classification or regression. Instead, the use of AR allows the decision maker/ knowledge seeker to have many different views on the data. There may be a set of general goals (like “what characterizes a good client?”, “which important groups of clients do I have?”, “which products do which clients typically buy?”). Moreover, the decision maker may even find relevant patterns that do not correspond to any question formulated beforehand. This style of data mining is sometimes called “fishing” (for knowledge), or undirected data mining [3]. Due to the data characterization objectives, association rule discovery algorithms produce a complete set of rules above user-provided thresholds (typically minimal support and minimal confidence, defined in Section 2). This implies that the output is a very large set of rules, which can easily get to the thousands, overwhelming the user. To make things worse, the typical association rule algorithm outputs the list of rules as a long text (even in the case of commercial tools like SPSS Clementine), and lacks post processing facilities for inspecting the set of produced rules. In this chapter we propose a method and tool for the browsing and visualization of association rules. The tool reads sets of rules represented in the proposed standard PMML [6]. The complete set of rules can then be browsed by applying operators based on the generality relation between itemsets. The set of rules resulting from each operation can be viewed as a list or can be graphically summarized. This chapter is organized as follows: we introduce the basic notions related to association rule discovery, and association rule space. We then describe PEAR (Postprocessing Environment for Association Rules), the post processing environment for association rules. We describe the set of operators and show one example of the use of PEAR, and then proceed to related work and conclusion.
2 Association Rules An association rule A→B represents a relationship between the sets of items A and B. Each item I is an atom representing a particular object. The relation is characterized by two measures: support and confidence of the rule. The support of a rule R within a dataset D, where D itself is a collection of sets of items (or itemsets), is the number of transactions in D that contain all the elements in A∪B. The confidence of the rule is the proportion of transactions that contain A∪B with respect to the transactions containing A. Each rule represents a pattern captured on the data. The support is the commonness of that pattern. The confidence measures its predictive ability.
48
A. Jorge, J. Poças, and P.J. Azevedo
The most common algorithm for discovering AR from a dataset D is APRIORI [1]. This algorithm produces all the association rules that can be found from a dataset D above given values of support and confidence, usually referred to as minsup and minconf. APRIORI has many variants with more appealing computational properties, such as PARTITION [18] or DIC [3], but that should produce exactly the same set of rules as determined by the problem definition and the data. In this work we used Caren (Classification and Association Rules ENgine) [2], a java based implementation of APRIORI. This association rule engine optionally outputs derived models in PMML format, besides Prolog, ASCII, and CSV. 2.1 The Association Rule Space The space of itemsets I can be structured in a lattice with the ⊆ relation between sets. The empty itemset ∅ is at the bottom of the lattice and the set of all itemsets at the top. The ⊆ relation also corresponds to the generality relation between itemsets. To structure the set of rules, we need a number of lattices, each corresponding to one particular itemset that appears as the antecedent, or to one itemset that occurs as a consequent. For example, the rule {a,b}Æ{c,d}, belongs to two lattices: the one of the rules with antecedent {a,b}, structured by the generality relation over the consequent, and the lattice of rules with {c,d } as a consequent, structured by the generality relation over the antecedents of the rules (Figure 1).
{a,b} → {c,d}
{a} → {c,d}
{b} → {c,d}
{ } → {c,d}
{a,b} → {c,d}
{a,b} → {c}
{a,b} → {d}
{a,b} → { }
Fig. 1. The two lattices of rule {a,b} → {c,d}
We can view this collection of lattices as a grid, where each rule belongs to one intersection of two lattices. The idea behind the rule browsing approach we present, is that the user can visit one of these lattices (or part of it) at a time, and take one particular intersection to move into another lattice (set of rules).
3 PEAR: A Web-Based AR Browser To help the user browsing a large set of rules and ultimately find the subset of interesting rules, we developed PEAR (Post processing Environment for Association
A Methodology for Exploring Association Models
49
Fig. 2. PEAR screen showing some rules. On the top we have the “Support>=”, “Confidence>=” and user defined metric (“F(sup,conf)”) parameter boxes. The “Navigation Operators” box is used to pick one operator from a pre-defined menu. The operator is then applied to the rule selected by clicking the respective circle just before the Id. When the “get Rules!” button is pressed, the resulting rules appear, and the process may be iterated.
Rules) [10]. PEAR implements the set of operators described below that transform one set of rules into another, and allows a number of visualization techniques. PEAR’s server is run under an http server. A client is run on a web browser. Although not currently implemented, multiple clients can potentially run concurrently. PEAR operates by loading a PMML representation of the rule set. This initial set is displayed as a web page (Figure 2). From this page the user can go to other pages containing ordered lists of rules with support and confidence. All the pages are dynamically generated during the interaction of the user with the tool. To move from page (set of rules) to page, the user applies restrictions and operators. The restrictions can be done on the minimum support, minimum confidence, or on functions of the support and confidence of the itemsets in the rule. Operators can be selected from a
50
A. Jorge, J. Poças, and P.J. Azevedo
Fig. 3. PEAR plotting support × confidence points for a subset of rules. The rule is identified when the mouse flies over the respective x-y point. On the chart above, the selected point is for the rule with Id 3.
list. If it is a {Rule}→{Sets of Rules} operator, the input rule must also be selected.For each page, the user can also select a graphical visualization that summarizes the set of rules on the page. Currently, the available visualizations are Confidence × Support x-y plot (Figure 3) and Confidence / support histograms (Figure 4). The produced charts are interactive and indicate the rule that corresponds to the point under the mouse.
4 Chunking Large Sets of Rules Our methodology is based on the philosophy of web browsing, page by page following hyperlinks. The ultimate aim of the user is to find interesting rules in the large rule set as easily as possible. For that, the set R of derived rules must be divided into small subsets that are presented as pages, and can be perceived by the user. In this sense, small means a rule set that can be seen in one screen (maximum 20 to 50 rules). Each page then presents some hyperlinks to other pages (other small sets of rules) and visual representations. The first problem to be solved is how to divide a set of rules into pages? Our current approach is to start with a set of rules that presents some diversity, and then use operators that link one rule in the current subset to another subset of rules. Currently proposed operators allow focusing on the neighbourhood of the selected rule. Other operators may have other effects.
A Methodology for Exploring Association Models
51
Fig. 4. PEAR showing a multi-bar histogram. Each bar represents the confidence (lighter color) and the support (super-imposed darker color) of one rule. Again, the rule can be identified by flying over the respective bar with the mouse.
The second problem is how easily interesting rules can be found? Since the user is searching for interesting rules, each page should include indications of the interest of the rules included. Besides the usual local metrics for each rule, such as confidence and support, global metrics can be provided. This way, when the user follows a hyperlink from one set of rules to another, the evolution of such metrics can be monitored. Below we will make some proposals of global metrics for association rule sets. The third problem is how to ensure that any rule can be found from the initial set of rules. If the graph of rules defined by the available operators is connected, then this is satisfied for any set of initial rules. Otherwise, each connected subgraph must have one rule in the initial set. 4.1 Global Metrics for Sets of Rules To compare the interest of different sets of rules, we need to numerically characterize each set of rules as an individual entity. This naturally suggests the need of global metrics (measuring the value of a set of rules), in addition to local metrics (characterizing individual rules). Each global metric provides a partial ordering on the family of rule sets. The value of a set of rules can be measured in terms of diversity and strength. The diversity of a set of rules may be measured, for instance, by the number of items involved in the rules (item coverage). Example coverage is also a relevant diversity measure. The strength of a set of rules is related to the actionability of the rules. One simple example is the accuracy that a set of association rules obtains on the set of known
52
A. Jorge, J. Poças, and P.J. Azevedo
examples, when used as a classification model. Other global measures of strength may be obtained by combining local properties of the rules in the set. One example of such a measure is the weighted χ2 introduced in [13]. 4.2 The Index Page To start browsing, the user needs an index page. This should include a subset of the rules that summarize the whole set. In other words: a set of rules with high diversity. In terms of web browsing, it should be a small set of rules that allows getting to any page in a limited number of clicks. For example, a candidate for such a set could be the smallest rule for each consequent. Each of these rules would represent the lattice on the antecedents of the rules with the same consequent. Since the lattices intersect, we can change to a focus on the antecedent on any rule by applying an appropriate operator. Similarly, we could start with the set of smallest rules for each antecedent. Alternatively, instead of the size, we could consider the support, confidence, or other measure. All these possibilities must be studied and some of them implemented in our system, which currently shows, as the initial page, the set of all rules. Another possibility for defining the starting set of rules, is to divide the whole set of rules into clusters with rules involving similar items. Then, a representative rule from each cluster is chosen. The set of representative rules is the starting page. The number of rules can be chosen according to the available screen space. In [10], hierarchical clustering is used to adequately divide a set of association rules. The representative rules are the closest to the centroids of each group of rules.
5 Operators for Sets of Association Rules The association rule browser helps the user to navigate through the space of rules by viewing one set of rules at a time. Each set of rules corresponds to one page. From one given page the user moves to the following by applying a selected operator to all or some of the rules viewed on the current page. In this section we define the set of operators to apply to sets of association rules. The operators we describe here transform one single rule R∈{Rules} into a set of rules RS∈{Sets of Rules}and correspond to the currently implemented ones. Other operators may transform one set of rules into another. In the following we describe the operators of the former class. Antecedent generalization AntG(A→B) = {A’→B | A’ ⊆ A} This operator produces rules similar to the given one but with a syntactically simpler antecedent. This allows the identification of relevant or irrelevant items in the current rule. In terms of the antecedent lattice, it gives all the rules below the current one with the same consequent.
A Methodology for Exploring Association Models
53
Antecedent least general generalization AntLGG(A→B) = {A’→B | A’ is obtained by deleting one atom in A} This operator is a stricter version of the AntG. It gives only the rules on the level of the antecedent lattice immediately below the current rule. Consequent generalization ConsG(A→B) = {A→B’ | B’ ⊆ B} Consequent least general generalization ConsLGG(A→B) = {A→B’ | B’ is obtained by deleting one atom in B} Similar to AntG and AntLGG respectively, but the simplification is done on the consequent instead of on the antecedent. Antecedent specialization AntS(A→B) = {A’→B | A’⊇A} This produces rules with lower support but finer detail than the current one. Antecedent least specific specialization AntLSS(A→B) = {A’→B | A’ is obtained by adding one (any) atom to A} As AntS, but only for the immediate level above on the antecedent lattice. Consequent specialization ConsS(A→B) = {A→B’ | B’⊇B} Consequent least specific specialization ConsLSS(A→B) = {A→B’ | B’ is obtained by adding one (any) atom to B} Similar to AntS and AntSS, but on the consequent. Focus on antecedent FAnt(A→B) = {A→C | C is any} Gives all the rules with the same antecedent. FAnt(R) = ConsG(R) ∪ ConsS(R).
54
A. Jorge, J. Poças, and P.J. Azevedo
Focus on consequent FCons(A→B) = {C→B | C is any} Gives all the rules with the same consequent. FCons(R) = AntG(R) ∪ AntS(R).
6 Example of the Application of the Proposed Methodology We now describe how the method being proposed can be applied to the analysis of downloads from the site of the Portuguese National Institute of Statistics (INE). This site (www.ine.pt/infoline) serves as an electronic store, where the products are tables in digital format with statistics about Portugal. From the web access logs of the site’s http server we produced a set of association rules relating the main thematic categories of the downloaded tables. This is a relatively small set set of rules (211) involving 9 items that serves as an illustrative example. The aims of INE are to improve the usability of the site by discovering which items are typically combined by the same user. The results obtained can be used in the restructuring of the site or in the inclusion of recommendation links on some pages. A similar study could be carried out for lower levels of the category taxonomy. The rules in Figure 5 show the contents of one index page, with one rule for each consequent (from the 9 items, only 7 appear). The user then finds the rule on “Territory_an_Environment” relevant for structuring the categories on the site. By applying the ConsG operator, she can drill down the lattice around that rule, obtaining all the rules with a generalized antecedent. Rule Economics_and_Finance CR2
(5)
Under the null hypothesis the test statistics TN IU : CR1 − CR2 TN IU = 1 C ∗ (1 − C ∗ ) nx,y +
1 nx,¬y
(6)
Visual Mining of Association Rules
115
Conf Sup
Fig. 11. The parallel coordinates plot of rules with consequence equal to Toothed
approximates a standard normal distribution given that nx,y and nx,¬y are sufficiently large. The term C ∗ refers to the estimate of the conjoint proportion: C∗ =
nx,y,z + nx,¬y,z nx,y + nx,¬y
(7)
From equation 7 it follows that C∗ measures the confidence of the rule R∗ = x → z. When we deal with one-to-one rules, the test statistics TN IU given in equation 6 is equal to the Difference of Confidence (Doc) test statistic proposed in [14] where the confidence of a rule is compared with the confidence of the rule obtained considering the same consequence but the negation of the whole antecedent set of items. The test can be used to prune those rules where at least one antecedent item has a NIU not significantly greater than 0 because the interaction among all the antecedent items is not relevant and a lower order rule must be retained. Figure 12 shows a parallel plot of the 60 rules that survived the test with a significance level of 0.05. The set of rules is characterised by high confidence values and by a strong interaction among the shared items, with NIU values often equal to 1. 4.8
Factorial Planes
As a matter of fact, the number of extracted rules, and even the number of rules after pruning, are huge, which makes manual inspection difficult. A factorial method can be used to face this problem because it allows to synthesize the information stored in the rules and to visualize the associations structure on 2-dimensional graphs.
116
D. Bruzzese and C. Davino
Fig. 12. The parallel coordinate plot of rules with consequence equal to Toothed after pruning
The rules being synthesized are stored in a data matrix where the number of n rows is equal to the number of rules and the number of columns (p = pif + pthen ) corresponds to the total number of different items, both in the antecedent part (pif ) and in the consequent part (pthen ) of the n rules. Each rule is coded by a binary array assuming value 1 if the corresponding column item is present in the rule and value 0 otherwise. The well known confidence and support measures are also considered as added columns. The final data matrix has thus n × (pif + pthen + 2) dimensions and it can be analysed through the Multiple Correspondence Analysis (MCA) ([3], [9]) that allows to represent the relationships among the observed variables, the similarities and differences among the rules and the interactions between them. MCA allows to reduce the number of original variables finding linear combinations of them, the so called factors, that minimize the deriving loss of information due to the dimensionality reduction. Different roles are assigned to the columns of the data matrix: the antecedent items are called active variables and they intervene directly in the analysis defining the factors; the consequent items and the support and the confidence values are called supplementary variables because they depend from the former and are projected later on the defined factorial planes. Referring at the zoo data set, the rules survived to a pruning process [6] are 1447 and they involve 16 different items4 both in the antecedent part (pif ) and in the consequence (pthen ). The set of rules should thus be represented in a 16dimensional space and the set of items in a 1447-dimensional space. In order to reduce the number of original variables through the factors, it is necessary to evaluate the loss of information deriving or the variability explained by the retained factors. According to the Benzcri approach [3] for the evaluation of the explained variability in case of MCA, in table 3, the explained variability and the cumulative variability is shown. The first two factors share more than the 4
Venomous, Domestic and >4 Legs are the items removed by the pruning procedure.
Visual Mining of Association Rules
117
Table 3. Total inertia decomposition Factor 1 2 3 4
% of Cumulative variability % 44 40 12 4
44 84 96 100
**************************************** ************************************ *********** ****
80% of the total inertia and they correspond to the highest change in level in the percentage of variability. Once the MCA is performed it is possible to represent the rules and the items on reduced dimensions subspaces: the factorial planes allowing to explain at least a user defined threshold of the total variability (in the zoo example, the first factorial plane) or a user defined factorial plane or the factorial plane best defined by a user chosen item. Different views on the set of rules can be obtained exploiting the results of the MCA. 1. Items Visualization. A graphical representation of the antecedent and the consequent items is provided by the factorial plane where the item points have a dimension proportional to their supports and the confidence and the support are represented by oriented segments linking the origin of the axes to their projection on the plane. In Figure 13 the active and the supplementary items are visualized together with the confidence and support arrows. Privileged regions characterized by strong rules can be identified in case of high coordinates of the confidence and the support because their coordinates represent the correlation coefficients with the axes. The proximity between two antecedent items shows the presence of a set of rules sharing them while the proximity between two consequent items is related to a common causal structure. Finally, the closeness between antecedent items and consequent items highlights the presence of a set of rules with a common dependence structure. 2. Rules Visualization. Another view on the mined knowledge is provided by the rules representation on the factorial plane. Graphical tools and interactive features can help in the interpretation of the graph: the rules are represented by points with a dimension proportional to their confidence, the proximity among two or more rules shows the presence of a common structure of antecedent items associated to different consequences, a selected subset of rules can be inspected in a tabular format. For example in table 4 the subset of the rules selected in figure 14 is listed. It is worth of notice that this subset of rules is very close on the plane because they have similar antecedent structures sharing at least one item, even some rules overlap because they have exactly the same antecedent structure.
118
D. Bruzzese and C. Davino
IF aquatic
2
IF fins IF 0 legs
THEN fins THEN 0 legs THEN aquatic
1.5
THEN predator
1 IF toothed
0.5
THEN toothed
IF tail
IF eggs
THEN catsize THEN 4 legs THEN hair IF milk THEN milk IF hair sup IF 4 legs THEN breathesTHEN backboneIF catsize
0 IF backbone
IF predator
conf
ï0.5 THEN eggs IF breathes
THEN tail
ï1
ï1.5
THEN airborne THEN feathers THEN 2 legs IF 2 legs IF feathers IF airborne
ï1.5
ï1
0
ï0.5
0.5
1
Fig. 13. The items representation -0.272
-0.274
R901 R894 R899 R892 R900 R893
-0.276
-0.278
-0.28
-0.282
-0.284
1
R1259 R1260 R1258 R1027 R1257 R1026 R1024 R1025
-0.286
-0.288
-0.29
0.5 0.508
0.51
0.512
0.514
0.516
0.518
0.52
0.522
0
-0.5
-1
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Fig. 14. The rules representation
It is possible to imagine to transform the set of overlapping rules into a higher order macro-rule obtained linking the common behaviour described by the antecedent items to the logical disjunction of the different consequent items.
Visual Mining of Association Rules
119
Table 4. Description of a subset of overlapping rules Rule Antecedent
Consequence Conf. Sup.
899 900 901 892 893 894 1257 1258 1259 1260 1024 1025 1026 1027
Toothed Backbone 4 legs Toothed Backbone Tail Hair Toothed Backbone Tail Milk Toothed Backbone Tail
Hair & Milk & Breathes & Catsize Hair & Milk & Breathes & Catsize Hair & Milk & Breathes & Catsize Hair & Milk & Breathes & 4 legs Hair & Milk & Breathes & 4 legs Hair & Milk & Breathes & 4 legs Milk & Breathes & 4 legs & Catsize Milk & Breathes & 4 legs & Catsize Milk & Breathes & 4 legs & Catsize Milk & Breathes & 4 legs & Catsize Hair & Breathes & 4 legs & Catsize Hair & Breathes & 4 legs & Catsize Hair & Breathes & 4 legs & Catsize Hair & Breathes & 4 legs & Catsize
0.97 1.00 0.83 0.97 1.00 0.90 1.00 0.96 1.00 0.92 1.00 0.96 1.00 0.92
0.29 0.30 0.25 0.30 0.31 0.28 0.25 0.24 0.25 0.23 0.25 0.24 0.25 0.23
IF aquatic
2
rr509
IF fins IF 0 legs
rr508 rr507
1.5
TH HEN fins HE THEN THEN 0 legs THEN THEN aquatic
1
IF toothed
0.5 IF eggs
rr1385 rr1384 rr1382 rr1383
0
IF predator
THEN catsize THEN 4 legs THEN hair IF milk THEN milk IF hair IF 4 legs THEN backboneIF catsize
−0.5 IF breathes
THEN tail
−1
−1.5
airb THEN feathers THEN airborne feathe
THEN 2 legs IF 2 legs IF feathers IF airborne
−1.5
−1
−0.5
0
0.5
1
Fig. 15. The Conjoint representation
3. Conjoint Visualization. The factorial planes features allow to visualize simultaneously the items and the rules. In the conjoint representation, aside from a scale factor, each rule is surrounded by the antecedent items it holds and vice versa each item is surrounded by the rules sharing it. By linking two or more active items it is possible to highlight all the rules that contain at least one of the selected items in the antecent. For example in figure 15
120
D. Bruzzese and C. Davino
two groups of rules have been closed inside the polygons joining the items they share in the antecedent.
5
Concluding Remarks
Association Rules Visualization is emerging as a crucial step in a data mining process in order to profitably use the extracted knowledge. In this paper the main approaches used to face this problem have been discussed. It rises that, up to day, a compromise have to be done between the quantity of information (in terms of number of rules) that could be visualized and the depth of insight that can be reached. This suggests that there is not a winning visualization but their strength lies in the possibility to exploit the synergic power deriving from their conjoint use. Moreover, it is advisable a stronger interaction among the visualization tools and the data mining process that should incorporate each other. Acknowledgements. The paper was financially supported by University of Macerata grant (2004) Metodi Statistici Multivariati per l’Analisi e la Valutazione delle Performance in Campo Economico e Aziendale.
References 1. Advanced Visual Systems (AVS), OpenViz. http://www.avs.com/software/ 2. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases. In: Proceedings of the 1993 ACM SIGMOD Conference, Washington DC, USA, pp. 207–216 (May 1993) 3. Benz`ecri, J.-P.: L’Analyse des Donn`ees, Dunod, Paris (1973) 4. Bruzzese, D., Buono, P.: Combining Visual Techniques for Association Rules Exploration. In: Proceedings of the International Conference Advances Visual Interfaces, Gallipoli, Italy, May 25-28 (2004) 5. Bruzzese, D., Davino, C., Vistocco, D.: Parallel Coordinates for Interactive Exploration of Association Rules. In: Proceedings of the 10th International Conference on Human - Computer Interaction, Creta, Greece, June 22-27. Lawrence Erlbaum, Mahwah (2003) 6. Bruzzese, D., Davino, C.: Significant Knowledge Extraction from Association Rules. In: Electronic Proceedings of the International Conference Knowledge Extraction and Modeling Workshop, Anacapri, Italy, September 4-6 (2006) 7. Clementine, Suite from SPSS, http://www.spss.com/Clementine/ 8. Glymour, C., Madigan, D., Pregibon, D., Smyth, P.: Statistical Inference and Data Mining. Communications of the ACM (1996) 9. Greenacre, M.: Correspondence Analysis in Practice. Academic Press, London (1993) 10. Hartigan, J., Kleiner, B.: Mosaics for contingency tables. In: Proceedings of the 13th Symposium on the interface, pp. 268–273 (1981) 11. Hofmann, H.: Exploring categorical data: interactive mosaic plots. Metrika 51(1), 11–26 (2000)
Visual Mining of Association Rules
121
12. Hofmann, H., Siebes, A., Wilhelm, A.: Visualizing Association Rules with Interactive Mosaic Plots. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 227–235 (2000) 13. Hofmann, H., Wilhelm, A.: Validation of Association Rules by Interactive Mosaic Plots. In: Bethlehem, J.G., van der Heijden, P.G.M. (eds.) Compstat 2000 Proceedings in Computational Statistics, pp. 499–504. Physica-Verlag, Heidelberg (2000) 14. Hofmann, H., Wilhelm, A.: Visual Comparison of Association Rules. Computational Statistics 16, 399–416 (2001) 15. IBM Intelligent Miner for Data, http://www.software.ibm.com/data/intelli-mine 16. Inselberg, A.: N-dimensional Graphics, part I - Lines and Hyperplanes, in IBM LASC Tech. Rep. G320-2711, 140 pages. IBM LA Scientific Center (1981) 17. Inselberg, A.: Visual Data Mining with Parallel Coordinates. Computational Statistics 13(1), 47–64 (1998) 18. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., Verkamo, A.I.: Finding interesting rules from large sets of discovered association rules. In: Proceedings of the Third International Conference on Information and Knowledge Management CIKM 1994, pp. 401–407 (1994) 19. Kopanakis, I., Theodoulidis, B.: Visual Data Mining & Modeling Techniques. In: 4th International Conference on Knowledge Discovery and Data Mining (2001) 20. Liu, B., Hsu, W., Ma, Y.: Pruning and Summarizing the Discovered Associations. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999), San Diego, CA, USA, August 15-18 (1999) 21. Liu, B., Ma, Y., Lee, R.: Analyzing the Interestingness of Association Rules from Temporal Dimensions. In: International Conference on Data Mining, CA (2001) 22. Machine learning, http://www1.ics.uci.edu/∼ mlearn/MLSummary.html 23. Megiddo, N., Srikant, R.: Discovering Predictive Association Rules. In: Knowledge Discovery and Data Mining (KDD 1998), pp. 274–278 (1998) 24. Miner3D, Miner3D Excel, http://www.miner3d.com/m3Dxl/ 25. Ong, K.-H., Ong, K.-L., Ng, W.-K., Lim, E.-P.: CrystalClear: active Visualization of Association Rules. In: Proc. of the Int. workshop on Active Mining, Japan (2002) 26. Purple Insight Mineset, http://www.purpleinsight.com 27. Sas Enterprise Miner, http://www.sas.com/technologies/analytics/datamining/miner 28. Shah, D., Lakshmanan, L.V.S., Ramamritham, K., Sudarshan, S.: Interestingness and Pruning of Mined Patterns. In: Workshop Notes of the 1999 ACM SIGMOD Research Issues in Data Mining and Knowledge Discovery (1999) 29. Statistica, http://www.statsoft.com 30. Toivonen, H., Klemettinen, M., Ronkainen, P., Hatonen, K., Mannila, H.: Pruning and grouping of discovered association rules. In: Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning, and Knowledge Discovery in Databases, Heraklion, Greece, April 1995, pp. 47–52 (1995) 31. Unwin, A., Hofmann, H., Bernt, K.: The TwoKey Plot for Multiple Association Rules Control. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168. Springer, Heidelberg (2001) 32. VisualMine, http://www.visualmine.com/ 33. Webb, G.I.: Preliminary Investigations into Statistically Valid Exploratory Rule Discovery. In: Proceedings of the Australasian Data Mining Workshop, Sydney (2003)
122
D. Bruzzese and C. Davino
34. Weber, I.: On Pruning Strategies for Discovery of Generalized and Quantitative Association Rules. In: Proceedings of Knowledge Discovery and Data Mining Workshop, Singapore (1998) 35. Wong, P.C., Whitney, P., Thomas, J.: Visualizing Association Rules for Text Mining. In: Wills, G., Keim, D. (eds.) Proceedings of IEEE Information Visualization 1999. IEEE CS Press, Los Alamitos (1999) 36. XGvis: A System for Multidimensional Scaling and Graph Layout in any Dimension, http://www.research.att.com/areas/stat/xgobi/ 37. Yang, L.: Visualizing Frequent Itemsets, Association Rules and Sequential Patterns in Parallel Coordinates. In: Kumar, V., Gavrilova, M.L., Tan, C.J.K., L’Ecuyer, P. (eds.) ICCSA 2003. LNCS, vol. 2667, pp. 21–30. Springer, Heidelberg (2003)
Interactive Decision Tree Construction for Interval and Taxonomical Data François Poulet1 and Thanh-Nghi Do2 1 IRISA-Texmex Université de Rennes I Campus Universitaire de Beaulieu 35042 Rennes Cedex, France [email protected] 2 Equipe InSitu INRIA Futurs, LRI, Bat.490 Université Paris Sud 91405 Orsay Cedex, France [email protected]
Abstract. Visual data-mining strategy lies in tightly coupling the visualizations and analytical processes into one data-mining tool that takes advantage of the assets from multiple sources. This paper presents two graphical interactive decision tree construction algorithms able to deal either with (usual) continuous data or with interval and taxonomical data. They are the extensions of two existing algorithms: CIAD [17] and PBC [3]. Both CIAD and PBC algorithms can be used in an interactive or cooperative mode (with an automatic algorithm to find the best split of the current tree node). We have modified the corresponding help mechanisms to allow them to deal with interval-valued attributes. Some of the results obtained on interval-valued and taxonomical data sets are presented with the methods we have used to create these data sets.
1 Introduction Knowledge Discovery in Databases (or KDD) can be defined [10] as the non-trivial process of identifying patterns in the data that are valid, novel, potentially useful and understandable. In most existing data mining tools, visualization is only used during two particular steps of the data mining process: in the first step to view the original data, and in the last step to view the final results. Between these two steps, an automatic algorithm is used to perform the data-mining task (for example decision trees like CART [8] or C4.5 [19]). The user has only to tune some parameters before running the algorithm and waiting for its results. Some new methods have recently appeared [22], [15], [24], trying to involve more significantly the user in the data mining process and using more intensively the visualization [9], [20], this new kind of approach is called visual data mining. In this paper we present some methods we have developed, which integrate automatic algorithms, interactive algorithms and visualization methods. These methods are two interactive classification algorithms. The classification algorithms use both human S.J. Simoff et al. (Eds.): Visual Data Mining, LNCS 4404, pp. 123–135, 2008. © Springer-Verlag Berlin Heidelberg 2008
124
F. Poulet and T.-N. Do
pattern recognition facilities and computer calculus power to perform an efficient user-centered classification. This paper is organized as follows. In section 2 we briefly describe some existing interactive decision tree algorithms and then we focus on the two algorithms we will use for interval-valued data and taxonomical data. The first one is an interactive decision tree algorithm called CIAD (Interactive Decision Tree Construction) using support vector machine (SVM) and the second is PBC (Perception Based Classifier). In section 3 we present the interval-valued data: how they can be sorted, what graphical representation can be used and how we perform the graphical classification of these data with our decision tree algorithms. The section 4 presents the same information as section 3 but concerning the taxonomical data. Then we present some of the results we have obtained in section 5 before the conclusion and future work.
2 Interactive Decision Tree Construction Some new user-centered manual (i.e. interactive or non-automatic) algorithms inducing decision trees have appeared recently: Perception Based Classification (PBC) [4], Decision Tree Visualization (DTViz) [12], [21] or CIAD [16]. All of them try to involve the user more intensively in the data-mining process. They are intended to be used by a domain expert and not the usual statistician or data-analysis expert. This new kind of approach has the following advantages: - the quality of the results is improved by the use of human pattern recognition capabilities, - using the domain knowledge during the whole process (and not only for the interpretation of the results) allows a guided search for patterns, - the confidence in the results is improved, the KDD process is not just a "black box" giving more or less comprehensible results. The technical part of these algorithms are somewhat different: PBC and DTViz use an univariate decision tree by choosing split points on numeric attributes in an interactive visualization. They use a bar visualization of the data: within a bar, the attribute values are sorted and mapped to pixels in a line-by-line fashion according to their order. Each attribute is visualized in an independent bar (cf. fig.1). The first step is to sort the pairs (attri, class) according to attribute values, and then to map to lines colored according to class values. When the data set number of items is too large, each pair (attri, class) of the data set is represented with a pixel instead of a line. Once all the bars have been created, the interactive algorithm can start. The classification algorithm performs univariate splits and allows binary splits as well as n-ary splits. Only PBC and CIAD provide the user with an automatic algorithm to help him choose the best split in a given tree node. The other algorithms can only be run in a 100% manual interactive way. CIAD is a bivariate decision tree using line drawing in a set of two-dimensional matrices (like scatter plot matrices [9]). The first step of the algorithm is the creation of a set of (n-1)2/2 two-dimensional matrices (n being the number of attributes). These
Interactive Decision Tree Construction for Interval and Taxonomical Data
125
matrices are the two dimensional projections of all possible pairs of attributes, the color of the point corresponds to the class value. This is a very effective way to graphically discover relationships between two quantitative attributes. One particular matrix can be selected and displayed in a larger size in the bottom right of the view (as shown in figure 2 using the Segment data set from the UCI repository [6], it is made of 19 continuous attributes, 7 classes and 2310 instances). Then the user can start the interactive decision tree construction by drawing a line in the selected matrix and performing thus a binary, univariate or bi-variate split in the current node of the tree. The strategy used to find the best split is the following. We try to find a split giving the largest pure partition, the splitting line (parallel to the axis or oblique) is interactively drawn on the screen with the mouse. The pure partition is then removed from all the projections. If a single split is not enough to get a pure partition, each half-space created by the first split will be treated alternately in a recursive way (the alternate half-space is hidden during the current one's treatment). Attr.1 1 5 2 9 3 6
Attr.2 5 7 1 3 2 9
Class A A B B A B
Attr.1 1 2 3 5 6 9
Class A B A A B B
Attr.2 1 2 3 5 7 9
Class B A B A A B
Attr.1
Class A Class B
Attr.2
Fig. 1. Creation of the visualization bars with PBC
At each step of the classification, some additional information can be provided to the user like the size of the resulting nodes, the quality of the split (purity of the resulting partition) or overall purity. Some other interactions are available to help the user: it is possible to hide, show or highlight one class, one element or a group of elements. A help mechanism is also provided to the user. It can be used to optimize the location of the line drawn (the line becomes the best separating line) or to automatically find the best separating line for the current tree node or for the whole tree construction. They are based on a support vector machine algorithm, modified to find the best separating line (in two dimension) instead of the best separating hyperplane (in n-1 dimension for a n-dimensional dataset).
126
F. Poulet and T.-N. Do
Fig. 2. The Segment data set displayed with CIAD
3 Interval Data Decision trees usually deal with qualitative or quantitative values. Here we are interested in interval-valued data. This kind of data is often used in polls (for example for income or age). We only consider the particular case of finite intervals. 3.1 Ordering Interval Data To be able to use this new kind of data with PBC, we need to define an order on these data. There are mainly three different orders we can use [14]: according to the minimum values, the maximum values or the mean values. Let us consider two interval data: I1=[l1,r1] (mean=m1) and I2=[l2,r2] (mean=m2). If the data are sorted according to the minimum values, then: if l1= l2, then I1 < I2 r1< r2; if l1≠ l2, then I1 < I2 l1< l2.
Interactive Decision Tree Construction for Interval and Taxonomical Data
127
If the data are sorted according to the maximum values, then: if r1= r2, then I1 < I2 l1< l2; if r1≠ r2, then I1 < I2 r1< r2. And finally, if the data are sorted according to the mean values, then I1 0 (b) FX > 0, FY = 0
(c)
FX FY
=1
(d)
FX FY
=
1 5
(e)
FX FY
=
1 4
(f)
FX FY
=
1 3
(g)
FX FY
=
2 5
(h)
FX FY
=
1 2
(i)
FX FY
=
3 5
(k)
FX FY
=
3 4
(l)
FX FY
=
4 5
(j)
FX FY
=
2 3
Fig. 13. Mapping categorical variables to the frequency of vibrating glyphs
302
H.R. Nagel et al.
patterns, and can thus not be said to be a defence for dynamic glyph attributes, but it does allow a more detailed study of vibrational modes. In contrast to the previous visualisation methods, one must with this method only use categorical variables. Alternatively, one can round-off continuous variables to a few discrete steps. When mapping two variables to the frequency attributes in a 2D visualisation, the shape of the movement patterns depends upon the ratio between the two frequencies. This means that the ratio between frequencies is clearly displayed, but that the actual values cannot readily be deduced. In Figure 13, some of the basic movement patterns in two dimensions are shown. These curves are well-known by mathematicians and are e.g. described in [22], [23] and [24]. The curves are usually called Lissajous curves after the French mathematician Jules-Antoine Lissajous (1822-1880) who discovered them in 1857, while studying wave patterns. However, it is also said that the American astronomer and mathematician Nathaniel Bowditch (1773-1838) discovered the curves already in 1815. The curves are therefore also sometimes called Bowditch curves. Plots that make use of these kinds of curves can therefore, perhaps, be called “Lissajous Plots”.
8
Auditory Exploration of Static Worlds
In this section, three cases will be presented where the sound tools of 3DVDM are used. The first case will investigate usage of soundscapes in a situation where sound acts as a support for color, which represents Forest Cover Type. The second will investigate the same dataset but with sound used to represent Wilderness Area. In the third test we will attempt to map Vertical Hydrology Distance to sound. For all cases the axes indicate Elevation, Horizontal Roadways Distance and Horizontal Hydrology Distance. This combination does not receive particular high score when we calculate the partial correlation coefficient. Still, interesting shapes appear as “tongues” stretching towards two of the axes’ higher ends, Horizontal Hydrology Distance in particular. Figure 14 shows a 3D scatter plot
Fig. 14. 3D scatter plot from different angels
Immersive Visual Data Mining: The 3DVDM Approach
303
of the dataset used for these examples. The two pictures show the same scatter plot from different angles. As it is difficult to present a soundscape in time in a document of this type, we will try to make a description of the soundscapes that occur. During the tests the distance threshold is adjusted to investigate how this function affects the soundscape and the listeners perception of the content. We also try different rendering methods: Rendering the whole set and rendering a sweep along the three axes. When applicable we bring out the virtual torch to test this on areas on high interest. 8.1
Sound Supporting Color
The main objective in this test is to investigate how the use of dynamic 3D soundscapes works as support for a visual parameter, in this case: color. We will also investigate if it is possible to navigate the soundscape, so that we can locate areas of high concentration. Finally we will will investigate the possibility of locating interesting areas that will not be visible to the eye. The database is sampled each 10ms, and 16 entries are randomly picked and mapped to sound samples of spoken numbers 1 to 7 representing the seven Cover Types. Initially all thresholds are set to maximum values, so that all observations are potential sources. The listener is placed in the middle of the coordinate system, and starts navigating the soundscape from there. It is allowed to adjust distance threshold. – The immediate overall impression is a soundscape consisting of the numbers 5 and 7, which are the two dominant types of cover type: Lodgepole Pine and Spruce Fir. It is difficult to hear other types. Distance threshold is lowered to 10 (graphical) units2 . This allows the listener to investigate close range areas further by navigating through the data. – It doesn’t reveal anything straight away but closing in on the area around the middle of the elevation axis increases the number of Aspen (Type 1) to a noticeable level. It reveals what can be seen from the color: that the number of Aspen are few compared to Lodgepole Pine and Spruce Fir at that Elevation point (and they are close to roads in general). Next, the Elevation Axis is rendered in time. Data sampling is done using a sliding window (Figure 15) and distance threshold reset to maximum. The objective is then to navigate around in the soundscape and listen for interesting things. – This gives the impression of gradually changing Cover Type as the Elevation increases. After several passes it also reveals that there still are types that we cannot see in the scatter plot when Lodgepole Pine and Spruce Fir become visually dominant (primarily Aspen). 2
For reference: The coordinate system is 100 × 100 × 100 graphical units.
304
H.R. Nagel et al.
Fig. 15. Sampling along the Elevation axis. Black cubes mark the current samples.
Rendering the other axes does not reveal anything that can not be seen from the colors. Using the torch to investigate statistical values does not really make sense in this case where sound acts as support for color. 8.2
Sound “On Its Own” – Categorical Variables
The main objective of this test is to investigate how the use of dynamic 3D soundscapes works for data mining when there is no visual reference, and the data variable that is used for rendering the soundscape consists of a few (four) categorical values. We will see if it is possible to get an idea of how the selected variable is distributed. We will also investigate if it is possible to navigate though the soundscape, so that we can locate areas of interest. The database is sampled each 10ms, and 16 entries are randomly picked and mapped to sound samples of spoken numbers 1 to 4 representing the four wilderness areas. Color still shows Cover Type. Initially all thresholds are set to maximum values, so that all observations are potential sources. The listener is placed in the middle of the coordinate system, and starts navigating the soundscape from there. It is allowed to adjust distance threshold. Doing a bit of “cheating” by changing color to represent Wilderness Area shows the layout in Figure 16. – The initial experience when sampling the whole set randomly is an overweight of area 2 and 4. This is expected from the layout of the database. We choose to render the three axes one by one while navigating the dataset to get a picture of the Wilderness Area distribution. – It becomes clear that the large tongue stretching out the Horizontal Roadways Distance axis is area 4. About half the maximum distance area 2 gradually increases. The slim tongues that reach out the Horizontal Hydrology Distance axis are primarily area 2 with some representation of area 4. Data
Immersive Visual Data Mining: The 3DVDM Approach
305
Fig. 16. A “sneak peak” at Wilderness Area distribution using color
from area 3 seems to be located at high Elevation with low Roadway Distance and area 1 is located at low Elevation and low Hydrology and Roadway Distances. We then choose to try identifying what kind of Forest Cover that is in different Wilderness areas. By looking at the colors one can see that Cottonwood/Willow, Douglas fir and Ponderosa Pine are grouped in one corner of the scatter plot. Distance threshold is set to 10 and the area is investigated further (by navigating into this area). – This reveals that all these Cover types are in area 1 until the elevation reaches a certain level. – This area also has a few Lodgepole Pines. It now seems feasible to bring out the torch and point it where area 1 seems to stop (Figure 17) to see how these Cover Types are distributed in the Wilderness Areas. – Closer investigation with the torch reveals that area 1 has a few observations around Elevation 2600 where it seems to stop. Areas 2 and 4 take over and have a few Douglas-fir and Ponderosa Pine but no Cottonwood/Willow. – Moving a bit upwards along the Elevation axis with the torch aimed at Aspen confirms that this only exist in area 4. 8.3
Sound “On Its Own” – Continuous Variables
The main objective of this test is to investigate how the use of dynamic 3D soundscapes works for data mining when there is no visual reference, and the data variable that is selected for rendering the soundscape consists of many
306
H.R. Nagel et al.
Fig. 17. Investigating area 1, maximum elevation area
different observations. We will see if it is possible to get an idea of how the selected variable is distributed. We will also investigate if it is possible to navigate though the soundscape, so that we can locate areas of interest. The database is sampled each 10ms, and 16 entries are randomly picked and mapped to synthesized waveforms spaced on a Ionic scale over two octaves. The values in the chosen scale (Vertical Hydrology Distance) is normalized and mapped to this scale so that low values become low pitch and visa versa. Color still shows Cover Type. Initially all thresholds are set to maximum values so that all observations are potential sources. The listener is placed in the middle of the coordinate system starts navigating the soundscape from there. It is allowed to adjust distance threshold. – The immediate overall impression is a soundscape with many different values but with high concentration of mid range values and very few in the ultimate high and low region. Distance threshold is again lowered to 10 units allowing closer inspection of local areas by navigation. – There is a noticeable change in the soundscape along the Horizontal Hydrology Distance axis. At low distance the values seem concentrated on a value in the middle of the lower octave (i.e. around 1/4th of the maximum Vertical Distance, i.e. around 0 meters, since this data variable has both negative and positive values). There doesn’t seem to be any very low or high values. – Moving in the direction of high Horizontal Hydrology Distance creates a more distributed soundscape with many notes of different pitch. Sounds seem concentrated around middle values with a larger spread than initially, but there are definitely some very low and high values in this area. It seems feasible to try to render along the three axes, especially Horizontal Hydrology Distance should be interesting.
Immersive Visual Data Mining: The 3DVDM Approach
307
Fig. 18. Mapping Vertical Hydrology Distance to color
– This confirms what was indicated rendering the whole scatter plot. There is a strong representation of a level about 1/4th as mentioned above – This spreads out as Horizontal Hydrology Distance increases. – When Horizontal Hydrology Distance is 0 Vertical Hydrology Distance is also 0 provided that the pitch value we hear as about 1/4th of the total tonal range is equal to the value of 0. (This correlation between the two distances would be expected for e.g. trees close to lakes and rivers). Mapping Vertical Hydrology Distance to color (Figure 18) gives a clearer view of how the distribution of this value changes as Horizontal Hydrology Distance increases. The red tongue was not detected during the test, but was only audible as a few high pitched sounds which were difficult to locate.
9 9.1
Discussion Visual Data Exploration
We aimed at demonstrating the potential benefit of immersive VR for Visual Data Mining, using a new version of our 3DVDM system and a new version of the VR++ framework upon which it is based. The system is capable of providing real-time user response and navigation as well as showing dynamic visualizations of large amounts of data. Although we find the real-time performance of the system very adequate, we have not included any tests to support this claim. Instead, we have put emphasis on illustrative support of our hypothesis that a complementary and valuable benefit of using the system’s visualization tools is achievable concerning the detection of e.g. non-linear relationships and substructures in data, which are most likely to escape traditional methods of data analysis.
308
H.R. Nagel et al.
An investigation of a data set starts with simple uni- and bi-variate analyses to become familiar with data and possibly reduce the number of variables to a reasonable number. In the case study presented we had 10 quantitative variables, and they were all forwarded to an evaluation of which combinations of three could be most interesting for defining the three spatial axes in the “Extended Scatter Plots”, where more detailed visual inspection could take place. A “Scatter Plot Tour” presented systematically the 120 unique “triplets” of the 10 variables with the dependent variable (Forest Cover Type) mapped as color of the data points (visualized as small objects). The criteria for “interesting” is entirely based on the user’s subjective evaluations when observing data in 3D space. This is potentially a weakness, but the point is to complement the algorithmic methods with the power of human perception. Several interesting triplets were observed and a few were illustrated and selected for further investigation. One triplet was used to demonstrate that the objects might encode multiple visual properties representing more statistical variables simultaneously, while another demonstrated the use of flexible and controllable color scales. Eventually the Macro Dynamic visualization was demonstrated and revealed most surprising substructures in the data. The full benefit of the 3DVDM tools assumes 3D VR visualization systems like a CAVE or a Panorama, where the user can navigate around and pursue intriguing views. Such benefit can only be experienced “in situ”, and for this paper we are left with the means of mono illustrations as presented on a monitor. However, we hope that the major points of the approach and its potential do come through via the many illustrations provided and their organization as a successively progressing use of the visualization tools. Several peculiar data structures were detected through the visual inspection, and they could possibly warrant a more detailed statistical analysis to explain the phenomena as valuable information or otherwise. The visualizations presented are all selected on the basis of their perceptual particularities, without any claims whatsoever about practical and/or statistical significance. 9.2
Auditory Data Exploration
The tools and methods presented in this chapter are only a few of many possibilities of this system. There are virtually unlimited ways to construct a soundscape using different sounds and settings. In general, sound seems to support a visual parameter like color quite well. It doesn’t reveal much new information; but can be useful when some data structure occludes other interesting observations. If certain clusters are out of visual range it will create sound cluster in that direction, provided that there are sufficient data in that cluster. Mapping statistical values to a few different sounds works when the data values are a few categorical values. If the data values are continuous it is more difficult to suggest a soundscape with the same informational value, though using the synthesized sounds may still yield some information about relative values.
Immersive Visual Data Mining: The 3DVDM Approach
309
Sampling of the whole 3D scatter plot randomly and especially along the three axes in time gives a soundscape that is similar to the one we will get by choosing color rather than sound. The real strength of this method appears when comparing color and sound information to investigate the correlation between these. The torch is useful for finding the direct statistical value of a given observation as long as the statistical values are few and preferably categorical. An important parameter to consider is the distance threshold that enables the listener to concentrate on the local area, because this also seems to eliminate most of the potential background noise created from distant objects. However, this also eliminates the ability to navigate towards distant areas on basis of the auditory cue from these. This was especially true for the last test where the synthesizer was used. In the second test it was possible to work with a higher distance threshold, probably because of the few different sounds. When the user becomes familiar with the current properties of a 3D soundscape it may become possible to navigate supported by sound cues, given that there are clusters that provides sufficient positional cues. In cases where there are no apparent clusters or other patterns in the soundscape it may just confuse the user. In this case it is probably a better solution not to use the sound tools. In any case, using soundscapes does hold enough information to give a strong indication of the distribution of a given statistical value. This information is also significant enough to trigger a closer investigation in most cases.
10
Conclusions
Concerning the potential benefit of immersive VR for VDM, our hypothesis was that a complementary and valuable benefit is achievable concerning the detection of e.g. non-linear relationships in data, which are most likely to escape traditional methods of data analysis. We have not presented tests that verify explicitly the benefit of navigation and real-time user response in immersive VR system, but we have illustrated the usefulness of the 3DVDM framework designed for VR through a series of examples. The VDM tools do help in discovering remarkable nonlinear data relations and substructures in the dataset used, which it would have been very difficult or impossible to detect using more traditional methods of analysis. In particular the Macro Dynamic Visualization revealed unexpected substructures. Commenting on the actual practical and statistical significance of the discovered data structures is beyond the scope of the chapter. No statistical nor conclusive analysis is aimed at with the system. The output is specification of phenomena, that may warrant follow-up of proper statistical analysis. Concerning the use of sound for data exploration, this project intended creating software tools that allowed us to use sound to assist us in performing visual data mining in VR. This chapter presented two basic sound tools developed for this purpose, and our aim was to present and test these tools, in various ways.
310
H.R. Nagel et al.
Tests have shown that it is possible to use sound for data mining in VR as either a support for visual parameters or as a stand-alone method, and especially using different exchangeable sample banks to represent statistical values proves to be a useful way of locating data values in VR. The success will depend on which type of data we wish to investigate, since keeping a simple soundscape is crucial for precise perception of value. Still it is possible to encode some kind of information about level for data that is more complicated. Future tests should try to investigate the threshold of complexity for soundscapes that are useful for data mining (i.e. holds some kind of numerical value). This is important in order to avoid listener fatigue and information overload which was a common problem during this work.
Acknowledgments We gratefully acknowledge the support to the 3DVDM project from the Danish Research Councils, grant no. 9900103. We also acknowledge Jock A. Blackard and Colorado State University with thanks for making the Forest Cover database available.
References 1. Asimov, D.: The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comput. 6(1), 128–143 (1985) 2. Inselberg, A., Dimsdale, B.: Parallel coordinates: a tool for visualizing multidimensional geometry. In: VIS 1990: Proceedings of the 1st conference on Visualization 1990, pp. 361–378. IEEE Computer Society Press, Los Alamitos (1990) 3. Nelson, L., Cook, D., Cruz-Neira, C.: Xgobi vs the c2: Results of an experiment comparing data visualization in a 3-d immersive virtual reality environment with a 2-d workstation display. Computational Statistics: Special Issue on Interactive Graphical Data Analysis 14, 39–51 (1999) 4. Nagel, H.R., Granum, E., Musaeus, P.: Methods for visual mining of data in virtual reality. In: Proceedings of the International Workshop on Visual Data Mining, in conjunction with ECML/PKDD2001, Freiburg, Germany, 2nd European Conference on Machine Learning and 5th European Conference on Principles and Practice of Knowledge Discovery in Databases, September 2001, pp. 13–28 (2001) 5. Nagel, H.R.: Exploratory Visual Data Mining in Spatio-Temporal Virtual Reality. PhD dissertation, Faculty of Engineering and Science. Aalborg University. Denmark (2005) 6. Granum, E., Musaeus, P.: Constructing virtual environments for visual explorers. In: Quotrup, L. (ed.) Virtual Space: The Spatiality of Virtual Inhabited 3D Worlds, Springer, Heidelberg (2002) 7. Symanzik, J., Cook, D., Kohlmeyer, B.D., Lechner, U., Cruz-Neira, C.: Dynamic statistical graphics in the c2 virtual environment. In: Second World Conference of the International Association for Statistical Computing, Pasadena, California, USA, February 1997, vol. 29, pp. 35–40 (1997) 8. Wegman, E.J., Symanzik, J.: Immersive projection technology for visual data mining. Journal of Computational and Graphical Statistics 11(1), 163–188 (2002)
Immersive Visual Data Mining: The 3DVDM Approach
311
9. Carr, D.B., Nicholson, W.L.: Evaluation of graphical techniques for data in dimensions 3 to 5: Scatterplot matrix, glyph, and stereo examples. In: Proceedings of the Section on Statistical Computing, Alexandria, VA, American Statistical Association, pp. 229–235 (1985) 10. Pickett, R.M., Grinstein, G.: Iconographic displays for visualizing multidimensional data. In: Proceedings of the IEEE Conference on Systems, Beijing and Shenyang, People’s Republic of China, Man and Cybernetics, pp. 514–519 (1988) 11. Ribarsky, W., Ayers, E., Eble, J., Mukherjea, S.: Glyphmaker: Creating customized visualizations of complex data, pp. 57–64. IEEE Computer, Los Alamitos (July 1994) 12. Chernoff, H.: The use of faces to represent points in k-dimensional space graphically. Journal of the American Statistical Association 68(342), 361–368 (1973) 13. Healey, C.G., Enns, J.T.: Large datasets at a glance: Combining textures and colors in scientific visualization. IEEE Transactions on Visualization and Computer Graphics 5(2), 145–167 (1999) 14. Ebert, D.S., Rohrer, R.M., Shaw, C.D., Panda, P., Kukla, J.M., Roberts, D.A.: Procedural shape generation for multi-dimensional data visualization. In: Gr¨ oller, E., L¨ offelmann, H., Ribarsky, W. (eds.) Data Visualization 1999, pp. 3–12. Springer, Wien (1999) 15. Brown, M.L., Newsome, S.L., Glinert, E.P.: An experiment into the use of auditory cues to reduce visual workload. In: CHI 1989 Proceedings, New York, USA (1989) 16. Mutanen, J.: Perception of sound source distance, Nokia Research Center (2003) 17. Baluert, J.: Spatial hearing: the psychophysics of human sound localization. The MIT Press, Cambridge (1997) 18. Nagel, H.R., Granum, E.: Vr++ and its application for interactive and dynamic visualization of data in virtual reality. In: Proceedings of the Eleventh Danish Conference on Pattern Recognition and Image Analysis, Copenhagen, Denmark (August 2002) 19. Nagel, H.R., Granum, E.: A software system for temporal data visualization in virtual reality. In: Proceedings of the Workshop on Data Visualization for large data sets and Data Mining, Augsburg, Germany, Department of Computer Oriented Statistics and Data Analysis University of Augsburg (October 2002) 20. Zahorik, P.: Auditory display of sound source distance. In: Proceedings of the 2002 International Conference on Auditory Displays, Kyoto, Japan (July 2002) 21. Blackard, J.A.: Comparison of Neural Networks and Discriminant Analysis in Predicting Forest Cover Types. PhD dissertation, Department of Forest Sciences. Colorado State University. Fort Collins, Colorado (1998) 22. Lawrence, J.D.: A Catalog of Special Plane Curves. Dover, New York (1972) 23. Cundy, H., Rollett, A.: Mathematical Models, 3rd edn. Tarquin Pub., Stradbroke (1989) 24. Gray, A.: Modern Differential Geometry of Curves and Surfaces with Mathematica, 2nd edn. CRC Press, Boca Raton (1997)
DataJewel: Integrating Visualization with Temporal Data Mining Mihael Ankerst1, Anne Kao2, Rodney Tjoelker2, and Changzhou Wang2 1
Baaderstrasse 49. 80469 München, Germany [email protected] 2 The Boeing Company P.O. Box 3707 MC 7L-70, Seattle, WA 98124-2207 {anne.kao,rod.tjoelker,changzhou.wang}@boeing.com
Abstract. In this chapter we describe DataJewel, a new temporal data mining architecture. DataJewel tightly integrates a visualization component, an algorithmic component and a database component. We introduce a new visualization technique called CalendarView as an implementation of the visualization component, and we introduce a data structure that supports temporal mining of large databases. In our architecture, algorithms can be tightly integrated with the visualization component and most existing temporal data mining algorithms can be leveraged by embedding them into DataJewel. This integration is achieved by an interface that is used by both the user and the algorithms to assign colors to events. The user interactively assigns colors to incorporate domain knowledge or to formulate hypotheses. The algorithm assigns colors based on discovered patterns. The same visualization technique is used for displaying both data and patterns to make it more intuitive for the user to identify useful patterns while exploring data interactively or while using algorithms to search for patterns. Our experiments in analyzing several large datasets from the airplane maintenance domain demonstrate the usefulness of our approach and we discuss its applicability to domains like homeland security, market basket analysis and web mining.
1 Introduction In recent years, there has been much interest in the data mining technical community in mining of temporal data. Temporal datasets have a dedicated attribute storing a time stamp for each record. This time stamp usually refers to the time when an event occurs or when a data record has been measured and collected. Examples of temporal datasets include stock market data, manufacturing or production data, maintenance data, event data, web mining and point-of-sale records. Typically, in different domains different kinds of temporal patterns are of interest; an overview is provided in [2]. To address this need, we designed our architecture for DataJewel to provide access to many temporal data mining algorithms and an easy way to add new ones. DataJewel provides an extensible framework for mining of temporal data with tightly integrated visualization, algorithm, and database components. S.J. Simoff et al. (Eds.): Visual Data Mining, LNCS 4404, pp. 312–330, 2008. © Springer-Verlag Berlin Heidelberg 2008
DataJewel: Integrating Visualization with Temporal Data Mining
313
When dealing with temporal databases, the need to integrate additional data sources is an important challenge another substantial aspect motivating our architectural design. In large enterprises, databases evolve as a consequence of an organizational need. They are designed to serve a specific (e.g. operational) purpose. Often databases from different organizations can be linked together to serve a new purpose, for example, to provide a platform for data mining. It is often desirable or necessary to add new data sources. However, the task of linking databases together is far from trivial; the field of information integration deals with challenging and laborious problems of maintaining data integrity, schema mapping, and resolving duplication. Often, there is no common attribute at all except the timestamp. By linking database tables together to explore the union of the attributes with respect to time, a powerful new view of the data is obtained. Or, an enterprise can link together helpdesk data concerning computer problems with a completely independent table from the procurement department and a labor database. The detected patterns might reveal insights into causes of computer problems and might form a new purchasing strategy. In this chapter, we describe contributions toward visual data mining in the temporal domain. We show that a system that is designed to tightly integrate components from various disciplines, i.e. visualization, data mining algorithms and database systems, can substantially improve functionality compared to systems with loosely coupled components. The rest of the chapter is organized as follows: In Section 2, we summarize related work. Section 3 describes a user-centric data mining process and the DataJewel architecture. Section 4 presents the visual component of our architecture and describes in detail our new visualization technique called CalendarView. Section 5 outlines how temporal data mining algorithms can be tightly integrated into DataJewel. Section 6 reports how to handle large datasets. In section 7, we describe several experiments with large datasets. We draw conclusions in Section 8 and mention several future directions.
2 Related Work Our main contribution to the area of temporal data mining is to tightly integrate a visual component, an algorithmic component and a database component. To our knowledge no such architecture has been proposed so far. Most of the work in temporal data mining deals with either just visualization techniques, an algorithmic approach or an approach to scale up to large datasets. We review these areas in this section. Many approaches for visualizing time dependent data have been proposed. Typically, visualization techniques represent temporal data either as a sequence along an axis or as animations where data at different times is represented in different frames. A recent approach which treats data as a sequence is ThemeRiver [6] It employs the metaphor of a current and maps histograms of document keywords to the height of a wave at a particular time. Mackinlay et. al. [11] use a spiral for calendar visualization; however, calendar days are merely used as reference points. Hierarchical pixel bar charts [8] are not aimed at visualizing temporal data but they can be used as an alternative pixel representation within a day.
314
M. Ankerst et al.
Several algorithms for mining temporal datasets have been proposed [2]. Contributions have been made in the areas of modeling temporal sequences, defining a suitable similarity measure for sequences and determining what kind of mining operations can be performed. We show in Section 5 that many existing algorithms can be leveraged by our architecture. Tightly integrated architectures have been proposed, but are only partially comparable to our approach. In [1], the authors describe an approach called cooperative classification, where the visualization and the algorithmic component are tightly integrated. This approach however, was specifically designed for decision tree classification and does not elaborate on scalability issues. Similarly, HD-eye [8] and n23Tool [15] integrate visualization with algorithms but are applicable only to clustering methods. Van Wijk, et al. [14] represent clusters of time series data which contain a pattern spanning one day and relating them to days with similar patterns. In contrast to our approach, it does not represent the data for each day nor does it cover scalability issues. Tightly integrating algorithms with databases or incorporating scalability considerations into data mining algorithms has been recognized and studied more extensively. A comprehensive survey is presented in [10]. Proposed ways to achieve scalability fall in one of three categories: design of a fast algorithm (e.g. by restricting the model space or parallelization), partitioning of the data (instance/feature selection methods) and relational representations (e.g. integration of data mining functionality in database systems). Recent approaches include the computation of sufficient statistics, similar to what Rainforest [4] does for decision trees. Sarawagi, et al.[12] describe an in-depth analysis of different levels of integration of an association mining algorithm into database systems. CONTROL [7] aims at a database-centric interactive analysis of large datasets focusing on online query processing. All these approaches, however, are not directly applicable to temporal data.
3 User-Centric Data Mining One design goal of our user-centric architecture is for intuitive use by domain experts as opposed to only being usable by data mining experts. It leverages users’ domain knowledge, and also supports accommodation of a user’s individual interest and needs. As a result, users can steer the exploration of temporal data, invoke algorithms to automatically discover patterns, incorporate their domain knowledge, hypothesize on the fly and use their perception to detect patterns of interest. In Figure 1, we outline the mining process within DataJewel. First, the user selects data tables and attributes for analysis. Then the data is loaded and visualized. The user has the option of invoking an algorithm and visualizing the resulting discovered patterns using the current settings. Alternatively, users can interact with the visualization to incorporate their domain knowledge or to discover patterns based on their own visual perception. Optionally, the user selects a date range to narrow in on a subset of interest and visualizes it with the same or another visualization technique. Other visualization techniques might be picked to view the data in a different way or because it is more suitable due to the reduced amount of data after the selection. After the user has iterated this loop several times, the user might be interested in “drilling down” to the raw data to see all attributes. The system then accesses the corresponding tables, and
DataJewel: Integrating Visualization with Temporal Data Mining
315
User selects data source / attributes
Data is loaded
Data is visualized
User invokes and algorithm
User interacts with visualization
User selects visualization technique
User selects subset of event
Raw data is displayed
Fig. 1. The mining process within DataJewel
the data is retrieved and presented. Note that this framework facilitates extensions by allowing incorporation of new algorithms and visualizations. DataJewel’s architecture enables users to interactively explore their data using both visual and algorithmic techniques, and it allows deeper exploration of subsets as well as incorporation of additional data to discover relationships and patterns. In this Section, we introduce some terminology and state assumptions for our architecture. Let us assume, the data sources consist of a set of tables. Each table contains r records, with each record consisting of d attributes a1, …, ad. At least one attribute contains a timestamp for each record. We refer to the timestamp attribute as the event date, all categorical attributes that could be incorporated in the analysis are event attributes and the attribute values of these event attributes are events. And event types are the unique set of different events in an event attribute. Note that continuous attributes can be treated as event attributes after discretization. While extensions can be made, in this chapter we will focus on event attributes (categorical attributes only) for which the following assumptions hold: a) The number of event attributes is low. (< 10) b) The number of event types any one event attribute is moderate. (< 200) c) The smallest time unit of interest in the event dates is one day Assumption a) restricts the number of event attributes used during the analysis. As opposed to high-dimensional feature vectors for which some mining tasks are performed, event attributes usually have a clear meaning. In some cases, highdimensional feature vectors can be converted into categorical events using pattern recognition or classification. Often, for a particular analysis, the analyst will select a small number of event attributes which are associated with each other in the particular domain of interest. Using domain knowledge, the remaining attributes are omitted
316
M. Ankerst et al.
from this particular analysis because they would just add noise. Assumption b) limits the number of unique event types of an event attribute to a moderate size. In cases where an event attribute has a large number of unique event types, a user can reduce the total number of unique event types by grouping them through definition of a concept hierarchy. With assumption c) we focus on the most common time unit of interest in many business domains. Note that days are just the smallest unit of interest, and the discovery of weekly or monthly patterns is also supported. Obviously, for applications such as intrusion detection systems, our proposed unit of time would have to be refined to reflect finer grained time units.
Visualization
Data flow during the mining process
Algorithms
DataJewel’s user interface flow
Data sources
Fig. 2. Data flow versus design of DataJewel
In Figure 2, a simplified view of the DataJewel architecture is depicted consisting of three layers. Although the data flows from the data source to the visualization layer, we have designed our user interface to flow in the opposite direction to better support a user-centric process. We use visualization to drive the analysis process, allowing users to explore, choose algorithms, and view data. Corresponding to each one of these layers, we will describe the visualization, the algorithmic and the database components. For simplicity, we only present one instance of the visualization and the algorithmic component; new visualization techniques and new algorithms can be easily added.
4 The Visualization Component The visualization component contains visualization techniques suitable for representing temporal data. In this section, we present a new visualization technique, CalendarView, which represents temporal data on a daily basis that also allows representation of weeks and months.
DataJewel: Integrating Visualization with Temporal Data Mining
317
4.1 CalendarView Our architecture is designed for use by domain experts, and not just for data mining experts. Thus the visualization component is intended to be intuitive and versatile. CalendarView, our new visualization technique, is motivated by representations people are already very familiar with. First, the representation of event dates is designed using the visual metaphor of a calendar. Second, the frequency of events is represented along the event dates timeline. Its representation is based on the familiarity of humans with histograms. In simpler linear representations, time is greatly simplified by modeling it as a sequence of dates. In contrast, we have selected the calendar metaphor because it reflects the rich temporal structure more effectively than typical simplified representations. From a calendar, the human preattentively extracts the notion of weekends, weekly repetitions, seasons, days with a special meaning in his domain, etc. Whereas the calendar metaphor is used to represent the event dates on a daily basis, an extended version of histograms reflects the distribution of events for a single day. To enable the user to compare different event attributes with each other, each event attribute is represented by a separate calendar. In the final visualization all calendars are drawn one above the other. Table 1. Example of a temporal dataset Event Date
Event Attribute:
Event Attribute:
Event Attribute:
Page hit
Browser
…
1/1/2002
Index.html
MS IE
…
1/1/2002
Dep1/contacts.htm
Netscape
…
…
…
…
…
Table 1 depicts an example of a temporal dataset. Each event has an associated event date, so we can count the frequency of an event type occurring on a single day. Repeating this for each event type of an event attribute we can display the distribution of event types by a histogram for this event attribute. We initially assign a different color to each event type of one event attribute. The default color map is the PBC color map [1] which has been developed to map distinct values to distinct colors. As illustrated in Figure 3, for each day the frequency distribution of the events is represented within the corresponding day in the calendar. Note that the color mapping in this chapter is not the original color assignment. It has been changed to optimize for grayscale printing. Instead of using colored histograms where frequency is depicted by the height of the bins, the events are represented pixel by pixel to account for more categories than are usually depicted by a histogram. In particular each day is filled with pixels in the following way:
318
M. Ankerst et al.
Distribution of events e1, e2, e3, e4 S
January 1st, 2002 e1 e2 e3 e4
M
T
W
T
F
Fig. 3. Illustration of CalendarView
Each day is represented by a constant size square of n by n pixels. If the total number of events of this event attribute on the corresponding day is less than or equal to 2 n , we can use one pixel per event. The pixel arrangement starts in the lower left corner of the day square. It goes up (n-1) times, goes one pixel to the right, then goes (n1) pixel down, one pixel to the right, and so forth. If the total number of events exceeds n2, we scale the number of pixels assigned to each event type accordingly. Following the illustration in Figure 3, let us assume we have four event types: e1, e2, e3 and e4. The frequency of the occurrence of e1 on a particular day is denoted by f(e1, date), the frequency of occurrence of e2 by f(e2, date), etc. Across all dates in the dataset, the date with the maximum number of event occurrences is denoted by datemax. On a particular date an event e is represented by pix(e,date) pixels as shown in formula 1. We draw the first pix(e1,date) pixels with the color assigned to e1. Following the described pixel arrangement, the next pix(e2,date) pixels are drawn in the color of event e2, and so forth.
⎡ ⎤ f (e, date) 2⎥ ⎢ pix(e, date) = ⋅n ⎢ ∑ f (e, date ) ⎥ max ⎢⎢ e ⎥⎥
(1)
The order of the events can greatly affect perception of their distribution. By reordering the events for each day preattentive processing can be improved. The reason becomes clear with the following example. Let us assume the daily distribution of ten events over one year is very similar and even within each day the number of events does not differ largely. Let us further assume there is exactly one day where the least frequent event suddenly happens more often. On that day it is the second most frequent event. Then, in addition to representing this event with more pixels than on other days, the reordering yields a better perception of this distribution change. Thus, the reordering improves the perception of distribution changes (cf. Figure 4 where the 4th day is reordered). Note that the daily reordering is done in real time, since
DataJewel: Integrating Visualization with Temporal Data Mining
319
View without reordering
View with reordering
Fig. 4. Enhanced perception of changes by reordering of events by frequency
computation time is negligible due to our assumptions in Section 3. The choice in the size of the day square n2 is a tradeoff between representing each event by one pixel and the size of the (virtual) screen. (Note: in this chapter, n=10 unless otherwise specified.) 4.2 Interaction with CalendarView A key feature of the design of DataJewel, is the ability to interact and guide the analysis and exploration of data through the visual interface. In this section, we describe the main interaction capabilities of Calendar-View. - Selection As described in Section 3, one essential feature of the visualization component is to select a subset of dates. The user is able to interactively select a set of consecutive days by clicking on the visual display. The subset corresponding to the selected event dates can again be visualized following the iterative process outlined in Section 3. - Ascending/descending order The decision as to whether the events should be ordered ascending or descending by frequency is only important for the case where there are fewer pixels in a day square than there are events. If the frequency distribution on a particular day is highly skewed, some events might not be represented at all because the drawing algorithm, using Formula 1, might have already filled up the complete day square. In most cases the user is either interested in detecting outlier events which happen very rarely or the user is interested in the overall distribution of the “main” events. Therefore we enable the user to switch between ascending or descending order in real time. If the ascending order has been selected, the drawing of the pixels starts with the rarest events and thus uncovers them at the possible expense of cutting off the most frequent event type at the end. If the descending order is selected, the most frequent events are drawn first. - Interactive color assignment Initially, colors are assigned to events based on the PBC color map. Thus missing values can be treated as a distinct event and are assigned a specific color (background by default). A dialog window enables the user to interactively assign colors to event types. With manual color assignment users can incorporate their domain knowledge, can formulate and test an hypothesis on-the-fly or steer the exploration in a
320
M. Ankerst et al.
meaningful way. The notion of color assignment is implemented as follows: If the user changes several event types to have the same color, it indicates a conceptual generalization of the events. As a result, all events which are assigned to the same color are portrayed as the same event type when the visualization is redrawn. Thus, within each day, the events with the same color are grouped together before the drawing algorithm is invoked. For example, using the web mining dataset from Table 1, let us assume we are recording web page hits on a particular day. These events can be generalized by the user by assigning color c1 to all pages of the main website which have been visited. Color c2 refers to the dep1/subdirectory, c3 to the dep2/subdirectory and color c4 is used for all other web pages. The user interface (depicted in Figure 5) also enables the user to sort by event type name or by frequency. Each event type has a separate color assignment. In Figure 6, two event attributes from our example dataset are visualized from January 1st, 2002 to February 18th, 2002. For the “page hits” event attribute, the user has assigned colors to four different groups of web pages as described above. We see page hits on January 1st but no more until Saturday, January 12th. This may lead to the hypothesis that the web server was down for 11 days. Also the event attribute “browser” has its first event on Sunday, January 20th. This may lead to the hypothesis that the web server did not record the browser type until that day. We also see that only two different browsers types have been recognized; one browser type has been used more frequently throughout the whole time period. -Zooming The user can zoom in or zoom out on the visualization to more closely inspect particular dates or to see a longer stretch of dates. - Detail on Demand Detailed information on the event corresponding to the pixel of the current mouse pointer position is displayed when the user hovers over a particular point of interest in the visual display.
Fig. 5. Interactive assignment of colors
Fig. 6. CalendarView with web mining dataset
DataJewel: Integrating Visualization with Temporal Data Mining
321
5 The Temporal Mining Component Building the visualization component, we have introduced a visualization technique called CalendarView, which maps different events to distinct colors. When there are just a few event types the visualization itself is a very powerful analytic tool since human preattentive perception is very efficient in finding a variety of patterns. If the number of different event types is larger, the usefulness of the default color assignment decreases because the variations of color are not perceived as being distinct any more. Nevertheless the visualization technique might reveal patterns since changes in the event distribution might still be perceived. With the interface for interactive color assignment, we have introduced the concept of reassigning colors to address the challenge of handling a larger number of event types. However, if the focus is not on a certain known event type, manually changing a random sequence of colors can quickly become tedious. In addition, some kinds of patterns of interest may be difficult to find unless a very specific color scheme could be chosen out of very many possibilities. This motivated us to include a tight integration of temporal mining algorithms to the visualization system. We include algorithms that discover patterns, determine the events involved in those patterns and use this information to automatically select colors to reveal those patterns to the user. This automatic color selection can be invoked at any time during the exploration to compute a reasonable default color assignment and the user can modify the color assignments incrementally. In summary, two aspects of our architecture contribute to the intuitive cooperative exploration of the data by the user and the algorithms. First, CalendarView visualizes not just the data but also the patterns. Second, the same color assignment interface is used by both the user and the algorithm. We now will focus on how the following three classes of algorithms use color assignment: • Discover an interesting pattern in one single event type of one event attribute • Discover an interesting pattern including multiple event types of one event attribute • Discover an interesting pattern across multiple event attributes involving one event type for each event attribute (an extension is that the user selects one event type and lets the algorithm detect events of other event attributes which show some relation to the selected event, e.g. similarity, correlation, etc.) Discover pattern in one event type of one event attribute Many existing algorithms find patterns calculated on one single event type based on some measure of interest [2]. These measures can range from basic statistical methods such as “highest variance” to more computationally expensive ones such as “most interesting trend”. No matter how the algorithms compute the single event pattern of interest, our approach encapsulates it and changes the color assignments accordingly. This means all colors but one are changed to one light color, whereas the event type for which the pattern was found is assigned a unique dark color. Thus the user’s attention is focused on the distribution of this single event type in relation to the overall frequency of all events. We have included the following implementation of such an algorithm called LongestStreak, which is based upon the idea of stabilized p charts from the statistical field of control charting [13]:
322
M. Ankerst et al.
1. For each event type e, compute a sequence of relative frequencies as follows: For each day, compute the percentage of occurrences of event type e based on all events occurring on the same day. 2. Compute the weighted mean and standard deviation of each sequence. Consider just the days that are event dates. 3. Label each day where the relative frequency of event type e is significantly below or above its mean as a significant day with respect to event type e. 4. Return the event type with the longest streak of consecutive significant days. Break ties by returning the first one found. Alternatively, we could also modify step 4 to return the event type with the most significant days. After the visualization is updated based on the discovered event type the user can see the discovered pattern and can continue the exploration process. Discover a pattern that includes multiple event types of one event attribute Again, many algorithms have been proposed which compute this class of patterns [2], e.g. discovery of similar events. The algorithm returns a set of events which together represent a pattern. Our architecture changes the color assignment such that each event that is part of the pattern is assigned a distinct color, and all other events are assigned to one color. Our implemented instance of this class of algorithms called MatchingEvents extends LongestStreak described above: 1. For each event type, compute significant days and record a bit sequence having a ‘1’ for each a significant day and a ‘0’ otherwise 2. Take LongestStreak as the baseline event type 3. Compare the bit sequence of the LongestStreak event type with all others to find the closest match. The match is defined as the number of common significant days, which is calculated using a bitwise AND operation. The event whose bit sequence has the highest match counter is the correlated event. 4. Return the LongestStreak event type and the correlated event type Discover a pattern across event attributes The two previous algorithms have looked for patterns in one single event attribute. In contrast, this class of algorithms looks for patterns relating event attributes to each other, instead of analyzing them separately. Many proposed algorithms fall into this class, e.g. finding similar event patterns across different event attributes. The resulting pattern is visualized by updating the color assignments of each event attribute accordingly. We implemented an instance of this class very similar to MatchingEvents. But instead of comparing the LongestStreak of the first event attribute to other events of the same attribute, it is compared to all event types of the other event attributes. The algorithm returns the LongestStreak of the first event attribute and for each other event attribute the event type that is correlated. In the experimental section, we refer to this algorithm as MatchingEvents2.
DataJewel: Integrating Visualization with Temporal Data Mining
323
6 The Database Component In this chapter, we assume the datasets reside in tables from one or more relational databases. The integration of a database component provides access to the data, a mechanism to scale up to large datasets and the capability to access the raw data of all attributes associated with the patterns found. The critical requirement of the database component is to scale up to large databases to support both visualization and memory requirements. The first aspect, namely how to visualize large datasets, is addressed by visualizing the relative frequency of events on a single day as described in Section 4. In this section, we describe how large datasets are processed. The fundamental idea is to compute an aggregated version of the dataset such that it fits in main memory. The aggregated dataset contains sufficient statistics similar to e.g. [4] for decision trees, and we show the upper bound of the main memory requirements based on our assumptions stated in Section 3. Consider our example dataset from Table 1. Assume the full dataset consists of millions of rows since each occurrence of an event is typically stored as one record. If we use the aggregation capabilities of the database, the number of records that are loaded can be significantly reduced. Instead of storing each occurrence of an event, we count within each day the number of occurrences for each event type. E.g. the sufficient statistics for event attribute “page hits” can be computed by submitting the following SQL query: SELECT Event_date, page_hits, count(*) as Frequency FROM example_table GROUP BY Event_date, page_hits ORDER BY Event_date, page_hits; The resulting table is sketched in Table 2. The amount of compression achieved by aggregation depends on the number of distinct event dates, the number of distinct events and how distinct events are distributed across the dates. Table 2. Sufficient statistics for event attribute “page_hits”
Event Date 1/1/2002 1/1/2002 …
Event Type (page hits) Index.html Dep1/contacts.html …
Frequency 1934 36 …
The memory requirement for our initial dataset is proportional to the number of entries in a relational table. For one event attribute, event dates and events of this attribute have the memory requirements meminit, with meminit ∝ number of days* average number of events per day In contrast, the memory requirements memnew for the computed sufficient statistics table (Table 2) is memnew ∝ number of days ∗ average of the number of distinct events per day
324
M. Ankerst et al.
The difference in memory usage is the ratio between the average number of events per day and the average number of distinct events per day. This ratio will vary with the domain and the event attribute. For example, we analyzed aircraft maintenance data for an airline with the following statistics: Average number of events per day: 402 Average number of distinct events per day: 32 The ratio in this example is 12.5:1. Whereas the number of records grows linearly for the initial dataset with every new event, our new table typically just increments a counter. This is most useful in domains where the number of events per day is very high, like web page accesses, items in market baskets across departments, phone calls, etc. Given our assumptions from Section 3, the worst case memory requirements memworst for the sufficient statistics table of one event attribute can be computed for e.g. 15 years: memworst ∝ 15 ∗ 365 days ∗ 200 distinct events = 1,095,000 In this case, every event happens every day at least once during a period of fifteen years. We can store each event with one byte (next to a small lookup table) and the days and frequency as integers with 4 bytes. The sufficient statistics table would require: 1,095,000 ∗ (1 + 4 + 4) = about 9.8 Megabytes. Together with our assumption that the number of event attributes is low, we conclude the sufficient statistics tables fit in main memory for many domains. To summarize, the database component is integrated in two ways: First, the relevant event attributes of the original tables are compressed by computing the summary statistics offline. Second, database access is provided in a straight forward way: Since the user selects subsets of the initial time period during the exploration process, the user can decide to retrieve the records with all attributes corresponding to the selected time period. Then a range query over the time period returns the raw data of interest. In our experiments, the computed summary statistics always fit in main memory and the computation of the proposed algorithms is efficient. Both, we believe are true for most datasets which fulfill our assumptions in Section 3. However, if more attributes are involved in an algorithmic run, or the integrated algorithms are more complex, then a tighter integration with the database component might be necessary. For example, algorithms might be decomposed and leveraged by SQL extensions or userdefined functions could be used. If the algorithmic run is pushed back to the database, the user can continue to explore the data and receive notification after the computation is finished.
7 Experiments In our experiments, we investigate several real-world datasets from the airplane maintenance domain. We believe the scenario we describe in this section is similarly applicable to many other domains such as homeland security, web mining, market basket analysis or intrusion detection. In our experiments, the datasets are tables from a database containing maintenance events of different airlines for different airplane models (An airplane model is e.g. 747, 767, etc.).
DataJewel: Integrating Visualization with Temporal Data Mining
325
Maintenance events range from negligible ones like cleaning coffee stains on the seat to major ones like repairs on a landing gear. Each record has information about the date a maintenance action has occurred, the airport where it was recorded, who discovered it, the text description of the issue, the maintenance action taken, the system and subsystems affected by the issue, etc. We will focus analysis on the affected systems, which will be our event attribute. A system is a set of related parts that work together to perform a function such as communication, engine, flight control, doors, etc. Table 3. Datasets Dataset A B C D E F G
Dates 3/6/89-12/31/02 5/12/90-12/31/02 1/30/89-12/31/02 3/6/89-12/31/02 11/12/89- 12/31/02 1/12/89-12/31/02 12/27/89- 12/31/02
Nr. events 37 39 41 28 41 182 40
Nr. records Nr. records (suff stat) 350,772 87,030 1,165,881 117,441 1,505,582 133,116 350,772 78,802 2,051,269 162,918 2,051,269 574,071 17,499 11,547
Table 4. Runtime (in seconds) of algorithms Dataset A B C D E F G
LongestStreak 0.27 0.31 0.35 0.28 0.37 0.71 0.23
MatchingEvents 0.31 0.30 0.36 0.26 0.36 0.68 0.22
MatchingEvents2 0.53 (with B) 0.62 (with C) 0.54 (with D) 0.63 (with E) 0.90 (with F) 0.87 (with G) 0.47 (with A)
Metadata about the various datasets we explored is depicted in Table 3. The recorded maintenance datasets span time periods between twelve and fourteen years. Table 4 shows the runtime of our implemented algorithms on the datasets. For the algorithm MatchingEvents2, we also indicate in brackets which other dataset has been included as the second event attribute. We ran all experiments on a PC with a Pentium III/ 800 Mhz processor and 1 GB main memory. For each dataset, we achieve an acceptable runtime. These runtimes show that it is feasible to run the algorithms as part of an interactive analysis system. 7.1 Mining Airplane Maintenance Datasets In this section, we describe a typical analysis scenario to illustrate how our approach can be used. We start our investigation by selecting a dataset from one airline and one airplane model. The chosen event attribute which we analyze over time is the airplane system. Since there are many different systems, we select the algorithm LongestStreak to identify an interesting system. In this case, it found a pattern in “engine-fuel” events. DataJewel updates the color assignment to highlight the pattern it found. The
326
M. Ankerst et al.
top row of Figure 7 shows a small range of the resulting visualization. Especially during the last five days of July 2000, we perceive many more engine-fuel system events than usual. Next, we add several datasets to compare this finding with patterns for different airlines. For each airline and the same model, we manually change the color assignment of the airplane systems. We color every airplane system, except engine-fuel, with a light color and assign a dark color to all engine-fuel related events. When we compare these airlines (two more airlines are shown in Figure 7), we see the other airlines do not show any pattern during the same period. Even though just a small time range is shown, the results are the same for all event dates. Based on this result, we might decide to further investigate the first airline. Next we add to the first dataset another dataset which aggregates individual airplane id’s of the maintenance events for the same airline and airplane model over time. The event attribute of the newly added dataset is the airplane id and we would like to find a correlation between the events we identified concerning engine fuel and maintenance events of individual airplanes. We run the algorithm MatchingEvents2 which singles out one airplane. This airplane is shown in Figure 8 and we see that many maintenance events for this single airplane have occurred on December 3rd, 1997. Note that for brevity we have omitted a screenshot of the corresponding time range of Figure 7. Finally, we select a dataset of maintenance events for just this airplane. The event attribute is again airplane systems. We run the algorithm MatchingEvents to see if two events frequently co-occur. A part of the resulting visualization is shown in Figure 9. The two correlated events returned are fuel and communications indicated by the black and light gray color. For example, on Monday 18th November, both events co-occur. With this knowledge we drill down to the raw data to further investigate the findings.
Fig. 7. Maintenance events in the same subsystem for three different airlines
DataJewel: Integrating Visualization with Temporal Data Mining
327
Fig. 8. CalendarView focusing on maintenance events for one airplane
Fig. 9. CalendarView focusing on maintenance events in two subsystems for one airplane
7.2 The DataJewel System We have implemented the DataJewel system based on the architecture described in this chapter. It can quickly be adapted to new domains since it is designed to be extensible for new visualization techniques and new algorithms. In Figure 10, two useful features are shown. First, the raw data from the underlying dataset can be accessed, displayed, saved or printed. Whereas the temporal analysis may be based on just a few event attributes from possibly several different tables, the user may be interested in other attributes of the records corresponding to the pattern found. As the data has been distilled and narrowed down during the exploration, the current range of event dates represents the data subset of interest. Thus just one range query is submitted against the database(s) to retrieve the attributes of interest. Second, an optional navigation tree on the right side of the user interface depicts the exploration process. The simplified temporal mining process presented in Section 3, focuses on the iterative process of reducing the dataset. However, at some point the user may wish to return to a previous stage, either because something of interest has been or nothing of interest appeared in that particular subset. The navigation tree on the right side shows a node for each subset of data explored so far and allows the user to return to a node or annotate a node. The algorithmic component can be used in three ways. It can be used to determine the default color mapping, it can be invoked at any time during the exploration
328
M. Ankerst et al.
Fig. 10. Screenshot of DataJewel
process or it can run as a background process in parallel to the user’s exploration and notify the user upon discovery of patterns. In addition to updating the color assignment after patterns have been found, the event dates not covering the patterns can be grayed out. Alternatively, the patterns can be displayed in a textual form. 7.3 Discussion We believe the DataJewel architecture is also well suited to address areas like homeland security, market basket analysis or intrusion detection. Homeland security tasks, such as identifying suspicious behavior, can be supported by our architecture in several powerful ways. For example, different event attributes can be associated with each other even though their events take place at different dates, months or possibly years. In an intrusion detection application, data may be aggregated hourly instead of daily, therefore an additional visualization technique would need to be added to the visualization component; the DataJewel architecture allows new components to be added to address particular needs. In the context of market basket analysis, many algorithms have already been proposed and successfully used to find patterns. An example pattern may look something like: If a customer buys bread and sugar then she is likely to buy beer as well. These algorithms look for items that are frequently bought together; however, they do not make use of the timestamp that is associated with each transaction. Analyzing market basket databases including the timestamp can reveal a new set of patterns such as: Customers are likely to buy cereal and fruits in the beginning of the week and alcohol and candies at the end of the week. Note that our approach would be suitable for these datasets even though the dimensionality of market basket databases is typically very high (hundreds or thousands of items). Each item is usually modeled as one attribute and a record corresponds to all items purchased by one customer. Instead, in our approach we map all items to different events of one attribute and store the frequency of the corresponding items bought per day. If the number of items is very large, a concept hierarchy could be used to generalize to fewer items, as outlined in Section 3.
DataJewel: Integrating Visualization with Temporal Data Mining
329
8 Conclusions Visualization, mining algorithms and databases are main contributing areas in the field of Knowledge Discovery in Databases (KDD). Most research concentrates in just one of these areas. Our work is based on an integrated approach that we believe can significantly improve the discovery of useful and understandable temporal patterns. Our novel user-centric architecture for temporal data mining tightly integrates visualization, algorithmic and database components. In this chapter, we introduced a new visualization technique called CalendarView for representing temporal data. It uses the same visualization scheme for the viewing the data and for viewing the patterns discovered by mining algorithms. In addition, we designed an interface of assigning colors to categories, which is used by both the user and the algorithms. Our system allows users to take advantage of both their own knowledge and analysis capabilities, as well as automated analyses of mining algorithms. On the one hand, the user can steer the exploration or incorporate domain knowledge, on the other hand, the algorithm can suggest meaningful color mappings based on the pattern discovered by algorithms. We also described an efficient process for precomputing sufficient statistics from the initial datasets so that our approach scales up to very large databases. In future work, we plan to apply DataJewel to different domain areas, using the extensible architecture to add new visualization and algorithmic components. We will investigate how our approach can be extended to fit different data types like text or multimedia data.
References 1. Ankerst, M., Ester, M., Kriegel, H.-P.: Towards an effective cooperation of the computer and the user for classification. In: Proceedings of the Sixth ACM SIGKDD international conference on Knowledge discovery and data mining SIGKDD 2000, Boston, MA, pp. 179–188 (2000) 2. Antunes, C.M., Oliviera, A.L.: Temporal data mining: An overview. In: Proceedings of the SIGKDD 2001 Workshop on Temporal Data Mining, 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 2001, August 26. ACM Press, San Francisco (2001) 3. Daassi, C., Dumas, M., Fauvet, M.-C., Nigay, L., Scholl, P.-C.: Visual exploration of temporal object databases. In: Proceedings of 16th French Conference on Databases BDA 2000, Blois, France, October 24-27, pp. 159–178 (2000) 4. Gehrke, J., Ramakrishnan, R., Ganti, V.: RainForest – A framework for fast decision tree construction of large datasets. Data Mining and Knowledge Discovery journal 4, 122–162 (2000) 5. Grinstein, G., Ankerst, M., Keim, D.A.: Visual data mining: Background, techniques and drug discovery applications. In: KDD 2002 Tutorial, 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD 2002, Edmonton, Canada (2002) 6. Havre, S., Hetzler, E., Whitney, P., Nowell, L.: ThemeRiver: Visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics 8(1) (January-March 2002)
330
M. Ankerst et al.
7. Hellerstein, J.M., Avnur, R., Raman, V.: Informix under CONTROL: Online query processing. Data Mining and Knowledge Discovery Journal 12, 281–314 (2000) 8. Hinneburg, A., Keim, D.A., Wawryniuk, M.: HD-Eye: Visual mining of high-dimensional data. IEEE Computer Graphics and Applications 19(5) (1999) 9. Keim, D.A., Hao, M.C., Dayal, U.: Hierarchical pixel bar charts. IEEE Transactions on Visualization and Computer Graphics 8(3), 255–269 (2002) 10. Kolluri, V., Provost, F.: A Survey for scaling up inductive algorithms. Data Mining and Knowledge Discovery journal 2, 131–169 (1999) 11. Mackinlay, J.D., Robertson, G.G., de Line, R.: Developing calendar visualizers for the information visualizer. In: Proceedings of the Seventh Annual ACM Symposium on User Interface Software and Technology UIST 1994, Marina del Rey, California, November 2-4, pp. 109–118 (1994) 12. Sarawagi, S., Thomas, S., Agrawal, R.: Integrating mining with relational database systems: Alternatives and implications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, June 2-4, pp. 343–354. ACM Press, Seattle (1998) 13. Trueblood, R.P., Lovett Jr., J.N.: Data Mining and Statistical Analysis Using SQL. Apress (2001) 14. Van Wijk, J.J., Van Selow, E.R.: Cluster and calendar based visualization of time series data. In: Wills, G., Keim, D. (eds.) Proceedings of the IEEE Symposium on Information Visualization InfoVis 1999, pp. 4–9. IEEE Computer Society, Los Alamitos (1999) 15. Yang, L.: Interactive Exploration of Very Large Relational Datasets through 3D Dynamic Projections. In: Proceedings of the Sixth ACM SIGKDD international conference on Knowledge discovery and data mining SIGKDD 2000, Boston, MA, pp. 236–243 (2000)
A Visual Data Mining Environment Stephen Kimani, Tiziana Catarci, and Giuseppe Santucci Universit´ a di Roma “La Sapienza”, Dipartimento di Informatica e Sistemistica, Via Ariosto 25, 00185 Roma, Italy {kimani, catarci, santucci}@dis.uniroma1.it
Abstract. It cannot be overstated that the knowledge discovery process still presents formidable challenges. One of the main issues in knowledge discovery is the need for an overall framework that can support the entire discovery process. It is worth noting the role and place of visualization in such a framework. Visualization enables or triggers the user to use his/her outstanding visual and mental capabilities, thereby gaining insight and understanding of data. The foregoing points to the pivotal role that visualization can play in supporting the user throughout the entire discovery process. The work reported in this chapter is part of a project aiming at developing an open data mining system with a visual interaction environment that supports the user in the entire process of mining knowledge.
1
Introduction
It should be acknowledged that a lot of research work has been and is being done with respect to knowledge discovery. However, much of the work concentrates on the development and optimization of data mining algorithms using techniques from other fields such as Artificial Intelligence, statistics and high performance computing, with little consideration, if any, of the other knowledge discovery phases. Consequently, corresponding tools/systems are normally difficult to integrate in the entire knowledge discovery process. There is a major need to develop an overall framework that can support the entire knowledge discovery process [Fayyad et al.2002]. The framework should accommodate and integrate all the phases seamlessly. Since it is not known a priori all the components the framework will be expected to support, the framework should be extensible to any new components. On the same note, the development of the framework and the components should be separated. One of the interesting issues related to the ongoing discussion is the role of visualization in the knowledge discovery process. Visualization is a very effective means of enabling the user to use his/her outstanding perceptual capabilities to recognize and understand data. Traditionally, visualization has been placed at
This work is supported by the MIUR project D2I: Integration, Warehousing, and Mining of Heterogeneous Sources (http://www.dis.uniroma1.it/˜lembo/D2I).
S.J. Simoff et al. (Eds.): Visual Data Mining, LNCS 4404, pp. 331–366, 2008. c Springer-Verlag Berlin Heidelberg 2008
332
S. Kimani et al.
the beginning and at the end of the knowledge discovery process. Instead, visualization has its place in all the phases of the knowledge discovery process. This puts visualization, and therefore the user, at the center of the entire knowledge discovery process. This is a major step toward developing user-centered data mining systems. This chapter describes VidaMine 1 , a visual data mining system with a visual environment aiming at supporting the user throughout the entire data mining process. VidaMine is built on a framework that is open to the inclusion of new architectural components and the modification of existing components. The system’s visual interface is designed based on the goal of supporting the user in the entire data mining process. Moreover, the interaction environment is consistent, uniform and flexible. We have also carried out various usability tests on the visual interface. The current visual interface prototype has its design based on the feedback that we have received from corresponding usability experiments. Moreover, we have defined an abstract syntax and a formal semantics for the interface. The rest of this chapter is organized as follows: section 2 focuses on related research work. Section 3 describes the architecture of the proposed visual data mining system. A detailed description of the visual data mining environment is presented in section 4. Section 5 highlights efforts aimed at defining a mapping between the abstract data mining engine and the visual interface. Work on usability studies is presented in section 6. Future work and conclusions are presented in section 7.
2
Related Work
In this section, a discussion of some data mining systems that offer a reasonably great and diverse number of data mining and visualization functionalities that have been proposed in the literature is given. Clementine [SPSS] is a product by Integral Solutions Ltd (ISL). SPSS purchased ISL on December 31, 1998. The product supports quite a number of mining techniques including: clustering, association rules, sequential patterns, factor analysis, and neural networks. Its visual interface reveals much about a data mining task by illustrating the flow of control and data. Therefore the user is better positioned to understand and follow the mining process. Users construct a map of their data mining project/model, called a “stream” by selecting icons, called “nodes” that represent steps in the data mining process. However, users would need to learn and think in terms of “streams” and “nodes”. Moreover, the product does not fair very well in terms of scalability i.e. Clementine does not scale up very well when dealing with massive amounts of data. It should be pointed out that Clementine does allow users to adjust/refine their “streams” and rerun the system on the refined model. It would be interesting for the company to have evaluation studies carried out and reported for the express purpose of assessing usability of Clementine. Fig. 1 shows the visual interface. 1
VidaMine is an acronym for VIsual DAta MINing Environment.
A Visual Data Mining Environment
333
Fig. 1. The Visual Interface of Clementine
Enterprise Miner [SAS] is a product by the SAS Institute. The product provides diverse data mining algorithms such as: decision trees, neural networks, regression, radial basis functions, and clustering. The product offers extensive parameter options for the algorithms. Enterprise Miner has a reasonably interesting visual interface. However, the product is hard to use especially when compared with other products such as Clementine. Enterprise Miner has powerful support for data transformation. Its visualization tools are useful for multidimensional analysis. The effort does not report any studies explicitly designed for testing the usability of the product. NicheWorks is a tool that was originally designed for exploring large graphs [Wills1997, Wills1999]. Among other applications, the tool has been used to develop visualizations for detecting international calling frauds [Cox et al.1997]. The detection is realized basically by using a visualization of the calling activities that allows the system user to quickly notice unusual calling patterns. The calling communities are represented using a directed graph, in which the nodes represent the subscribers whereas the links represent the calls. In particular, countries are mapped to unfilled circles, subscribers are represented by filled circles, the size and color of a filled circle reflect the total number of calls made by the subscriber. The width and color of a link represent the overall level of communication between the two ends. The tool also enables the user to drilldown on suspected patterns. It should be pointed out that, in the real sense
334
S. Kimani et al.
of the word, NicheWorks is not a data mining system; it may be regarded as a visualization or exploratory tool. Therefore, the tool cannot fully accommodate the entire mining process. Nonetheless, the tool is a classic example of the role of visual data mining in visualizing raw data. DBMiner is an on-line analytical processing (OLAP) data mining system developed by the Data Mining Research Group from the Intelligent Database Systems Research Laboratory at Simon Fraser University, British Columbia, Canada [Han et al.1996]. The system is owned by DBMiner Technology Inc [DBMiner]. It supports association rules, meta-patterns, classification, and clustering. DBMiner provides a browser to visualize the OLAP process. Association rules are visualized using bar charts and a three-dimensional ball graph view. With regard to visualizing decision trees, the product offers a three-dimensional graph view and an optional grid view. Clustering results are visualized using a twodimensional graph view or a grid view. The user interface is fairly simple and standard. However, it should be pointed out that users who are not acquainted with data mining are likely to find the data mining environment somewhat intimidating. It should be acknowledged that DBMiner does interface with MS-OLAP Server and also uses MS-Excel as a browser for visualizing the OLAP process. Nonetheless, DBMiner provides no explicit support for data/results export and import. Moreover, the effort does not report any evaluation on the system. KnowledgeSTUDIO is a product by ANGOSS Software Corporation [Angoss]. The product supports decision trees, clustering, and neural networks. Decision trees can be constructed automatically or interactively. The product relies heavily on standard business applications (e.g. MS Office) for visualization functionalities. Due to its interactive and exploratory environment, data mining models can be produced with relative ease. Consequently, KnowledgeSTUDIO has a short learning curve. The product presently does not offer an explicit recourse for exporting data mining results (or for exchanging data mining models in general). However, it is should observed that its support for PMML could facilitate the foregoing. There is no record of any usability studies carried out on the product. The visual interface can be seen in Fig. 2. Viscovery SOMine [Eudaptics] is a data mining system developed by Eudaptics. Among other data mining methods, the system supports clustering, prediction, regression and association. The system puts complex data into some order based on its similarity. It then shows a map from which the features of the data may be recognized and understood. On the whole, the system has a reasonably good user interface especially for self-organizing maps. The product is intended to target professional users from varied fields such as finance, marketing, industry, and scientific research. However, the effort does not report any evaluation on how successful the product is toward reaching out those types of users. Another related effort is VisMine. Hao et al [Hao et al.1999b, Hao et al.1999a] do point out that new techniques for mining knowledge from large data warehouses often exhibit the following problems: display problems (cluttered display and disjoint displays), limited access and lack of expandability. It was in an attempt to address the foregoing issues, that the foregoing authors proposed the
A Visual Data Mining Environment
335
Fig. 2. The Visual Interface of KnowledgeSTUDIO
visual data mining infrastructure named VisMine whose main features are: hiding non-primary structures and relationships, unless the user focuses on them; supporting the simultaneous presentation of multiple visual presentations (‘slice and dice’); and providing architectural plug-in support to enable the exploitation of other visualization toolkits. The infrastructure is more of a visual exploration environment rather than a core data mining system. TempleMVV [Mihalisin and Timlin1995] may be traced back in the MVV [Mihalisin et al.1991] effort. The latter uses bar charts (histogram within histogram within histogram) and slide bars (with horizontal scales) to locate clusters in multidimensional space that allows the display of multiple views of a given dataset. TempleMVV is a tool that has been proposed for fast visual data mining. This is because it achieves a high performance that is independent of the size of the dataset by utilizing discrete recursive computing to the maximum degree in a preprocessing step. It provides hierarchical visualizations for any mix of categorical and continuous attributes. Its visualization paradigm is based on nested attributes, with four attributes being represented at the same time. However, the user has to bring along a true understanding of multiple attributes to qualify for the sophisticated tool. Besides the specific issues raised for each of the foregoing systems, it should also be pointed out that each of the systems exhibits at least, either one or both of the following limitations:
336
S. Kimani et al.
– Non-extensible framework: The system is based on a framework that supports only some specific phases in the data mining process. Consequently, it is extremely difficult, if not impossible, to incorporate new components. – Non-homogeneous environment: The system makes use of a non-uniform mining environment. The user is presented with “totally” different interface(s) across implementations of different data mining techniques. We propose, describe and demonstrate VidaMine which is based on an open framework. VidaMine, which has been subjected to various usability tests, offers a consistent, uniform and flexible visual interaction environment across the entire process of mining. Moreover, a mapping between the abstract data mining components and the corresponding visual aspects has been defined.
3
System Architecture and Implementation
The proposed framework enables VidaMine to exhibit the following main features: – The system presents the user with a heterogeneous set of tasks in the most homogeneous and integrated way; – The system is open; – The system has a modular structure with well defined change/extension points; – The system provides the end user with the maximum flexibility during her/his data mining tasks. At present, the system supports, but is not limited to: metaqueries, association rules, and clustering. 3.1
System Architecture
The proposed architecture comprises two primary layers: the user layer, and the data mining engine layer, as seen in Fig. 3. 1. The Parametric User Interface/User Layer enables the user to interact with the other system components. It invokes the relevant system feature or functionality on behalf of the user. Ideally, this layer/component empowers the user to process data (and knowledge), and also to drive, guide and control the entire discovery process. The user component is organized around a GUI container which hosts specific GUI extension cartridges and the visualization component. The extension cartridges contain the knowledge to access their respective underlying data mining components/modules in the Data Mining Engine Layer. In effect, the GUI container registers the specific data mining technique GUI extension, loading respective specific menu items and other commands specific to the data mining technique. For instance, the specific data mining technique GUI extension for clustering has the knowledge on
A Visual Data Mining Environment
Metaquery user interaction component
337
Parametric User Interface GUI Container Visualization component
Association Rule user interaction component
Plug-in
CLustering user interaction component
Abstract Data Mining Engine Command Manager
Metaquery Engine
Data Mining Verifier
Association Rule Engine
Data MIning Discoverer
Clustering Engine
Global Dataset
Fig. 3. System Architecture
how to access the clustering engine and also on how to interact with the user in the acquisition of clustering input and in the visualization of clustering output. The GUI container provides various services to the data mining system. These services fall under two categories: infrastructural services and end user services. The infrastructural services that are supported include: – Registration of extensions which implement specific interaction contracts. – Runtime loading on the interaction environment of the features that are relevant to the active GUI extension (e.g. commands and options). – Advertising new GUI extensions. – Routing user commands to the active GUI extension. As for the end user services, the GUI container provides: – The user with a uniform, consistent and flexible user interface. – Services whose use spans across the entire data mining interaction environment (such as start, stop, save and load services ). There are various functionalities that a data mining GUI extension supports. The GUI extension carries out the specific commands that are loaded and made available to the user (loaded on the visual interface) by the GUI container. The extension also implements specific input and output modalities for the underlying data mining technique or algorithm(s). On the whole, modalities specific to the data mining technique may be added or may substitute some or part of the general pre-existing ones.
338
S. Kimani et al.
2. The Abstract Data Mining Engine Layer is completely decoupled from the User Layer. However, the structure of the Data Mining engine depicts parallelism with the structure of the GUI. The Data Mining Engine Layer is structured using an abstract reference model based on the following concepts: – A global dataset which contains the data to be mined and all the information necessary to apply and execute a data mining technique. – A command manager which forms the interface between the Data Mining engine and the User Layer. On the one hand, it interprets user commands originating from the user interface and on the other hand it manages any access to the internal structure of the engine. The command manager therefore serves as a two-way link between the engine and the GUI (the GUI container and the specific GUI extension). Some of the operations performed through the command manager include: defining the initial set of target data, applying some data mining algorithm, storing and verifying hypothesis. – An abstract data mining algorithm which must be specialized to implement some specific mining algorithm e.g. a clustering-based algorithm. The general behavior of the engine is abstract in that, it must be instantiated/specialized to specific “engines”, one for every data mining technique. It should also be pointed out that some results that are realized by one specific “engine” can be used by another “engine”. Consequently, the result of some data mining task can be used as input to another task. The instantiated “engines” are made available to the general engine framework dynamically. As a consequence, they are also made available to the user through the specific/respective GUI environment. It is worth pointing out that there are some services that are available to every specific “engine”. Such services include: data management, configuration savings, intersystem communication, data access and database connection management. As already mentioned, the architecture supports the incorporation of new and the modification of pre-existing components. Specific extension points are defined right where such component additions or modifications occur. It is envisioned that new components will be incorporated as plug-ins [Fayyad et al.2002]. 3.2
System Implementation
In the current prototype implementation, the User Layer, which makes it possible for the user to interact with the other system components, is developed using Java and the visual interface prototype runs on Windows platforms. One of the notable features of the User Layer is the exploitation of the DARE [Catarci et al.1999, Catarci and Santucci2001, Santucci and Catarci2002] environment for the visualization of both data and mining rules. In this context, such a system/tool plays the role of a visualization server: through a suitable interface, the User Layer instructs it about the details of the visualization (e.g. the tuples to be visualized, the association with the visual attributes, etc.) and the server produces the correct data visualization allowing the end user to
A Visual Data Mining Environment
339
further explore the visualized data, either to better understand the result s/he got or to start again the mining activity on a data subset. A mapping that serves as a two-way link for exchanging information between the visual interface and data mining algorithms has been implemented using XML DTDs. As it was stated in Section 3.1, the general behavior of the Data Mining engine is abstract in that, it must be instantiated/specialized to specific “engines”. In practice, the specific data mining “engines” correspond to specific data mining algorithms. In VidaMine, clustering algorithms have been developed using Delphi. The algorithms for metaqueries and association rules have been implemented using Java.
4
The User Interface
The visual interface has been designed to be uniform and consistent. As mentioned in Section 3, the interface offers visual interaction environments across different mining techniques and tasks. The visual environments have been developed based on a similar design. Figure 4, which is for the Metaquery Environment, shows the overall design of the visual interface. Each visual environment adopts such a design. The visual parts of the visual environments too are designed in a uniform manner across different visual environments. The uniformity aspect may be seen
Fig. 4. The Overall Design of the Visual Interface
340
S. Kimani et al.
in the figures in this chapter that show the visual interface and also from the usability studies described in Section 6 that have been carried out on VidaMine. In order to ensure that the states of the views are consistent, the views are appropriately coupled. When the user interacts with a particular visual part/view and the visual part/view updates itself, the other visual parts/views that are related to the visual part/view get updated as well, to reflect the new state of interaction or presentation. The visual interface also provides an environment through which various tasks can be integrated in a fluid manner. For instance, through the visual interface, it is possible to use the results of one data mining task as input to another data mining task. The foregoing support for exporting results is described further in subsection 4.4. Toward describing the specific system features, we consider a communications company that provides various services such as Web-access services, telephone services, etc. The company has a main office and a number of service centers. The main office principally deals with strategic and administrative issues. In fact, the company offers its services through the service centers. The company plans to introduce some special offers. The marketing strategist is expected to recommend the type of service to be featured and the customers to be targeted. Assume that the marketing lead decides to mine some information using VidaMine. In this case, we may view him as the user of the system. The marketing strategist might want to identify regions that had relatively good general service sales in the last one month. Such information might be used further to propose some specific service and customers that might be worth consideration in the offer. The recommendation could also include another service that normally does best with the proposed service. 4.1
Identifying Regions with Good Sales: Using the Clustering Environment
Understanding how different regions have been doing can be resourceful in making marketing decisions. The task can be accomplished through the clustering environment. The user starts by specifying a target dataset. The specification relies on two intuitive interaction spaces namely the “Specification Space” and the “Target Space”. These may be seen in all the figures illustrating the visual interface ( e.g. in the top-left part of Fig. 5). The “Specification Space” provides the mechanisms, tools and resources necessary for visually building the set of task relevant data. The “Target Space” holds or hosts the relations that are part of the task relevant dataset. The latter space may be envisaged as a container for the constructed target dataset. The two spaces are backed with drag and drop mechanisms and tools. Moreover, the two spaces are complementary in the manner in which they support the user. Therefore, the user operates by appropriately moving between the two components. Since the interface supports drag-and-drop mechanisms, the user may intuitively move elements (such as relations and relation items) from one component to the other as appropriate.
A Visual Data Mining Environment
341
In this task, the marketing strategist is mainly interested in customers and services (i.e. based on relations Customer and Service or on relation CustServ ). The company already has geographical information pertaining to customer addresses. For instance, their loci with respect to the main office. The user may construct a relation in which the attributes of interest are CustID, CustX, CustY, CustAmt and ServID. VidaMine provides an interaction environment with various input widgets through which the user can specify parameters characterizing a clustering task. The user can specify: 1. A measure of homogeneity, separation or density. Alternatively, the user may specify a fixed number of clusters. There are radio-buttons to enable the user to make the selection between the measure and the fixed number of clusters. Moreover, there is a radio-button for each of the three measures. These enable the user to express interest by specifying the corresponding measure. Each of the measures is presented on an ordinal scale that runs from 0 to 100. There is a slider control for each scale. The user can set a measure by dragging the slider to some value/position of interest on the scale. It should be pointed out that the Similarity slider on the interface corresponds to the homogeneity measure whereas the Dissimilarity slider corresponds to the separation measure. With regard to supporting the specification of a fixed number of clusters, the interface provides spin-boxes (and alternatively an edit-box). With any of the foregoing two main clustering options, the user may also specifically define objectwise, clusterwise and partitionwise quantifiers in terms of minimum, maximum, sum and average operators. The quantifiers have to do with homogeneity or separation. As for density, the respective combo-box enables the user to specify the kernel function whereas setting the respective slider corresponds to smoothing the clusters. As seen in Fig. 5, the user has specified a separation measure and the corresponding quantifiers. 2. Attributes that will actively/directly participate in cluster analysis. The interface supports a “checking” mechanism for specifying such attributes. A relation in the “Target Space” possesses features such as a “handle” and a “check-box” for each attribute. “Checking” a “check-box” means that the corresponding attribute has been chosen to directly participate in cluster analysis. On the other hand, the user may undo (“uncheck” attributes) the operation. These operations are realized through standard “point-and-click” functionalities. In Fig. 5, the attributes CustID, CustX and CustY have been selected (“checked”). 3. Attributes for supplementary purposes. In particular, the user may specify an attribute to be used in labeling cases in the output. When through with the parameter specification, the strategist may instruct the system to partition the target dataset by clicking the “torch” icon. The system performs the clustering and displays the results. As regards displaying the output of a clustering task, VidaMine uses the visualization component. The system supports two main visualization mechanisms: Clusters + Details (“Overview + Detail”) and Dedicated View.
342
S. Kimani et al.
Fig. 5. The Clustering Environment
Clusters + Details (“Overview + Detail”) This visualization displays clusters on a scatter plot, and also presents details that correspond to a selected cluster or cluster object. The former display corresponds to an “Overview” window whereas the latter corresponds to a “Detail” window. Cases are mapped to points on the scatter plot, with each point taking some x, y (and if appropriate z) values. A currently selected cluster, cluster object, or outlier is drawn with an outline around it. The points may be encoded to reflect other aspects (e.g. by using color and size). The top-right part of Fig. 5 shows the visualization in which the values on the x-axis correspond to CustX, those on the y-axis correspond to CustY and the z-axis values correspond to CustID. The “Detail” window is an exposition of a selected cluster or point. Dedicated View Homogeneity, separation and density measures are useful in many ways (e.g. for interpretability and evaluation purposes). In the clustering environment, the Dedicated View displays the measures of homogeneity, separation and density for each and every cluster or outlier. The separation measure is mapped to the y-axis. A circle encodes a cluster or an outlier. The circles are arranged along the x-axis. The density of the cluster or outlier is bound to the diameter of the circle. The grayscale level of a circle represents the homogeneity measure of the represented cluster or outlier.
A Visual Data Mining Environment
343
From the Clusters + Details and Dedicated View visualizations of Fig. 5, regions that are close to the main office depict a lot of sales. With regard to the anticipated offer, an interesting marketing strategy could put a lot of emphasis on people and service centers that are close to the main office. The marketing executive might want to gain more knowledge from those interesting regions. For instance he might want to identify some specific service and customers within those particular regions. The task would entail establishing data relationships. The analysis can be done through the metaquery environment. However, it is important to observe that the task is based on some particular subset of data which is not equivalent to the currently defined set of target data. In other words, the user intends to use some output from one task (clustering) as an input to another task (metaquerying). The interface enables the user to select points or clusters of interest through the use of the Standard Tools toolbox. The marketing strategist might then turn to the Export Facility. The facility would enable him to specify whether he would want to just save the specified output or to save and switch to another task with that output as the input to the new task. In the latter case, the system switches to the new environment with the output appearing in the “Specification Space” as a resource relation. 4.2
Establishing Data Relationships: Using the Metaquery Environment
The marketing strategist would need to analyze the relationships that exist among services, customers, and centers. The analysis would help him to determine the service to feature and potential customers. The metaquery environment can be helpful in carrying out such an analysis. Such relationships can be mined by exploiting the relations CustCent (customers vs service centers), ServCent (services vs service centers), and ClustOut1. It should be observed that the latter relation, ClustOut1 was “imported” from the clustering task and it relates to customers vs services. The intended effect is to have the metaquery analysis restricted to only the tuples contained in the data that was “imported” from the previous task (tuples in the relation ClustOut1 ). Consider that the user is specifically interested in the following attributes: CustCent.CustID, CustCent.CentID, ClustOut1.CustID, ClustOut1.ServID, ServCent.ServID and ServCent.CentID. Therefore the marketing strategist needs to specify the three relations with the foregoing attribute constraints toward constructing the target dataset. In Fig. 6, the marketing strategist has already constructed each of the three relations with the respective attributes and dropped each into the “Target Space”. In the environment, the user may define links2 manually (Manual Links) or have the system automatically do that (Automatic Links). Assume that the 2
A link defines a connection/relationship between attributes that is aimed at generating a consequent pattern.
344
S. Kimani et al.
Fig. 6. The Metaquery Environment
marketing strategist chooses the latter option. The system links the attributes as follows: – CustCent.CustID with CustServ.CustID – CustCent.CentID with ServCent.CentID – ServCent.ServID with ClustOut1.ServID Letting X be a representation for CustID, Y for CentID, and Z for ServID, and allowing reordering of attributes, the system generates the following transitive “combinations” (which are actually metaqueries): 1. CustCent(X, Y ), ClustOut1(X, Z), ServCent(Z, Y ) 2. ClustOut1(X, Z), CustCent(X, Y ), ServCent(Y, Z) 3. ServCent(Z, Y ), ClustOut1(Z, X), CustCent(X, Y ) The system puts the patterns in a pool as seen in Fig. 6. The strategist may also specify confidence and support values by using sliders or text-boxes. He may then instruct the system to search for specific rules from the target dataset that correspond to the metapatterns in the pool and that satisfy the specified parameters, by clicking the “torch” icon. Through the visualization component, VidaMine provides various visualizations for the search results. For any rule, the aspects that are of principal interest include: measures of interestingness, relationship between the head and body, and details about the items participating in the rule. The system provides two
A Visual Data Mining Environment
345
main visualizations for the search results Rules + Tuples (Overview + Detail) and Dedicated View. Rules + Tuples (Overview + Detail) This visualization displays all the rules from the search operation, and also presents tuples that correspond to some selected rule(s). The rules are displayed using a scatter plot. The interface invokes a system-driven mechanism which chooses an appropriate presentation style for the tuples. The scatter plot may be envisaged as the “Overview” window and the tuples display as the “Detail” window. On the scatter plot, a rule is mapped to a point, the confidence of a rule to grayscale, the support of a rule to the y-axis, and the number of items in a rule to the x-axis. Consider that the metaquery search based on the ongoing example returns the following results: 1. CustCent(CustID, CentID) ← ClustOut1(CustID, ServID), ServCent(ServID, CentID) Support = 33.33% and Confidence = 100% 2. ClustOut1(CustID, ServID) ← CustCent(CustID, CentID), ServCent(CentID, ServID) Support = 44.44% and Confidence = 75% 3. ServCent(ServID, CentID) ← ClustOut1(ServID, CustID), CustCent(CustID, CentID) Support = 55.56% and Confidence = 60% In Fig. 6, there is a Rules + Tuples visualization of the foregoing results. In the visualization, the marketing strategist has selected the rule with the highest confidence for exposition. The tuples window depicts some interesting trends. The service represented by black circles had the highest demand. The marketing head may interact with the display, for instance by using the exposition tools or by pointing the circle(s), thereby getting to know that the interesting service is WWW-Access. The display also depicts that, virtually all the customers who requested the WWW-Access service are young. Consequently, the marketing strategist may wish to consider WWW-Access service for the anticipated offer. Moreover, he has fairly substantial information regarding the customers to target: those living close to the main office and who are of young age. Dedicated View Like the foregoing visualization, the Dedicated View enables the user to visualize all the rules from the search operation. However, in this case, rules are visualized in a more elaborate manner. The Dedicated View displays: the confidence and support values of each rule, the relationship between the head and the body of each rule, and the individual items/components that make up each rule. The visualization uses a simple 2D floor with a perspective view. The floor has rows and columns. A rule is represented by a column on the floor. The rule is made up of the components which have entries in the column. The rows represent the items (such as attributes). Associated with each column/rule is a “bucket”. The
346
S. Kimani et al.
Fig. 7. Basket-based Construction of Association Rules
gray value of the contents of the “bucket” represents the confidence value of the rule. The level of the contents of the “bucket” is bound to the support value of the rule. The handle of the “bucket” can be used to select the corresponding rule. Rule items that form the antecedent are each represented using a “key” icon, whereas those that form the consequent are each represented using a “padlock” icon. The visualization can be seen in the bottom-right part of Fig. 6. 4.3
Market-Basket Analysis: Using the Association Rule Environment
The marketing strategist might also intend to find out another service that is frequently requested every time WWW-Access service is requested. Such knowledge would be instrumental in making some marketing decisions. For instance in designing advertisements that capture the two products. The analysis can be realized by switching to the association rule environment. It is worth mentioning that if the user would be interested in switching to the the new environment with the previous output as the new input, he could use the Export Facility that was described at the end of Section 4.1. One of the distinct features in the association rule environment is the provision of “market baskets”. It is interesting to note that the interface allows the marketing strategist to formulate the quest without having to understand the transaction details. For this task, it is enough for him to just have the Service relation and constrain it to attributes ServID
A Visual Data Mining Environment
347
and ServName. The resultant relation is seen in the target data space of Fig. 7. Toward specifying the structure of an association rule of interest, the user drags tuples from the target dataset and drops them into either the first (“IF”) basket or the second (“THEN”) basket. Recalling that the marketing strategist is interested in WWW-Access, he would therefore drag and drop the tuple Service = “WWW-Access” into the first “basket”. He may leave the second “basket” empty as a generic service entry that the system will later instantiate with various relevant service entries. The user then empties the baskets into the pool by clicking the icon marked with tilted baskets. As seen in Fig. 7, the effect of having left the second basket empty is apparent in the pool in that the “THEN” part of the rule is generic (some variable “X”). The user may specify confidence and support measures. He may then instruct the system to search (and display) specific rules from the target dataset that correspond to the association rule structure(s) in the pool and that satisfy the specified parameters. The association rules that are returned by the system are visualized using the same mechanisms that are used for visualizing metarules. As seen in Fig. 7, the marketing strategist has performed some basic selection of the association rule with the highest confidence in the scatter plot. Tuples corresponding to the rule are displayed in the Tuples display. The strategist may be able to determine the best service to associate with WWW-Access (e.g. Printing).
4.4
Visual Exploration Using DARE
We have incorporated into the proposed framework DARE as a visualization component/server. DARE is a visualization system that is built upon a knowledge base of rules [Catarci et al.1999, Santucci and Catarci2002]. The system may be used to develop a visual representation of data. The importance of utilizing correct, complete, and effective visual mappings for building visualizations cannot be overestimated. It is worth mentioning that, by definition, Information Visualization is the process of mapping the underlying data into a visual form that will assist or trigger one to use one’s natural capabilities, mental and visual, thereby gaining insight and understanding of that data. The definition itself points us to the goal of data mining. In the context of our discussion, the data to be visualized could be the target dataset or the result of a mining task. DARE is capable of analyzing whether a defined visual representation is correct, complete, and effective. In case the specified representation falls short of certain adequacy requirements, DARE is able to build and propose a more adequate visual representation. As an example of the case where the data to be visualized is the result of a mining task, consider once again the following mining results that were obtained from the marketing strategist’s metaquerying task: 1. CustCent(CustID, CentID) ← ClustOut1(CustID, ServID), ServCent(ServID, CentID) Support = 33.33% and Confidence = 100%
348
S. Kimani et al.
2. ClustOut1(CustID, ServID) ← CustCent(CustID, CentID), ServCent(CentID, ServID) Support = 44.44% and Confidence = 75% 3. ServCent(ServID, CentID) ← ClustOut1(ServID, CustID), CustCent(CustID, CentID) Support = 55.56% and Confidence = 60% VidaMine invokes DARE to generate a visualization of the rules. Fig. 8, on the left side, shows the Overview (Rules) view generated by the visualization component. In the view: a rule is represented by a point, the confidence value of a rule is represented by size, the support value of a rule is represented by the y value, and the number of items participating in a rule is represented by the x value. The user may interact with the visualization. It is also possible to change the mapping and generate a corresponding visualization from the same viewport.
Fig. 8. Multiple Views: Visualizing Mining Results
Consider that the user interacts with the Overview (Rules) view. On the right side of the same figure is the Detail (Tuples) view displaying the tuples that correspond to the rule that is selected in the Overview (Rules) view. In the Detail (Tuples) view, a tuple in the dataset is represented by a point. The attributes of a tuple are mapped as follows: CustID (Customer ID) is mapped to the x axis, ServID (Service ID) is mapped to the y axis, CentID (Service Center ID) is mapped to the z axis, CustGender (Customer Gender) is mapped to color, and CustAge (Customer Age) is mapped to size. One is at liberty to
A Visual Data Mining Environment
349
interact with both or either of the views, change or experiment with different visualization mappings. All this is meant to enable one realize some exploration or mining goal.
5
Formal Specification of the Visual Interface
Defining a mapping between the abstract components of a system and the corresponding visual ones is beneficial in many ways. For instance, such a definition facilitates data exchange, process automation, data storage and capturing of semantics. In this area, the research work started by considering the definition an abstract syntax and a formal semantics for the user interface. The abstract syntax is intended to provide a snapshot of the visual environment in terms of compact structures. It is worth observing that the snapshot of the visual environment is a static entity. This visual syntax reduces distance to mathematical objects (such as some specific data mining functions). In the sequel, we describe an abstract formal specification of the visual interface for each of the visual interaction environments corresponding to clustering, metaqueries, and association rules. 5.1
Clustering
The clustering environment in VidaMine provides various visual widgets for specifying or selecting parameters characterizing a clustering task. The parameters include: a fixed number of clusters or a measure (of homogeneity, separation or density); attributes that will be directly involved in cluster analysis; and supplementary attributes (e.g., for labeling cases in the output). Specifying each such parameter may also involve more specific options/settings. The corresponding environment at the bottom-left part of Fig. 5. Clustering methods are classified in VidaMine according to a taxonomy of abstract features with three levels: the form of the clustering problem, the type of accuracy measure, and the formal definition of the accuracy measure. Some accuracy measures are, in turn, structured into abstract components. Such an approach yields a uniform interface and semantics, and frees the user from the burden of investigating the details of each method. Consider a dataset S = {Oi | i = 1, 2, . . . , N }, and a symmetric dissimilarity function diss : S × S → R+ ∪ {0}. Let an accuracy measure be a function m on a universe U of partitions of S to R+ . We denote partitions by P. At the top level of the taxonomy, the user may choose between two forms of the clustering problem: Problem Π1 : Given an integer K > 1, find the partition P of S of size K such that m(P) is optimal.
350
S. Kimani et al.
Problem Π2 : Given a threshold θ ∈ R+ , find the partition P of S of minimum (maximum) cardinality such that m(P) ≤ θ (m(P) ≥ θ). In other words, sometimes m(P) is optimal when large and at other times it is optimal when small. At the next level in the taxonomy, accuracy measures are classified as being homogeneity, or separation, or density measures, measuring the dissimilarity between points belonging to the same cluster, the dissimilarity between points belonging to different clusters, and the density of clusters, respectively. Finally, at the lowest level, the abstract components of an accuracy measure can be specified. In the following, we describe the various accuracy measures available in VidaMine. Accuracy Measures Homogeneity Measures. A homogeneity measure tries to capture the overall degree of intra-cluster similarity between objects. Although, algorithms and the complexity of clustering according to homogeneity measures have been studied in several works, the review [Hansen and Jaumard1997] does present a unifying terminology, which we recall in the sequel. The diameter, radius, star, clique, normalized star, and normalized clique of a cluster C [Hansen and Jaumard1997] are measures of homogeneity defined as diss(Oi , Oj ), min max diss(Oi , Oj ), i:Oi ∈C j:Oj ∈C diss(Oi , Oj ), diss(Oi , Oj )
max
i,j:Oi ,Oj ∈C
min
i:Oi ∈C
mini:Oi ∈C
j:Oj ∈C
j:Oj ∈C
|C| − 1
i,j:Oi ,Oj ∈C
diss(Oi , Oj )
,
i,j:Oi ,Oj ∈C
diss(Oi , Oj )
|C|(|C| − 1)
,
respectively. Cluster homogeneities can be combined into partition homogeneities, that sconstitute the objective functions to be optimized (notice that since the above measures are based on dissimilarities, optimizing means minimizing). The design of clustering component in VidaMine is inspired by the classification of [Hansen and Jaumard1997]. However, in the VidaMine system, the clustering interface presents the user with a set of controls allowing for a generalization of the homogeneity measures above. In fact, the user does not select the desired measure from a list of named functions. Instead, the user builds explicitly the homogeneity accuracy measure of a partition by specifying the constituent column functions. First, the user selects the column function that is applied to all the dissimilarities of an object, defining thus an objectwise homogeneity. Then, the user selects the column function that is applied to all the objectwise homogeneities of objects in a cluster, yielding a clusterwise homogeneity. Finally, the user selects the column function that is applied to all the clusterwise homogeneities of
clusters in the partition. At present, column functions are limited to min, max, , and avg. Using this set of functions one can easily build a substantial number of homogeneity measures, including all the homogeneity measures
A Visual Data Mining Environment
351
above. Therefore, a homogeneity accuracy measure : U → R+ in VidaMine is a function of the form mhom (P) = C∈P i:Oi ∈C Ωj:Oj ∈C(Oi ) diss(Oi , Oj ), (1)
where, Ω ∈ {max, , avg} and ∈ {min, max, , avg}, U is the set of all partitions of S, and C(Oi ) is the cluster containing object Oi in P. Separation Measures. A separation measure captures the overall degree of intercluster dissimilarity between objects. Most popular separation measures for a cluster C are its split, its cut, and its normalized cut, defined as max diss(Oi , Oj ), diss(Oi , Oj ),
i,j:Oi ,Oj ∈C
i:Oi ∈C j:Oj ∈C
i:Oi ∈C
j:Oj ∈C
diss(Oi , Oj )
|C|(N − |C|)
,
respectively. As for homogeneity, rather than displaying a list of preset accuracy measures, the controls in VidaMine’s interface for separation allow for specifying objectwise, clusterwise and partitionwise separation measures. The user may select the column function that is applied to all the dissimilarities between an object and other objects not in its cluster (objectwise separation), the column function that is applied to all the objectwise separations of objects in a cluster, yielding a clusterwise separation, and the column function that is applied to all the clusterwise separations of clusters in the partition. More formally, a separation accuracy measure : U → R+ in VidaMine is a function of the form msep (P) = C∈P i:Oi ∈C j:Oj ∈C(Oi ) diss(Oi , Oj ), (2)
where, ∈ {min, , avg} and ∈ {min, max, , avg}, U is the set of all partitions of S, and C(Oi ) is the cluster containing object Oi in P. Density Measures. The roots of the approach go back to the non-parametric probability density estimate known as kernel estimator [Silverman1986]. The key idea of density estimation is to interpret the number of neighbours Oi of a space object O as an estimate of the density at O. A straightforward but fundamental improvement over counting consists of weighting the neighbours by means of a bump-shaped functions, the so-called kernel function, which models the influence an object exerts on the value of the estimate in its neighbourhood. A kernel probability density estimate is a normalized positive real function ϕˆh,ψ (x) defined as a sum of radially symmetric “bumps” centered at data objects and scaled by
N i ,x) ), where ψ is a fixed a “smoothing” factor 1/h: ϕˆh,ψ (x) = N1h i=1 ψ( diss(O h kernel function formalizing the shape of “bumps” (e.g., the Gaussian function). The idea of clustering based on probabiliy density estimates stems from the intuition that all data objects which can be connected to the same maximum of
352
S. Kimani et al.
ϕˆh,ψ (x) by an uphill path and have density greater than a threshold θ, can be grouped into a cluster. Objects Oi such that ϕˆh,ψ (Oi ) < θ are grouped into a single cluster of noise objects [Ester et al.1996, Hinneburg and Keim1998]. Let αh,ψ (Oi ) be the data object (if it exists) nearest to Oi in a neighbourhood of Oi , having density greater than Oi , and let ∼ be the equivalence relation generated by α. A density-based partition Pθ of S with respect to a density threshold θ is defined as Pθ = C : (∃x0 ∈ S) C = {x ∈ S : x ∼ x0 ∧ ϕˆh,ψ (x) ≥ θ} ∪ {x ∈ S : ϕˆh,ψ (x) < θ} . Notice that, if the user selects the density measure, U is restricted to densitybased partitions. The density accuracy measure m(P) is defined as maxPθ {1−θ : Pθ = P}. In other words, the quality of a partition is determined by the lowest noise “level” that allows to obtain the partition as a density-based partition. Abstract Syntax and Semantics of Clustering. An abstract syntax for a visual language can be defined in terms of multi-graphs as follows [Erwig1998]. Let α, β be sets representing label types. A directed labeled multi-graph of type (α, β) is a quintuple G(V, E, l, v, ε) consisting of finite sets of nodes V and edges E where l : E → V × V maps every edge to the pair of nodes it connects, v : V → α maps every node to its label, and ε : E → β maps every edge to its label. A visual language of type (α, β) is a set of directed labeled multi-graphs of type (α, β). We will ignore their geometric properties and represent only an abstract value for each control. The language is thus defined by the following, where the suffixes Rad, Spin, Slid, Com, stand for radio button, spin box, slider, and combo box, respectively. α = R+ ∪ {0} ∪ {min, max,
, avg} ∪
{Π1 , Π2 } ∪
(3)
{Homogeneity, Separation, Density} β=∅
(4)
V = {ProbRad , NClusSpin, HomSlid , SepSlid , DenSlid , AccurRad, HomObjwQCom, HomCluswQCom, HomPartwQCom,
(5)
SepObjwQCom, SepCluswQCom, SepPartwQCom, KernCom, SmoothSlid }. We now map the state of the controls positioned at the bottom-left part of Fig. 5 to a clustering problem by means of the multi-graph language above. The problem type depends on the label of the ProbRad node. In Fig. 5, the “Search Method” radio button selects a threshold problem type. Therefore v(ProbRad ) = Π2 . The type of accuracy measure depends on the label of the AccurRad node. In the same figure, the vertical radio button selects dissimilarity, which translates
A Visual Data Mining Environment
353
to v(AccurRad ) = Separation in the abstract multi-graph. The objectwise, clusterwise, and partitionwise combo boxes select the min quantifier; consequently v(SepPartwQCom), v(SepCluswQCom), v(SepObjwQCom ) all equal min in the graph. The value of the dissimilarity slider v(SepSlid ) equals 0.03. Then, if we let spq = v(SepPartwQCom), scq = v(SepCluswQCom), soq = v(SepObjwQCom), the system returns a partition Psep (S) such that m(Psep (S)) ≥ 0.03 and ∀P : m(P) ≥ 0.03 → |P| ≤ |Psep (S)|, where m(P) = spq C∈P scq i:Oi ∈C soq j:Oj ∈C(Oi ) diss(Oi , Oj ). Metaqueries. Metaqueries [Mitbander et al.1996] (or metapatterns) provide a generic description of a class of patterns that the user may want to discover from the underlying dataset. The input to a metaquerying task is expressed in terms of metaqueries, which are rule patterns of the form: T ← L1 , . . . , Lm , where T and Li are literal schemes P (Y1 , ..., Yn ) and P is either an ordinary predicate name or a predicate variable. The relevance of a rule (even association rules) may be determined by measures of interestingness (or plausibility indices). Some of the most common measures are: confidence and support. Confidence is computed as the ratio between the cardinality of the join defined by all the literal schemes, projected on the variables of the rule’s body, and the cardinality of the join defined by the rule’s body [Leng and Shen1996]. Support is computed as the maximum fraction, over all literal schemes in the rule’s body, of tuples of the literal scheme which can be extended to satisfy the rule’s body. In the metaquery environment of VidaMine, the user can specify patterns (or relationships) between or among data tables in an intuitive visual manner. The interface provides “hooks” and “chains” through which users can visually specify the relationships. By simply linking two attributes, the users indicate to the system that they are interested in metarules that have the two attributes related. Therefore, VidaMine enables users to visually construct metaqueries of interest. The left-hand side of Fig. 6 shows the part of the metaquery environment that supports the foregoing specification. Abstract Syntax and Semantics of Metaquerying. Here a simple abstract syntax for the metaquerying visual interaction environment is defined, together with its semantics, that is, the set of rules discovered by the system as instantiations of the metaqueries corresponding to the visual state of the interface when the “torch” icon is clicked. For consistency with the visual interaction environments for the other mining techniques, the metaquerying visual environment represents metaqueries using relation schemes instead of literal schemes, that is, using named attributes to denote table columns, instead of the usual logical variables. The metaqueries which are added to the pool when the “Add Pattern” button is clicked, are all constructible metaqueries, given the named and
354
S. Kimani et al.
unnamed relations (i.e. named “X”) in the “Target Space”. Note that the links between handles which are represented in the “Target Space” do not have a counterpart in the “IF. . . THEN” expressions in the pool. In fact, such expressions only show the relation schemes, and not the literal schemes, that compose the head and body of a rule. Therefore, only the content of the “Target Space” defines multiple occurrences of variables in a rule. Since a user might click “Add Pattern” several times with different configurations of the “Target Space”, before clicking the “Torch” button, for simplicity only one such configuration is considered in the semantics. The notion of abstract syntax, which may be recalled from Section 5.1, is assumed. Moreover, the geometric details of the commonest controls is ignored; they are represented as nodes with a label. In the “Target Space”, rectangular frames enclosing attribute or relation names, and lines representing links connecting the frames (ignoring the handles) are nodes of the multi-graph. A node representing a frame is labeled by the relation name or attribute name it represents, i.e. the name appearing inside the frame. Attribute and relation frames may be adjacent, and attribute frames may be connected by lines. Therefore, for relations, attributes and links, two visual relations need to be represented: adjacent and intersecting. Adjacency will be assumed antisymmetric: A frame is adjacent to another if and only if the two frames share a horizontal edge, and the first frame is located above the second in the display. In the following, let U and R be universes of attribute names and relation names, respectively, in the database. Let V = {X1 , . . . , Xi , . . .} be a countably infinite set of variable symbols, and W = {P1 , . . . , Pi , . . .} be a countably infinite set of predicate variable symbols. V and W are assumed disjoint and both disjoint from R. Let also ζ : V → V, λ : V → W be injective functions. In the sequel, ζ and λ will be used to construct the literals of the rules from the frames. The language is defined by the following. α = R+ ∪ U ∪ R ∪ {”X ”}
(6)
β = {adjacent , intersecting}
(7)
V ⊃ {ConfSlid , SuppSlid }.
(8)
The set IRS of rules returned by metaquerying is defined as follows. IRS = {r ∈ RS : conf (r ) ≥ v(ConfSlid ) ∧ supp(r ) ≥ v(SuppSlid )} (9) RS = {r : (∃σ)(∃mq ∈ MQ) r = σ(mq)} (10) MQ = {h ← b} (11) h∈L
b∈L−{h}
L = {P (X1 , . . . , Xm ) : (∃n) P = pred (n) ∧ isrel (n)
∧ (∀i ≤ m)(∃n ) Xi = ζ(n ) ∧ inschema(n , n)}
(12)
A Visual Data Mining Environment
355
where isadj (n, n ) ⇔ (∃e) l(e) = (n, n ) ∧ ε(e) = adjacent
intersects(n, n ) ⇔ (∃e) l(e) = (n, n ) ∧ ε(e) = intersecting λ(n) if v(n) = ”X”, pred (n) = v(n) otherwise isconn = isadj e isrel (n) ⇔ (n ) isadj (n , n)
(13) (14) (15) (16) (17)
islink (n) ⇔ (∃n )(∃n )(∃n1 )(∃n2 ) isrel (n1 ) ∧ isrel (n2 ) ∧ ¬isconn(n1 , n2 ) ∧ isconn(n , n1 ) ∧ isconn(n , n2 ) (18) intersects(n, n ) ∧ intersects(n, n ) inschema(n, n ) ⇔ islink (n) → (∃n ) isconn(n , n ) ∧ intersects(n, n ) ∧ ¬islink (n) → isconn(n , n) (19) and isadj e is the equivalence relation generated by isadj , that is, the smallest equivalence relation containing isadj . Therefore, an equivalence class contains nodes corresponding to frames gathered in one relation scheme in the “Target Space”. L is the set of literals defined by the visual configuration (Equation (12)). In each literal, P is the relation name enclosed in a frame, or a distinct predicate variable, if the name is “X” (Statements (15) and (17)). Every variable corresponds to an attribute frame that is connected to the relation frame which names the literal, or corresponds to a link that intersects such an attribute frame (Statement (19)). The set MQ of metaqueries is obtained from L by generating one metaquery for each literal, having the literal as head and the remaining literals as body. The rules RS instantiating MQ are defined by means of a class of substitutions. In Equation (10), σ is any substitution which, given a metaquery mq, consistently replaces exactly every predicate variable occurring in mq with a relation name. Finally, the rule set IRS is defined as the set of all rules in RS that satisfy the support and confidence specified by the sliders SuppSlid and ConfSlid . Association Rules. Association rules [Agrawal et al.1993] represent a data mining technique that is used to discover implications between sets of items in the database. Let Ω be a set of items, and T be a database of transactions, that is, a list of elements of 2Ω . Association rules [Agrawal et al.1993] are implications of the form (20) I1 , . . . , Im → I, where {I, I1 , . . . , Im } ⊆ Ω. If r is the association rule (20), and t = {I1 , . . . , Im }, then r has confidence conf (r ) and support supp(r ) in T defined by
356
S. Kimani et al.
| t ∈ T : {I} ∪ t ⊆ t | conf (r ) = | t ∈ T : t ⊆ t | | t ∈ T : {I} ∪ t ⊆ t | · supp(r ) = |T |
(21) (22)
In the association rule environment of VidaMine, there is the provision of “market baskets”. As seen on the left-hand side of Fig. 7, the association rule environment offers two baskets, the IF basket and the THEN basket. The IF basket represents items in the antecedent part of an association rule and the THEN basket represents items in the consequent part of the rule. Users may “drag and drop” items from the target dataset into the relevant baskets. When through with the visual construction of a particular association rule, the users may empty both baskets into a pool of association rules that are of interest to them, and may go ahead and construct another rule. Abstract Syntax and Semantics of Association Rules. The VidaMine system’s support for association rules permits the specification of syntactic constraints [Agrawal et al.1993], that is, listing items which must appear in the antecedent or the consequent. Such items are dragged from the “Target Space” to the “IF” or “THEN” basket, and later combined to form the rule in the pool. It is assumed that the abstract graph contains one “item” node for every item occurring in the “IF”. . . “THEN” rule, labeled with the item’s name. Assuming also nodes representing confidence and support, the set of association rules returned by the system is IAR = {r ∈ AR : conf (r ) ≥ v(ConfSlid ) ∧ supp(r ) ≥ v(SuppSlid )} AR = {I1 , . . . , Im → I : VTHEN = {n} → I = v(n) ∧ (∀n ∈ VIF )(∃i ≤ m) Ii = v(n)},
(23) (24)
where VIF and VTHEN are the sets of “item” nodes in the rule’s “IF” and “THEN” part, respectively.
6
Usability
Various usability methods have already been employed during the development life cycle of VidaMine [Kimani et al.2003]. The feedback that was obtained from HCI experts and representative users who were involved in the usability studies was instrumental in guiding the user interface design process. Usability methods provide an opportunity for problems to be identified and rectified during the development cycle. Therefore the methods can shield a system from exhibiting drawbacks after it becomes operational. Usability methods also provide an avenue through which the system can be improved before (and even after) it becomes operational. VidaMine targets two categories of users: expert users and casual users. In the former category are users who are acquainted with mining knowledge. In
A Visual Data Mining Environment
357
the category, there are users with experience in such fields such as data mining, information retrieval, statistics, etc. The distinct feature of users belonging to the latter category is that they have little or no knowledge in mining. These users may be involved in aspects such as: management, marketing, administration, etc. The user audience factor was taken into consideration by involving representatives from both user categories in the experiments. 6.1
Heuristic Evaluation
As a way of getting started, a heuristic evaluation was carried. Heuristic evaluation is a more informal evaluation where the interface is assessed in terms of more generic features. This informal evaluation presents reasonably concise and generic principles that apply to virtually any kind of user interface. The heuristic evaluation involved three HCI experts with the initial proposal of the data mining user interface on paper. The evaluation was principally based on user interface and usability guidelines from [Nielsen1994a] and [Shneiderman1997]. In the former, Jakob Nielsen highlights the following guidelines or usability heuristics: visibility of system status, real world mapping, user control and freedom, consistency and standards, error prevention, recognition rather than recall, flexibility and efficiency of use, aesthetic and minimalist design, helping users to recognize and recover from errors, and provision of help and documentation. In the latter reference, Ben Shneiderman presents what he refers to as the “Eight Golden Rules of Dialog Design.” The rules include: consistency, shortcuts, feedback, dialog closure, error handling, reversal of actions, internal locus of control, and reduction of short-term memory load. The HCI experts assessed and worked on the proposed interface in order to ensure that it adhered or would adhere to the guidelines. In the following discussion, is an analysis of how some of the guidelines have been applied in the design of VidaMine. – The interface dialogue should be simple and natural. Moreover, the interface design should be based on the user’s language/terms. In general, there should be an effective mapping between the interface and the user’s conceptual model. In VidaMine, the user interface primarily uses basic data mining terms. Furthermore, the provision of natural tools and mechanisms (such as visual market “baskets” for designing association rules, “hooks” and “chains” for linking attributes in designing metaqueries, “drag and drop” mechanisms, etc) is part of the effort aimed at getting effective mappings between the interface and the user’s conceptual model. – The interface should shift the user’s mental workload from the cognitive processes to the perceptual processes. VidaMine has a user interface that offers various mechanisms to support the shift. For practically all inputs the user does not have to supply the units of measurement. Such inputs include the specification of the relation type, the name of the relation, the attribute type, the number of attributes, the attributes that will be actively used in cluster analysis, the measure of confidence, the measure of support, the level
358
S. Kimani et al.
of homogeneity, the level of separation, the level of density, the number of clusters, the clustering accuracy measures, the attribute for labeling clustering results, etc. Moreover, the system offers interaction controls for helping the user get familiar with the range of valid values and also for helping the user to input within the range. Such interaction controls include: • Sliders for specifying the measure of confidence, the measure of support, the level of homogeneity, the level of separation, the level of density, and the Smoothing clustering accuracy measure. • List-boxes (and combo-boxes) for inputting the relation type, the name of the relation, the attribute for labeling clustering results, and all the clustering accuracy measures (except the Smoothing clustering accuracy measure). • Check-boxes for specifying the attributes that will be actively used in cluster analysis. • Radio buttons too do help in clarifying the range of valid input, though normally at a higher level of abstraction. In VidaMine, radio buttons have been used to support the specification of attribute types (All/Any; or Some), the selection of the clustering search method (based on number of clusters; or thresholds such as homogeneity, separation and density), the selection of measures of clustering accuracy, the choice of the mechanism to use in designing metaqueries (manual; or automatic), the selection of the approach to use for exporting results (export by simply saving the results; or export coupled with switching to another task with those results). Furthermore, presenting query parameters as visual objects (such as data relations and attributes) minimizes the possibility of making mistakes while formulating a query. – There should be consistent usage and placement of interface design elements. Consistency builds confidence in using the system and also facilitates exploratory learning of the system. In VidaMine, the same information is presented in the same location on all the screens. In fact, the visual interface is uniform across the various environments for metaqueries, association rules, and clustering. – The system should provide continuous and valuable feedback. One of the mechanisms that were to be adopted for providing feedback would be realized when the user would put some item into the “baskets” or would empty the “baskets”. The “baskets” would respond to reflect the insertion or the removal. Moreover, the various visualizations that were supported would dynamically update themselves when the user would change (or interact with) various user interface and visualization parameters. – There is a need to provide shortcuts especially for frequently used operations. Through the user interface, the system would provide various shortcut strategies such as, but not limited to, double clicking and single key press commands.
A Visual Data Mining Environment
359
– There are many situations that could potentially lead to errors. Adopting an interface design that prevents error situations from occurring would be of great benefit. In fact, the need for error prevention mechanisms arises before (but does not eliminate) the need to provide valuable error messages. The visual interface offers mechanisms to prevent invalid inputs. As aforementioned, such mechanisms include specification by selection (e.g. using list-boxes, combo-boxes, check-boxes, and radio buttons), and specification through sliders. The user interface also provides some status indicators (e.g. when an item is put in a “basket”, the status of the “basket” changes to indicate containment). 6.2
Mock-Up Experiment
Based on the heuristic evaluation, a mock-up user interface of the data mining system was graphically designed. The mock-up user interface comprised various interface screens that would enable the user to “go through (and accomplish)” a mining task. The mock-up interface was presented before a team of ten stakeholders participating in the D2I project and carried out a simulation of their typed tasks. The team of stakeholders comprised experts from various fields such as databases, data mining, and HCI. The results were interesting and encouraging, and even included suggestions on how to improve the interface. For instance, the stakeholders suggested that the user interface should allow the user to specify more generic relations when constructing the set of target data. They also suggested that the interface should offer the possibility to put constraints on such a generic relation (e.g. by specifying the number of its attributes). In order to ensure that the user interface met the experts’ requirement on enabling the user to specify generic relations and offering the possibility to put constraints on such generic relations while specifying the task relevant dataset, two visual interaction components were introduced: the “Specification Space” and the “Target Space”. As the name suggests, the “Specification Space” supports the user in performing the specification task. The component enables the user to specify a relation to be considered as part of target data set. The specification can be done in a manner that generates a generic relation. Moreover, through the same space, constraints can be introduced on such a generic relation such as the relation type, the name of the relation, the type of attributes of interest, and the desired number of attributes. Due to the foregoing visual flexibility and richness in specifying a relation, the visual specification may be envisaged as visual construction. The “Target Space” primarily acts as a container for the relations that the user specifies/constructs. The specific details of how the user realizes the target dataset through the “Specification Space” and the “Target Space” can be found in Section 4.
360
6.3
S. Kimani et al.
User Tests
Later, a simulation prototype of the user interface was developed. The user could directly interact with the prototype. Based on the simulation prototype, some more usability tests were performed. The tests involved five selected users from the universities of Bologna3 , Ferrara4, and Modena and Reggio Emilia5 . In an attempt to categorize users and their individual differences, Jakob Neilsen [Nielsen1994b] highlights three main dimensions along which users’ experience differs: users’ general computing experience, users’ experience with the specific system, and users’ knowledge of the domain. All the five selected users had reasonably good general computer experience. Taking into account the fact that the specific system, VidaMine, is a fairly new system, the tested users had little knowledge about the corresponding user interface. Since VidaMine is primarily a data mining system, the appropriate domain knowledge is data mining. There were various levels of data mining expertise among the subjects. One of the tested users is a multi-media postdoctorate student with some basic knowledge in data mining. Another user is an image analysis graduate with very little knowledge in data mining. Two of the users are data mining researchers. The last user is a post-doctorate student in pattern recognition with some little knowledge in data mining. The selection of the subjects was intended to take into consideration the user audience that VidaMine targets, namely: casual users, who have little domain knowledge and expert users, who have domain knowledge. During the experiment, each of the subjects was presented with: the prototype as an application, a user scenario, user tasks, data schema, and a questionnaire. The user scenario was based on the description found in Section 4. The user tasks corresponded to the foregoing user scenario. The tasks were also based on an operational specification of the system that is described in [Kimani et al.2004] and highlighted in Section 7. The data schema corresponded to the data used in the user scenario. The data schema corresponded to the data used in the user scenario. As regards the questionnaire, it had three parts. The first part contained closed questions pertaining to the simplicity or complexity of performing user tasks. The second part of the questionnaire also had closed questions pertaining to user interface design principles. The third part of the questionnaire contained open questions pertaining to strengths, weaknesses, and capability of the system/user interface. The third part also was open to user’s extra/other comments. The questionnaire can be seen in Figure 9. Each user was expected to run and interact with the interface of the prototype with reference to the accompanying documents (user scenario, user tasks, and data schema). After the experiment, the user would fill in the questionnaire. The results obtained from the first part of the questionnaire are seen in Figure 10, in which for each level of simplicity/complexity, user tasks are analyzed. 3 4 5
http://www-db.deis.unibo.it. http://www.unife.it. http://www.unimo.it.
A Visual Data Mining Environment
Post-Test Questionnaire
Participant Number _____
On a scale of 1 to 5, indicate (by crossing i.e.[x]) how much difficult or easy it was to: Not at all difficult 1 1. Specify a set of task-relevant data 1[ ] 2[ ] 3[ ] 2. Construct metarules 1[ ] 2[ ] 3[ ] 3. Construct association rules 1[ ] 2[ ] 3[ ] 4. Understand and interact with the 1[ ] 2[ ] 3[ ] visualization(s) of metarules 5. Understand and interact with the 1[ ] 2[ ] 3[ ] visualization(s) of association rules 6. Specify clustering inputs 1[ ] 2[ ] 3[ ] 7. Understand and interact with clustering 1[ ] 2[ ] 3[ ] output 8. Use the output of a particular task as the 1[ ] 2[ ] 3[ ] input to another task 9. Perform a mining task, in general 1[ ] 2[ ] 3[ ]
4[ ] 4[ ] 4[ ]
Page 1 of 1 Very difficult 5 5[ ] 5[ ] 5[ ]
Not Applicable N/A[ ] N/A[ ] N/A[ ]
4[ ]
5[ ]
N/A[ ]
4[ ]
5[ ]
N/A[ ]
4[ ]
5[ ]
N/A[ ]
4[ ]
5[ ]
N/A[ ]
4[ ]
5[ ]
N/A[ ]
4[ ]
5[ ]
N/A[ ]
On a scale of 1 to 5, indicate (by crossing i.e.[x]) how much you agree or disagree with the following: Strongly Strongly disagree agree 5 1 10. The language/terms used on the screens 1[ ] 2[ ] 3[ ] 4[ ] 5[ ] is/are clear 11. The interface design elements are well 1[ ] 2[ ] 3[ ] 4[ ] 5[ ] organized on the screen 12. There is consistency in the way design elements are used and placed on the various 1[ ] 2[ ] 3[ ] 4[ ] 5[ ] interface screens 13. The system always valuably responds to user 1[ ] 2[ ] 3[ ] 4[ ] 5[ ] operations 14. The system always informs the user about its 1[ ] 2[ ] 3[ ] 4[ ] 5[ ] progress 15. The interface provides adequate shortcuts for common operations 16. The interaction mechanisms that are offered by the interface substantially shield one from making errors 17. In an error situation, the interface helps one to recognize the specific error and recover from the error 18. The system has sufficient and useful help and documentation 19. It is easy to exit from a task or situation at any level of operation, at any time 20. It is easy to perceive or remember the role of the interface elements 21. While performing a task, it is easy to remember (or acquire) the aspects that are relevant to the task
361
Not Applicable N/A[ ] N/A[ ] N/A[ ] N/A[ ] N/A[ ]
1[ ]
2[ ]
3[ ]
4[ ]
5[ ]
N/A[ ]
1[ ]
2[ ]
3[ ]
4[ ]
5[ ]
N/A[ ]
1[ ]
2[ ]
3[ ]
4[ ]
5[ ]
N/A[ ]
1[ ]
2[ ]
3[ ]
4[ ]
5[ ]
N/A[ ]
1[ ]
2[ ]
3[ ]
4[ ]
5[ ]
N/A[ ]
1[ ]
2[ ]
3[ ]
4[ ]
5[ ]
N/A[ ]
1[ ]
2[ ]
3[ ]
4[ ]
5[ ]
N/A[ ]
22. State three main things you like about the interface:
23. State three main things you do not like about the interface:
24. Does the interface support all the functionalities and capabilities you expect? If it does not, state such functionalities and capabilities:
25. Other comments:
Fig. 9. Questionnaire
362
S. Kimani et al.
Fig. 10. Simplicity/Complexity of User Tasks
3 users found each of the supported tasks at least fairly easy6 to carry out. In particular, all the users found the clustering output very easy to understand and interact with. 4 users found specifying target dataset and clustering input very easy to perform. 2 users found it very easy to construct metaqueries and association rules. 4 users found it at least fairly easy to understand and interact with the visualization(s) of metarules and association rules. 4 users also found it at least fairly easy to perform a mining task in general. As for using the output of a particular task as the input to another task, 3 users found it at least fairly easy to perform. The results obtained from the second part of the questionnaire are seen in Figure 11, in which user interface design aspects are analyzed for each level of adherence7 . 3 users found each of the tested design aspect at least fairly well-adhered to. 3 users observed that consistency was very well-applied in the interface design. 4 users felt that the natural mapping feature was at least fairly well-adhered to. 4 users felt that the user’s language was at least fairly 6
7
The phrase “at least”: Wherever, in the test results, this research work indicates that n users rated a particular interface feature with “at least” a rating-level x, it means that each of the n users gave a rating-level that was equal to or higher than x. It should be mentioned that this being a simulation experiment, the subjects were not in a position to assess some of the aspects e.g. progress information, help and documentation.
A Visual Data Mining Environment
363
Design Principles Per Adherence Level 5
Language clarity Element organization
Number of Users
4
Consistency 3 Valuable feedback 2
Progress information Shortcuts
1
Error prevention
to re
ed
he
er
ad
dh
ly
l-a
Error recognition and recovery Help and documentation
po
or
el w Ve
ry
ot N
d
to
e ag er
Av
er dh
el y
irl Fa
Ve
ry
w
w
el
l-a
l-a
dh
er
ed
ed
to
to
0
Design Principles Per Adherence Level
Memory load minimization Natural mapping
Fig. 11. Adherence of User Interface Design Principles
well-applied. All the users noted that it was at least fairly easy to remember (or acquire) the aspects relevant to a particular mining task i.e. memory load minimization. Moreover, 3 users observed that as they interacted with the prototype, it responded to user operations in a reasonably valuable way. However, it should be observed that one subject was non-commital regarding the assessment of the valuable feedback feature. 3 users found the interface elements at least fairly well organized. As for the last part of the questionnaire, 2 users mentioned consistency as one of the main features they liked most about the interface. 2 users mentioned good layout/organization as one of the main features they liked most about the interface. 2 users mentioned the support for visual exploration as one of the main features they liked most about the interface. Therefore, the foregoing indicates that the interface is strong in consistency, layout/organization and visual exploration. On the other hand, some of the users stated similar interface aspects that they considered to be negative. 2 users found the size of some of the visual elements
364
S. Kimani et al.
small/big or allocated little/a lot of space. This could be linked to some users’ observation that there were many visual elements on the interface. Based on one subject’s statement of a negative aspect and another subject’s comments, it was apparent that the role of the visual control Number of Attributes in the “Specification Space” was not clear. As for the functionalities supported by the system, 3 users were satisfied. However, 2 users did recommend that the system should provide an explicit and visual way of expressing joins in the “Specification Space”.
7
Future Work and Conclusions
As part of implementation, it is worth noting that this research work also considered the provision of an operational specification. This specification entailed defining a concrete syntax of mining tasks and an XML-based communication protocol. The concrete syntax of mining tasks is a high-level syntax of each of the tasks that had been proposed in our usability tests. This syntax describes the click-streams that are allowed to occur during the user’s interaction with the interface. The click-streams are dynamic entities. The concrete syntax reduces distance to the specification of procedural objects. The XML-based protocol serves as a two-way link for exchanging information between any applications that deal with mining aspects. It is interesting to realize that the protocol idea can be exploited not only to import and incorporate new components into the system framework, but also to export the system’s components to external systems. More information about the operational specification can be found in [Kimani et al.2004]. User interface design modifications and improvements based on the results obtained from the user tests discussed in Section 6 have already been taken into consideration. There now are plans to support the output of descriptive statistics, principally using pie-charts and histograms, in the clustering environment. There are also considerations to extend the system to be able to support Web Mining. When it comes to visualizing (e.g.target data, data mining results, links, etc) on the Web, it is interesting to observe that the Web Mining aspiration can source, to some extent, from some of our previous related research work [Kimani et al.2002]. It is interesting to mention that the set of system features that were described in this chapter are under an engineering process in order to become part of the commercial suite D2I, distributed by Inspiring Software 8 . The D2I (Data to Information) environment is thought for handling the data needed by a generic process. Inspiring Software plans to include the system features in the next commercial releases. In this chapter, the need for a framework that supports the entire discovery process has been discussed. The chapter has also highlighted the pivotal role that visualization plays in such a framework. VidaMine, a visual data mining 8
http://www.cisat-group.com
A Visual Data Mining Environment
365
system with a visual environment that is aimed at supporting the user in the pursuit of knowledge has been described.
References [Agrawal et al.1993] Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216. ACM Press, New York (1993) [Angoss] Angoss. Software corporation (link accessed on June 21st, 2004), http://www.angoss.com [Catarci and Santucci2001] Catarci, T., Santucci, G.: The prototype of the dare system. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001) [Catarci et al.1999] Catarci, T., Santucci, G., Costabile, M.F., Cruz, I.F.: Foundations of the dare system for drawing adequate representations. In: Proceedings of the International Symposium on Database Applications in Non-Traditional Environments (1999) [Cox et al.1997] Cox, K.C., Eick, S.G., Wills, G.J., Brachman, R.J.: Visual data mining: Recognizing telephone calling fraud. Knowledge Discovery and Data Mining 1(2), 225–231 (1997) [DBMiner] DBMiner. Technology inc. (link accessed on June 21st, 2004), http://www.dbminer.com [Erwig1998] Erwig, M.: Abstract syntax and semantics of visual languages. Journal of Visual Languages and Computing 9, 461–483 (1998) [Ester et al.1996] Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996) [Eudaptics] Eudaptics. Software gmbh (link accessed on June 21st, 2004), http://www.eudaptics.com [Fayyad et al.2002] Fayyad, U., Grinstein, G.G., Wierse, A. (eds.): Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann Publishers, San Francisco (2002) [Han et al.1996] Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., Zaiane, O.R.: Dbminer: A system for mining knowledge in large relational databases. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining (1996) [Hansen and Jaumard1997] Hansen, P., Jaumard, B.: Cluster analysis and mathematical programming. Mathematical Programming 79, 191–215 (1997) [Hao et al.1999a] Hao, M., Dayal, U., Hsu, M., Becker, J., D’Eletto, R.: A java-based visual mining infrastructure and applications. Technical Report HPL-1999-49, HP Labs (1999a) [Hao et al.1999b] Hao, M., Dayal, U., Hsu, M., D’Eletto, R., Becker, J.: A java-based visual mining infrastructure and applications. In: Proceedings of the IEEE International Symposium on Information Visualization, pp. 124–127. IEEE Computer Society Press, Los Alamitos (1999b)
366
S. Kimani et al.
[Hinneburg and Keim1998] Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 58–65. AAAI Press, Menlo Park (1998) [Kimani et al.2002] Kimani, S., Catarci, T., Cruz, I.: Web rendering systems: Techniques, classification criteria, and challenges. In: Geroimenko, V., Chen, C. (eds.) Visualizing the Semantic Web. Springer, UK (2002) [Kimani et al.2003] Kimani, S., Catarci, T., Santucci, G.: Visual data mining: An experience with the users. In: Stephanidis, C. (ed.) Proceedings of HCI International - Universal Access in HCI: Inclusive Design in the Information Society. Lawrence Erlbaum Associates, Mahwah (2003) [Kimani et al.2004] Kimani, S., Lodi, S., Catarci, T., Santucci, G., Sartori, C.: Vidamine: A visual data mining environment. Journal of Visual Languages and Computing 15(1), 37–67 (2004) [Leng and Shen1996] Leng, B., Shen, W.M.: A metapattern-based automated discovery loop for integrated data mining - unsupervised learning of relational patterns. IEEE Transactions on Knowledge and Data Engineering 8(6), 898–910 (1996) [Mihalisin and Timlin1995] Mihalisin, T., Timlin, J.: Fast robust visual data mining. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 231–234. AAAI Press, Menlo Park (1995) [Mihalisin et al.1991] Mihalisin, T., Timlin, J., Schwegler, J.: Visualization and analysis of multi-variate data: A technique for all fields. In: Proceedings of the International Conference on Visualization (1991) [Mitbander et al.1996] Mitbander, B., Ong, K., Shen, W.M., Zaniolo, C.: Metaqueries for data mining, ch.15. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press (1996) [Nielsen1994a] Nielsen, J.: Heuristic evaluation. In: Nielsen, J., Mack, R.L. (eds.) Usability Inspection Methods. John Wiley and Sons, Chichester (1994a) [Nielsen1994b] Nielsen, J.: Usability Engineering. Academic Press, London (1994b) [Santucci and Catarci2002] Santucci, G., Catarci, T.: Dare: A multidimensional environment for visualizing large sets of medical data. In: Proceedings of the International Conference on Information Visualisation (2002) [SAS] SAS. Enterprise miner (link accessed on June 21st, 2004), http://www.sas.com [Shneiderman1997] Shneiderman, B.: Designing the User Interface: Strategies for Effective Human-Computer Interaction. Addison-Wesley, Reading (1997) [Silverman1986] Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, Boca Raton (1986) [SPSS] SPSS. Clementine (link accessed on June 21st, 2004), http://www.spss.com/clementine [Wills1997] Wills, G.J.: Nicheworks: Interactive visualization of very large graphs. In: Proceedings of the Symposium on Graph Drawing (1997) [Wills1999] Wills, G.J.: Nicheworks: Interactive visualization of very large graphs. Journal of Computational and Graphical Statistics 8(2), 190–212 (1999)
Integrative Visual Data Mining of Biomedical Data: Investigating Cases in Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia Paul J. Kennedy2, Simeon J. Simoff1,2, Daniel R. Catchpoole2,3, David B. Skillicorn4, Franco Ubaudi2, and Ahmad Al-Oqaily2 1
School of Computing and Mathematics, University of Western Sydney NSW 2007 Australia [email protected] 2 Faculty of Information Technology, University of Technology, Sydney PO Box 123 Broadway NSW 2007 Australia {paulk, simeon, faubaudi, aaoqaily}@it.uts.edu.au 3 The Oncology Research Unit, The Children’s Hospital at Westmead, Locked Bag 4001, Westmead NSW 2145, Australia [email protected] 4 School of Computing, Queen’s University, Kingston, Canada [email protected]
Abstract. This chapter presents an integrative visual data mining approach towards biomedical data. This approach and supporting methodology are presented at a high level. They combine in a consistent manner a set of visualisation and data mining techniques that operate over an integrated data set of several diverse components, including medical (clinical) data, patient outcome and interview data, corresponding gene expression and SNP data, domain ontologies and health management data. The practical application of the methodology and the specific data mining techniques engaged are demonstrated on two case studies focused on the biological mechanisms of two different types of diseases: Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia, respectively. The common between the cases is the structure of the data sets.
1 Introduction Molecular and genomic information are becoming an important part of methods for diagnosing diseases, based on biological indicators. There is a very large and increasing level of effort towards improving the overall methodology for utilising the data gathered through gene expression profiling. The efforts are focused on the measurement procedures and data collection technology, experiment designs, and diverse data analysis and mining methods [1]. Some of the best practices have been discussed in [2, 3]. Mining microarray data on it’s own is a challenging task [4], due, on the one hand, to the superposition of a number of physical processes in the data collection, on the other, to the need to convert extracted patterns to biological knowledge. Consequently, there has been an increasing interest towards complementary techniques for S.J. Simoff et al. (Eds.): Visual Data Mining, LNCS 4404, pp. 367–388, 2008. © Springer-Verlag Berlin Heidelberg 2008
368
P.J. Kennedy et al.
analysing simultaneously gene expression data and other data sources, for example, literature-based information [5], DNA sequence database [6] or from several sources [7]. This increasing tendency in extending data mining techniques, for example, association rule mining [8, 9], is reflected in some of the tools developed recently [10, 11]. These “joint” methods, however, have emerged somewhat on an ad-hoc basis. Though biologists often focus on data, collected from microarray-based expression profiles, other molecular data, including the organisation and function of genes in the context of the cell, the physical genome and sequence, the relationships between species in terms of this organisation, can provide important insights into the phenomenon. Overall, in the biomedical and health sciences, various databases collect these diverse data sets, each providing a basis for knowledge discovery within a specific area of understanding. This is illustrated in Fig. 1. Biomedical and health data and patterns discovered from it often consist of many small interactions contributing to the explanation of the phenomenon. Developing a consistent methodology and the corresponding combinations of supporting algorithm is the aim of the work, presented in this chapter. SNP Data
Gene Expression Data t
Proteomic Data
Full Blood Count
Illness Symptoms
Geno type
Gene Activity
Protein Levels
Cellular Function
Pheno type
Fig. 1. Relationship of data source to the biological genotype - phenotype spectrum
Fig. 2 shows the broader picture of the data sources that are involved on the biomedical side in modern healthcare. There is a growing opinion that the analysis of biomedical data requires the integration of various data sources to build up a more complete picture of the various levels of biology, clinical understanding and optimal patient management. Consequently, there is a need for a consistent methodology that enables combined analysis of clinical traits, marker genotypes, comprehensive gene expression, SNP data, in order to dissect the biological mechanisms of complex disease. Recent research recognises also the necessity in automatic utilisation of existing knowledge compiled in various “omic” electronic libraries in order to understand and interpret the outcomes of microarray and SNP data in the context of existing biological knowledge [12]. A brief overview of the different types of data (data sources), their characteristics and issues of integration with the other data are presented in Table 1 - Table 4. We consider seven types of data sources, grouped in four categories: • Medical and Clinical data sources (including Medical Data, Patient Outcome Data and Patient Questionnaire Data) presented in Table 1; • Biological data sources (including Gene Expression Profiles and Single Nucleotide Polymorphisms (SNPs)) presented in Table 2;
Integrative Visual Data Mining of Biomedical Data
369
• Biological knowledge bases (including Domain Ontologies and other Databases) presented in Table 3; • Healthcare data sources (including Health Management Data) presented in Table 4. Ideally, each of these types of data should be present in the integrated data set, however, the final selection depends on the available data and the study scenario.
Data
Knowledge
SNP Data
Gene Expression
Proteomic Data
Pathology Tests
Illness Symptoms
Treatment Protocol
Cost $$$
Genotype
Gene Activity
Protein Levels
Cell, Tissue Function
Prognosis
Patient Outcomes
Health Management
Molecular Biologist
Clinician
Manager
Fig. 2. The diverse biomedical and healthcare data sets are the source for knowledge discovery within a specific area of understanding associated with the management of patients
Further in the chapter we present an overview of the general methodology and demonstrate its application in two case studies: the identification of biological markers underlying Chronic Fatigue Syndrome and to the sadly common childhood malignancy Acute Lymphoblastic Leukaemia. Table 1. Medical and Clinical data sources Data Sources Medical Data: Prognostic indicators are used for the empirical diagnosis of disease. In the case of ALL patients, this is a risk-based directed therapy [13]. Patient Outcome Data: Studies and trials are generally designed to compare potentially better therapy with therapy that is currently accepted as standard. Patient Questionnaire Data: Specifically designed questionnaires are used in studies of diseases with psychosocial basis. Analysis of such data usually provides a starting point for classification of cases and then for further investigation of the existence of a possible biological background.
Characteristics of the data Integration Issues Patient age, sex, ethnicity, Data available is often dewhite blood cell count, cytoge- rived from patients presentnetic analysis, cell surface ing at hospitals and treated antigens and response to initial on specific drug trials. Data types are mixed but may be chemotherapy. available only as unstructured text. Treatment protocols list drug Therapies and outcome data schedules for patients in differ- are in unstructured text and ent risk categories and modifi- must be encoded into a representation, cations for patients with computer bearing in mind the heteroabnormal response to drugs. geneity of response. Questionnaires usually include Open ended questions generboth close- and open-ended ate data which may be furquestions. The close-ended ther mined using computaquestions generate numerical tion linguistic approaches. attributes. The open-ended This may require specialist questions result in an unstruc- domain ontologies such as those associated with the tured data UMLS [14].
370
P.J. Kennedy et al. Table 2. Biological data sources
Data Sources Gene Expression Profiles: The mRNA expression profile of diseased cells may reflect the unique genetic alterations present and has been shown to be predictive of clinical and biological characteristics of illness for many diseases. A major issue in these data is the unreliable variance estimation, complicated by the intensitydependent technology-specific variance [15].
Characteristics of the data cDNA microarray is the high throughput analysis of global gene expression within a biological specimen. Gene expression measurements (e.g. relative levels of expression between tumour and normal cells) are made simultaneously for many thousands of genes.
Single Nucleotide Polymorphisms (SNPs): The analysis of SNPs within the human genome will enhance our understanding of underlying genetic variations that exist in the human population. Individual SNPs are being associated with specific diseases and have been correlated to altered drug response in pharmacogenomic analyses [17].
Increasing numbers of examples of single base pair variations within the coding region of genes which, whilst not being a mutation which leads to a defective protein, are associated with altered activity of the protein [18]. Larger blocks of genetic variation, called haplotypes, are also being assessed in so called Haploblock studies.
Integration Issues Comparing gene expression measurements between different technologies and between measurements on the same technology at different times is a challenge handled by normalisation techniques. A specialised markup language for microarray data is in [16]. Furthermore, the number of replicated microarrays is usually small because of cost and sample availability, resulting in unreliable variance estimation and thus unreliable statistical hypothesis tests. There is a need to establish statistically significant correlations between SNPs and disease or outcome of treatment through association studies. High throughput analysis of SNPs, with up to 100000 different variations are now achievable.
Table 3. Biological knowledge bases Data Sources Domain Ontologies and other Databases: GO [19] and other biological and medical ontologies and databases (e.g PubMed, TRASER, Swiss-Prot, etc.) are publicly available over the Internet.
Characteristics of the data GO [19] is a main public curated vocabulary of over 17,000 terms and allows association of biological ‘functionality’ with gene products.
Integration Issues Integration adds context and knowledge about genes. Issues arise when matching records between databases as the primary key used to index entities often differs depending on the owner of the database.
Integrative Visual Data Mining of Biomedical Data
371
Table 4. Healthcare data sources Data Sources
Characteristics of the data
Integration Issues
Health Management Data: Retrospective cost-benefit assessment of clinical trials are often conducted by health managers so as to improve the broader management strategies and financial resource allocation for Departments.
Patient visits to inpatient and outpatient wards/clinics, total cost of medication, efficiency of service delivery, consultation time and study comparison analysis. Quality of life measurements. palliation vs effective cure. Effect of new screening test with regards to benefit etc.
Privacy issues, updating costs of drugs over times. Comparison of different drugs between countries. This is performed retrospectively with findings difficult to implement in future trials.
2 The “Extract-Explain-Generate” Methodology We present the general outline of the “Extract-Explain-Generate” methodology, motivated by the multifactorial and multilevel nature of biomedical data. A schematic of the methodology is shown in Fig. 3. The methodology is centred on our technology-mediated knowledge based inductive learning process. It analyses new observations in the context of the available domain knowledge. As illustrated in Fig. 3, the observations of a new patient and domain knowledge related to the case, possibly including existing biological hypotheses, are
Fig. 3. The “Extract-Explain-Generate” methodology
372
P.J. Kennedy et al.
the inputs to the knowledge based inductive learning process. The output of the process includes one or more medical hypotheses. These hypotheses assist (i)
the clinician to formulate a treatment protocol and understand how a patient differs from other previous patients; (ii) the biological researcher to identify biological markers (possibly genes or other indicators) for future investigation, and; (iii) the health manager to understand the costs associated with treatment. Once validated by these end-users, the hypotheses are used to update the domain knowledge. Domain knowledge can be categorised into three classes: (i)
a case base of previous patients together with outcomes of the treatment protocol applied; (ii) public knowledge bases of biomedical information (including domain ontologies and databases); and (iii) health management information. The knowledge-based inductive learning step facilitates reuse of acquired knowledge in the context of prior domain knowledge. Although the processes in the knowledge based inductive learning step are different for each of the types of hypotheses and for different problems, these processes may be grouped under three categories: “extract”, “explain” or “generate hypotheses”. Medical hypotheses (together with treatment outcomes) are used to update the database of cases. Biological hypotheses eventually lead to updates of the knowledge bases involved in the process in Fig. 3. In order to position and interpret the results of the analysis of microarray data in the context of other existing biological knowledge, we utilize available domain ontologies, electronic libraries and other databases (referred in the literature also as “omic knowledge” libraries). Currently, the bioinformatics tools that process such sources are restricted to deal with one type of “omic knowledge”, e. g. particular gene ontology, or interactions. Promising for our approach are the efforts in the development of mechanisms and protocols to deal with any type of omic knowledge for example the work on Omic Space Markup Language (OSML) [12]. Overall, the proposed integrative methodology takes into account and incorporates biological, clinical and economic aspects of the medical treatment. This is also indicated by the explicit presence of the roles of Clinician, Healthcare Manager and Molecular Biologists in Fig. 3. As “Extract-Explain-Generate” methodology involves in each step one or more data mining and analytics experts, these roles are not explicitly shown in the diagram in Fig. 3. It provides a broad framework for constructing consistent instances of case study designs, including required data mining support for specific cases. We illustrate how these instances are formed on the cases of Chronic Fatigue Syndrome and Acute Lymphoblastic Leukaemia. In the first case study the focus is on the anatomy of the “Extract” step, when in the second study the focus is on the anatomy of the “Explain” step.
3 Case Study 1: Chronic Fatigue Syndrome In this section we demonstrate an instance of the application of the “Extract-ExplainGenerate” methodology to a biomedical problem: identification of biological markers
Integrative Visual Data Mining of Biomedical Data
373
underlying Chronic Fatigue Syndrome. In the following subsections we describe the problem and the goals of our study. Then we construct and apply an instance of our methodology to this particular problem and describe the outcomes of the investigation. 3.1 Problem Chronic Fatigue Syndrome (CFS) [20] is an illness with a primary symptom of debilitating fatigue over a six month period. Currently diagnosis of CFS is generally made by clinical assessment of symptoms using a number of surveys measuring functional impairment, quantifiable measurements of fatigue and occurrence, duration and severity of the symptoms [21]. A primary goal of current research is to derive a definition of the syndrome, which goes beyond a clinical assessment of symptoms to an empirical diagnosis founded on an established biological lesion. The motivation for this kind of research is to gain a clearer understanding of the illness and to find empirical guidelines for its diagnosis. 3.2 The Goals of the Study The goal of our study is to investigate whether there is a biological basis to CFS. To this end we interrogate an integrated dataset of clinical, blood evaluation and gene expression data to identify patterns differentiating fatigued (CFS and other fatigued individuals (ISF) with insufficient severity of symptoms to be classified as suffering from CFS) and non-fatigued (NF) individuals. We use publicly available data for CFS and NF individuals from the Critical Assessment of Microarray Data Analysis (CAMDA 2006) competition datasets [22]. The integrated data set that we composed, comprises two clinical data sets, one giving survey results for patients for the above mentioned fatigue and symptom questionnaires, the other giving complete blood evaluation results for patients. This involved 139 CFS/ISF patients and 73 NF individuals. Gene expression data for a subset of the CFS/ISF and NF patients (118 CFS/ISF and 53 NF) was also available (consisting of around ten thousand genes for each sample), together with SNP data and proteomics data. We did not use the SNP or proteomics data in our investigation although it can easily be incorporated within our methodology. These data cover the full biological spectrum from genotype to phenotype (see Fig. 1). Investigators have developed a stratification of CFS which characterises its clinical significance [23]. Their initial hypothesis stated that gene expression profiling would allow them to establish prognostic indicators of the syndrome. We have queried this assumption and asked whether there is a biological basis to CFS or does it have a purely psychosocial aetiology? In particular we have focussed on whether we can identify a pathological lesion for CFS in peripheral blood. Put more simply, in what way, and to what extent, are the SNP, gene expression, proteomic and blood chemistry profiles different between non-fatigued (NF) subjects and those with CFS or a fatigue syndrome? These questions influenced the outline of the study scenario and the components included in the integrated data set.
374
P.J. Kennedy et al.
3.3 The Study Scenario A specific instance of the “Extract-Explain-Generate” methodology was applied to the problem of identifying a biological basis to CFS. It is thought that CFS is unlikely to be caused by a single agent [20]. This multifactorial nature of the problem domain motivates us to extend the “Extract-Explain-Generate” methodology to take a complex systems approach towards the analysis, illustrated in the schematic in Fig. 4.
Fig. 4. An instance of “Extract-Explain-Generate” applied in the Chronic Fatigue Syndrome case study
We use the labels “global” and “local” patterns to distinguish between patterns derived from and valid over the integrated data set and patterns that are generated from a subset of attributes. For example, clusters of patients can be global patterns, if they are generated out of clinical and gene expression data in the integrated data set. Global patterns are aimed at establishing deep linkages between the attributes and within and between the components of the integrated data set that explain or question assumptions about the phenomenon. Therefore, we label approaches and algorithms seeking them as “constructionist”. If we use just a subset of attributes, like a list of individual genes, but not all genes, then the patterns derived are local. Such reductionist approaches are common in microarray data analysis and are used to discover biomarkers for diseases. Local patterns can be viewed as the output of a reductionist approach in predictive modeling, where one looks at the attributes that allow generating accurate predictive models, without necessarily providing an explanation about the underlying phenomenon. We posit that such reductionist approaches are less useful for highly dimensional datasets and for multifactorial diseases for two reasons. Classification between classes of patient in high dimensional datasets is susceptible to the “curse of dimensionality” where the biological markers (genes) chosen differentiate training examples but do not generalise well to unseen data. This is a result of insufficient patient samples
Integrative Visual Data Mining of Biomedical Data
375
compared to data items collected per sample. It is infeasible to collect sufficient data items for the extremely high dimensional gene expression data (consisting of thousands of attribute values i.e. genes per patient). Secondly, with the predicted ‘multifactorial’ nature of CFS, it is likely to be multigenic in nature, governed by small changes in many genes rather than a simple genetic defect involving a single gene. Consequently, we take a data driven approach towards the interrogation with the aim of getting a better understanding of the phenomena before phrasing a specific biological hypothesis. This approach aims to avoid the introduction of unnecessary bias. “Extract” step The data is pre-processed before the “Extract” stage. In particular, some attributes of the “illness” dataset, the clinical dataset containing the patient’s answers to the illness questionnaires, are omitted because they are (i) skewed with almost all individuals having the same attribute value; (ii) not deemed useful for the data mining effort; or (iii) calculated by the original researchers and would bias our efforts. The attributes concerned are “DOB”, “intake classific”, “cluster”, “onset”, “yrs ill”, “race” and “ethnic”. The dependent variable “Empiric” is used as the patient class and patient subtypes are combined to make three classes CFS, ISF and NF. In the other clinical dataset, concerning the complete blood evaluation for patients, we add a copy of the “Empiric” attribute. The datasets are linked by the patient identifier attribute “ABTID”. The gene expression datasets are combined into a single dataset for all individuals and the “Emipric” patient class linked as described above. Each gene for each patient in the gene expression dataset consists of four attribute values: “Spot Label” with the gene name, and three statistical measures of the gene expression value including standard deviation and mean of values within the spot. The statistical measures of the gene expression are normalised over all arrays and patients by multiplying values with the average value of every gene over all arrays divided by the average value of every gene over the individual array. We create integrated datasets of pairs and the triplet of the individual pre-processed datasets. As discussed above, the “Extract” step in this case takes a complementary constructionist and reductionist approach. The constructionist schema combines a kernelbased clustering and visualisation method to the integrated data set. This method [24] finds a low-dimensional projection of the integrated dataset such that the distance between points in the projection is similar to the distance in the kernel induced feature space. The linear kernel is used and additional pre-processing is applied to data in the clinical datasets where all attribute values are recoded to numeric values, patient class information is omitted from calculation of the kernel (to get an unbiased visualisation) and the data is centered and normalised. The global patterns identified are clusters of patients in the space of low-dimensional projection of the original data. Efficient calculation of the kernel matrix for the gene expression data requires special treatment. Each row of the gene expression dataset represents an individual gene measurement for a particular microarray (for each patient). The straightforward approach of calculating the linear kernel matrix is to concatenate the rows of the gene expression dataset into a matrix consisting of one row for each array with a set of
376
P.J. Kennedy et al.
attribute values for each spot label (“ARM Dens - Levels”, “MAD - Levels” and “SD - Levels”) then to calculate the linear kernel by multiplying the matrix with its transpose. Clearly this approach and the corresponding algorithms are impractical in our situation because of the large number of genes on each the array. We developed a more efficient approach, motivated by computational linguistics, for direct computation of the linear kernel matrix from gene expression data. The kernel value for two samples (i.e. microarrays) is calculated from sorted lists of genes (spot labels) associated with each array. The kernel value is calculated as the sum of the product of the attribute values for genes matching in both lists. Computation of the linear kernel matrices for the integrated datasets is simply a matter of adding the linear kernel matrices for the individual datasets. The reductionist schema in this case is based on the Gene Feature Ranking (GFR) method, developed by the team. It calculates a rank that measures the separation between fatigued (CFS/IFS) and non-fatigued (NF) data points for genes. Each gene is assigned a rank corresponding to the Euclidean distance in terms of the normalised averaged “ARM Dens – Levels” and “MAD – Levels” values (in the gene expression dataset) for the 119 patients classified as fatigued and the corresponding averaged values for the 53 non-fatigued patients. Larger ranks correspond to spot labels that better discriminate the classes of patient. Similarly, distances are calculated for the other gene expression measures (“MAD – Levels” and “SD – Levels”). The ranked genes are evaluated through an SMO (Sequential Minimal Optimisation) Support Vector Machine (SVM) classifier [25, 26] with test error estimated with 10-fold cross-validation again using the linear kernel function. By analogy with the Newton family of numerical methods for finding the roots of polynomial equations, we developed a strategy with variable step size in order to identify the optimum number of genes that result in the best classification. “Explain” step The global patterns found with the kernel clustering and visualisation and the local patterns discovered by the gene feature ranking algorithm are explored in the “Explain” stage with decision tree and association analysis [27]. In this study we focus on the global patterns by applying decision tree analysis separately to three subsets of the complete blood evaluation clinical dataset with respect to the patient class as defined by the “Empiric” attribute. Association analysis examined rules where the patient class is the “consequent” of the rule. The details are presented in the Outcomes subsection of this case study. 3.4 The Outcomes We present the outcomes of the case study following the same structure as in the Study Scenario section, distinguishing the outcomes related to the “Extract” and “Explain” steps respectively. “Extract” step The “Extract” step of the methodology identifies patterns in the data for further explanation. The constructionist approach identifies global patterns in the integrated dataset, whilst the reductionist approach, in this problem, looks for patterns in the gene expression data set only.
Integrative Visual Data Mining of Biomedical Data
a.
b.
c.
d.
377
Fig. 5. Kernel-based clustering and visualisation in 3 dimensions of (a) illness dataset, (b) blood dataset, (c) gene expression dataset and (d) the integrated blood, illness and gene expression dataset. Legend: + = NF patient, F = ISF patient, = CFS patient.
Fig. 5 shows some of the global patterns found as a result of the constructionist part of the “Extract” step. These results are the input to the interactive 3D visual data mining system, which offers various functions for exploring the visual space. The global patterns are evident as points in the 3-dimensional space comprising the first three principal components of the projection of points into the kernel feature space. Results are shown for the individual datasets and the triplet, but not for the pairs of datasets. The kernel based visualisation of the illness dataset (i.e. the survey information) (Fig. 5a) clearly shows that the NF patients cluster together. This is expected because medical professionals make the classification of patients into CFS, ISF and NF on the basis of information in the survey data. One CFS patient near this region appears to be clustered incorrectly. However, medical professionals use two different schemes to
378
P.J. Kennedy et al.
classify patients and, using the other classification scheme, this patient is categorized as ISF. In other words, this patient is a border line case. Less structure is evident in the visualization of the blood dataset in Fig. 5b. This suggests that there may not be strong biological markers evident in the complete blood evaluation of patients. The clustering of the gene expression dataset in Fig. 5c shows three clear clusters which do not strongly correspond to the patient classes. This also suggests to us that there may not be a clear biological basis in the gene expression values. These results reinforce the multifactorial notions of this disease. Results from the reductionist (GFR) approach within the “Extract” step are illustrated in Fig. 6. This diagram shows the accuracy of SVM classification with different numbers of the top gene feature ranking ranked genes/spots.
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% -600
-400
-200
0
200
400
600
800
1000
1200
Number of ranked spots Fig. 6. Accuracy of ranked spots classifiers. Legend: × = “ARM Dens – Levels” and “SD – Levels”; = “MAD – Levels” and “SD – Levels”.
The leftmost points on the graph use the 500 lowest ranked genes for classification to show the magnitude of the difference between classification accuracy at both ends of the ranking scale. The graph shows that many spots are required to reach acceptable classification accuracy. Reductionist approaches like this assume (most likely incorrectly) that factors affecting the outcome of the classification act independently. If there is an attribute that is strongly correlated with another attribute, the advice (Occam’s razor) is to remove it. Hence, the fewer attributes - the better. However, genes may not necessarily fit well in this modeling scheme, due to variety of possible interactions. Hence they are most likely not independent. The large number of genes necessary to achieve reasonable classification accuracy in Fig. 6 suggests, again, that
Integrative Visual Data Mining of Biomedical Data
379
there is not a clear biological marker consisting of a small number of genes to discriminate between NF and CFS/ISF patients. “Explain” step The goal of the “Explain” step of the methodology is to provide an explanation or background to the patterns found in the “Extract” step. As discussed above, decision tree and association analysis was applied to the clinical datasets. Interrogation of the complete blood evaluation data with decision tree and association analysis indicated few differences at the cellular level between the blood samples obtained from CFS vs ISF vs NF patients. There were slight imbalances in a range of attributes associated with red blood cells (RBC) and this may be characteristic of a fatigued patient. The ‘imbalances’ however, were mostly in the normal range for these attributes within the general population and could not independently be used to diagnose a fatigue syndrome. For example, the trained decision tree found that all CFS patients had a Mean Corpuscle Volume (MCV) ≥ 81.15 fl (normal range 86±10fl). Similarly, the ISF patients were found to have a Mean Corpuscle Haemaglobin (MCH) ≥ 26.45pg, however, the normal range is 29.5±2.5pg. The biological interpretation of this is that, whilst the RBC attributes are not sufficient to characterise a ‘fatigued’ patient as having a form of anaemia, the imbalances may point to slight inefficiencies in O2 distribution of CFS and ISF patients. The attributes, however, are not sufficient of themselves to be used as a diagnostic marker for a fatigue syndrome nor may they reflect the underlying biological basis for the syndrome. That said, decision tree analysis identified that the NF samples were identified by CO2 ≥ 21.4 units (58 of the 63 patients matching also MCH > 33.45 and anion gap ≥ 21.4). However, the CFS patients were identified by CO2 ≤ 28.9 units. Given that the normal range is between 20-30 units it appears that, for this attribute at least, the difference identified by the decision trees may represent different distributions between the test and control patient cohorts, with both cohorts having values within a range found to be normal within the wider general population. Clearly however, the biological differences between the blood count and chemistry of the fatigued and NF patients is minimal and not useful as an independent classifier of CFS. The imbalances detected may however, in combination with the other data available allow for the construction of a multifactorial or multigenic classifier for fatigue syndromes. Indeed, F Test analysis for the MCV and MCH variables indicate that sample variances between the CFS and NF populations (excluding the ISF samples) were significantly different (p