277 48 7MB
English Pages 204 Year 2012
S. K. Mourya Shalu Gupta
α
Alpha Science International Ltd. Oxford, U.K.
Data Mining and Data Warehousing 214 pgs. | 85 figs. | 7 tbls.
S. K. Mourya Shalu Gupta Department of Computer Science and Engineering MGM’s College of Engneering & Technology Noida Copyright © 2013 ALPHA SCIENCE INTERNATIONAL LTD. 7200 The Quorum, Oxford Business Park North Garsington Road, Oxford OX4 2JZ, U.K.
www.alphasci.com All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of the publisher. Printed form the camera-ready copy provided by the Authors. ISBN 978-1-84265-757-7 E-ISBN 978-1-84265-995-3 Printed in India
Dedicated To our parents, grand parents and most beloved Shilu and our children S. K. Mourya Shalu Gupta
Preface The explosion of information technology, which continues to expand data driven markets and business, has made data mining an even more relevant topic of study. Books on data mining tend to be either broad or introductory or focus on some very specific technical aspect of the field. Data Mining and Data Warehousing in nine chapters explores in depth the core of data mining (classification, clustering and association rules) by offering overviews that include both analysis and insight. Written for graduate students from various parts of the country studying Data Mining courses. The book is an ideal companion to either an introductory Data Mining textbook or a technical Data Mining book. Unlike many other books that mainly focus on the modeling part, this volume discusses all the important—and often neglected—parts before and after modeling. The book is organized as follows. It is divided into nine chapters. In Chapter 1, a brief introduction to data mining offers great promise in helping organizations uncover patterns hidden in their data that can be used to predict the behaviour of customers, products and processes, various forms of data preprocessing, data cleaning, missing values, noisy data etc. In this chapter various classifications and various issues of data mining have also been explained with examples to illustrate them. Chapter 2 deals with explaining data preprocessing and the various other needs of data processing. Forms of data preprocessing, e.g., data cleaning, missing values, noisy data. In this chapter we have also discussed how to handle inconsistent data, data integration and transformation. Chapter 3 deals with explanation how statistics measures are used in large databases through measuring of central tendency, measuring dispersion of data in data mining, some graphical techniques used in data analysis of continuous data, etc. Further in the chapter we have also discussed data cube approach (OLAP).
viii
Preface
Chapter 4 deals with discovery of frequent patterns, association, and correlation relationships among huge amounts of data, how it is useful in selective marketing, decision analysis, and business management. A popular area of application is market basket analysis which studies the buying habits of customers by searching for sets of items that are frequently purchased together (or in sequence). Association rule mining that consists of first finding frequent item sets from which strong association rules in the form of A=>B are generated, has also been dealt with. In Chapter 5, we have explained how classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. While classification predicts categorical labels (classes), prediction models deal with continuous-valued functions. Chapter 6 explains that many clustering algorithms have been developed. These can be categorized into partitioning methods, hierarchical methods, densitybased methods, grid-based methods, model-based methods, methods for highdimensional data (including frequent pattern–based methods), and constraint based methods. Some algorithms may belong to more than one category. In Chapter7, we study a well-accepted definition of the data warehouse and see why more and more organizations are building data warehouses for the analysis of their data. In particular, we study the data cube, a multidimensional data model for data warehouses, architecture and design. Through this chapter we can study that clearly, the presence of a data warehouse is a very useful precursor to data mining, and if it is not available, many of the steps involved in data warehousing will have to be undertaken to prepare the data for mining. Chapter 8 presents an overview of OLAP technology aggregation, efficient query facility and its multidimensional aspects. Such an overview is essential for understanding the overall data mining and knowledge discovery process. Chapter 9 deals with various issues of privacy and security of data emerged at a relatively early stage in the development of data mining. This development is not at all surprising given that all activities of data mining revolve around data and many sensitive issues of accessibility or possible reconstruction of data records exist, along with backup and testing concerns. S. K. Mourya Shalu Gupta
Acknowledgements “Few successful endeavors have ever been made by one person alone, and this book is no exception.” We would like to thank the people and organizations that supported us in the production of this book and the authors are greatly indebted to all. There are several individuals and organizations whose support demands special mention and they are listed in the following. Our special thanks to MGM’s family Mr. Kamalkishor N. Kadam (Chairman), Dr. Mrs. Geeta S. Lathkar (Director), Prof. Sunil Wagh (VP) for being our inspiration and enabling us to publish this book. Above all, we want to thank our colleagues of Computer Science and Engineering Department, MGM-Noida, our family and friends who supported and encouraged us in spite of all the time it took us away from them. We also wish to thank Mr. N. K. Mehra, Director, Narosa Publishing House Pvt. Ltd., for his enthusiasm, patience, and support during our writing of this book. Our Production Editor, and his staff for their conscientious efforts regarding production, editing, proof-reading and design etc. also deserve our special thanks. We would like to express our gratitude to the many people who saw us through this book and to all of the reviewers for their invaluable feedback. S. K. Mourya Shalu Gupta
Contents Preface Acknowledgements
1. Introduction 1.1 What is Data, Information and Knowledge? 1.2 What is Motivated Data Mining? 1.3 Data Mining: Overview 1.3.1 Data collections and data availability 1.3.2 Some alternative terms for data mining 1.3.3 Steps for data processing 1.4 Typical Architecture of a Data Mining System 1.5 Data Mining: What kind of Data can be Mined? 1.5.1 Flat files 1.5.2 Relational databases 1.5.3 Data warehouses 1.5.4 Transaction databases 1.5.5 Multimedia databases 1.5.6 Spatial databases 1.5.7 Time-series databases 1.5.8 World Wide Web 1.6 Data Mining: What can be Discovered? 1.7 Classification of Data Mining Systems 1.8 Major Issues in Data Mining Summary Exercises 2. Data Preprocessing 2.1 Need of Data Processing 2.2 Form of Data Preprocessing 2.3 Data Cleaning 2.3.1 Missing values 2.3.2 Noisy data 2.3.3 How to handle inconsistent data? 2.4 Data Integration and Transformation
vii ix
1.1 1.2 1.3 1.3 1.4 1.5 1.5 1.7 1.8 1.8 1.9 1.9 1.10 1.10 1.10 1.11 1.11 1.11 1.12 1.13 1.14 1.14 2.1 2.1 2.2 2.3 2.3 2.4 2.5 2.5
xii
Contents
2.5
2.4.1 Data integration 2.4.2 Data transformation Data Reduction 2.5.1 Data cube aggregation 2.5.2 Attribute subset selection 2.5.3 Dimensionality reduction 2.5.4 Numerosity reduction 2.5.5 Discretization and concept hierarchy generation Summary Exercises
3. Statistics and Concept Description in Data Mining 3.1 Overview 3.2 Statistics Measures in Large Databases 3.2.1 Measuring of Central Tendency 3.2.2 Measuring Dispersion of Data 3.3 Graphical Techniques used in Data Analysis of Continuous Data 3.4 Concept/Class Description: Characterization and Discrimination 3.4.1 Methods for concept description 3.4.2 Differences between concept description in large databases and on-line analytical processing 3.5 Data Generalization and Summarization based Characterization 3.5.1 Data cube approach (OLAP) 3.5.2 Attribute-oriented induction approach (AOI) 3.6 Mining Class Comparisons: Discrimination between Classes Summary Exercises
2.5 2.6 2.7 2.7 2.7 2.7 2.7 2.8 2.8 2.8
3.1 3.2 3.2 3.2 3.4 3.6 3.9 3.10 3.11 3.12 3.12 3.13 3.14 3.15 3.15
4. Association Rule Mining 4.1 4.1 Introduction 4.2 4.1.1 Frequent pattern analysis 4.2 4.1.2 What is market basket analysis (MBA)? 4.3 4.2 Concepts of Association Rule Mining 4.4 4.3 Frequent Pattern Mining in Association Rules 4.6 4.4 Mining Single-dimensional Boolean Association Rules 4.7 4.4.1 Apriori algorithm 4.8 4.4.2 Improving the efficiency of the Apriori Rules 4.9 4.5 Mining Multi-level Association Rules from Transaction Database 4.11 4.5.1 Multi-level association rules 4.13 4.5.2 Approaches to mining multi-level association rules 4.14 4.6 Mining Multi-dimensional Association Rules from Relational Databases and Data Warehouses 4.14 4.6.1 Multi-dimensional association rules 4.15 Summary 4.15 Exercises 4.16
Contents
xiii
5. Classification and Prediction 5.1 Classification 5.2 Prediction 5.3 Why are Classification and Prediction Important? 5.4 What is Test Data? 5.5 Issues Regarding Classification and Prediction 5.5.1 Preparing the data for classification and prediction 5.5.2 Comparing classification and prediction methods 5.6 Decision Tree 5.6.1 How a decision tree works 5.6.2 Decision tree induction 5.6.3 What is decision tree learning algorithm? 5.6.4 ID3 5.7 Bayesian Classification 5.7.1 Naïve Bayesian classifiers 5.7.2 Bayesian networks 5.8 Neural Networks 5.8.1 Feed forward 5.8.2 Backpropagation 5.9 K-nearest Neighbour Classifiers 5.10 Genetic Algorithm Summary Exercises
5.1 5.2 5.3 5.3 5.4 5.4 5.4 5.4 5.5 5.6 5.7 5.7 5.8 5.11 5.12 5.14 5.15 5.16 5.16 5.18 5.20 5.21 5.21
6. Cluster Analysis 6.1 Cluster Analysis: Overview 6.2 Stages of Clustering Process 6.3 Where do We Need Clustering/Application Areas? 6.4 Characteristics of Clustering Techniques in Data Mining 6.5 Data types in Cluster Analysis 6.5.1 Data matrix (or object-by-variable structure) 6.5.2 Dissimilarity matrix (or object-by-variable structure) 6.6 Categories of Clustering Methods 6.6.1 Partitioning methods 6.6.2 Hierarchical clustering 6.6.3 Density based methods 6.6.4 Grid based methods 6.6.5 Model based method: Statistical approach, neural network approach and outlier analysis Summary Exercises
6.1 6.2 6.4 6.4 6.5 6.6 6.6 6.7 6.7 6.8 6.13 6.16 6.18 6.20 6.21 6.21
xiv
Contents
7. Data Warehousing Concepts 7.1 What is a Data Warehouse? 7.1.1 Types of data warehouse 7.1.2 Data warehouse access tools 7.1.3 Data warehouse advantages 7.1.4 Differences between operational database systems and data warehouses 7.1.5 Difference between database (DB) and data warehouse 7.1.6 Transaction database vs. operational database 7.2 Multidimensional Data Model 7.2.1 Data cubes 7.2.2 Star schema 7.2.3 Snowflake schema 7.2.4 Fact constellation schema 7.2.5 Concept hierarchy 7.3 Data Warehouse Design 7.3.1 The process of data warehouse design 7.4 Architecture of Data Warehouse 7.4.1 Two-tier architecture of data warehouse 7.4.2 Three-tier architecture 7.5 What is Data Mart? Summary Exercises 8. OLAP Technology and Aggregation 8.1 Aggregation 8.1.1 Cube aggregation 8.1.2 Historical information 8.2 Query Facility 8.2.1 Efficient processing of OLAP queries 8.2.2 Transformation of complex SQL queries 8.2.3 System building blocks for an efficient query system 8.2.4 Query performance without impacting transaction processing 8.3 Online Analytical Processing (OLAP) 8.3.1 OLAP/OLAM architecture 8.3.2 Characteristics of OLAP 8.3.3 OLAP cube life-cycle 8.3.4 OLAP functions/analytical operations 8.3.4.1 Roll-up/Drill-up or consolidate 8.3.4.2 Drill-down 8.3.4.3 Slice and dice 8.3.4.4 Pivot (rotate) 8.3.5 OLAP tools 8.3.6 Types of OLAP servers
7.1 7.2 7.2 7.3 7.6 7.6 7.7 7.8 7.8 7.8 7.9 7.10 7.11 7.12 7.12 7.13 7.13 7.14 7.14 7.15 7.17 7.17
8.1 8.2 8.4 8.4 8.5 8.6 8.6 8.7 8.7 8.7 8.8 8.8 8.10 8.11 8.12 8.12 8.12 8.14 8.14 8.15
Contents
8.3.6.1
8.4
Multidimensional OLAP (MOLAP): Cube based 8.3.6.2 Relational OLAP (ROLAP): Star schema based 8.3.6.3 Comparison between MOLAP and ROLAP 8.3.6.4 Hybrid OLAP (HOLAP) Data Mining vs. OLAP Summary Exercises
xv
9. Data Mining Security, Backup, Recovery 9.1 Data Mining Interfaces 9.1.1 Programmatic interfaces 9.1.2 Graphical user interface 9.2 Data Mining and Security 9.2.1 Identifying the data 9.2.2 Classifying the data 9.2.3 Quantifying the value of data 9.2.4 Identifying data vulnerabilities 9.2.5 Identifying protective measures and their costs 9.2.6 Why is security necessary for a data warehouse? 9.3 Backup and Recovery 9.4 Tuning Data Warehouse 9.5 Testing Data Warehouse 9.5.1 Data warehouse testing responsibilities 9.5.2 Business requirements and testing 9.5.3 Data warehousing test plan 9.5.4 Challenges of data warehouse testing 9.5.5 Categories of data warehouse testing Summary Exercises Model Paper Glossary Index
8.15 8.17 8.18 8.19 8.19 8.20 8.21
9.1 9.2 9.2 9.2 9.2 9.3 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.11 9.11 9.11 9.11 9.12 9.12 9.14 9.14 MP.1 G.1 I.1
CHAPTER 1
Introduction We are in an age often referred to as the information age. In this information age, because we believe that information leads to power and success, and thanks to sophisticated technologies such as computers, satellites, etc., we have been collecting tremendous amounts of information. Initially, with the advent of computers and means for mass digital storage, we started collecting and storing all sorts of data, counting on the power of computers to help sort through this amalgam of information. Unfortunately, these massive collections of data stored on disparate structures very rapidly became overwhelming. This initial chaos has led to the creation of structured databases and database management systems (DBMS). The efficient database management systems have been very important assets for management of a large corpus of data and especially for effective and efficient retrieval of particular information from a large collection whenever needed. The proliferation of database management systems has also contributed to recent massive gathering of all sorts of information.
CHAPTER OBJECTIVES • • • • • • • •
What are Data, Information and Knowledge? What motivated Data Mining? Data Mining – Overview Typical Architecture of a Data Mining System Data Mining - What kind of Data can be mined? Data Mining - What can be discovered? Classification of Data Mining Systems Major issues in Data Mining
Today, we have far more information than we can handle: from business transactions and scientific data, to satellite pictures, text reports and military intelligence. Information retrieval is simply not enough anymore for decisionmaking. Confronted with huge collections of data, we have now created new
1.2
Data Mining and Data Warehousing
needs to help us make better managerial choices. These needs are automatic summarization of data, extraction of the “essence” of information stored, and the discovery of patterns in raw data.
1.1
j HAT IS DATA, INFORMATION AND KNOWLEDGE?
Data are the fundamental element of cognition, the common denominator on which all constructs are based, and are stored in information systems. Derived from data, and positioned along a continuum that eventually leads to wisdom, are information and knowledge. Data categories are groupings of data with common characteristics or features. Data can be categorized into 3 Stages of Data – Data – Information – Knowledge Data: Data are any facts, numbers, or text that can be processed by a computer. Today organizations are accumulating vast and growing amounts of data in different formats and databases. This includes: 1. Operational or transactional data such as sales, cost, inventory, payroll, and accounting. 2. Non-operational data like industry sales, forecast data, and macroeconomic data. 3. Metadata: Data about the data itself such as logical database design or data dictionary definitions which literally means “data about data.” Information: The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point-of-sale transaction data can yield information on which products are selling and when.
Figure 1.1
Extraction of knowledge from data
Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail
Introduction
1.3
supermarket sales can be analyzed in light of promotional efforts to provide knowledge or consumer buying behaviour. Thus a manufacturer or a retailer could determine those items that are most susceptible to promotional efforts.
1.2
j
HAT IS MOTIVATED DATA MINING?
“Necessity is the mother of invention.” The major reason that data mining has attracted a great deal of attention in information industry in recent years is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. The information and knowledge gained can be used for applications ranging from business management, production control, and market analysis, to engineering design and scientific exploration.
1.3
WATA MINING: OVERVIEW
Data mining helps the end users to extract useful information from large databases. These large databases are present in data warehouses, i.e., “Data Mountain,” which are presented to data mining tools. In short, data warehousing allows one to build the data mountain. Data mining is the nontrivial extraction of implicit, previously unknown and potentially useful information from the data mountain. This data mining is not specific to any industry – it requires intelligent technologies and the willingness to explore the possibility of hidden knowledge that resides in the data as shown in figure 1.2. Data mining is also referred to as knowledge discovery in databases (KDD).
Figure 1.2
Flow of data toward producing knowledge
1.4
Data Mining and Data Warehousing
Data mining: What it can’t do Data mining is a tool, not a magic wand. It won’t sit in your database watching what happens and send you e-mail to get your attention when it sees an interesting pattern. It doesn’t eliminate the need to know your business, to understand your data, or to understand analytical methods. Data mining assists business analysts with finding patterns and relationships in the data — it does not tell you the value of the patterns to the organization. Furthermore, the patterns uncovered by data mining must be verified in the real world. Data Mining can be described in many ways
¾ ¾ ¾ ¾
1.3.1
Data mining refers to extracting or “mining” knowledge from large amount of data. Data mining is a process of discovering interesting knowledge from large amounts of data stored either, in database, data warehouse, or other information repositories Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining, as we use the term, is the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. We assume that the goal of data mining is to allow a corporation to improve its marketing, sales, and customer support operations through a better understanding of its customers.
Data Collections and Data Availability
We have been collecting a myriad of data, from simple numerical measurements and text documents, to more complex information such as spatial data, multimedia channels, and hypertext documents. Here is a non-exclusive list of a variety of information collected in digital form in databases and in flat files. •
•
Business transactions: Every transaction in the business industry is (often) “memorized” for perpetuity. Such transactions are usually time related and can be inter-business deals such as purchases, exchanges, banking, stock, etc. Scientific data: Whether in a Swiss nuclear accelerator laboratory counting particles, in the Canadian forest studying readings from a grizzly bear radio collar, on a South Pole iceberg gathering data about oceanic activity, or in an American university investigating human psychology, our society is amassing colossal amounts of scientific data that need to be analyzed.
Introduction
• •
•
• • • • •
•
1.3.2 • • • • • • • 1.3.3
1.5
Medical and personal data: From government census to personnel and customer files, very large collections of information are continuously gathered about individuals and groups. Surveillance video and pictures: With the amazing collapse of video camera prices, video cameras are becoming ubiquitous. Video tapes from surveillance cameras are usually recycled and thus the content is lost. Satellite sensing: There is a countless number of satellites around the globe: some are geo-stationary above a region, and some are orbiting around the Earth, but all are sending a non-stop stream of data to the earth. Games: Our society is collecting a tremendous amount of data and statistics about games, players and athletes. Digital media: The proliferation of cheap scanners, desktop video cameras and digital cameras is one of the causes of the explosion in digital media repositories. CAD and Software engineering data: There are a multitude of Computer Assisted Design (CAD) systems for architects to design buildings or engineers to conceive system components or circuits. Virtual Worlds: There are many applications making use of threedimensional virtual spaces. These spaces and the objects they contain are described with special languages such as VRML. Text reports and memos (e-mail messages): Most of the communications within and between companies or research organizations or even private people, are based on reports and memos in textual forms often exchanged by e-mail. The World Wide Web repositories: Since the inception of the World Wide Web in 1993, documents of all sorts of formats, content and description have been collected and inter-connected with hyperlinks making it the largest repository of data ever built. Some Alternative Terms for Data Mining Knowledge discovery (mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Data dredging Information harvesting Business intelligence etc. Steps for Data Processing
Data mining should have been more appropriately named knowledge mining from data, which is unfortunately somewhat long. “Knowledge mining", a
1.6
Data Mining and Data Warehousing
shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of precious nuggets from a great deal of raw material.
Figure 1.3
Data mining is the core of knowledge discovery process
As shown in figure 1.3 above, the knowledge discovery in databases process comprises a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps: • • • • • • •
Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection. Data integration: in this stage, multiple data sources, often heterogeneous, may be combined in a common source. Data selection: in this step, the data relevant to the analysis is decided on and retrieved from the data collection. Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure. Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful. Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures. Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.
Introduction
1.7
gYPICAL ARCHITECTURE OF A DATA MINING SYSTEM
1.4
Based on this view figure 1.4, the architecture of a typical data mining system may have the following major components:
Figure 1.4
•
• •
•
•
Architecture of a data mining system
Database, data warehouse, World Wide Web, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis. Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so
1.8
Data Mining and Data Warehousing
as to focus the search toward interesting patterns. It may use interestingness thresholds to filter out discovered patterns. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended.
1.5
WATA MINING: WHAT KIND OF DATA CAN BE MINED?
In principle, data mining is not specific to one type of media or data. Data mining should be applicable to any kind of information repository. However, algorithms and approaches may differ when applied to different types of data. Indeed, the challenges presented by different types of data vary significantly. Data mining is being put into use and studied for databases, including relational databases, object-relational databases and object-oriented databases, data warehouses, transactional databases, unstructured and semi-structured repositories such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series databases and textual databases, and even flat files. Here are some examples in more detail: 1.5.1
Flat Files
Flat files are actually the most common data source for data mining algorithms, especially at the research level (figure 1.5). Flat files are simple data files in text or binary format with a structure known by the data mining algorithm to be applied. The data in these files can be transactions, time-series data, scientific measurements, etc.
Figure 1.5
Fragments of some relations from a relational database for XYZ store
Introduction
1.5.2
1.9
Relational Databases
Briefly, a relational database consists of a set of tables containing either values of entity attributes, or values of attributes from entity relationships. Tables have columns and rows, where columns represent attributes and rows represent tuples. A tuple in a relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a unique key. In figure 1.5 we present some relations Customer, Items, and Borrow representing business activity in a fictitious video store XYZ store. These relations are just a subset of what could be a database for the video store and is given as an example. The most commonly used query language for relational database is SQL, which allows retrieval and manipulation of the data stored in the tables, as well as the calculation of aggregate functions such as average, sum, min, max and count. 1.5.3
Data Warehouses
A data warehouse as a storehouse, is a repository of data collected from multiple data sources (often heterogeneous) and is intended to be used as a whole under the same unified schema as in figure 1.6. A data warehouse gives the option to analyze data from different sources under the same roof.
Figure 1.6
Data warehouse
1.10
Data Mining and Data Warehousing
Data Warehouses are an important asset for organizations to maintain efficiency, profitability and competitive advantages. Organizations collect data through many sources—Online, Call Center, Sales Leads, Inventory Management. The data collected have degrees of value and business relevance. As data is collected, it is passed through a 'conveyor belt', called the Data Life Cycle Management. 1.5.4
Transaction Databases
A transaction database is a set of records representing transactions, each with a time stamp as given in figure 1.7, an identifier and a set of items. Associated with the transaction files could also be descriptive data for the items.
Figure 1.7
1.5.5
Fragment of a transaction database for the rentals at XYZ store
Multimedia Databases
Multimedia databases include video, images, audio and text media. They can be stored on extended object-relational or object-oriented databases, or simply on a file system. Multimedia is characterized by its high dimensionality, which makes data mining even more challenging. Data mining from multimedia repositories may require computer vision, computer graphics, image interpretation, and natural language processing methodologies. 1.5.6
Spatial Databases
As shown in figure 1.8 spatial databases are databases that, in addition to usual data, store geographical information like maps, and global or regional positioning. Such spatial databases present new challenges to data mining algorithms.
Figure 1.8
Visualization of spatial OLAP
Introduction
1.5.7
1.11
Time-Series Databases
Time-series databases contain time related data such stock market data or logged activities. These databases usually have a continuous flow of new data coming in, which sometimes causes the need for a challenging real time analysis. Data mining in such databases commonly includes the study of trends and correlations between evolutions of different variables, as well as the prediction of trends and movements of the variables in time. 1.5.8
World Wide Web
The World Wide Web is the most heterogeneous and dynamic repository available. A very large number of authors and publishers are continuously contributing to its growth and metamorphosis, and a massive number of users are accessing its resources daily. Data in the World Wide Web is organized in interconnected documents. These documents can be text, audio, video, raw data, and even applications.
1.6
WATA MINING: WHAT CAN BE DISCOVERED?
The kinds of patterns that can be discovered depend upon the data mining tasks employed. By and large, there are two types of data mining tasks: descriptive data mining tasks that describe the general properties of the existing data, and predictive data mining tasks that attempt to do predictions based on inference on available data. The data mining functionalities and the variety of knowledge they discover are briefly presented below: Characterization: Data characterization is a summarization of general features of objects in a target class, and produces what is called characteristic rules. The data relevant to a user-specified class are normally retrieved by a database query and run through a summarization module to extract the essence of the data at different levels of abstractions. Discrimination: Data discrimination produces what are called discriminant rules and is basically the comparison of the general features of objects between two classes referred to as the target class and the contrasting class. Association analysis: Association analysis is the discovery of what are commonly called association rules. It studies the frequency of items occurring together in transactional databases, and based on a threshold called support, identifies the frequent item sets. Another threshold, confidence, which is the conditional probability than an item appears in a transaction when another item appears, is used to pinpoint association rules. Association analysis is commonly used for market basket analysis. Classification: Classification analysis is the organization of data in given classes. Also known as supervised classification, the classification uses given
1.12
Data Mining and Data Warehousing
class labels to order the objects in the data collection. Classification approaches normally use a training set where all objects are already associated with known class labels. Prediction: Prediction has attracted considerable attention given the potential implications of successful forecasting in a business context. There are two major types of predictions: one can either try to predict some unavailable data values or pending trends, or predict a class label for some data. Prediction is however more often referred to as the forecast of missing numerical values, or increase/ decrease trends in time related data. The major idea is to use a large number of past values to consider probable future values. Clustering: Similar to classification, clustering is the organization of data in classes. However, unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes. Clustering is also called unsupervised classification, because the classification is not dictated by given class labels. There are many clustering approaches all based on the principle of maximizing the similarity between objects in a same class (intraclass similarity) and minimizing the similarity between objects of different classes (inter-class similarity). Outlier analysis: Outliers are data elements that cannot be grouped in a given class or cluster. Also known as exceptions or surprises, they are often very important to identify. While outliers can be considered noise and discarded in some applications, they can reveal important knowledge in other domains, and thus can be very significant and their analysis valuable. Evolution and deviation analysis: Evolution and deviation analysis pertain to the study of time related data that changes in time. Evolution analysis models evolutionary trends in data, which consent to characterizing, comparing, classifying or clustering of time related data. Deviation analysis, on the other hand, considers differences between measured values and expected values, and attempts to find the cause of the deviations from the anticipated values.
1.7
VLASSIFICATION OF DATA MINING SYSTEMS
Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, business, bioinformatics, or psychology. There are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, others are more versatile and comprehensive. Data mining systems can be categorized according to various criteria. Among other classifications are the following:
Introduction
Figure 1.9
1.13
Data mining as a confluence of multiple disciplines
Classification according to the type of data source mined: this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc. Classification according to the data model drawn on: this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc. Classification according to the kind of knowledge discovered: this classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together. Classification according to mining techniques used: Data mining systems employ and provide different techniques. This classification categorizes data mining systems according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, databaseoriented or data warehouse-oriented, etc. The classification can also take into account the degree of user interaction involved in the data mining process such as query-driven systems, interactive exploratory systems, or autonomous systems. A comprehensive system would provide a wide variety of data mining techniques to fit different situations and options, and offer different degrees of user interaction.
1.8
`AJOR ISSUES IN DATA MINING
Before data mining develops into a conventional, mature and trusted discipline, many still pending issues have to be addressed. Some of these issues are addressed below. Note that these issues are not exclusive and are not ordered in any way. Security and social issues: Security is an important issue with any data collection that is shared and/or is intended to be used for strategic decisionmaking. In addition, when data is collected for customer profiling, user behaviour understanding, correlating personal data with other information, etc.,
1.14
Data Mining and Data Warehousing
large amounts of sensitive and private information about individuals or companies is gathered and stored. User interface issues: The major issues related to user interfaces and visualization are "screen real-estate", information rendering, and interaction. Interactivity with the data and data mining results is crucial since it provides means for the user to focus and refine the mining tasks, as well as to picture the discovered knowledge from different angles and at different conceptual levels. Mining methodology issues: These issues pertain to the data mining approaches applied and their limitations. Topics such as versatility of the mining approaches, the diversity of data available, the dimensionality of the domain, the broad analysis needs (when known). Performance issues: Many artificial intelligence and statistical methods exist for data analysis and interpretation. However, these methods were often not designed for the very large data sets data mining is dealing with today. Terabyte sizes are common. This raises the issues of scalability and efficiency of the data mining methods when processing considerably large data. Data source issues: There are many issues related to the data sources, some are practical such as the diversity of data types, while others are philosophical like the data glut problem. We certainly have an excess of data since we already have more data than we can handle and we are still collecting data at an even higher rate. _________________________________________________________________
fUMMARY Data mining offers great promise in helping organizations uncover patterns hidden in their data that can be used to predict the behaviour of customers, products and processes. However, data mining tools need to be guided by users who understand the business, the data, and the general nature of the analytical methods involved. Realistic expectations can yield rewarding results across a wide range of applications, from improving revenues to reducing costs. The results of data mining will be in profitable. Data warehouses are used to consolidate data located in disparate databases. A data warehouse stores large quantities of data by specific categories, so it can be more easily retrieved, interpreted, and sorted by users. Thus in this section we have seen an overview of the data mining and data warehousing. _________________________________________________________________
XXERCISES Q1. Q2. Q3.
Discuss about the motivation for data mining and why is it important? Discuss about the different data stores on which mining can be performed. How is a Data Warehousing different from a database? How are they similar?
Introduction
Q4. Q5. Q6.
1.15
Define Data Mining? Specify some synonyms for Data Mining. Are all of the patterns interesting? Define each of the following Data Mining functionalities. • Characterization • Discrimination • Association • Classification and Prediction • Clustering • Evolution Analysis Q7. How do you classify the Data Mining Systems? Q8. Explain the typical architecture of Data Mining. Q9. What do you mean by Data Warehouse? Explain various access tools used in Data Warehouse. Q10. What are the various issues during Data Mining? _________________________________________________________________
CHAPTER 2
Data Preprocessing Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogeneous sources. Low-quality data will lead to low-quality mining results.
CHAPTER OBJECTIVES • • • •
Need of Data Processing Form of Data Preprocessing Data Cleaning Data Reduction
“How can the data be preprocessed in order to help/improve the quality of the data and consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process?”
2.1
aEED OF DATA PROCESSING
To get the required information from huge, incomplete, noisy and inconsistent set of data it is necessary to use data processing. Every business needs data mining because companies are looking for new ways to access and to allow end users access to the data they need for making decisions, serving customers, and gaining the competitive edge. We need data mining to collect data and allow us to analyze it. If it is not analyzed at that time the collected data could be important in the near future. As a database grows, the ability to support decisions by processing using traditional query languages is not feasible.
2.2 Data Mining and Data Warehousing
2.2
Y ORM OF DATA PREPROCESSING
Data preprocessing is an important issue for both data warehousing and data mining, as real-world data tend to be dirty, incomplete, and inconsistent. Data preprocessing techniques can improve the quality of the data, thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data preprocessing is an important step in the knowledge discovery process, because quality decisions must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the data to be analyzed can lead to huge payoffs for decision making.
Figure 2.1
Forms of data preprocessing
As shown in figure 2.1 there are a number of data preprocessing techniques: 1. Data cleaning can be applied to remove noise and correct inconsistencies in the data. 2. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse.
Data Preprocessing 2.3
3. Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. 4. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These techniques are not mutually exclusive; they may work together.
2.3
WATA CLEANING
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2.3.1
Missing Values
Imagine that you need to analyze ABC-company sales and customer data. You note that many tuples have no recorded value for several attributes, such as customer income. How can you go about filling in the missing values for this attribute? Let’s look at the following methods: 1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. 2. Fill in the missing value manually: In general, this approach is timeconsuming and may not be feasible given a large data set with many missing values. 3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown” or . If missing values are replaced by say, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common—that of “Unknown.” Hence, although this method is simple, it is not foolproof. 4. Use the attribute mean to fill in the missing value: For example, suppose that the average income of ABC-company customers is $56,000. Use this value to replace the missing value for income. 5. Use the attribute mean for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple. 6. Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.
2.4 Data Mining and Data Warehousing 2.3.2
Noisy Data
Noise is a random error or variance in a measured variable. Given a numerical attribute such as, say, price, how can we “smooth” out the data to remove the noise using the following techniques: 1. Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values around it. The sorted values are distributed into a number of “buckets,” or bins, see in figure 2.2. Binning techniques are first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Figure 2.2
Binning methods for data smoothing
2. Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or “clusters”. Intuitively, values that fall outside of the set of clusters as in figure 2.3 may be considered outliers.
Figure 2.3
Clustering
Data Preprocessing 2.5
3. Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the “best” line to fit two attributes (or variables) as in figure 2.4, so that one attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.
Figure 2.4
Regression
4. Computer and Human inspection: This method can be called as semiautomated method. In other words combined computer and human inspection detect suspicious values and check manually 2.3.3 • •
How to Handle Inconsistent Data? Manual correction using external references Semi-automatic using various tools – To detect violation of known functional dependencies and data constraints – To correct redundant data
2.4
WATA INTEGRATION AND TRANSFORMATION
2.4.1
Data Integration
Data integration combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files etc., stored under a unified schema, and that usually reside at a single site.
2.6 Data Mining and Data Warehousing
Figure 2.5
Issues to consider during data integration
There are various issues as related to data integration as follows: 1) Schema integration and object matching: How can equivalent realworld entities from multiple data sources be matched up? This is referred to as the entity identification problem. Example: As shown in figure 2.6, How can the data analyst or the computer be sure that Name in one database and Given Name in another refer to the same attribute? The metadata may also be used to help transform the data. PID 1234 6791 PID 1234 6791
Name Sunny Shweta Given Name Sunny Shweta
DOB 06-01-1980 18-08-1995 DOB 06-01-1980 18-08-1995
Figure 2.6 Two database tables showing the same attribute may have different names
2) Redundancy is another important issue. An attribute (such as annual revenue, for instance) may be redundant if it can be “derived” from another attribute or set of attributes. 2.4.2
Data Transformation
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Data Preprocessing 2.7
Data transformation can involve the following: 1) Smoothing: which works to remove noise from the data. Such techniques include binning, regression, and clustering. 2) Aggregation: where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities. 3) Generalization: where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higherlevel concepts, like city or country. Similarly, values for numerical attributes, like age, may be mapped to higher-level concepts, like youth, middle-aged, and senior. 4) Normalization: where the attribute data are scaled so as to fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0. Attribute construction (or feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process.
2.5
WATA REDUCTION
Data reduction techniques obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results. Strategies for data reduction include the following: 2.5.1
Data Cube Aggregation
Where aggregation operations are applied to the data in the construction of a data cube. 2.5.2
Attribute Subset Selection
Where irrelevant, weakly relevant or redundant attributes or dimensions may be detected and removed. 2.5.3
Dimensionality Reduction
Where encoding mechanisms are used to reduce the data set size. 2.5.4
Numerosity Reduction
Where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms.
2.8 Data Mining and Data Warehousing 2.5.5
Discretization and Concept Hierarchy Generation
Where raw data values for attributes are replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction. ________________________________________________________________
fUMMARY
We can summarize that data preprocessing is an important issue for both data warehousing and data mining, as real-world data tend to be incomplete, noisy, and inconsistent. Data preprocessing includes data cleaning, data integration, data transformation, and data reduction. Although numerous methods of data preprocessing have been developed, data preprocessing remains an active area of research, due to the huge amount of inconsistent or dirty data and the complexity of the problem. ________________________________________________________________
XXERCISES Q1.
Define Data Mining. What are the various steps to process the knowledge during Data Mining? Q2. Discuss issues to consider during data integration. Q3. What do you mean by pre-processing? Explain with reference to: (i) Data Cleaning (ii) Data Reduction Q4. What is the need of Data Preprocessing? Q5. Explain Data Cleaning? Q6. Explain Data Integrity and Transformation? Q7. What is data reduction? Enlist the various techniques used for data reduction. Explain any of the two in brief. Q8. Explain Discretization and Concept Hierarchy Generation. _______________________________________________________________
CHAPTER 3
Statistics & Concept Description in Data Mining Statistics is a very important and most established set of methodologies for data mining task. Historically, first data analysis application was based on statistics. Statistics offers a variety of methods for data mining which includes different types of regression and discriminate analysis.
CHAPTER OBJECTIVES • • • • •
Overview Statistics Measures in Large Databases Concept/Class Description: Characterization and Discrimination Data Generalization and Summarization based Characterization Mining Class comparisons: Discrimination between Classes
The main difference between typical statistical applications and data mining lies in the size of the data size. In statistics, a “large” data set may contain a few hundred or a thousand data points whereas in data mining, many millions or even billions of data points is not surprising. Such large databases occur in all real world applications.
3.2 Data Mining and Data Warehousing
b VERVIEW
3.1
We have the aggregate functions like: Count(), sum(), avg(),max(), and min() in the SQL standards. But, we are going to a explore the data in a multidimensional manner that we need to take care of the data in several perspectives and represent the data in a much simpler manner than a conventional relational database, since we are going to handle huge volume of data. So we have, • • •
3.2
Measuring the Central Tendency Measuring the dispersion of data Graph displays of basic statistical class descriptions
fTATISTICS MEASURES IN LARGE DATABASES
From the data mining point of view, we need to examine how they can be computed efficiently in large databases. In particular, it is necessary to introduce the notions of distributive measure, algebraic measure, and holistic measure. Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it. Statistics is the science of collecting, organizing, analyzing and presenting data. "Knowledge Discovery in Databases" is not much different. For many data preprocessing tasks, users would like to learn about data characteristics regarding both central tendency and dispersion of the data. Measures of central tendency include mean, median, mode, and midrange, while measures of data dispersion include quartiles, inter-quartile range (IQR), and variance. These descriptive statistics are of great help in understanding the distribution of the data. Such measures have been studied extensively in the statistical literature. 3.2.1
Measuring of Central Tendency
The term central tendency refers to the “middle” value or perhaps a typical value of the data, and is measured using the mean, median, mode and midrange. Each of this measure is calculated differently, and the one that is best to use depends upon the situation. • Mean is the most common and most effective numerical measure of the “center” of a set of data is the Arithmetic Mean. Mean with respect to Data mining and data-warehousing is nothing but the weighted arithmetic mean. The weight reflects the significance, importance, or occurrence frequency attached to their respective values. The mean (or average) of a set of data values is the sum of all of the data values divided by the number of data values. That is:
Statistics and Concept Description in Data Mining
Mean =
3.3
Sum of all data values Number of data values
Symbolically, x =
Σx n
or
mean (x) = x =
1 m
m
∑x i =1
i
Where x (read as ‘x bar’) is the mean of the set of x values, Σ x is the sum of all the x values, and n is the number of x values. In this case, we can compute the mean with the frequency values. Let x1; x2; …. ; xn be a set of n values or observations, such as for some attribute, like salary. The mean of this set of values is: x=
( x1 + x2 + " + xn )
n Example The average marks in class of students during first sessional were 39, 45, 28, 29, 27, 31, 32. Find the mean max marks of the class.
Mean = (39+45+28+29+27+31+32)/7 Mean = 231/7 Mean = 33 Hence the mean or average marks in the class are 33. •
Median is the middle value in the ordered set as we all know and it requires arranging the distribution in ascending order and applying the formula to calculate the median. ⎧ x( r + 1) if m is odd, i.e., m = 2r + 1 ⎪ median ( x ) = ⎨ 1 if m is even, i.e., m = 2r ⎪ x( r ) + x( r + 1) ⎩2
(
)
Example The Average marks in class of students during first sessional were 39, 45, 28, 29, 27, 31, 32. Find the median from the marks of the class
Therefore to find the median of maximum marks in the class, we arrange the elements in order from least to greatest. 27, 28, 29, 31, 32, 39, 45 The middle element is the fourth element here, which is 31. Hence the median mark is 31. Hence the mean or average mark in the class is 33.
3.4 Data Mining and Data Warehousing
Figure 3.1
•
Measures of central tendency include mean, median, mode
Mode is a value that occurs frequently or the value with the highest frequency in the distribution can also be compared through figure 3.1 with mean and median.
mean – mode = 3 × (mean – median) Example The Average marks in class of students during first sessional were 27, 27, 28, 29, 31, 32, 32, 39, 45. Find the mode from the marks of the class.
Here, 27 and 32 are both mode for the data sets. •
Mid-range is the average of the largest and the smallest values in a set.
3.2.2
Measuring Dispersion of Data
Measures of dispersion or Measures of variability address the degree of clustering of the scores about the mean.”
The degree to which numeric data tend to spread is called the dispersion, or variance of the data. The most common measures of data dispersion are the fivenumber summary (based on quartiles), the inter quartile range, and standard deviation. •
Five Number Summary Statistics The five numbers summary of a continuous variable consists of a minimum value, the first quartile, the median, the third quartile, and the maximum value. The median, or second quartile, is the mid value of the sorted data. The first quartile, denoted by Q1, is the 25th percentile; the third quartile, denoted by Q3, is the 75th percentile of the sorted data. The distance between the third quartiles and the first quartile (Q3 – Q1) is called IQR, thus, is a simple measure of spread that gives the range covered by the middle half of the data. These five numbers display the full range of variation (from minimum to maximum), the common range of variance (from first to third quartile), and a typical value (the median).
Statistics and Concept Description in Data Mining
•
3.5
Range Range is the difference between the largest and smallest values or we can say the difference between the largest (max()) and smallest (min()) values, assume that the data are sorted in increasing numerical order. However a great deal of information is ignored, and the range is greatly influenced by outliers. Example: In case of ABC-company group, whose scores were 6, 6, 10, and 10, we would say, “The scores ranged from 6 to 10.” Or we could express the range as the difference between 6 and 10 (10 − 6) or 4.
•
Variance Variance is the average measure of the variance. It is computed as the average of the square of the deviations from the average; however, because variance relies on the squared differences of a continuous variable from the mean, a single outlier has a greater impact on the size of the variance than does a value close to the mean. The variance of n observations x1, x2, x3……xn, is 2
Variance = s =
•
∑( x − x )
2
n Standard Deviation The standard deviation (σ) is the square root of the variance (σ2). In a normal distribution, about 68% of the values fall within one standard deviation of the mean, and about 95% of the values fall within two standard deviation of the mean. Both variance and standard deviation measurements take into account the difference between each value and the mean.
Standard Deviation = s =
∑( x − x )
2
n
The basic properties of the standard deviation, σ, as a measure of spread are: o σ measures spread about the mean and should be used only when the mean is chosen as the measure of center. o σ = 0 only when there is no spread, that is, when all observations have the same value. Otherwise σ > 0. •
Inter Quartile Range The IQR is a robust measure of dispersion. It is the difference between the 75th percentile (Q3) and the 25th percentile (Q1). The IQR is hardly affected by extreme scores; therefore, it is a good measure of spread for skewed distributions. In normally distributed data, the IQR is approximately equal to 1.35 times the standard deviation. This distance is defined as: IQR = Q3 − Q1
3.6 Data Mining and Data Warehousing Rule of thumb for identifying suspected outliers is to single out values falling at least 1.5 × IQR above the third quartile or below the first quartile.
ZRAPHICAL TECHNIQUES USED IN DATA ANALYSIS OF
3.3
CONTINUOUS DATA
Graphical representations, such as histograms, boxplots, quantile plots, quantile-quantile plots, scatter plots, and scatter-plot matrices, facilitate to convert complex and messy information in large databases into meaningful displays and are thus useful for data preprocessing and mining. •
Frequency Histogram Histogram is a popular form of graphical display. The horizontal frequency histogram displays classes on the vertical axis and frequencies of the classes on the horizontal axis as shown in figure 3.2. A histogram for an attribute A partitions the data distribution of A into disjoint subsets, or buckets. Typically, the width of each bucket is uniform. If each bucket represents only a single attributevalue/frequency pair, the buckets are called singleton buckets.
Figure 3.2
Each bucket is represented by a rectangle whose height is equal to the count or relative frequency of the values at the bucket
Drawing Frequency Histograms
In drawing frequency histograms, bear in mind the following rules: • Intervals should be equally spaced see figure 3.3 • Select intervals to have convenient values • Number of intervals is usually between 6 to 20 o Small amounts of data require fewer intervals o 10 intervals are sufficient for 50 to 200 readings
Statistics and Concept Description in Data Mining
Figure 3.3
3.7
Example of a histogram
Example: Suppose that the values for a given set of data are grouped into intervals. The intervals and corresponding frequencies are as follows. Draw a frequency Histogram for Unit Prices ($). Unit price ($) 40-59 60-79 80-99 100-119 120-139
Number of items sold 2750 4157 5089 3954 387
Solution:
Figure 3.4
Frequency histogram for unit prices ($)
3.8 Data Mining and Data Warehousing •
Box Plot Box plots can be plotted based on the five-number summary and are a useful tool for identifying outliers. A box plot provides an excellent visual summary of many important aspects of a distribution. The box plot is based on the five-number summary plot, which is based on the median, quartiles, and extreme values.
Figure 3.5
Shows the basic part of a box plot
As shown in the figure 3.5 the box stretches from the lower hinge (first quartile) to the upper hinge (the third quartile) and therefore contains the middle half of the scores in the distribution. The median is shown as a line across the box. Therefore one quarter of the distribution is between this line and the bottom of the box. A box plot may be useful in detecting skewness to the right or to the left. •
Normal Probability Plot The normal probability plot is a graphical univariate data distribution technique for assessing whether or not a variable is appropriately normally distributed. The data are plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line can also be seen through graph in figure 3.6. Departure from this straight line indicates departure from normality. A normal probability plot, is also known as normal quantile-quantile plot or Q-Q plot.
At the y-axis ordered data value is plotted against the associated quantiles of the normal distribution at x-axis. The data is plotted at a 45degree presumed reference line. If the two sets come from a population with the same distribution, the points should fall approximately along this reference line.
Statistics and Concept Description in Data Mining
Figure 3.6
•
3.9
Shows a Quantile plot for the set of data
2D and 3D Scatter plots A scatter plot is a useful graphical method for determining if there appears to be a relationship, pattern, or trend between 2 (3 in 3D) numeric attributes. To construct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane.
Figure 3.7
Shows a scatter plot for the set of data
The scatter plot is a useful method for providing a first look at bivariate data to see clusters of points and outliers as shown in figure 3.7, or to explore the possibility of correlation relationships.
3.4
VONCEPT/CLASS DESCRIPTION: CHARACTERIZATION AND DISCRIMINATION
From a data analysis point of view, data mining can be classified into two categories:
3.10 Data Mining and Data Warehousing • •
Descriptive data mining describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms Predictive data mining is based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data
Databases usually store large amounts of data in great detail. However, users often like to view sets of summarized data in concise, descriptive terms. Such data descriptions may provide an overall picture of a class of data or distinguish it from a set of comparative classes. Moreover, users like the ease and flexibility of having data sets described at different levels of granularity and from different angles. Such descriptive data mining is called concept description, and forms an important component of data mining. Concept description is sometimes called class description when the concept to be described refers to a class of objects. A concept usually refers to a collection of data such as stereos, frequent buyers, graduate students, and so on. As a data mining task, concept description is not a simple enumeration of the data. Instead, it generates descriptions for characterization and comparison of the data. Characterization provides a concise and succinct summarization of the given collection of data, while concept or class comparison (also known as discrimination) provides descriptions comparing two or more collections of data. Since concept description involves both characterization and comparison, we will study techniques for accomplishing each of these tasks. 3.4.1
Methods for Concept Description
There are four methods used in concept description: 1. 2. 3. 4.
Multilevel generalization Summarization Characterization Discrimination
Multilevel generalization is a process that employs concept hierarchies to provide descriptions of data objects at different granularities or levels of detail. It summarizes data by replacing relatively low-level values (such as numeric values for an attribute age) with higher-level concepts (such as young, middle aged and senior). Given the large amount of data stored in databases, it is useful to be able to describe concepts in concise and succinct terms at generalized (rather than low) levels of abstraction. Allowing data sets to be generalized at multiple levels of abstraction facilitates users in examining the general behaviour of the data. Summarization maps data into subsets with associated simple descriptions. It extracts or derives representative information about the database. This may be accomplished by actually retrieving portions of the data. For instance, we can
Statistics and Concept Description in Data Mining
3.11
drill down on sales data summarized by quarter to see the data summarized by month. Similarly, we can roll up on sales data summarized by city to view the data summarized by country. Characterization is a summarization of the general characteristics or features of a target class of data. The data corresponding to the user-specified class are typically collected by a database query. For example, to study the characteristics of xyz products whose sales increased by 10% in the last year, the data related to such products can be collected by executing an SQL query. Discrimination is done by comparison of the target class with one or a set of comparative classes (often called the contrasting classes). Hence the general features of target class data objects is compared with the general features of objects. The target and contrasting classes can be specified by the user, and the corresponding data objects retrieved through database queries. For example, the user may like to compare the general features of xyz products whose sales increased by 12% in the last year with those whose sales decreased by at least 35% during the same period. The methods used for data discrimination are similar to those used for data characterization. 3.4.2
Differences between Concept Description in Large Databases and On-Line Analytical Processing
•
Data warehouses and OLAP tools are based on a multidimensional data model which views data in the form of a data cube, consisting of dimensions (or attributes) and measures (aggregate functions). However, the possible data types of the dimensions and measures for most commercial versions of these systems are restricted. Many current OLAP systems confine dimensions to nonnumeric data. Similarly, measures (such as count(), sum(), average()) in current OLAP systems apply only to numeric data. In contrast, for concept formation, the database attributes can be of various data types, including numeric, nonnumeric, spatial, text or image. Furthermore, the aggregation of attributes in a database may include sophisticated data types, such as the collection of nonnumeric data, the merge of spatial regions, the composition of images, the integration of texts, and the group of object pointers. Therefore, OLAP, with its restrictions on the possible dimension and measure types, represents a simplified model for data analysis. Concept description in databases can handle complex data types of the attributes and their aggregations, as necessary.
•
On-line analytical processing in data warehouses is a purely usercontrolled process. The selection of dimensions and the application of OLAP operations, such as drill-down, roll-up, dicing, and slicing, are directed and controlled by the users. Although the control in most OLAP systems is quite user-friendly, users do require a good understanding of the role of each dimension. Furthermore, in order to
3.12 Data Mining and Data Warehousing get a satisfactory description of the data, users may need to specify a long sequence of OLAP operations. In contrast, concept description in data mining strives for a more automated process which helps users determine which dimensions (or attributes) should be included in the analysis, and the degree to which the given data set should be generalized in order to produce an interesting summarization of the data.
WATA GENERALIZATION AND SUMMARIZATION
3.5
BASED CHARACTERIZATION Data and objects in databases often contain detailed information at primitive concept levels. For example, the item relation in a sales database may contain attributes describing low level item information such as item ID, name, brand, category, supplier, place made, and price. It is useful to be able to summarize a large set of data and present it at a high conceptual level. For example, summarizing a large set of items relating to Christmas season sales provides a general description of such data, which can be very helpful for sales and marketing managers. This requires an important functionality in data mining: data generalization. Data generalization is a process which abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels.
Figure 3.8
Data generalization
Methods for the efficient and flexible generalization of large data sets can be categorized according to two approaches: (1) Data cube approach, and (2) Attribute-oriented induction approach. 3.5.1
Data Cube Approach (OLAP)
In the data cube approach (or OLAP approach) to data generalization, the data for analysis are stored in a multidimensional database, or data cube.
Statistics and Concept Description in Data Mining
Figure 3.9
3.13
3-D view of an OLAP, Data can be represented as a 3 dimensional array
In general, the data cube approach materializes data cubes by first identifying expensive computations required for frequently-processed queries. These operations typically involve aggregate functions, such as count(), sum(), average(), and max(). The computations are performed, and their results are stored in data cubes. Such computations may be performed for various levels of data abstraction. These materialized views can then be used for decision support, knowledge discovery, and many other applications. 3.5.2
Attribute-Oriented Induction Approach (AOI)
The Attribute-Oriented Induction (or AOI approach) is used for data generalization and summarization-based characterization, it was first proposed in 1989, a few years prior to the introduction of the data cube approach. The data cube approach can be considered as a data warehouse-based, pre computationoriented, materialized view approach. It performs off- line aggregation before an OLAP or data mining query is submitted for processing. On the other hand, the attribute-oriented induction approach, at least in its initial proposal, is a relational database query-oriented, generalization- based, on-line data analysis technique. However, there is no inherent barrier distinguishing the two approaches based on on-line aggregation versus off-line pre computation. Some aggregations in the data cube can be computed on-line, while off-line pre computation of multidimensional space can speed up attribute-oriented induction as well. In fact, data mining systems based on attribute-oriented induction, such as DB Miner, have been optimized to include such off-line pre computation. The generalization is performed by either
1. Attribute removal is based on the following rule: If there is a large set of distinct values for an attribute of the initial working relation, but either (1) there is no generalization operator on the attribute (e.g., there is no concept hierarchy defined for the attribute), or (2) its higher level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation.
3.14 Data Mining and Data Warehousing 2. Attribute generalization is based on the following rule: If there is a large set of distinct values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute. Aggregation is performed by merging identical, generalized tuples, and accumulating their respective counts. This reduces the size of the generalized data set. The resulting generalized relation can be mapped into different forms for presentation to the user, such as charts or rules.
3.6
`INING CLASS COMPARISONS: DISCRIMINATION BETWEEN CLASSES
Characterization provides a concise and succinct summarization of the given collection of data, while concept or class comparison (also known as discrimination) provides descriptions comparing two or more collections of data. Class discrimination or comparison (hereafter referred to as class comparison) mines descriptions that distinguish a target class from its contrasting classes. Notice that the target and contrasting classes must be comparable in the sense that they share similar dimensions and attributes. For example, the three classes, student, address, and item, are not comparable. However, the sales in the last three years are comparable classes, and so are computer science students versus information technology students. Procedure for class comparison is as follows: 1. Data collection: The set of relevant data in the database is collected by query processing and is partitioned respectively into a target class and one or a set of contrasting class(es). 2. Dimension relevance analysis: If there are many dimensions, then dimension relevance analysis should be performed on these classes to select only the highly relevant dimensions for further analysis. Correlation or entropy-based measures can be used for relevance analysis. 3. Synchronous generalization: Generalization is performed on the target class to the level controlled by a user- or expert-specified dimension threshold, which results in a prime target class relation. The concepts in the contrasting class(es) are generalized to the same level as those in the prime target class relation, forming the prime contrasting class(es) relation. 4. Presentation of the derived comparison: The resulting class comparison description can be visualized in the form of tables, graphs, and rules. This presentation usually includes a “contrasting” measure such as count% (percentage count) that reflects the comparison between the target and contrasting classes. The user can adjust the comparison
Statistics and Concept Description in Data Mining
3.15
description by applying drill-down, roll-up, and other OLAP operations to the target and contrasting classes, as desired. The above discussion outlines a general algorithm for mining comparisons in databases. In comparison with characterization, the above algorithm involves synchronous generalization of the target class with the contrasting classes, so that classes are simultaneously compared at the same levels of abstraction.
fUMMARY Finally we can conclude that concept description is the most basic form of descriptive data mining. It describes a given set of task-relevant data in a concise and summarized manner, presenting interesting general properties of the data. Concept (or class) description consists of characterization and comparison (or discrimination). The former summarizes and describes a collection of data, called the target class, whereas the latter summarizes and distinguishes one collection of data, called the target class, from other collection(s) of data, collectively called the contrasting class (es).
XXERCISES Q1.
Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. (a) What is the mean of the data? What is the median? (b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal, trimodal, etc.). (c) What is the midrange of the data? (d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data? (e) Give the five-number summary of the data. (f) Show a box plot of the data. (g) How is a quantile-quantile plot different from a quantile plot?
Q2. Q3.
Define Concept Description. Explain Data Generalization and Summarization based characterization (AOI). How do you measure central tendency dispersion of data? Explain Histogram. The following data are a list of prices of commonly sold items at a company. The number have been sorted 1,1,5,5,5,8,8,10,10,15,15,15,20,20,20,20. Make a histogram for price using singleton buckets. What is the mode for the numbers 7, 6, 5, 8, 7, 5, 9, 3, 5 and 4?
Q4. Q5.
Q6.
CHAPTER 4
Association Rule Mining T
his chapter introduces the concept of association rules, a form of local pattern discovery in an unsupervised learning system, along with some methods for correlations in transactional, relational databases and data warehouses. In addition to introducing the basic concepts, such as market basket analysis, many techniques for frequent item-set mining are presented in an organized way.
CHAPTER OBJECTIVES • • • • • •
Introduction to Market Basket Analysis? Concepts of Association Rule Mining Frequent Pattern Mining Mining Single-Dimensional Boolean Association Rules Mining Multilevel Association Rules from transaction Database Mining Multi-Dimensional Association Rules from Relational Databases and Data Warehouses
Association rules are used to uncover relationships between inherently unrelated data items. Mining Association rules is one of the main application areas of Data Mining. Given a set of customer transactions on items, the aim is to find correlations between the sales of items. In market basket analysis the miner is interested in finding which sets of products are frequently bought together. In this chapter we discuss market basket analysis and methods to discover association rules. We give an overview of the problem and explain approaches that have been used to attack this problem.
4.2 Data Mining and Data Warehousing
4.1
\NTRODUCTION
Association rules were first introduced by Agrawal et al., in 1993 as a means of determining relationships among a set of items in a database. Association rules, like clustering, are a form of unsupervised learning and have been applied to many fields such as retail business, web mining, and text mining. The most challenging part of association rule inference involves finding sets of items which satisfy specific criteria, and in turn are used to infer the rules themselves. The discovery of interesting correlation relationships among huge amounts of business transaction records can help in many business decision-making processes, such as catalog design, cross-marketing, and customer shopping behaviour analysis. 4.1.1
Frequent Pattern Analysis
Frequent patterns are patterns (such as item-sets, subsequences, or substructures) that appear in a data set frequently. For example, a set of items, such as milk and bread, which appear frequently together in a transaction data set is a frequent item-set. A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
Figure 4.1
A subset lattice for frequent item set mining
From figure 4.1 we can summarize that If {A,B,C,D,E} are the frequent item-set, candidate rules are:
Association Rule Mining
4.3
ABCD ->E, ABD ->C, ACD ->B, BCD ->A, A ->BCD, B ->ACD, CE ->ABD, D ->ABC AB ->CD, AC -> BD, ADE -> BC, BC ->AD, BD ->AC, CD ->AB Find associations between products means that, in a supermarket which products are frequently bought together? Do some products influence the sales of other products? e.g., “75% of people who buy ABCD, also buy E” e.g., ABCD ->E or people who buys D will also possibly purchase ABC product too e.g., D ->ABC. Applications - Basket / products - Patient / symptoms - Movies / ratings - Web pages / keywords 4.1.2
What is Market Basket Analysis (MBA)?
The market basket is defined as an item-set bought together by a customer on a single visit to a store. The market basket analysis is a powerful tool for the implementation of cross-selling strategies. Especially in retailing it is essential to discover large baskets, since it deals with thousands of items. Although some algorithms can find large item-sets, they can be inefficient in terms of computational time. MBA analyzes customer buying habits by finding associations between different items that customers place in their “shopping baskets”. Market basket analysis does not refer to a single technique; it refers to a set of business problems related to understanding point-of-sale transaction data. Market basket data is transaction data that describes three fundamentally different entities: • • •
Customers Orders (also called purchases or baskets or, in academic papers, item sets) Items
Considered as, one of the most intuitive applications of association rules strives to analyze customer buying patterns by finding associations between items that customers put into their baskets.
4.4 Data Mining and Data Warehousing
Figure 4.2
Market Basket Analysis of frequent items purchased helps to customize store layout
For example, one can discover that customers buy milk and bread together and even that some particular brands of milk are more often bought with certain brands of bread, e.g., multigrain bread and soy milk. These and other more interesting (and previously unknown) rules can be used to maximize profits by helping to design successful marketing campaigns, and by customizing store layout. In the case of the milk and bread example, the retailer may not offer discounts for both at the same time, but just for one; the milk can be put at the opposite end of the store with respect to bread, to increase customer traffic so that customers may possibly buy more products.
4.2
VONCEPTS OF ASSOCIATION RULE MINING
The aim or concept of Association Rule Mining (ARM) is to examine the contents of the database and find rules, known as association rules, in the data. Association rules are used to detect common relationships between items, thus making market basket analysis a practical application of rule inference and usage. The source of information for association rule algorithms is often a database viewed as a set of tuples, where each tuple contains a set of items in table 4.1.
Table 4.1
Transaction ID
Items
T100
1, 3, 4
T200
2, 3, 5
T300
1, 2, 3, 5
T400
2, 5
T500
1,2,5
Contents of a sample database containing five purchase transactions
Association Rule Mining
4.5
In general, association rule mining can be viewed as a two-step process: 1. Find all frequent item-sets: By definition, each of these item-sets will occur at least as frequently as a predetermined minimum support count, min sup. 2. Generate strong association rules from the frequent item-sets: By definition, these rules must satisfy minimum support and minimum confidence. Let I = {i1, i2,....., im} be a set of items and D be the set of transactions (transactional data set) where each transaction T ⊆ I is associated with an identifier TID and m is the number of items. Let A and B be two sets of items. A transaction T is said to contain A if and only if A ⊆ T. An association rule is an implication in the form A => B where A ⊂ I, B ⊂ I, and A∩B = Ø Continuing our example of market-basket analysis, we represent each product in a store as a Boolean variable, which represents whether an item is present or absent. Each customer’s basket is represented as a Boolean vector, denoting which items are purchased. The vectors are analyzed to find which products are frequently bought together (by different customers), i.e., associated with each other. These co-occurrences are represented in the form of association rules: LHS => RHS [support, confidence]
(4.1)
where the left-hand side (LHS) implies the right-hand side (RHS), with a given value of support and confidence. Support and confidence are used to measure the quality of a given rule, in terms of its usefulness (strength) and certainty. Support tells how many examples (transactions) from a data set that was used to generate the rule include items from both LHS and RHS. Confidence expresses how many examples (transactions) that include items from LHS also include items from RHS. Measured values are most often expressed as percentages. An association rule is considered interesting if it satisfies minimum values of confidence and support, which are to be specified by the user (domain expert). Relation between support and confidence: Confidence ( A ⇒ B ) = P ( B A ) = • •
support_count ( A ∪ B ) support_count ( A )
Support_count(A U B) is the number of transactions containing the item-sets A U B Support_count(A) is the number of transactions containing the item-set A.
4.6 Data Mining and Data Warehousing
YREQUENT PATTERN MINING IN ASSOCIATION RULES
4.3
Finding such frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data in transactional and relational databases. Market basket analysis is just one form of frequent pattern mining. In fact, there are many kinds of frequent patterns, association rules, and correlation relationships. Frequent pattern mining can be classified in various ways, based on the following criteria: 1. Based on the types of values handled in the rule: If a rule involves associations between the presence or absence of items, it is a Boolean association rule. For example, Rule 4.2 buys(X, “computer”) => buys(X, “HP printer”)
(4.2)
are Boolean association rules obtained from market basket analysis. If a rule describes associations between quantitative items or attributes, then it is a quantitative association rule. 2.
Based on the completeness of patterns to be mined: Different applications may have different requirements regarding the completeness of the patterns to be mined, which in turn can lead to different evaluation and optimization methods. In this chapter, our study of mining methods focuses on mining the complete set of frequent itemsets, closed frequent item-sets, and constrained frequent item-sets.
3.
Based on the number of data dimensions involved in the rule: If the items or attributes in an association rule reference only one dimension, then it is a single-dimensional association rule. e.g Rule 4.2 If a rule references two or more dimensions, such as the dimensions age, income, and buys, then it is a multidimensional association rule. The following rule is an example of a multidimensional rule: age(X, “30-39”)^income(X, “42K-48K”) => buys(X, “high resolution TV”) (4.3)
4. Based on the levels of abstraction involved in the rule set: Some methods for association rule mining can find rules at differing levels of abstraction. For example, suppose that a set of association rules mined includes the following rules where X is a variable representing a customer: (4.4) buys(X, “computer”) => buys(X, “HP printer”) the items bought are referenced at different levels of abstraction (e.g., “computer” is a higher-level abstraction of “laptop computer”). At lower level rule will be:
Association Rule Mining
buys(X, “laptop computer”) => buys(X, “HP printer”)
4.7 (4.5)
We refer to the rule set mined as consisting of multilevel association rules. If, instead, the rules within a given set do not reference items or attributes at different levels of abstraction, then the set contains singlelevel association rules. 5.
4.4
Based on the kinds of rules to be mined: Frequent pattern analysis can generate various kinds of rules and other interesting relationships. Association rules are the most popular kind of rules generated from frequent patterns. Typically, such mining can generate a large number of rules, many of which are redundant or do not indicate a correlation relationship among item-sets. Thus, the discovered associations can be further analyzed to uncover statistical correlations, leading to correlation rules.
`INING SINGLE-DIMENSIONAL BOOLEAN ASSOCIATION RULES
If the items or attributes in an association rule reference only one dimension, then it is a single-dimensional association rule. Also discussed before in Rule 4.2. The following four steps are used to generate single-dimensional association rules: 1. Prepare input data in the transactional format. 2. Choose items of interest, i.e., item-sets. 3. Compute support counts to evaluate whether selected item-sets are frequent, i.e., whether they satisfy minimum support. 4. Given the frequent item-sets, generate strong association rules that satisfy the minimum confidence by computing the corresponding conditional probabilities (counts). Since the frequent item-sets used to generate the association rules satisfy the minimum support, the generated rules also satisfy the minimum support. Let us understand the same example again briefly. buys(X,”computer”) => buys(X,”software”) [support=1%,confidence=50%] (4.6)
where X is a variable representing a customer. A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well. A 1% support means that 1% of all of the transactions under analysis showed that computer and software were purchased together. This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a single predicate are referred to as single-
4.8 Data Mining and Data Warehousing dimensional association rules. Dropping the predicate notation, the above rule can be written simply as computer => software [1%, 50%]
(4.7)
One more example is used to illustrate the concept. Example: An association rule that describes customers who buy milk and bread. buys (x, milk) => buys (x, bread) [25%, 60.0%]
(4.8)
The rule shows that customers who buy milk also buy bread. The direction of the association, from left to right, shows that buying milk “triggers” buying bread. These items are bought together in 25% of store purchases (transactions), and 60% of the baskets that include milk also include bread. 4.4.1
Apriori Algorithm
The Apriori algorithm is one of the most well-known association rule algorithms. This algorithm uses the property that any subset of a large item-set is also large. It was proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent item-set properties, as we shall see in the following. Apriori employs an iterative approach known as a level-wise search, where k-item-sets are used to explore (k+1)-item-sets. Using the Apriori property, the Apriori algorithm: Algorithm 4.1 Apriori Algorithm Ck: Candidate item-set of size k Lk : frequent item-set of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return Uk Lk;
Example: The algorithm described above can be illustrated by the following steps and also diagrammatically through figure 4.3:
Association Rule Mining
– – – – –
4.9
first finds all 1-item-sets, C1 next, finds among them a set of frequent 1-item-sets, L1 next extends L1 to generate 2-item-sets, C2 next finds among these 2-item-sets a set of frequent 2-item-sets, L2 and repeats the process to obtain L3, L4, etc. until no more frequent k-item-sets can be found.
The finding of each Lk requires one full scan of the database.
Figure 4.3
4.4.2
Full scanning of the database
Improving the Efficiency of the Apriori Rules
The Apriori algorithm was further modified to improve its efficiency (computational complexity). Below we briefly explain the most important improvements. •
Hashing is used to reduce the size of the candidate k-item-sets, i.e., item-sets generated from frequent item-sets from iteration k–1, Ck, for k >1. For instance, when scanning D to generate L1 from the candidate 1item-sets in C1, we can at the same time generate all 2-item-sets for each transaction, hash (map) them into different buckets of the hash table structure, and increase the corresponding bucket counts. A 2-itemset whose corresponding bucket count is below the support threshold cannot be frequent, and thus we can remove it from the candidate set C2. In this way, we reduce the number of candidate 2-item-sets that must be examined to obtain L2.
4.10 Data Mining and Data Warehousing
Table 4.2
•
•
Table structure showing different buckets of the hash
Transaction removal removes transactions that do not contain frequent item-sets. In general, if a transaction does not contain any frequent k-itemsets, it cannot contain any frequent (k+1) item-sets, and thus it can be removed from the computation of any frequent t-item-sets, where t > k. Data set partitioning generates frequent item-sets based on the discovery of frequent item-sets in subsets (partition) of D. The method has two steps(figure 4.4):
Figure 4.4
Mining by partitioning the data
1. Division of the transactions in D into s non overlapping subsets and the mining frequent item-sets in each subset. Based on this support count, all frequent item-sets (for all k) in each subset, referred to as local frequent item-sets are found. A special data structure, which for each item-set records the TID of the transactions that contains the items in this item-set, is used to find all local frequent k item-sets, for all k = 1,2,3,…., in just
Association Rule Mining
4.11
one scan of D. The frequent local item-sets may or may not be frequent in D, but any item-set that is potentially frequent in D must be frequent in at least one subset. Therefore, local frequent item-sets from all subsets become candidate item-sets for D. The collection of all local frequent item-sets is referred to as global candidate item-sets with respect to D. 2. Computation of frequent item-sets for D based on the global candidate item-sets. One scan of D is performed to find out which of the global candidate item-sets satisfy the support threshold. The size and number of subsets is usually set so that each of the subsets can fit into the main computer memory. • Sampling generates association rules based on a sampled subset of transactions in D. In this case, a randomly selected subset S of D is used to search for the frequent item-sets. The generation of frequent item-sets from S is more efficient (faster), but some of the rules that would have been generated from D may be missing, and some rules generated from S may not be present in D, i.e., the “accuracy” of the rules may be lower. Usually the size of S is selected so that the transactions can fit into the main memory, and thus only one scan of the data is required (no paging). To reduce the possibility that we will miss some of the frequent item-sets from D when generating frequent item-sets from S, we may use a lower support threshold for S as compared with the support threshold for D. This approach is especially valuable when the association rules are computed on a very frequent basis. • Mining frequent item-sets without generation of candidate itemsets. One of the main limiting aspects of the Apriori algorithm is that it can still generate very large number of candidate item-sets. For instance, for 10,000 1-item-sets, the Apriori algorithm generates approximately 10,000,000 candidate 2-item-sets and has to compute and store their occurrence frequencies. When a long frequent pattern is generated, say with 100 items, the Apriori algorithm generates as many as 2100 candidate item-sets. The other limiting aspect is that the Apriori algorithm may need to repeatedly scan the data set D to check frequencies of a large set of candidate item-sets—a process that is especially transparent when mining long item-sets, i.e., n+1 scans is required where n is the length of the longest item-set.
4.5
`INING MULTI-LEVEL ASSOCIATION RULES FROM TRANSACTION DATABASE
Basic association rules can be used to generate interesting knowledge about the relationships between items from transactions. Nevertheless, it may be desirable to extend beyond the transactions using external knowledge about the items to generate the rules. In applications where items form a hierarchy, it may be difficult to find strong association rules at the low level of abstraction due to
4.12 Data Mining and Data Warehousing sparsity of data in the multidimensional space. Strong association rules usually can be found at the higher level of a hierarchy, but they often represent already known, commonsense knowledge. For instance, the milk and bread rule is likely to have strong support, but it is trivial. At the same time, the rule skim_milk => large_white_bread may be useful, but it may have weak support. The corresponding concept hierarchy is shown in figure 4.5.
Figure 4.5
Muti-level association rule
An association mining algorithm should be able to generate and traverse between the rules at different levels of abstraction. Multi-level association rules are generated by performing a top-down, iterative deepening search. In simple terms, we first find strong rules at the high level(s) in the hierarchy, and then search for lower-level “weaker” rules. For instance, we first generate the milk bread rule and then concentrate on finding rules that concern breads of different sizes and milks with different fat content. There are two main families of methods for multilevel association-rule mining: •
•
Methods based on uniform support, where the same minimum support threshold is used to generate rules at all levels of abstraction. In this case, the search for rules is simplified, since we can safely assume that item-sets containing item(s) whose ancestors (in the hierarchy) do not satisfy the minimum support are also not frequent. At the same time, it is very unlikely that items at the lower level of abstraction occur as frequently as those at the higher levels. Consequently, if the minimum support threshold is too high, the algorithm will miss potentially useful associations at lower levels of abstraction. On the other hand, if the threshold is too low, a very large number of potentially uninteresting associations would be generated at the higher levels. Methods based on reduced support, an approach that addresses the drawbacks of uniform support. In this case, each level of abstraction is
Association Rule Mining
4.13
furnished with its own minimum support threshold. The lower the abstraction level, the smaller the corresponding threshold as shown in figure 4.6.
Figure 4.6 Illustrates the difference between the uniform and reduced support methods
4.5.1
Multilevel Association Rules
As already discussed in Rules 4.4 and 4.5, Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel association rules. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework. A serious side effect of mining multilevel association rules is its generation of many redundant rules across multiple levels of abstraction due to the “ancestor” relationships among items. For example: The concept hierarchy of figure 4.7 has five levels, respectively referred to as levels 0 to 4.
Figure 4.7
A concept hierarchy for XYZ company computer items
In the following rule where “computer” is an ancestor of “laptop computer” based on the concept hierarchy of Figure and where X is a variable representing
4.14 Data Mining and Data Warehousing customers who purchased items in XYZ company transactions. At Multiple level association rules at differing levels of abstraction, suppose that a set of association rules mined includes the following rules at higher-level: buys(X, “computer”) => buys(X, “HP printer”)
(4.9)
The items bought are referenced at different levels of abstraction (e.g., “computer” is a higher-level abstraction of “laptop computer”). At lower level rule will be: buys(X, “laptop computer”) => buys(X, “HP printer”)
(4.10)
A rule 4.9 is an ancestor of a rule 4.10, if Rule1 can be obtained by replacing the items in Rule 2 by their ancestors in a concept hierarchy. For example, Rule (4.9) is an ancestor of Rule (4.10) because “laptop computer” is an ancestor of “IBM laptop computer.” We refer to the rule set mined as consisting of multilevel association rules. 4.5.2
Approaches to Mining Multi-level Association Rules
There are three approaches to search for multi-level associations using the reduced support method: –
–
–
4.6
The level-by-level independent method, in which a breadth-first search is performed, i.e., each node in the hierarchy is examined, regardless of whether or not its parent node is found to be frequent. The level-cross-filtering by single item method, in which an item at a given level in the hierarchy is examined only if its parent at the preceding level is frequent. In this way, a more specific association is generated from a general one. For example, in figure 4.6, if the minimum support threshold were set to 25%, then reduced_fat_milk, low_fat_milk, and skim_milk would not be considered. The level-cross-filtering by k-item-set method, in which a k-itemset at a given level is examined only if its parent k-item-set at the preceding level is frequent.
`INING MULTI-DIMENSIONAL ASSOCIATION RULES FROM RELATIONAL DATABASES AND DATA WAREHOUSES
However, rather than using a transactional database, sales and related information are stored in a relational database or data warehouse. Such data stores are multidimensional, by definition. For instance, in addition to keeping track of the items purchased in sales transactions, a relational database may
Association Rule Mining
4.15
record other attributes associated with the items, such as the quantity purchased or the price, or the branch location of the sale. Additional relational information regarding the customers who purchased the items, such as customer age, occupation, credit rating, income, and address, may also be stored. Association rules that involve two or more dimensions or predicates can be referred to as multi-dimensional association rules. Already discussed Rule (4.3) contains three predicates (age, occupation, and buys), each of which occurs only once in the rule. Hence, we say that it has no repeated predicates. Multidimensional association rules with no repeated predicates are called interdimensional association rules. We can also mine multidimensional association rules with repeated predicates, which contain multiple occurrences of some predicates. These rules are called hybrid-dimensional association rules. 4.6.1
Multi-dimensional Association Rules
If a rule references two or more dimensions, such as the dimensions age, income, and buys, then it is a multidimensional association rule. Also discussed before in Rule 4.3. Let us understand the same example again. Suppose, instead, that we are given the XYZ company relational database relating to purchases. A data mining system may find association rules like age(X, “20-29”)^income(X, “20K-29K”) => buys(X, “CD player”)[support = 2%, confidence = 60%] (4.11) where X is a variable representing a customer and “^” represents a logical “AND.” The rule indicates that of the XYZ company customers under study, 2% are 20 to 29 years of age with an income of 20,000 to 29,000 and have purchased a CD player at XYZ Company. There is a 60% probability that a customer in this age and income group will purchase a CD player. Note that this is an association between more than one attribute, or predicate (i.e., age, income, and buys). Adopting the terminology used in multidimensional databases, where each attribute is referred to as a dimension, the above rule can be referred to as a multidimensional association rule. _________________________________________________________________
fUMMARY Finally we can conclude that the discovery of frequent patterns, association, and correlation relationships among huge amounts of data is useful in selective marketing, decision analysis, and business management. A popular area of application is market basket analysis, which studies the buying habits of customers by searching for sets of items that are frequently purchased together (or in sequence). Association rule mining consists of first finding frequent itemsets from which strong association rules in the form of A=>B are generated.
4.16 Data Mining and Data Warehousing These rules also satisfy a minimum confidence threshold. Associations can be further analyzed to uncover correlation rules, which convey statistical correlations between item-sets A and B. _________________________________________________________________
XXERCISES Q1. Q2. Q3. Q4. Q5. Q6. Q7.
How do you classify Association Rules? Explain Apriori Algorthithm for finding frequent item sets. Explain Multi-level Association Rules. Explain Association Mining to Correlation Analysis. What do you mean by Aggregation and Association Rules? What do you mean by multi-dimensional association rules? Explain the following: i. Mining Single dimensional Boolean association rules from transactional databases. ii. The Apriori Algorithm – Finding frequent item sets using candidate generation.
CHAPTER 5
Classification & Prediction Databases are rich with hidden information that can be used for intelligent decision making. Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large. Whereas classification predicts categorical (discrete, unordered) labels, prediction models continuous valued functions.
CHAPTER OBJECTIVES • • • • • • • • • •
Classification Prediction Why Classification and Prediction important? What is Test Data? Issues regarding Classification and Prediction Decision tree Bayesian Classification Neural networks K-nearest neighbour methods Genetic Algorithm
For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation. Many classification and prediction methods have been proposed by researchers in machine learning, pattern recognition, and statistics.
5.2
5.1
Data Mining and Data Warehousing
VLASSIFICATION
Classification analysis is the organization of data in given classes. Also known as supervised classification, the classification uses given class labels to order the objects in the data collection (figure 5.1). Classification approaches normally use a training set where all objects are already associated with known class labels.
Figure 5.1 The data classification (a) Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute is loan_decision, and the learned model or classifier is represented in the form of classification rules. (b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied of new data tuples
Classification and Prediction
5.3
Classification is one of the most common applications for data mining. It corresponds to a task that occurs frequently in everyday life. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. A classification task begins with a data set in which the class assignments are known. For example, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case. Take another example, a hospital may want to classify medical patients into those who are at high, medium or low risk of acquiring a certain illness, an opinion polling company may wish to classify people interviewed into those who are likely to vote for each of a number of political parties or are undecided, or we may wish to classify a student project as distinction, merit, pass or fail. “How is the derived model presented?” The derived model may be represented in various forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks.
5.2
cREDICTION
Prediction has attracted considerable attention given the potential implications of successful forecasting in a business context. There are two major types of predictions: one can either try to predict some unavailable data values or pending trends, or predict a class label for some data. Prediction is however more often referred to the forecast of missing numerical values, or increase/ decrease trends in time related data. The major idea is to use a large number of past values to consider probable future values.
5.3
jHY ARE CLASSIFICATION AND PREDICTION IMPORTANT
Previously unseen records should be assigned a class as accurately as possible for deciding on the type of prediction that is most appropriate: (1) Classification: predicting into what category or class a case falls, or (2) Regression: predicting what number value a variable will have (if it’s a variable that varies with time, it’s called time series prediction). You might use regression to forecast the amount of profitability, and classification to predict which customers might leave.
5.4
Data Mining and Data Warehousing
5.4
jHAT IS TEST DATA?
As already shown in figure 5.1 test data is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Each row of the training set is called an instance. An instance comprises the values of a number of attributes and the corresponding classification. The training set constitutes the results of a sample of trials that we can use to predict the classification of other (unclassified) instances.
5.5
\SSUES REGARDING CLASSIFICATION AND PREDICTION
5.5.1
Preparing the Data for Classification and Prediction
The following preprocessing steps may be applied to the data to help improve the accuracy, efficiency, and scalability of the classification or prediction process. Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing techniques, for example) and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics. Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether any two given attributes are statistically related. For example, a strong correlation between attributes A1 and A2 would suggest that one of the two could be removed from further analysis. A database may also contain irrelevant attributes. Attribute subset selection4 can be used in these cases to find a reduced set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Data transformation and reduction: The data may be transformed by normalization, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as -1:0 to 1:0, or 0:0 to 1:0. In methods that use distance measurements, for example, this would prevent attributes with initially large ranges (like, say, income) from outweighing attributes with initially smaller ranges (such as binary attributes). 5.5.2
Comparing Classification and Prediction Methods
Classification and prediction methods can be compared and evaluated according to the following criteria:
Classification and Prediction
5.5
•
Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data (i.e., tuples without class label information). Similarly, the accuracy of a predictor refers to how well a given predictor can guess the value of the predicted attribute for new or previously unseen data. • Speed: This refers to the computational costs involved in generating and using the given classifier or predictor. • Robustness: This is the ability of the classifier or predictor to make correct predictions given noisy data or data with missing values. • Scalability: This refers to the ability to construct the classifier or predictor efficiently given large amounts of data. • Interpretability: This refers to the level of understanding and insight that is provided by the classifier or predictor. Interpretability is subjective and therefore more difficult to assess. We discuss some work in this area, such as the extraction of classification rules from a “black box” neural network classifier called back propagation (Section 5.8.2). These issues are discussed throughout the chapter with respect to the various classification and prediction methods presented.
5.6
WECISION TREE
A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value, as shown in figure 5.2 each branch represents an outcome of the test, and tree leaves represent classes or class distributions. Decision trees can easily be converted to classification rules.
Figure 5.2
A Simple Decision Tree showing various levels and branches
5.6
Data Mining and Data Warehousing
5.6.1 How a Decision Tree Works? Decision tree is a classifier in the form of a tree structure where each node is either: • a leaf node, indicating a class of instances, or • a decision node that specifies some test to be carried out on a single attribute value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an instance by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance. Example: Decision making in the London stock market Suppose that the major factors affecting the London stock market are: • what it did yesterday; • what the New York market is doing today; • bank interest rate; • unemployment rate; • England’s prospect at cricket. Table 5.1 is a small illustrative dataset of six days about the London stock market. The lower part contains data of each day according to five questions, and the second row shows the observed result (Yes (Y) or No (N) for “It rises today”). Figure 5.3 illustrates a typical learned decision tree from the data in Table 5.1. Instance No. It rises today
1 Y
2 Y
3 Y
4 N
5 N
6 N
It rose yesterday NY rises today Bank rate high Unemployment high England is losing
Y Y N N Y
Y N Y Y Y
N N N Y Y
Y N Y N Y
N N N N Y
N N Y N Y
Table 5.1
Examples of a small dataset on the London stock market
The process of predicting an instance by this decision tree can also be expressed by answering the questions in the following order: Is unemployment high? YES: The London market will rise today NO: Is the New York market rising today? YES: The London market will rise today NO: The London market will not rise today.
Classification and Prediction
Figure 5.3
5.6.2
5.7
A decision tree for the London stock market
Decision Tree Induction
Decision tree induction is a typical inductive approach to learn knowledge on classification. The key requirements to do mining with decision trees are: • Attribute-value description: object or case must be expressible in terms of a fixed collection of properties or attributes. • Predefined classes: The categories to which cases are to be assigned must have been established beforehand (supervised data). • Discrete classes: A case does or does not belong to a particular class, and there must be for more cases than classes. • Sufficient data: Usually hundreds or even thousands of training cases. • “Logical” classification model: Classifier that can be only expressed as decision trees or set of production rules 5.6.3
What is Decision Tree Learning Algorithm?
Decision tree learning algorithm has been successfully used in expert systems in capturing knowledge. The main task performed in these systems is using inductive methods to the given values of attributes of an unknown object to determine appropriate classification according to decision tree rules.
Algorithm 5.1: Decision Tree Input: T // Decision Tree D // Input Database Output: M // Model Prediction
5.8
Data Mining and Data Warehousing
DTProc Algorithm: // Illustrates Prediction Technique using DT for each t ε D do n = root node of T; while n not leaf node no Obtain answer to question on n applied t; Identify arc from t which contains correct answer; n = node at end of this arc; Make prediction for t based on labeling of n;
The 3 widely used decision tree learning algorithms are: ID3, ASSISTANT and C4.5 5.6.4
ID3
Interactive Dichotomizer 3 or ID3, uses a basic tree induction algorithm that assigns an attribute to a tree node based on how much information is gained from that node. The ID3 method allows an attribute to have two or more values at a node or splitting point, thereby facilitating formation of an n-ary tree where n can be greater than two. Table 6 contains the input and output parameters for the Build DT algorithm — the basic tree induction method used in ID3. Algorithm 5.2: ID3 The input and output parameters for algorithm BuildDT. Input: Data set D = {di, d,2, • • •, dn}. Collection of attributes A = {ai, 02, . . . , am}Domain of each attribute en, Vi = {vn, V12, • • . , vn^}. Each Vij represents a value of the attribute ai. k is the cardinality of the attribute domain V*. Output: Decision tree T = {ti, *2, • • •, ts} |J {c\, C2, .. ., cr}. Each ti represents an internal node, i.e., an attribute. s is the number of internal nodes. Each Cj represents an external or leaf node, i.e., a class or category. r is the number of external nodes. Algorithm 5.3: BuildDT Select best splitting criterion a* from set A Create root node U and assign it label at Add a branch to node U for each v\j in Vi Add U to set T for each branch of node U
Classification and Prediction
5.9
{ Set D = subset of D created by applying attribute aj to D if stopping point reached for this path, then { Create leaf node c$ Add Ci to set T } else { T' = BuildDT (D) Add T" to set T } }
BuildDT assumes knowledge of optimal attribute selection and accurate stopping point detection. Ideally, a stopping point is reached when the data set is perfectly classified; however, situations may arise where reaching the accurate stopping point becomes an obstacle to creating a tractable decision tree. In such situations trade-off is often made between accuracy and performance by setting a stopping criterion at a certain depth. Example of ID3 Suppose we want ID3 to decide whether the weather is amenable to playing baseball. Over the course of 2 weeks, data is collected to help ID3 build a decision tree. The target classification is "should we play baseball?" which can be yes or no. The weather attributes are outlook, temperature, humidity, and wind speed. They can have the following values: outlook = { sunny, overcast, rain } temperature = {hot, mild, cool } humidity = { high, normal } wind = {weak, strong } Examples of set S are We need to find which attribute will be the root node in our decision tree. The gain is calculated for all four attributes: Gain(S, Outlook) = 0.246 Gain(S, Temperature) = 0.029 Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.048
5.10
Data Mining and Data Warehousing
Table 5.2
Showing example of set S
Outlook attribute has the highest gain, therefore it is used as the decision attribute in the root node. Since Outlook has three possible values, see figure 5.4 the root node has three branches (sunny, overcast, rain). The next question is "what attribute should be tested at the Sunny branch node?" Since we have used Outlook at the root, we only decide on the remaining three attributes: Humidity, Temperature, or Wind. Ssunny = {D1, D2, D8, D9, D11} = 5 examples from table 1 with outlook = sunny Gain(Ssunny, Humidity) = 0.970 Gain(Ssunny, Temperature) = 0.570 Gain(Ssunny, Wind) = 0.019 Humidity has the highest gain; therefore, it is used as the decision node. This process goes on until all data is classified perfectly or we run out of attributes.
Classification and Prediction
Figure 5.4
5.11
Final decision tree from table 5.2
The final decision = tree The decision tree can also be expressed in rule format: IF outlook = sunny AND humidity = high THEN playball = no IF outlook = rain AND humidity = high THEN playball = no IF outlook = rain AND wind = strong THEN playball = yes IF outlook = overcast THEN playball = yes IF outlook = rain AND wind = weak THEN playball = yes ID3 has been incorporated in a number of commercial rule-induction packages. Some specific applications include medical diagnosis, credit risk assessment of loan applications, equipment malfunctions by their cause, classification of soybean diseases, and web search classification. As discussed above and through example show that ID3 is easy to use. Its primary use is replacing the expert who would normally build a classification tree by hand. As industry has shown, ID3 has been effective.
5.7
UAYESIAN CLASSIFICATION
Bayesian approaches are a fundamentally important DM technique. Given the probability distribution, Bayes classifier can provably achieve the optimal result. Bayesian method is based on the probability theory. Bayes Rule is applied here to calculate the posterior from the prior and the likelihood, because the later two is generally easier to be calculated from a probability model. One limitation that the Bayesian approaches cannot cross is the need of the probability estimation from the training dataset. It is noticeable that in some situations, such as the decision is clearly based on certain criteria, or the dataset has high degree of randomality, the Bayesian approaches will not be a good choice.
5.12
Data Mining and Data Warehousing
5.7.1
Naïve Bayesian Classifiers
The Naïve Bayes Classifier technique is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. As indicated at figure 5.5, the objects can be classified as either BLACK or GREY. Our task is to classify new cases as they arrive (i.e., decide to which class label they belong, based on the currently exiting objects).
Figure 5.5
Objects are classified to BLACK or GREY
We can then calculate the priors (i.e. the probability of the object among all objects) based on the previous experience. Thus: Prior probability for BLACK ∝ Prior probability for GREY ∝
Number of BLACK objects Total number of objects Number of GREY objects Total number of objects
Since there is a total of 60 objects, 40 of which are BLACK and 20 GREY, our prior probabilities for class membership are:
Prior probability for BLACK ∝
40 60
Prior probability for GREY ∝
20 60
Having formulated our prior probability, we are now ready to classify a new object (WHITE circle in figure 5.6). Since the objects are well clustered, it is reasonable to assume that the more BLACK (or GREY) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label.
Classification and Prediction
Figure 5.6
5.13
Classify the WHITE circle
We can calculate the likelihood: Likelihood of X given BLACK ∝ Likelihood of X given GREY ∝
Number of BLACK in the vivinity of X Total number of BLACK cases Number of GREY in the vivinity of X Total number of GREY cases
In Figure 5.6, it is clear that Likelihood of X given GREY is larger than Likelihood of X given BLACK since the circle encompasses 1 BLACK object and 3 GREY ones. Thus: Probability of X given BLACK ∝
1 40
Probability of X given GREY ∝
3 40
Although the prior probabilities indicate that X may belong to BLACK (given that there are twice as many BLACK compared to GREY) the likelihood indicates otherwise; that the class membership of X is GREY (given that there are more GREY objects in the vicinity of X than BLACK). In the Bayesian analysis, the final classification is produced by combining both sources of information (i.e. the prior and the likelihood) to form a posterior probability using Bayes Rule. Posterior probability of X being BLACK ∝ Prior probability of BLACK × 4 1 1 = Likelihood of X given BLACK = × 6 40 60 Posterior probability of X being GREY ∝ Prior probability of GREY × 2 3 1 = Likelihood of X given GREY= × 6 40 40 Finally, we classify X as GREY since its class membership achieves the largest posterior probability.
5.14
Data Mining and Data Warehousing
5.7.2
Bayesian Networks
A Bayesian network is a graphic probabilistic model through which one can acquire, capitalize on and exploit knowledge. Bayesian networks are the natural successors and heirs to symbolic, connectionist and statistical approaches to Artificial Intelligence and Data Mining. They combine the rigour of powerful and stable mathematical formalism and the effectiveness of a distributed representation of knowledge and the readability of rule-based models. Particularly suited to taking uncertainty into consideration, they can as easily be described manually by experts in the field as they can be generated automatically through learning.
Figure 5.7
Diagnostics of lung disease: from causes to symptoms using a small Bayesian network in Bayesia Lab
As viewed in figure 5.7 above a Bayesian network is used to represent knowledge from a system (technical, computer, economic, biological, sociological, etc.) or to find out this knowledge by analyzing data (learning). Through the network one can then: • • • • •
Diagnose: one observes the effects observed and from this one aims to deduct the probability distribution on possible causes Simulate (the system’s behaviour): this is the inverse path. The entry variables (causes) are provided to observe the resulting probability distribution on the effects Analyze data Make decisions Control the system, etc.
Classification and Prediction
5.8
5.15
aEURAL NETWORKS
Neural networks are of particular interest because they offer a means of efficiently modeling large and complex problems in which there may be hundreds of predictor variables that have many interactions.(Actual biological neural networks are incomparably more complex.) Neural nets may be used in classification problems (where the output is a categorical variable) or for regressions (where the output variable is continuous). A neural network (figure 5.8) starts with an input layer, where each node corresponds to a predictor variable. These input nodes are connected to a number of nodes in a hidden layer. Each input node is connected to every node in the hidden layer. The nodes in the hidden layer may be connected to nodes in another hidden layer, or to an output layer. The output layer consists of one or more response variables.
Figure 5.8
A Neural Network with one hidden layer
After the input layer, each node takes in a set of inputs, multiplies them by a connection weight Wxy (e.g., the weight from node 1 to 3 is W13 — see Figure 5.9), adds them together, applies a function (called the activation or squashing function) to them, and passes the output to the node(s) in the next layer. For example, the value passed from node 4 to node 6 is: Activation function applied to ([W14 * value of node 1] + [W24 * value of node 2])
Figure 5.9
WXY is the weight from node x to node y
5.16
Data Mining and Data Warehousing
Each node may be viewed as a predictor variable (nodes 1 and 2 in this example) or as a combination of predictor variables (nodes 3 through 6). Node 6 is a nonlinear combination of the values of nodes 1 and 2, because of the activation function on the summed values at the hidden nodes. In fact, if there is a linear activation function but no hidden layer, neural nets are equivalent to a linear regression; and with certain non-linear activation functions, neural nets are equivalent to logistic regression. The connection weights (W’s) are the unknown parameters which are estimated by a training method. Originally, the most common training method was backpropagation; newer methods include conjugate gradient, quasi-Newton, One of the most common types of neural network is the feed-forward backpropagation network. For simplicity of discussion, we will assume a single hidden layer. Backpropagation training is simply a version of gradient descent, a type of algorithm that tries to reduce a target value (error, in the case of neural nets) at each step. The algorithm proceeds as follows. 5.8.1
Feed Forward
The value of the output node is calculated based on the input node values and a set of initial weights. The values from the input nodes are combined in the hidden layers, and the values of those nodes are combined to calculate the output value. 5.8.2
Backpropagation
The error in the output is computed by finding the difference between the calculated output and the desired output (i.e., the actual values found in the training set). Next, the error from the output is assigned to the hidden layer nodes proportionally to their weights. This permits an error to be computed for every output node and hidden node in the network. Finally, the error at each of the hidden and output nodes is used by the algorithm to adjust the weight coming into that node to reduce the error. The algorithm is summarized below. The steps involved are expressed in terms of inputs, outputs, and errors, and may seem awkward if this is your first look at neural network learning. Algorithm 5.4: Backpropagation Neural network learning for classification backpropagation algorithm.
or
prediction,
using
the
Input: • D, a data set consisting of the training tuples and their associated target values; • l, the learning rate; • network, a multilayer feed-forward network.
Classification and Prediction
5.17
Output: A trained neural network.
Initialize the weights: The weights in the network are initialized to small random numbers (e.g., ranging from-1:0 to 1:0, or -0:5 to 0:5). Each unit has a bias associated with it, as explained below. The biases are similarly initialized to small random numbers. Each training tuple, X, is processed by the following steps. Propagate the inputs forward: First, the training tuple is fed to the input layer of the network. The inputs pass through the input units, unchanged. That is, for an input unit, j, its output, Oj, is equal to its input value, Ij. Next, the net input and output of each unit in the hidden and output layers are computed. The net input to a unit in the hidden or output layers is computed as a linear combination of its inputs. To help illustrate this point, a hidden layer or output layer unit is shown in Figure 5.8. Each such unit has a number of inputs to it that are, in fact, the outputs of the units connected to it in the previous layer. Each connection has a weight. To compute the net input to the unit, each input connected to the unit is multiplied by its corresponding weight, and this is summed. Given a unit j in a hidden or output layer, the net input, Ij, to unit j is
5.18
Data Mining and Data Warehousing
I j = ∑ wij Oi + θ j ,
5.1
i
where wij is the weight of the connection from unit i in the previous layer to unit j; Oi is the output of unit i from the previous layer; and θj is the bias of the unit. The bias acts as a threshold in that it serves to vary the activity of the unit.
5.9
^-NEAREST NEIGHBOUR CLASSIFIERS
When trying to solve new problems, people often look at solutions to similar problems that they have previously solved. K-nearest neighbour (k-NN) is a classification technique that uses a version of this same method. It decides in which class to place a new case by examining some number — the “k” in knearest neighbour — of the most similar cases or neighbours (figure 5.10). It counts the number of cases for each class, and assigns the new case to the same class to which most of its neighbors belong.
Figure 5.10 K-nearest neighbour. N is a new case. It would be assigned to the class X because the seven X’s within the ellipse outnumber the two Y’s
The first thing you must do to apply k-NN is to find a measure of the distance between attributes in the data and then calculate it. While this is easy for numeric data, categorical variables need special handling. For example, what is the distance between blue and green? You must then have a way of summing the distance measures for the attributes. Once you can calculate the distance between cases, you then select the set of already classified cases to use as the basis for classifying new cases, decide how large a neighbourhood in which to do the comparisons, and also decide how to count the neighbours themselves (e.g., you might give more weight to nearer neighbors than farther neighbours). KNN (Algorithm 5.5) is a more complicated and robust classification algorithm that labels the target based on the classes of the K-nearest objects. The KNN algorithm is illustrated by the following example.
Classification and Prediction
5.19
Algorithm 5.5: KNN Algorithm Input : T {Traning Data} K {Number for classify} x' {Input object to classify} Output : c {Class to which x' is assigned} N ← 0 for all v ε T do if N ≤ K then N ← N ∪ {u } else if ∃ u ε N such that d ( x ′, u ) ≥ d ( x ′, v ) then N ← N – {u} N ← N ∪ {v} end if end for c = class to which the most u ε N are classified
“Closeness” is defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points or tuples, say, X1 = (x11, x12, : : : , x1n) and X2 = (x21, x22, : : : , x2n), is dist ( X 1 , X 2 ) =
n
∑( x i =1
1i
– x2i )
2
5.2
In other words, for each numeric attribute, we take the difference between the corresponding values of that attribute in tuple X1 and in tuple X2, square this difference, and accumulate it. The square root is taken of the total accumulated distance count. Typically, we normalize the values of each attribute before using Equation (5.2). This helps prevent attributes with initially large ranges (such as income) from outweighing attributes with initially smaller ranges (such as binary attributes). Min-max normalization, for example, can be used to transform a value v of a numeric attribute A to v’ in the range [0, 1] by computing v′ =
v − min A , max A − min A
where minA and maxA are the minimum and maximum values of attribute A.
5.3
5.20
Data Mining and Data Warehousing
5.10
ZENETIC ALGORITHM
Genetic algorithms are examples of evolutionary computing methods and are optimization-type algorithms. Given a population of potential problem solutions (individuals), evolutionary computing expands this population with new and potentially better solutions. The basis for evolutionary computing algorithms is biological evolution, where over time evolution produces the best or "fittest" individuals. Chromosomes, which are DNA strings, provide the abstract model for a living organism. Subsections of the chromosomes, which are called genes, are used to define different traits of the individual. During reproduction, genes from the parents are combined to produce the genes for the child. 000
000
000 000
000
111
00
111
111
111 111
111 000 Parents Children (a) Single crossover Table 5.3
000 111 00 111 000 11
11 Parents Children (b) Multiple crossover The process of crossover
In genetic algorithms, reproduction is defined by precise algorithms that indicate how to combine the given set of individuals to produce new ones. These are called crossover algorithms. Given two individuals (parents) from the population, The crossover technique generates new individuals (offspring or children) by switching subsequences of the strings. Table 5.3 illustrates the process of crossover. The locations indicating the crossover points are shown in the table with the vertical lines. Algorithm 5.6: Genetic Algorithm Input: P // Initial Population Output: P' // Improved Population Genetic Algorithm: // Illustrates Genetic Algorithm repeat N= P; P' = θ; repeat i1, i2 = select (P); o1, o2 = cross (i1, i2); o1 = mutate (o1);
Classification and Prediction
5.21
o2 = mutate (o2); P′ = P′ ∪ {o1, o2 } until P′ = N; P = P'; until termination criteria satisfied;
Algorithm 5.6 outlines the steps performed by a genetic algorithm. Initially, a population of individuals, P, is created. Although different approaches can be used to perform this step, they typically are generated randomly. From this population, a new population, p’, of the same size is created. The algorithm repeatedly selects individuals from whom to create new ones. These parents, i1, i2, are then used to produce two offsprings, o1, o2, using a crossover process. Then mutants may be generated. The process continues until the new population satisfies the termination condition. ________________________________________________________________
fUMMARY
We can summarize Classification and Prediction in two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. While classification predicts categorical labels (classes), prediction models continuous-valued functions. Preprocessing of the data in preparation for classification and prediction can involve data cleaning to reduce noise or handle missing values, relevance analysis to remove irrelevant or redundant attributes, and data transformation, such as generalizing the data to higher-level concepts or normalizing the data. Predictive accuracy, computational speed, robustness, scalability, and interpretability are five criteria for the evaluation of classification and prediction methods. _______________________________________________________________
XXERCISES Q1. Q2. Q3. Q4. Q5. Q6. Q7. Q9. Q9. Q10.
Define Classification and Prediction. What is Bayes theorem? Explain Naive Bayesian classification. Explain different classification methods. Explain Bayesian Classification What is Bach Propagation? Explain it. Explain K – Nearest neighbour classifiers and other methods. What is Classification? What is Prediction? Explain them. Discuss K-Nearest neighbour classifiers and case-based reasoning. How Prediction is different from Classifications? Compare and contrast Classification methods?
CHAPTER 6
Cluster Analysis When human beings try to make sense of complex questions, our natural tendency is to break the subject into smaller pieces, each of which can be explained more simply. Clustering is a technique used for combining observed objects into groups or clusters. Clustering is a common descriptive task where one seeks to identify a finite set of categories or clusters to describe the data. The categories may be mutually exclusive and exhaustive, or consist of a richer representation such as hierarchical or overlapping categories. The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. A term similar to clustering is database segmentation, where like tuples (records) in a database are grouped together. This is done to partition or segment the database into components that then give the user a more general view of the data.
CHAPTER OBJECTIVES • • • • • •
Overview Where do we need Clustering? Characteristics of Clustering Techniques in Data Mining Applications of Clustering Data types in Cluster Analysis Categories of Clustering Methods
Examples of clustering in a knowledge discovery context include discovering homogeneous sub-populations for consumers in marketing databases and identification of sub-categories of spectra from infrared sky measurements. Note that the clusters overlap allowing data points to belong to more than one cluster. The original class labels (denoted by two different colours) have been replaced by “no color” to indicate that the class membership is no longer assumed.
6.2
Data Mining and Data Warehousing
VLUSTER ANALYSIS: OVERVIEW
6.1
Similar to classification, clustering is the organization of data in classes. However, unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes. Clustering is also called unsupervised classification, because the classification is not dictated by given class labels. There are many clustering approaches all based on the principle of maximizing the similarity between objects in a same class (intraclass similarity) and minimizing the similarity between objects of different classes (inter-class similarity). The following figure 6.1 shows the example of cluster, in which there are twenty points and three different ways of dividing them into clusters.
Figure 6.1
Different ways of clustering the same set of points
Many definitions for clusters have been proposed: • •
Set of like elements. Elements from different clusters are not alike. The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it.
EXAMPLE: An international online catalog company wishes to group its customers based on common features. Company management does not have any predefined labels for these groups. Based on the outcome of the grouping, they will target marketing and advertising campaigns to the different groups. The information they have about the customers includes income, age, and number of children, marital status, and education. Table 6.1 shows some tuples from this database for customers in India. Depending on the type of advertising, not all attributes are important. For example, suppose the advertising is for a special sale on children's clothes. They could target the advertising only to the persons with children. One possible clustering is that shown by the divisions of the table. The first group of people has young children and a high school degree, while the second group is similar but have no children. The third group has both children and a college degree.
Cluster Analysis
6.3
The last two groups have higher incomes and at least a college degree. The very last group has children. Different clustering would have been found by examining age or marital status.
Table 6.1
Figure 6.2
Example showing various groups of targeted people
Based on data from table 6.1 various groups of clusters
As illustrated in figure 6.2, a given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown. The first type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.
Difference between clustering and classification A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse
6.4
Data Mining and Data Warehousing
direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups.
6.2
fTAGES OF CLUSTERING PROCESS
In brief, cluster analysis groups data objects into clusters such that objects belonging to the same cluster are similar, while those belonging to different ones are dissimilar. The notions of similarity and dissimilarity will become clear in a later section. The above definition indicates that clustering cannot be a one-step process. The clustering process can be divided into the following stages: • Data Collection: Includes the careful extraction of relevant data objects from the underlying data sources. In our context, data objects are distinguished by their individual values for a set of attributes (or measures). • Initial Screening: Refers to the massaging of data after its extraction from the source, or sources. This stage is closely connected to a process widely used in Data Warehousing, called Data Cleaning. • Representation: Includes the proper preparation of the data in order to become suitable for the clustering algorithm. Here, the similarity measure is chosen, the characteristics and dimensionality of the data is examined. • Clustering Tendency: Checks whether the data in hand has a natural tendency to cluster or not. This stage is often ignored, especially in the presence of large data sets. • Clustering Strategy: Involves the careful choice of clustering algorithm and initial parameters. • Validation: This is one of the last and, in our opinion, most understudied stages. Validation is often based on manual examination and visual techniques. However, as the amount of data and their dimensionality grow, we have no means to compare the results with preconceived ideas or other clustering. • Interpretation: This stage includes the combination of clustering results with other studies, e.g., classification, in order to draw conclusions and suggest further analysis. 6.3
jHERE DO WE NEED CLUSTERING / APPLICATION AREAS?
Clustering has been used in many application domains, including economics, financial, medicine, marketing, and economics. Clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. In many fields there are obvious benefits to be had from grouping together similar objects. For example:
Cluster Analysis
• • • • • •
6.5
In an economics application we might be interested in finding countries whose economies are similar. In a financial application we might wish to find clusters of companies that have similar financial performance. In a marketing application we might wish to find clusters of customers with similar buying behaviour. In a medical application we might wish to find clusters of patients with similar symptoms. In a document retrieval application we might wish to find clusters of documents with related content. In a crime analysis application we might look for clusters of high volume crimes such as burglaries or try to cluster together much rarer (but possibly related) crimes such as murders.
Recent uses include examining Web log data to detect usage patterns.
VHARACTERISTICS OF CLUSTERING TECHNIQUES IN
6.4
DATA MINING Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. It is considered as a challenging field of research in which its potential applications pose their own special requirements. In data mining the following are typical requirements of clustering: • Scalability: Can work well on small data sets containing fewer than 200 data objects; as well as on large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. •
Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types.
•
Able to deal with different types of attributes: Able to deal with not only with numerical but also many other types of data e.g. binary, categorical (nominal), and ordinal data, or mixture of these data types.
•
Discovery of clusters with arbitrary shape: Many algorithms of cluster are able to deal with shapes such as based on distance measures tend to find spherical clusters with similar size and density. Also can detect clusters of arbitrary shape etc.
•
Minimal requirements for domain knowledge to determine input parameters: Cluster analysis requires inputting certain parameters from users (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters and often difficult to determine.
6.6
Data Mining and Data Warehousing
This not only burdens users, but it also makes the quality of clustering difficult to control. •
Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality.
•
Incremental clustering and insensitivity to the order of input records: Some clustering algorithms are sensitive to the order of input data; may generate dramatically different clusters depending on the order of presentation of the input objects. It is important to develop incremental clustering algorithms and algorithms that are insensitive to the order of input data.
•
High dimensionality: Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in high dimensional space is challenging, especially considering that such data can be sparse and highly skewed.
•
Constraint-based clustering: For the real-world applications, such as choosing the locations for a given number of new automatic banking machines (ATMs) in a city may need to decide upon cluster households. While considering constraints such as the city’s rivers and highway networks, and the type and number of customers per cluster, a challenging task is to find groups of data with good clustering behaviour that satisfy specified constraints.
•
Interpretability and usability: Clustering results are expected to be interpretable, comprehensible, and usable. That is, clustering may need to be tied to specific semantic interpretations and applications.
6.5
WATA TYPES IN CLUSTER ANALYSIS
The types of data that often occur in cluster analysis and how to preprocess them for such an analysis. Suppose that a data set to be clustered contains n objects, which may represent persons, houses, documents, countries, and so on. Main memory-based clustering algorithms typically operate on either of the following two data structures. 6.5.1
Data Matrix (or object-by-variable structure)
This represents n objects, such as persons, with p variables (also called measurements or attributes), such as age, height, weight, gender, and so on. The structure is in the form of a relational table, or n-by-p matrix (n objects × p variables):
Cluster Analysis
Figure 6.3
6.5.2
6.7
Data matrix
Dissimilarity Matrix (or object-by-object structure)
This stores a collection of proximities that are available for all pairs of n objects. It is often represented by an n-by-n table:
Figure 6.4
Dissimilarity matrix
where d(i, j) is the measured difference or dissimilarity between objects i and j. In general, d(i, j) is a nonnegative number that is close to 0 when objects i and j are highly similar or “near” each other, and becomes larger the more they differ. Since d(i, j) = d( j, i), and d(i, i) = 0, we have the matrix in (figure 6.4).Measures of dissimilarity are discussed throughout this section.
6.6
VATEGORIES OF CLUSTERING METHODS
There are many clustering methods available, and each of them may give a different grouping of a dataset. The choice of a particular method will depend on the type of output desired. The known performance of method with particular types of data, the hardware and software facilities available and the size of the data-set. Many clustering algorithms exist in the literature. It is difficult to provide a crisp categorization of clustering methods because these categories may overlap, so that a method may have features from several categories. Nevertheless, it is useful to present a relatively organized picture of the different clustering methods. In general, the major clustering methods can be classified into the following categories:
6.8
Data Mining and Data Warehousing
Figure 6.5
6.6.1
Categories of clustering methods
Partitioning Methods
With partitional clustering, the algorithm creates only one set of clusters. These approaches use the desired number of clusters to drive how the final set is created. Given a database of objects, a partitional clustering algorithm constructs partitions of the data, where each cluster optimizes a clustering criterion, such as the minimization of the sum of squared distance from the mean within each cluster. The k-Means Method: Centroid-based Technique The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.
Figure 6.6
Clustering of a set objects based on the k-means method (The mean of each cluster is marked by a “+”.)
As viewed in figure 6.6 above, the k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster.
Cluster Analysis
6.9
Algorithm 6.1: The k-means partitioning algorithm. Input: K: the number of clusters D: a data set containing n objects Output: A set of k clusters Method: (1) Arbitrary choose k objects from D as in initial cluster centers (2) Repeat (3) Reassign each object to the most similar cluster based on the mean value of the objects in the cluster (4) Update the cluster means (5) Until no change
Advantages of k-means • • • • • •
K-means is relatively scalable and efficient in processing large data sets The computational complexity of the algorithm is O(nkt) n: the total number of objects k: the number of clusters t: the number of iterations Normally: k