353 65 356KB
English Pages 30 Year 1997
T
E
C
H
T
O
P
I
C
5
DATA MINING: AN ARCHITECTURE PART 1 by W.H. Inmon [This Tech Topic is divided into two parts because of the length of the topic. It is recommended that if you read the first part that you also read the second part, as the two parts logically form a single Tech Topic.] One of the most important uses of the data warehouse is that of data mining. Data mining is the process of using raw data to infer important business relationships. Once the business relationships have been discovered, they can then be used for business advantage. Certainly a data warehouse has other uses than that of data mining. However, the fullest use of a data warehouse certainly must include data mining. There are many approaches to data mining just as there are many approaches to actual mining of minerals. Minerals historically have been mined in many different ways — by panning for gold, digging mine shafts, strip mining, analyzing satellite photos taken from space, and so forth. In much the same fashion, data mining occurs in many different forms and flavors, each with their own overhead, rewards and probability of success.
A General Approach The general approach to data mining that will be discussed in the series of Tech Topics is described in Figure APP 1.
© 1997 William H. Inmon, all rights reserved
1
a general approach to data exploration/data mining infrastructure
locate data integrate data scrub data establish history
identify source data locate metadata define granularity secure hardware platform set objectives secure dbms platform determine frequency of refreshment
random access of data heuristic search for patterns
exploration
summary level comparisons sampling analysis DSS analyst inutition
analysis
business relevance of discovered patterns conditions predicating the pattern statistical strength of discovered patterns population pattern is applicable to
interpretation
exploitation
business cycles strength of prediction factor corelation seasonality size of population - time applicable population - geography - demographics - other sales packaging new product introduction pricing
advertising strategic alliance delivery presentation
the simple methodology describes a simple approach to data exploration and data mining. The methodology is non linear in that there are constant movements up and down the scale into and out of different activities. The methodology is a heuristic one in that the next step of development depends on the success and results of the current development. Figure App 1 The diagram in Figure APP 1 is not a methodology, per se. Instead, the diagram represents a broader approach than a single methodology. The steps in the approach shown in Figure APP 1 are as follows: ■ infrastructure preparation, ■ exploration, ■ analysis, ■ interpretation, and ■ exploitation.
2
© 1997 William H. Inmon, all rights reserved
Infrastructure Preparation The first step in data mining is the identification and preparation of the infrastructure. It is in the infrastructure that the actual activity of data mining will occur. The infrastructure contains (at a minimum): ■ a hardware platform, ■ a dbms platform, and ■ one or more tools for data mining. In almost every case, the hardware platform is a separate platform than that which originally contains the data. Said differently, it is very unusual to try to do data mining on the same platform as the operational environment. To be done properly, data needs to be removed from its host environment before it can be properly mined. Removing data from the host environment is merely the first step in preparing the data mining infrastructure. In order for data mining to be done efficiently, the data itself needs to have undergone a thorough analysis, and in most cases the data needs to have been scrubbed. The scrubbing of the data entails integrating the data, because the operational data will often have come from many sources. In addition, a metadata infrastructure — a “card catalog” — that sits above the data to be mined is very useful. Unless there is a small amount of data to be mined (which is almost never the case), the metadata sits above the data to be mined and serves as a roadmap as to what is and is not contained in the data. The metadata contains such useful information as: ■ what data is contained in the data to be mined, ■ what the source of the data is, ■ how the data has been integrated, ■ the frequency of refreshment, and so forth.
Granularity One of the biggest issues of creating the data mining infrastructure is that of the granularity of the data. The finer the level of granularity, the greater the chance that unusual and never before noticed relationships of data will be discovered. But the finer the level of granularity, the more units of data there will be. There often are so many units of data that important relationships hide behind sheer volume of data. Therefore, raising the level of granularity can greatly help in discovering important relationships of data. There is a very important trade-off to be made here between the very fine level of detail that is possible and the need to manage volumes of data. Making this trade-off properly is one of the reasons why data mining is an art, not a science.
Exploration Once the infrastructure has been established, the first steps in exploration can commence. There are as many ways to approach data exploration as there are to approach the actual discovery of minerals. Some of the approaches to the discovery of important relationships include: ■ analyzing summary data and “sniffing out” unusual occurrences and patterns, ■ sampling data and analyzing the samples to discover patterns that have not been detected before,
© 1997 William H. Inmon, all rights reserved
3
■ ■ ■
taking advantage of the intuition and experience of the experienced DSS analyst, doing random access of data, heuristically searching for patterns, etc.
Analysis Once the patterns have been discovered, there needs to be a thorough analysis of the pattern. Some patterns are not statistically very strong, while other patterns are very strong. The stronger that pattern, the better the chance that pattern will form a basis for exploitation. On the other hand, if a pattern is not strong today, but its strength is increasing over time, then this kind of pattern may be of great interest because it may be a clue as to how to anticipate the marketplace. Strength of pattern is not the only thing that needs to be considered. A second consideration is whether the pattern is a “false positive”. A false positive is a correlation between two or more variables that is valid, but random. Given the large number of occurrences of data and the large number of variables, it is inevitable that there will be some number of false positive correlations of data. A third consideration is whether a valid correlation of variables has any business significance. It is entirely possible there will be a valid correlation between two variables that is not a false positive, but for which there is no business significance. These then are some of the analysis activities that follow exploration.
Interpretation Once the patterns have been discovered and analyzed, the next step is to interpret them. Without interpretation, the patterns that have been discovered are fairly useless. In order to interpret the patterns, it is necessary to combine technical and business expertise. An interpretation without both elements is a fairly pointless exercise. Some of the considerations in the interpretation of the patterns include: ■ the larger business cycles of the business, ■ the seasonality of the business, ■ the population to which the pattern is applicable, ■ the strength of the pattern and the ability to use the pattern as a basis for future behavior, ■ the size of the population the pattern applies to, and ■ other important external correlations to the pattern, such as: ♦ time — of week, of day, of month, of year, etc. ♦ geography — where the pattern is applicable to, ♦ demographics — the group of people the pattern is applicable to, and so forth. Once the pattern has been discovered and the interpretation has been made, the process of data mining is prepared to enter the last phase.
4
© 1997 William H. Inmon, all rights reserved
Exploitation Exploitation of a discovered pattern is a business activity and a technical activity. The easiest way that a discovered pattern can be exploited is to use the pattern as a predictor of behavior. Once the behavior pattern is determined for a segment of the population served by a company, the pattern can be used as a basis for prediction. Once the population is identified and the conditions under which the behavior will predictably occur are defined, the business is now in a position to exploit the information. There are many ways the business can exploit the information: ■ by making sales offers, ■ by packaging products to appeal to the predicted audience, ■ by introducing new products, ■ by pricing products in an unusual way, ■ by advertising to appeal to the predicted audience, ■ by delivering services and/or products creatively, ■ by presenting products and services to cater to the predicted audience, and so forth. In addition to using patterns to position sales and products in a competitive and novel fashion, the measurement of patterns over time is another useful way that pattern processing can be exploited. Even if a pattern has been detected that does not have a strong correlation or if there is only a small population showing the characteristics of the pattern, if the pattern is growing stronger over time or if the population exhibiting the pattern is growing, the company can start to plan for the anticipated growth. Measuring the strength and weakness of a pattern over time or the growth or shrinkage of the population exhibiting the characteristics of the pattern over time is an excellent way to gauge changes in product line, pricing, etc. Yet another way that patterns can lead to commercial advantage is in the distinguishing of the populations that correlate to the pattern. In other words, if it can be determined that some populations do correlate to a pattern and other populations do not, then the business person can position advertising and promotions with bull’s eye accuracy, thereby improving the rate of success and reducing the cost of sales.
The Approach to Data Exploration and Data Mining The approach that has been outlined in Figure APP 1 is one in which the activities appear to be linear — i.e., the activities appear to flow from one activity to another in a prescribed order. While there may actually be some project that flows as described, it is much more normal for the activities to be executed in a heuristic, non linear manner. First one activity is accomplished, then another activity is done. The first activity is repeated and another activity commences, and so forth. There is no implied order to the activities shown in Figure APP 1. Instead, upon the completion of an activity, any other activity may commence, and even activities that have been previously done may have to be redone. Such is the nature of heuristic processing.
© 1997 William H. Inmon, all rights reserved
5
Data Mining/Data Exploration What is data mining and data exploration? Figure 1.1 gives a simple definition of data mining and data exploration.
what is data mining/data exploration?
data mining/data exploration - the usage of historical data to discover and exploit important business relationships Figure 1.1 Data mining is the use of historical data to discover patterns of behavior and use the discovery of those patterns for exploitation in the commercial environment. Typically the audience being considered by data miners is that of the consumer. In addition, the sales that have been made are the focus of the mining activity. However,there is no reason data mining cannot be used in the manufacturing, telecommunications, insurance, banking and other environments. The notion behind data mining is that there are important relationships of transactions and other vestiges of customer activity that are hidden away in the every day transactions executed by the customer. Older systems have captured a wealth of data as the systems executed the basics of a transaction. It is data exploration and data mining that discover those relationships and enable the relationships to be exploited commercially in novel and unforeseen ways. In order to understand the techniques for data mining and data exploration, it is necessary to recognize the people who are doing data mining and data exploration. Figure 1.2 shows that there are two primary groups of people who engage in data mining and data exploration.
data mining is done by explorers and farmers
Figure 1.2
6
© 1997 William H. Inmon, all rights reserved
Figure 1.2 also shows that there are two distinct groups of people that do data mining and data exploration — farmers and explorers. At any one moment in time an individual is one or the other. Over time, an individual may act in both capacities.
Explorers Explorers are depicted in Figure 1.3.
explorers - access data infrequently - don't know what they want - look at things randomly - look at lots of data - often find nothing - occasionally find huge nuggets Figure 1.3 Explorers are DSS analysts who don’t know what they want, but have an idea that there is something in the data worth pursuing. Explorers typically look at large amounts of data. Frequently, explorers find nothing. Explorers look for interesting patterns of data in a non repetitive fashion. Even though they typically find nothing, explorers will find huge nuggets of information on occasion. Explorers look at data in very unpredictable and very unusual fashions. Many important findings are made by explorers, although the results they achieve are non predictable.
Farmers Farmers are also DSS analysts, just as explorers are. But farmers have a very different approach to data mining and data exploration. Figure 1.4 depicts farmers.
farmers - know what they are looking for - frequently look for things - have a repetitive pattern of access - look for small amounts of data - frequently find small flakes of gold Figure 1.4
© 1997 William H. Inmon, all rights reserved
7
Farmers look for small amounts of data, and they frequently find what they are looking for. Farmers have a predictable pattern of access of data and seldom come up with any major insight. Instead farmers find small flakes of gold when they do mining. Farmers access data frequently and in a very predictable pattern. Farmers follow the lead of explorers, as seen by Figure 1.5.
where the explorer leads, the farmers follow Figure 1.5 Explorers look over vast seas of data. When explorers find something of value, they turn over their findings to farmers. Farmers, in turn, attempt to repeat the success of the explorer, but in a predictable, repetitive manner, not in the random manner of the explorer. Said differently, the explorer discovers what to look for, and the farmer executes the search once it is established that something valuable exists in the data on which mining and exploration is done.
Macro/Micro Exploration Data mining and exploration is effectively done at both the micro and the macro level, as seen in Figure 1.6.
micro exploration - at the detailed level
macro exploration - at the summary level
there are two types of exploration - at the macro level and at the micro level Figure 1.6
8
© 1997 William H. Inmon, all rights reserved
Different types of analysis and different types of conclusions can be drawn from the summary and detailed data on which mining and exploration can be done. Figure 1.6 shows that macro exploration is done at the summary level, and micro analysis is done at the detailed level. Both types of data are needed for data mining and exploration. Any mining and exploration effort that excludes one or the other type of data greatly reduces the chances of success. Summary data is good for looking at long-term trends and getting the larger picture. Once the larger picture is understood, the DSS analyst knows where to go to productively mine and explore the detailed data. The summary data serves as a roadmap to where productive data mining and exploration can be done. Without summary data, productive detailed mining and exploration is simply guesswork. The limitation of mining and exploration at the summary level is that no detailed analysis or correlation of data can be done. With summary data, the process of mining and exploration can only go so far before the lack of detail hampers the analysis. Detailed data is required for the in-depth analysis and correlation of data. There is no question as to the worth of detailed data. However, there usually is so much detailed data that the DSS analyst drowns in detail unless a preliminary analysis has been done at the summary level. The single, most difficult issue of analyzing detailed data is that there are massive volumes that must be considered. The cost, overhead and time consumed by dealing with massive volumes of detailed data is such that the most effective way to deal with detailed data is to start with summary data. There is then a symbiotic relationship between the summary data and detailed data that comes to the attention of the explorer and the farmer. Exploration of data is an iterative process that involves at least three elements, as described in Figure 1.7.
summarization
detail
corporation/ business exploration is a constant loop of iterative hypothesis and verification between the business itself, summary data, and detailed data Figure 1.7
© 1997 William H. Inmon, all rights reserved
9
The process of exploration is a constant movement of focus from summary data to detailed data to the business environment. The activities shift from one environment to the next in a random order, based upon the heuristic processing that is occurring. Trying to do exploration with any one of the elements missing will greatly inhibit the explorer and reduce the chances of success.
Correlating Data The intellectual basis for data mining is the correlation of data. Figure 1.8 graphically depicts the correlation of some data values.
vs correlating data is at the basis of exploration - when there is a correlation there may be the opportunity for exploitation - when there is no correlation of data, the opportunity for business exploitation is diminished or non existant
Figure 1.8 When data is correlated, there is a message. In some cases — the very strongest — the cause and effect relationship is inferred. In the cause and effect relationship, one data occurrence is said to have caused another occurrence of data. When cause and effect can be inferred it is easy to predict business phenomena; however, the cause and effect relationship is not very common. The more normal case is not cause and effect. The more normal case is that of coincidental relationship. Perhaps two occurrences occur because of an unknown common heritage. In other words, the variables are not related in a direct cause and effect relationship. Instead, some event has caused both variables to occur. In this case there is a very strong relationship between two variables even though it is not a cause and effect relationship. A third important possibility is that of an indirect relationship. In an indirect relationship, two variables are related although not in a direct causal relationship, but in an indirect relationship. Variable A was generated by event ABC. Event ABC caused event BCD to occur, which in turn caused variable B to come into existence. In this case, there is a relationship between A and B but the relationship is an indirect, not a direct, one and certainly not a cause and effect relationship.
10
© 1997 William H. Inmon, all rights reserved
Another relationship between the existence of two variables is that of a very indirect relationship. In a very indirect relationship there is indeed some relationship between the existence of variable A and variable B but the relationship is unknown and is not obvious to any reasonable form of analysis. The next form of a relationship is between two variables (A and B), which is purely random. For the given case of data there happens to be a relationship between A and B, but there is no valid reason for the relationship, and for another set of data the relationship may very well not exist. The existence of the relationship is merely a random artifact of the set of data the analysis is being done under. When there is a lot of data and a lot of variables, it is inevitable that there will be a fair number of “false positive” relationships that are discovered and are mathematically valid; however, they are only a function of the data on which the analysis is being done. Finally, there is the possibility of a valid relationship between two variables that has a mathematical basis, but for which there is no business basis. The variables may in fact have a mathematically sound relationship, but there is no valid business reason why there should be a correlation. This last case is very interesting. Should there in fact be a valid business basis for the relationship and the DSS analyst were to discover what that basis is, then there may well be a nugget of wisdom waiting. But, if in fact there is no valid business basis for the correlation, the correlation may well be a red herring. The stronger the relationship, the better the chance that there will be the opportunity to exploit the correlation. Conversely, the weaker the relationship, the less chance there will be for exploitation. However, weak relationships should not be discarded out of hand. When a weak relationship is discovered and that relationship is becoming stronger over time, there will be the opportunity to anticipate a marketplace and this opportunity is the epitome of the success that can follow data mining and exploration. Therefore, looking at weak relationships is not a waste of time if in fact those relationships are increasing in strength over time. Another factor to be considered in examining relationships is the population over which the relationship is valid. In some cases, there may be a weak correlation of data over the general population, but a very strong correlation over a selected population. In this case, segmented marketing may present a real and undiscovered opportunity. For these reasons, weak relationships may be as interesting as very strong and very obvious relationships.
Ways to Correlate Data The simplest way to correlate data is to ask, for a given selection of data, how often variable B exists when variable A exists. This simple way is used quite often and forms the basis of discovering relationships. There are, however, many refinements to the way that data can be correlated. Figure 1.9 shows some for the refinements.
© 1997 William H. Inmon, all rights reserved
11
there are many ways to correlate data vs vs
vs
unit of data versus unit of data Jan Feb Mar Apr .....
Mon 8:00 am Tues 9:00 am Wde 10:00 am Thurs 11:00 am ........ ............
unit of data versus time
unit of data versus groups of data
vs
unit of data versus geography
unit of data versus external data
vs Dow-Jones Avg
vs
unit of data versus demographics Figure 1.9
Figure 1.9 shows that data can be correlated from one element to another, from one element to another over different time periods, from one element to a group of elements, from one element over a geographic area, from one element to external data, from one element to a demographically segmented population, and so forth. Furthermore, the same one element can be correlated over multiple variables at the same time — time, geography, demographics, etc. As an example of correlating an element of data to another element of data, consider the analysis of sales where the amount of sale is correlated to whether the sales is paid for in cash or with credit card. When a sale is below a certain amount it is found that the payment is made with cash. When the sale is over a certain amount the payment is made with a credit card. And when the sale is between certain parameters, the sale is paid for with either credit card or cash. As an example of a variable being correlated to time, consider the analysis of airline flights throughout the year. The length of the flight and the cost of the flight can be correlated against the month of the year that a passenger flies. Do people make more expensive trips in January? In February? As the holidays approach, do people make shorter and less expensive trips? What exactly is the correlation? As an example of correlating data against groups of other data, consider an analyst who wants to know if the purchase of automobiles correlates to the sale of large ticket items in general, such as washers and dryers, television sets and refrigerators. The correlation of units of data to geography is a very common one in which data is matched against local patterns of activity and consumption. The comparison of the beer drinking habits of Southerners versus Southwesterners is an example of correlating units of data to geography. One of the most useful types of correlations is that of correlating internal data to external data. The value of this kind of correlation is demonstrated by the comparison of internal sales figures to industry-wide sales figures.
12
© 1997 William H. Inmon, all rights reserved
And finally, as an example of demographic correlation, the saving rates for college-educated people can be correlated against the savings rate for non college educated people. There are an infinite number of combinations of correlations that can be done. Some correlations are very revealing and some are merely just interesting, but have no potential for exploitation. The reason why correlations are so intriguing is that they hold the key to commercial exploitation. Figure 1.10 illustrates the sequence of activities that starts with correlation and ends with competitive advantage.
what detailed pattern analysis and recognition leads to - the determination of trends what the determination of trends leads to - predictable patterns of consumption or business activity over large audiences what predictable patterns of business activity leads to - the ability to anticipate the marketplace what the ability to anticipate the marketplace leads to - competitive advantage why pattern recognition is so important Figure 1.10
Different Kinds of Trends There are a multitude of trends. Some of them are useful; some are not. In no small part the usefulness of a trend is dependent on how permanent the trend is. Figure 1.11 suggests the difference in trends.
different kinds of trends -
long term trends
short term trends
sub trends there are different kinds of trends - each is important and each has its own particular place in gaining competitive advantage Figure 1.11
© 1997 William H. Inmon, all rights reserved
13
The most interesting trends are the long-term trends. Long-term trends are interesting in that once detected they can be used for market or behavioral anticipation. Once anticipation can be achieved, it is easy to position for market advantage. But long-term trends tend to be obvious, and the competition most likely will have noticed the trend as well. Short-term trends hold promise in two ways. If they can be detected early enough they will become useful tools of competition only to the company who has detected them. But there are problems with short-term trends: ■ by definition, they have a short life, and ■ they often compete with many other trends for attention and are hard to detect. Exploiting short-term trends requires great agility on the part of the corporation. But because no one is likely to have noticed the trend, the advantage afforded can be large. Another way of looking at trends is that a large trend is merely a part of a series of much smaller trends. Each of the smaller trends can be exploited on its own.
Data Warehouse and Data Mining/Data Exploration Data mining and exploration can be done on any body of data. It is not necessary to have a data warehouse for the purpose of data mining. But there are some very good reasons why a data warehouse is — easily — the best foundation for data mining and data exploration. Figure 1.12 shows that data warehouse sits at the base of data mining. trends and patterns
heuristic analysis
data warehouse data warehouse sets the stage for data exploration and mining
Figure 1.12 Data warehouse provides the foundation for heuristic analysis. The results of the heuristic analysis are created and analyzed as the basis for the next iteration of analysis. Figure 1.13 shows that many iterations of heuristic analysis occur.
14
© 1997 William H. Inmon, all rights reserved
hypothesis #1
hypothesis #2
hypothesis #3
hypothesis #4
heuristic analysis
data warehouse because data in the data warehouse is static and is not updated, the DSS analyst can change the hypothesis and determine the results without worrying about the effect of changing data
Figure 1.13 As the DSS analyst goes from one hypothesis to another, the DSS analyst needs to hold the data constant. The DSS analyst changes the hypothesis and reruns the analysis. As long as the data is held constant, the DSS analyst knows that the changes in results from one iteration to another are due to the changes in the hypothesis. In other words, the data needs to be held constant if iterative development is to proceed on a sound basis. Consider what happens when data is being updated, as seen in Figure 1.14.
hypothesis #1
hypothesis #2
hypothesis #3
hypothesis #4
heuristic analysis
update
operational data
in the operational environment where update is occurring, the DSS analyst never knows whether the different results that are achieved are because of the changing hypothesis or because of changes that have occurred to the underlying data
Figure 1.14 In Figure 1.14, update is regularly occurring to data at the same time that heuristic analysis is being done. The changes in the results the DSS analysts sees are always questionable. The DSS analyst never knows whether the changes in results are a function of the changing of the hypothesis, the changing of the underlying data, or some combination of both. The first reason why the data warehouse provides such a sound basis for data mining and data exploration is that the foundation of data is not constantly changing. But that reason — as important as it is — is not the only reason why the data warehouse provides such a good basis for data mining.
© 1997 William H. Inmon, all rights reserved
15
Perhaps the most compelling case that can be made for a data warehouse as a basis for data mining and data exploration is that the data in the warehouse is integrated. The very essence of the data warehouse is that of integration. To properly build a data warehouse requires an arduous amount of work because data is not merely “thrown into” a data warehouse. Instead, data is carefully and willfully integrated as it is placed into the warehouse. Figure 1.15 shows the integration that occurs as data is placed into the warehouse.
integrate - scrub - reconcile - restructure - convert - change dbms - change oper sys - summarize - merge - default values ............................
hypothesis
hypothesis
integrate data warehouse
without a data warehouse, there is a need for a massive integration effort before the process of data mining can commence Figure 1.15
When the DSS analyst wants to operate from a non data warehouse foundation, the first task the DSS analyst faces is that of having to “scrub” and integrate the data. This is a daunting task and holds up progress for a lengthy amount of time. But when a DSS analyst operates on a data warehouse, the data has already been scrubbed and integrated. The DSS analyst can get right to work with no major delays. Stability of data and integration are not the only reasons why the data warehouse forms a sound foundation for data mining and exploration. The rich amount of history is another reason why the data warehouse invites data mining. Figure 1.16 shows that one of the characteristics of the data warehouse is that of a robust amount of data.
hypothesis without a data warehouse, there is a need to go back and find and reconstruct massive amounts of historical data
hypothesis
integrate
1985 1986 1987 1988 ...... 1996
....... Figure 1.16 16
© 1997 William H. Inmon, all rights reserved
data warehouse
If the DSS analyst attempts to go outside of the data warehouse to do data mining, the DSS analyst faces the task of locating historical data. For a variety of reasons, historical data is difficult to gather. Some of those reasons are: ■ historical data gets lost, ■ historical data is placed on a medium that does not age well and the data becomes impossible to physically read, ■ the metadata that describes the content of historical data is lost and the structure of the historical data becomes impossible to read, ■ the context of the historical data is lost, ■ the programs that created and manipulated the historical data becomes misplaced, ■ the versions of dbms that the historical data is stored under becomes out of date and is discarded, and so forth. There are many challenges facing the DSS analyst in trying to go backward in time and reclaim historical data. The data warehouse, on the other hand, has the historical data neatly and readily available. Another reason why the data warehouse sets the stage for effective data mining and data exploration is that the data warehouse contains both summary and detail data, as illustrated in Figure 1.17.
hypothesis
hypothesis with a data warehouse there is a rich supply of both detailed and summary data summary
detail
detail data warehouse
Figure 1.17 The DSS analyst is able to immediately start doing effective macro analysis of data using the summary data found in the data warehouse. If the DSS analyst must start from the foundation of operational or legacy data, there often is very little summary data. Only detailed data resides (to any great extent) in the operational or legacy environment. Therefore, the DSS analyst can do only micro analysis of data when there is no data warehouse, and starting with micro analysis of data in the data mining adventure is risky at best.
© 1997 William H. Inmon, all rights reserved
17
Figure 1.18 shows the ability of the data warehouse to support both micro and macro data mining and data exploration activities.
hypothesis
the different levels of summarization and detail are supported by the existence of both summarized and detailed data macro analysis
summary
detail
micro analysis
data warehouse Figure 1.18
Analyzing Historical Data There is a notion that history is not a valid source for analysis of trends. The idea is that today’s business is different from yesterday’s business, so looking at yesterday’s data only tells me about the past, not the future. Indeed, if a company goes through a radical change in business then yesterday’s data does not point the way to tomorrow’s opportunity. However, radical business change is not the way that most businesses conduct affairs. While external aspects of a business do change frequently, the fundamentals of the business remain constant. History is required to be able to measure and fathom the economic cycles to which all businesses are subject. You cannot understand business cycles by looking at this month’s or this quarter’s data. Historical data is the underpinning of understanding long-term trends that the business is engulfed in. Figure 1.19 shows the basis of history for the understanding of long-term trends.
18
© 1997 William H. Inmon, all rights reserved
summary data
at the macro level of analysis, history is needed to understand the larger business cycles - seasonality - industrial cyclicity Jan
Apr
Jul
Oct
Figure 1.19 In addition to history being the key to comprehending long-term business cycles, history is also the key to understanding the seasonality of sales and other business activity. These long-term trends are best understood in terms of summary data, not detailed data. Historical data can be used at the detailed, micro level as well. Historical data can be used to track the pattern of activity of individual consumers. Figure 1.20 shows the usage of detailed historical data.
at a micro level, detailed history is important because consumers are creatures of habit. It is very unusual for a consumer to undergo a massive change of habits. Therefore, the consumption habits of the past are an excellent indicator of future habits of consumption Figure 1.20 Individuals are creatures of habit. The likelihood that buying habits, consumption habits, recreational habits, living style habits, and so forth will be the same tomorrow as they were today is a valid assumption for most people, once the individual's habits are formed in the first place. It is very unusual for an individual to have a great departure from a lifelong habit once that habit has been established. Therefore, on an individual basis, looking at the history of consumption and other activities for an individual is an excellent predictor as to what the future habits of the individual will be.
Volumes of Data — The Biggest Challenge The single largest challenge the DSS analyst has in doing effective data mining and data exploration is that of coming to grips with the volumes of data that accompany the data warehouse or the population of data that faces the DSS analyst. There inevitably are masses and masses of data. The cost of manipulating, storing, understanding, scanning, and loading, etc. is enormous. Even the simplest activity becomes painful in the face of the large volumes that await the DSS analyst.
© 1997 William H. Inmon, all rights reserved
19
Figure 1.21 shows that piles of data await the DSS analyst as the first challenge in doing data mining.
HELP!
the biggest challenge of data exploration/data mining drowning in a sea of data Figure 1.21 There are many problems with the volumes of data the DSS analyst must maneuver through. But the most insidious problem is that the volumes of data mask the important relationships and correlations that the DSS analyst is looking for. Figure 1.22 shows that hiding among the occurrences of data are important patterns that the DSS analyst is seeking.
large volumes of data hide important relationships
Figure 1.22
20
© 1997 William H. Inmon, all rights reserved
There are other problems with trying to ferret out patterns of activity and behavior from volumes of data. There is the problem of “false positives” occurring when there is a massive amount of data. A “false positive” is a relationships that is valid, but is randomly valid. Figure 1.23 shows the problems of “false positives” arising with a large amount of data.
with enough data, there are bound to be some "false positives" that are detected. Even though the correlation is real and is mathematically valid, the business basis is non existent Figure 1.23 False positives occur simply because there is so much data. A false positive is a mathematically accurate pattern which has no basis in the business itself. When there is a lot of data it is impossible for there not to be false positives. Given enough data, correlations of data will begin to appear simply because there is so much data, and for no other reason. Therefore, the DSS analyst needs to always interject the sensibility of business into the equation. Just because the numbers tell a story is no indication that the story has a basis in the business itself. The task of the DSS analyst is to find the needle in the haystack — where there is a mathematical basis for a correlation and where there is also a business basis as well. Figure 1.24 shows an analyst looking for the needle.
finding useful business patterns is like finding the needle in the haystack Figure 1.24
© 1997 William H. Inmon, all rights reserved
21
Sampling Techniques One of the most effective way of dealing with massive volumes of data found in the data warehouse is to use sampling techniques. By using a sample the DSS analyst can significantly reduce the overhead and cost of doing analysis against data. In addition, the DSS analyst can reduce the turnaround time required for any pass against the data. Figure 1.25 suggests the importance of sampling.
one way to weed through the volume of data is to select a sample of data and do analysis on the sample Figure 1.25 While there are many advantages to using a sample for analysis of data, there are some disadvantages. Some of the disadvantages are: ■ Whenever a sample is chosen, there will be bias introduced. In the best of cases the bias will be a known factor and will not unduly influence the analytical work done against the data. ■ The sample needs to be periodically refreshed. ■ The sample cannot be used for general purpose processing. ■ The sample cannot produce accurate results, only approximations, and so forth. For all of the factors that limit the usefulness of sampling, the advantages of sampling far outweigh the disadvantages insofar as the work of the DSS analyst is concerned. The first and most basic question the DSS analyst faces upon beginning to use samples for data mining and data exploration is — what kind of sample should be chosen. A simple technique is to chose a random sample, as shown in Figure 1.26.
22
© 1997 William H. Inmon, all rights reserved
random sample
a randomly chosen sample often yields random positive patterns, which may or may not be relevant to the business equation. Choosing a random sample is a good way to start general exploration Figure 1.26 A random sample can be chosen by simply marching through the data warehouse in a sequential manner and selecting every nth record. Inevitably there will be some bias in using this technique, but as long as the bias is not too severe, the technique works just fine. One of the problems of choosing a random sample of data on which to do data mining and data exploration is that it is still possible to get false positives. It is true that fewer false-positive correlations will appear using a random sample than if the full population was used for analysis. But false positives are not eliminated by the selection of random samples.
Judgment Samples An appealing alternative to the selection of a random sample is that of choosing a non random sample (or a “judgment sample”). Figure 1.27 shows the selection of a judgment sample of data for the purpose of doing data mining and data exploration.
judgment sample
a non random sample is a good way to start a refined, directed analysis. More powerful conclusions can be reached more quickly starting with a nonrandom sample. Figure 1.27
© 1997 William H. Inmon, all rights reserved
23
The choice of the data entering the judgment sample can be a powerful tool for the DSS analyst. The DSS analyst can use the selection of the data to go into the random sample in order to qualify the correlations that will be made. In other words, the DSS analyst can look at a selected subset of data and find the correlations that apply only to that subset. In doing so, the DSS analyst can start to create profiles for the different subpopulations of the corporation. The judgment sample can be made along any lines the DSS analyst deems useful. Figure 1.28 shows some of the many different ways the DSS analyst can choose a judgment sample.
s e l e c t there are many criteria on which to select the data - by time - by geography - by customer demographics - by data qualification - by multiple criteria Figure 1.28 On the one hand, the DSS analyst can use the judgment sample in order to qualify the data and in doing so, start to understand the different subpopulations. On the other hand, the DSS analyst can be limited by the judgment samples as to the conclusions that can be drawn and the patterns that are discovered. Any pattern that is discovered must always be qualified by stating that the pattern applies only to the sub population represented by the judgment sample. One technique to overcome the limitations of selecting a judgment sample is to analyze the judgment sample, then to reanalyze the base population of data once the pattern has been observed. Figure 1.29 shows this technique.
the conclusions that are drawn from the analysis of a sample are applicable to ONLY the sample. In order to generalize the conclusions the hypothesis needs to be run against the general population Figure 1.29 One of the interesting uses of this technique is to measure the difference between the strength of the correlation against the general population versus the strength of the correlation against the judgment sample. In some cases the correlation will be very strong using the judgment sample data, and weak or non existent using the general population. Knowing that a correlation applies only to a subset of a population is a fact that can be used quite effectively in achieving a position of marketplace exploitation.
24
© 1997 William H. Inmon, all rights reserved
In the same vein, one judgment sample may exhibit one degree of correlation against the sample and another degree of correlation against another sample, as seen in Figure 1.30.
pattern abc sample B
sample A
sample C
the sample of data that has been chosen may well exhibit other patterns of correlation than other samples. Stated differently, some patterns of data can be detected when looking at a sample that cannot be detected when looking at the population in general. Asking the question - what is different about this sample than another sample can be a very enlightening thing to do Figure 1.30 This difference in degrees of correlation against different sample bases is the basis for doing “bull’s eye” marketing, in which one demographic group is singled out for exploitation.
Refreshing Sample Data While it is necessary to keep the base of data stable in doing heuristic analysis, it is also necessary to periodically refresh the data in order to capture recently emerging trends. Figure 1.31 shows that periodically refreshment must be done.
periodic refreshment
periodic refreshment of the sample is in required in order to keep the sample as up to date as possible Figure 1.31 Fortunately, when looking at long-term trends, it is not necessary to refresh the data on which analysis is being done very frequently. Depending on the data and the analysis being done on the data, it may be necessary to refresh the data no more frequently than every quarter or so.
© 1997 William H. Inmon, all rights reserved
25
The general approach to the usage of a sample approach to data mining and data exploration can be stated by Figure 1.32. 4
determine strength of pattern against base population 1
create sample hypothesize 2 test
3 discover pattern
a typical sequence of events 1 - the base population has a sample created 2 - the sample is iteratively hypothesized and tested 3 - a pattern is discovered 4 - the strength and vaildity of the pattern is tested against the base population Figure 1.32
Using Summary Data Undoubtedly, using sampling techniques is the most popular and the most effective approach to handling the volumes of data that arise in the world of data warehousing and data mining. The use of sampling should be in the bag of tricks of every data warehouse administrator. There are other effective ways to discover patterns of interest to the DSS analyst other than the usage of sampling techniques. Indeed, the techniques that will be discussed can be used in conjunction with sampling techniques quite handily. Using summary data is a very useful technique to find patterns of interest to the DSS analyst. At first the usage of summary data may seem contradictory because, by definition, the summary data does not contain detailed data, and where there is no detailed data, it is questionable as to how to locate interesting patterns. But summary data can act as a “water witch.” In the old West, when it was desired to find water, a pioneer cut a forked stick from a tree and set about “dowsing.” The pioneer would walk about holding the forked twig in front of him and would stop when the twig “dipped.” It was at this point that the water witch would declare that the place where the twitching of the forked stick occurred was the place to dig for water. In much the same vein the DSS analyst can use summary data as a forked twig as a determinant to position where to look for interesting business patterns. The difference is that there is a rational basis for using summary data as an indicator as to where to look for nuggets of information.
26
© 1997 William H. Inmon, all rights reserved
As an example of how summary data can be used, consider the diagram shown in Figure 1.33
looking at summary data can lead to interesting observations which in turn can lead to the discovery of patterns why was there a peak here?
is this a believable long term trend?
why was there a valley here?
why was there such a falloff?
Figure 1.33 Figure 1.33 shows a simple summary chart over a period of time. There are several obvious places where the DSS analyst might start to look for interesting correlations of data: ■ at the bottom of a trough, where the trend bottomed out, ■ at the top of a peak, where the trend reached a new high, ■ at the tops of several peaks to see if a consistent trend is forming, and so forth. In short, the simple summary graph suggests many likely places to start to look for trends. Summary data can then be used to find the needle in the haystack, or at least to indicate where in the haystack is a productive place to look for needles. But a simple graph is only one way that summary data can be used. Figure 1.34 shows that a very useful way to analyze summary data is to look at two summarizations and compare the summarizations together. total industry sales for the year
our company's annual sales
comparisons of different kinds of summary data can be very elucidating. Patterns may emerge that would otherwise go undetected
Figure 1.34
© 1997 William H. Inmon, all rights reserved
27
In the case of Figure 1.34, total industry numbers are being compared to numbers generated inside the corporation. By comparing and contrasting these two sets of numbers, many interesting observations can be made that might otherwise escape the attention of the DSS analyst. Figure 1.35 shows some of the things the alert DSS analyst might see. a
b
c total industry sales for the year
our company's annual sales
a - at this point we are selling significantly higher than the industry average. Why? b - at this point we are selling significantly lower than the industry average. Why? c - at this point we are selling significantly higher than the industry average. Why?
the kinds of observations that come from comparing industry averages to internal numbers and the questions that can be raised. The questions lead to the discovery of patterns, or at least to the ability to ask the questions that will lead to the discovery of patterns. Figure 1.35 The alert DSS analyst looks for places where the current trend goes for or against the industry average. At these points, the DSS industry is tempted to ask — what are we doing differently then the rest of the industry? When we are succeeding at a faster rate than everyone else, what are we doing right? And when we are failing at a faster rate than the rest of the industry, what are we doing wrong? The contrast with other summary numbers can provide an excellent place to begin closely examining detailed data in search of interesting correlations.
Intuition But summary data is not the only place to start. A very basic place to start, which is often overlooked, is to trust the intuition of the DSS analyst. In many cases, when a DSS analyst sees data, the analyst will have a feeling that something interesting is going on that the analyst cannot put an immediate finger on. Something speaks out to the DSS analyst to look further. Figure 1.36 shows that in some cases the DSS analyst will know where to look more deeply for correlations even though the DSS analyst cannot tell you exactly why a deeper search is warranted.
28
© 1997 William H. Inmon, all rights reserved
this just seems like a good place to start digging. I can just feel the gold under here....
the intuition of an experienced DSS analyst is always a good place to start looking for nuggets Figure 1.36 The DSS analyst will not be correct in every case. But in some cases the DSS analyst can cut through masses of data simply by trusting his/her intuition.
Further Subdividing Data Another technique (which is really a variation of sampling) is to further subdivide data so that only relevant fields appear in the data that is being studied. Figure 1.37 illustrates this technique. xxxx xxxxxxxxx xxxx xxxxxxx xx xxxxxxxxx xx
xxxx xxxxxxxxx xxxx xxxxxxx xx xxxxxxxxx xx
xxxx xxxxxxxxx xxxx xxxxxxx xx xxxxxxxxx xx
xxxx xxxxxxxxx xxxx xxxxxxx xx xxxxxxxxx xx
xxxx xxxxxxxxx xxxx xxxxxxx xx xxxxxxxxx xx
xxxx xxxxxxxxx xxxx xxxxxxx xx xxxxxxxxx xx
.......................................... .......................................... ..........................................
xxxx xxxxxxxxx xxxx xxxxxxx xx xxxxxxxxx xx
one way to achieve a different perspective of data is to look at subsets of attributes rather than at the entire row of data. This technique cuts down on the volume of data that must be handled while creating a basis for analysis Figure 1.37 In Figure 1.37 the DSS analyst has started the analysis by stripping out only two variables of data. In doing so, there is much less data to be managed. The stripped out data can be efficiently stored and analyzed. As long as the DSS analyst wants to look ONLY at the data that has been stripped out, then there is no problem. The analysis can proceed smoothly. But the minute the DSS analyst wishes to correlate the stripped-out data with the other data that it was originally stored with, then the limitations of this approach become manifest. Once the stripped data is removed from its source, it cannot easily be correlated with any other data that it is related to.
© 1997 William H. Inmon, all rights reserved
29
In the same vein, another way to achieve a unique perspective of data is to look at the data by type of data, rather than by actual occurrence. Figure 1.38 shows this approach.
............
type A
type B
type C
another good way to achieve an interesting perspective of data is to categorize data into classes and count the occurrences of data in each class. Depending on how the categorization is done, the comparison of the number of data that resides in each class can produce very interesting results Figure 1.38 Figure 1.38 shows that data is grouped into classes. Once grouped into classes, the DSS analyst does studies on the classes of data rather on the raw data itself. In such a fashion, data can be studied and managed efficiently and succinctly. [This Tech Topic is Part 1 in the series on data mining and data exploration. In this Tech Topic we have explored the origins of data mining, the considerations of the infrastructure of data mining and the activities of exploration. Part 2 of this Tech Topic will explore how to exploit the patterns discovered in the exploration.]
30
© 1997 William H. Inmon, all rights reserved