485 52 270KB
English Pages 22 Year 1997
T
E
C
H
T
O
P
I
C
6
DATA MINING: EXPLORING THE DATA PART 2 by W.H. Inmon
[This Tech Topic is part 2 in the series of Tech Topics on data mining and data exploration. It is assumed that the reader has already read Part 1.]
Data Correlation The basis for data mining and data exploration is the correlation of data. When data can be correlated mathematically and from a business basis, assumptions can be made and the basis for commercial exploitation is formed. The groundwork for correlation is the mathematical relationship between two or more variables. There are a wide variety of correlations of data. Figure 2.1 shows some simple types of correlations of data.
a perfect correlation between two variables
a strong correlation between two variables
a weak correlation between two variables different strengths of correlation Figure 2.1 The first correlation in Figure 2.1 is a perfect correlation of data. In this case, for every occurrence of A there is an occurrence of B, and vice versa. Such an occurrence seldom happens, but when it does, the perfect correlation forms a very sound basis for exploitation. A more normal case is a strong correlation of data, which is represented by the second set of data shown in Figure 2.1. In the second set of data, in most cases, where there is an A there is also a B. But in a few cases A will exist when there is no B, and B will sometimes exist where there is no A. This correlation is fairly common and also forms a sound basis for further exploitation. The third correlation shown in Figure 2.1 is a weak correlation. In a weak correlation, in some cases when there exists an A there will also exist a B, or in some cases where there exists a B there will also exist an A. But in many cases A and B will exist independently. As shown, the weak correlation between A and B does not form a particularly good case for commercial exploitation. And, if that is all there is, then nothing more can be done.
© 1997 William H. Inmon, all rights reserved
1
But in a way, weak correlations are the most interesting of all for some very good reasons. The allure of weak correlations is that they may point to important trends that are as yet undiscovered, and because they are undiscovered, can lead to opportunities for exploitation that are as yet unknown. Therefore, there are two very important aspects of weak correlations that are worth exploring: ■ Is the correlation growing more significant over time? ■ Is the correlation very strong for a subset of either A or B? If the weak correlation is growing stronger over time, it is entirely possible that the weak correlation is a harbinger of large new trends that are just now developing. If such is the case, there may be massive opportunities for exploitation. Of course, how weak the correlation is and how fast the strength of the correlation is increasing is very relevant. If a correlation is increasing in strength at glacial speed, it will be very difficult to exploit the correlation. And if the correlation is so weak that it is useless, it will be a long time for the correlation to become strong enough to be able to be exploited. The second case where a weak correlation is of interest is where there is a weak correlation for the entire population, but a quite strong correlation for a subset of the population. In other words, if there are other characteristics of A that can be used to select a subpopulation of A, and if after having selected that subpopulation, the correlation between A and B becomes much stronger, then there will most likely be a significant opportunity for commercial exploitation by targeting the subpopulation of A that strongly correlates to B. For these reasons, weak correlations are some of the most interesting of all the different ways that data can be correlated. Of course, where there is no correlation of data at all, there is little chance for exploitation through the techniques of data mining. There are of course other types of correlations other than correlations based on existence criteria. The correlations that have been discussed are based on whether two variables exist in the presence of each other. Another very common type of correlation is not based on existence at all, but is based on values. As an example, suppose A and B always exist in each other's presence. When A has a value greater than 50, B has a value greater than 100, and when A has a value greater than 100, B has a value greater than 200, and so forth. The correlation is measured not in terms of existence of variables, but in terms of the values of the variables compared to each other.
Multivariate Analysis Of course there are correlations between more than two variables. Figure 2.2 suggests a simple form of multivariate analysis.
multivariate analysis Figure 2.2 The relationships that can be divined doing multivariate analysis can be as interesting to the DSS analyst doing data mining and data exploration on the more common dual variable analysis. When
2
© 1997 William H. Inmon, all rights reserved
the DSS analyst discovers a multivariate correlation, there is the potential for exploitation just as there is in the case of a correlation between two variables. However, the discovery of a multivariate correlation is a difficult thing to accomplish in a data warehouse for two very important reasons: ■ the volume of data in the data warehouse makes analysis of multiple variables a very difficult thing to do, and ■ the number of variables is usually so many that it is not clear which combination of variables should be analyzed together. In addition, even when a multivariate correlation is discovered, the underlying business case may be very spurious. There is then great potential in doing multivariate analysis in a data mining and data exploration effort. However, the DSS analyst should be aware of the obstacles that await.
The Spectrum of Correlation Whether dual variables are correlated or multiple variables are correlated, the result will be a spectrum of strengths of correlation. Figure 2.3 depicts the spectrum of correlation that lies ahead for the DSS analyst.
no correlation very strong whatsoever correlation there is a large spectrum of correlation between variables Figure 2.3 Corresponding to the spectrum of correlation is the spectrum of opportunity that awaits the DSS analyst that would exploit a correlation. As a general rule, the stronger the correlation, the greater the chance that the correlation is already well known. The greater the chance the correlation is well known, the greater the chance it is already being exploited. In other words, for well known correlations, even though the correlation is strong, the opportunity for exploitation is minimal because the competition is already making use of the relationship. However, if a strong relationship develops between two variables that are not well known, then there is a major opportunity for exploitation. For correlations that are not so strong there is a real opportunity for exploitation, if the correlation: ■ has not been widely discovered, ■ is growing in strength, or ■ is very strong for an identifiable subset of the population being studied. The DSS analyst should be aware of the spectrum of strength of correlation and should be aware of what opportunities are possible based on the strength of the correlation.
© 1997 William H. Inmon, all rights reserved
3
The Business Relationship Even where there is a valid mathematical relationship that has been discovered, there is no guarantee that this relationship will lead to an opportunity for exploitation. Figure 2.4 shows that, after the mathematical correlation has been established, there must be an analysis of the underlying business relationship.
f(x) dx business x=1, 10 mathematical just because there is a mathematical relationship between two or more variables does not necessarily mean that there is a business relationship. If there is no business relationship, exploitation will be very difficult to do Figure 2.4 Some of the possibilities in the analyzing of the underlying business relationship are: ■ the correlation is a false positive and has no underlying business relationship whatsoever, ■ the correlation is a previously unknown relationship and is ripe for exploitation, ■ the correlation has a business relationship at its basis, but the business relationship is well known and is already being exploited to the fullest, and ■ the correlation is mathematically valid and has no underlying business relationship, but the relationship is so strong that the correlation will still present opportunities for exploitation. Each of these circumstances will be discussed. In the case where there is a false positive and where there is no business relationship whatsoever, it is important that the DSS analyst know because the DSS analyst can save a large amount of resources by not attempting to try to exploit an opportunity that is not viable. In the case where the DSS analyst has discovered a previously unknown correlation and there is a business basis for the relationship, the DSS analyst has a rare and powerful opportunity for exploitation. In the case where there is both a mathematical relationship and a business basis for the relationship and where the relationship is well known, it is unlikely that there will be an opportunity for exploitation simply because the relationship has already been exploited. In the case where there is a strong mathematical correlation, but no apparent business basis, there are plenty of opportunities. One opportunity is to examine the business basis very carefully to discover whether there really is a basis, however subtle. Where there is a subtle business basis, there will be
4
© 1997 William H. Inmon, all rights reserved
plenty of opportunity for exploitation because it is unlikely that this business relationship will have been found. Even in the case where there is no discernible business basis for a correlation, if the mathematical relationship is strong enough and is consistent enough, there will still be an opportunity for exploitation through the sheer strength of the relationship.
Looking for Correlations The easiest place to start to look for correlations is in the most obvious places. Consider the scenario shown in Figure 2.5.
FIVE AND DIME BUS
foot traffic at the five and dime is heavier on rainy days the starting place to look for correlations is the obvious place Figure 2.5 In Figure 2.5 it is seen that on rainy days that shops near bus stops will get more walk-in traffic. There may not be any purchases that take place, but there will be an abundance of foot traffic. There are many cases where the correlation is obvious — in the summer, a lot of beer is consumed. In the winter, snow chains are frequently sold. In the spring, lawn mowers are popular, and so forth. In the case of obvious correlations, the opportunity lies not so much in the discovery of the correlation but in the creation of novel ways to exploit the correlation.
© 1997 William H. Inmon, all rights reserved
5
As an example of a macro analysis of summary data and its use to discover correlations, consider the graph in Figure 2.6.
$
1 bil 900 mil 800 mil 700 mil 600 mil 500 mil 400 mil 300 mil 200 mil 100 mil Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec retail sales peak at Christmas time an obvious pattern Figure 2.6
Figure 2.6 shows that for a retailer, sales progress throughout the year and peak in December, at the time of the Christmas holidays. The summary table suggests that when this year's highs are greater than last year's highs and this year's lows are less than last year's lows, there might be some interesting correlations of data that could be observed. At an even more macro level, consider the profitability over a long period of time of several insurance companies, as seen in Figure 2.7.
100 mil
50 mil
0
-50 mil
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996
insurance company A insurance company B insurance company C insurance company D
comparing the profitability of insurance companies over time - how have they all been alike? - how have they been different? Figure 2.7
6
© 1997 William H. Inmon, all rights reserved
In Figure 2.7, four insurance companies have shown their profitability over a decade. The obvious points of interest (and the most likely place to look for interesting correlations) are: ■ points where one company falls lower than all other companies or where one company soars higher than other companies, and ■ points where one company is operating on a different trend line than other companies. These macro observations lead to places where productive correlations can be found.
Finding Interesting Correlations at the Micro Level Looking at macro indicators is a good way to find where in the grand scheme of things interesting correlations will reside, but analysis at the macro level can never get to the level of detail that will satisfy the DSS analyst. Analysis must proceed to the micro level in order to be productive. Figure 2.8 shows accumulation of a large number of sales data. sale amt = 2.98 sale amt = 65.98 sale amt = 3.87 sale amt = 48.76 sale amt = 3.97 sale amt = 2.76 sale amt = 1.87 sale amt = 3.75 sale amt = 4.97 sale amt = 5.01 sale amt = 2.98 sale amt = 4.65 sale amt = 3.97 sale amt = 2.86 sale amt = 1.86 sale amt = 2.97
sale amt = 4.65 sale amt = 3.65 sale amt = 2.87 sale amt = 3.97 sale amt = 2.87 sale amt = 78.32 sale amt = 3.54 sale amt = 4.33 sale amt = 4.72 sale amt = 3.34 sale amt = 1.32 sale amt = 3.37 sale amt = 3.97 sale amt = 1.45 sale amt = 2.22 sale amt = 8.42
sale amt = 4.98 sale amt = 3.22 sale amt = 2.01 sale amt = 3.97 sale amt = 2.72 sale amt = 4.16 sale amt = 3.84 sale amt = 1.76 sale amt = 3.77 sale amt = 1.22 sale amt = 4.42 sale amt = 3.20 sale amt = 3.97 sale amt = 1.53 sale amt = 3.66 sale amt = 1.10
sale amt = 5.56 sale amt = 3.09 sale amt = 5.92 sale amt = 4.46 sale amt = 3.31 sale amt = 1.29 sale amt = 3.97 sale amt = 1.79 sale amt = 3.29 sale amt = 1.56 sale amt = 2.87 sale amt = 2.97 sale amt = 3.97 sale amt = 1.72 sale amt = 3.97 sale amt = 2.73
sale amt = 3.09 sale amt = 3.37 sale amt = 1.87 sale amt = 4.92 sale amt = 17.29 sale amt = 2.33 sale amt = 3.97 sale amt = 1.79 sale amt = 3.28 sale amt = 2.38 sale amt = 1.29 sale amt = 3.41 sale amt = 1.87 sale amt = 1.75 sale amt = 2.96 sale amt = 3.19
sale amt = 1.71 sale amt = 3.09 sale amt = 2.86 sale amt = 2.19 sale amt = 3.19 sale amt = 2.37 sale amt = 2.18 sale amt = 1.98 sale amt = 2.07 sale amt = .98 sale amt = 1.62 sale amt = 3.82 sale amt = 28.27 sale amt = .93 sale amt = 1.67 sale amt = 2.97
find all sales amounts greater than 10.00 Figure 2.8 One of the ways to find where there are interesting correlations is to ask — where are the exceptions to the norm? In the case of the data found in Figure 2.8, the DSS analyst has honed in on all sales greater than $10.00. There are five sales greater than $10.00. In fact, the sales greater than $10.00 are so significantly larger than all the other sales that there must be something interesting about the sales. The DSS analyst is tipped off to look at such things as: ■ What items were purchased? ■ Were the items purchased as a group? ■ Who were the items purchased by? ■ When were the items purchased? These questions may not lead to any interesting observations, but there is something unusual about the purchases greater than $10.00. Discovering what other data correlates to the sale amount is a productive place to start the analysis of the data.
© 1997 William H. Inmon, all rights reserved
7
In the same vein, finding the smallest and the largest sales may well lead to productive results. Figure 2.9 shows this simple analysis. sale amt = 2.98 sale amt = 65.98 sale amt = 3.87 sale amt = 48.76 sale amt = 3.97 sale amt = 2.76 sale amt = 1.87 sale amt = 3.75 sale amt = 4.97 sale amt = 5.01 sale amt = 2.98 sale amt = 4.65 sale amt = 3.97 sale amt = 2.86 sale amt = 1.86 sale amt = 2.97
sale amt = 4.65 sale amt = 3.65 sale amt = 2.87 sale amt = 3.97 sale amt = 2.87 sale amt = 78.32 sale amt = 3.54 sale amt = 4.33 sale amt = 4.72 sale amt = 3.34 sale amt = 1.32 sale amt = 3.37 sale amt = 3.97 sale amt = 1.45 sale amt = 2.22 sale amt = 8.42
sale amt = 4.98 sale amt = 3.22 sale amt = 2.01 sale amt = 3.97 sale amt = 2.72 sale amt = 4.16 sale amt = 3.84 sale amt = 1.76 sale amt = 3.77 sale amt = 1.22 sale amt = 4.42 sale amt = 3.20 sale amt = 3.97 sale amt = 1.53 sale amt = 3.66 sale amt = 1.10
sale amt = 5.56 sale amt = 3.09 sale amt = 5.92 sale amt = 4.46 sale amt = 3.31 sale amt = 1.29 sale amt = 3.97 sale amt = 1.79 sale amt = 3.29 sale amt = 1.56 sale amt = 2.87 sale amt = 2.97 sale amt = 3.97 sale amt = 1.72 sale amt = 3.97 sale amt = 2.73
sale amt = 3.09 sale amt = 3.37 sale amt = 1.87 sale amt = 4.92 sale amt = 17.29 sale amt = 2.33 sale amt = 3.97 sale amt = 1.79 sale amt = 3.28 sale amt = 2.38 sale amt = 1.29 sale amt = 3.41 sale amt = 1.87 sale amt = 1.75 sale amt = 2.96 sale amt = 3.19
sale amt = 1.71 sale amt = 3.09 sale amt = 2.86 sale amt = 2.19 sale amt = 3.19 sale amt = 2.37 sale amt = 2.18 sale amt = 1.98 sale amt = 2.07 sale amt = .98 sale amt = 1.62 sale amt = 3.82 sale amt = 28.27 sale amt = .93 sale amt = 1.67 sale amt = 2.97
find the largest and the smallest Figure 2.9 And along the same line, looking for the five largest (or the five smallest) may be productive. Figure 2.10 shows this criteria.
sale amt = 2.98 sale amt = 65.98 sale amt = 3.87 sale amt = 48.76 sale amt = 3.97 sale amt = 2.76 sale amt = 1.87 sale amt = 3.75 sale amt = 4.97 sale amt = 5.01 sale amt = 2.98 sale amt = 4.65 sale amt = 3.97 sale amt = 2.86 sale amt = 1.86 sale amt = 2.97
sale amt = 4.65 sale amt = 3.65 sale amt = 2.87 sale amt = 3.97 sale amt = 2.87 sale amt = 78.32 sale amt = 3.54 sale amt = 4.33 sale amt = 4.72 sale amt = 3.34 sale amt = 1.32 sale amt = 3.37 sale amt = 3.97 sale amt = 1.45 sale amt = 2.22 sale amt = 8.42
sale amt = 4.98 sale amt = 3.22 sale amt = 2.01 sale amt = 3.97 sale amt = 2.72 sale amt = 4.16 sale amt = 3.84 sale amt = 1.76 sale amt = 3.77 sale amt = 1.22 sale amt = 4.42 sale amt = 3.20 sale amt = 3.97 sale amt = 1.53 sale amt = 3.66 sale amt = 1.10
sale amt = 5.56 sale amt = 3.09 sale amt = 5.92 sale amt = 4.46 sale amt = 3.31 sale amt = 1.29 sale amt = 3.97 sale amt = 1.79 sale amt = 3.29 sale amt = 1.56 sale amt = 2.87 sale amt = 2.97 sale amt = 3.97 sale amt = 1.72 sale amt = 3.97 sale amt = 2.73
find the five largest Figure 2.10
8
© 1997 William H. Inmon, all rights reserved
sale amt = 3.09 sale amt = 3.37 sale amt = 1.87 sale amt = 4.92 sale amt = 17.29 sale amt = 2.33 sale amt = 3.97 sale amt = 1.79 sale amt = 3.28 sale amt = 2.38 sale amt = 1.29 sale amt = 3.41 sale amt = 1.87 sale amt = 1.75 sale amt = 2.96 sale amt = 3.19
sale amt = 1.71 sale amt = 3.09 sale amt = 2.86 sale amt = 2.19 sale amt = 3.19 sale amt = 2.37 sale amt = 2.18 sale amt = 1.98 sale amt = 2.07 sale amt = .98 sale amt = 1.62 sale amt = 3.82 sale amt = 28.27 sale amt = .93 sale amt = 1.67 sale amt = 2.97
Yet another way to analyze the sales data, looking for a useful way to understand the data and to determine where there might be interesting correlations, is to create a profile of the data. Certainly averages and medians (i.e., mid points) can be calculated, and those numbers are meaningful. But another meaningful way to characterize the data is in terms of a “profile.” Figure 2.11 depicts a simple profile of the sales data. sale amt = 2.98 sale amt = 65.98 sale amt = 3.87 sale amt = 48.76 sale amt = 3.97 sale amt = 2.76 sale amt = 1.87 sale amt = 3.75 sale amt = 4.97 sale amt = 5.01 sale amt = 2.98 sale amt = 4.65 sale amt = 3.97 sale amt = 2.86 sale amt = 1.86 sale amt = 2.97
sale amt = 4.65 sale amt = 3.65 sale amt = 2.87 sale amt = 3.97 sale amt = 2.87 sale amt = 78.32 sale amt = 3.54 sale amt = 4.33 sale amt = 4.72 sale amt = 3.34 sale amt = 1.32 sale amt = 3.37 sale amt = 3.97 sale amt = 1.45 sale amt = 2.22 sale amt = 8.42
sale amt = 4.98 sale amt = 3.22 sale amt = 2.01 sale amt = 3.97 sale amt = 2.72 sale amt = 4.16 sale amt = 3.84 sale amt = 1.76 sale amt = 3.77 sale amt = 1.22 sale amt = 4.42 sale amt = 3.20 sale amt = 3.97 sale amt = 1.53 sale amt = 3.66 sale amt = 1.10
from .00 to .99 from 1.00 to 1.99 from 2.00 to 2.99 from 3.00 to 3.99 from 4.00 to 4.99 from 5.00 to 5.99 from 6.00 to 6.99 greater than 6.99
sale amt = 5.56 sale amt = 3.09 sale amt = 5.92 sale amt = 4.46 sale amt = 3.31 sale amt = 1.29 sale amt = 3.97 sale amt = 1.79 sale amt = 3.29 sale amt = 1.56 sale amt = 2.87 sale amt = 2.97 sale amt = 3.97 sale amt = 1.72 sale amt = 3.97 sale amt = 2.73
2 21 23 31 10 3 0 6
sale amt = 3.09 sale amt = 3.37 sale amt = 1.87 sale amt = 4.92 sale amt = 17.29 sale amt = 2.33 sale amt = 3.97 sale amt = 1.79 sale amt = 3.28 sale amt = 2.38 sale amt = 1.29 sale amt = 3.41 sale amt = 1.87 sale amt = 1.75 sale amt = 2.96 sale amt = 3.19
sale amt = 1.71 sale amt = 3.09 sale amt = 2.86 sale amt = 2.19 sale amt = 3.19 sale amt = 2.37 sale amt = 2.18 sale amt = 1.98 sale amt = 2.07 sale amt = .98 sale amt = 1.62 sale amt = 3.82 sale amt = 28.27 sale amt = .93 sale amt = 1.67 sale amt = 2.97
create a profile
Figure 2.11 The profile is useful to determine if anything unusual is happening to the data. Within the population of the data itself there may be hidden trends and ratios. The profile created in Figure 2.11 shows that the vast majority of sales are between $1.00 and $4.00. Anything outside of that range is an anomaly. The sales form a slightly skewed bell curve around the $3.00 sales mark. The use of a profile is to characterize the masses of data in a perspective that is not immediately obvious from examining the details of the data directly.
© 1997 William H. Inmon, all rights reserved
9
Still another way to do a detailed analysis of data is to use a scatter chart. Figure 2.12 shows a simple scatter chart. shelf time - 5 days, shelf time - 1 days, shelf time - 20 days, shelf time - 3 days, shelf time - 21 days, shelf time - 10 days, shelf time - 3 days, shelf time - 35 days, shelf time - 13 days, shelf time - 5 days, shelf time - 18 days, shelf time - 16 days, shelf time - 17 days, shelf time - 8 days, shelf time - 10 days, shelf time - 12 days, shelf time - 6 days, shelf time - 4 days,
cost - 10.99 cost - .75 cost - 89.95 cost - 1.75 cost - 65.00 cost - 15.95 cost - 59.99 cost - 5.95 cost - 3.75 cost - 2.99 cost - 3.76 cost - 17.96 cost - 2.87 cost - 8.95 cost - 17.75 cost - 98.00 cost - 6.97 cost - 2.99
shelf time - 4 days, shelf time - 1 days, shelf time - 22 days, shelf time - 2 days, shelf time - 27 days, shelf time - 10 days, shelf time - 32 days, shelf time - 4 days, shelf time - 8 days, shelf time - 3 days, shelf time - 6 days, shelf time - 13 days, shelf time - 1 days, shelf time - 3 days, shelf time - 11 days, shelf time - 23 days, shelf time - 3 days, shelf time - 2 days,
cost - 12.67 cost - 1.19 cost - 65.98 cost - 2.21 cost - 90.00 cost - 16.00 cost - 65.98 cost - 6.95 cost - 3.79 cost - 2.76 cost - 3.87 cost - 19.44 cost - 1.77 cost - 2.61 cost - 49.95 cost - 97.34 cost - 8.97 cost - 4.55
shelf time - 7 days, shelf time - 3 days, shelf time - 24 days, shelf time - 4 days, shelf time - 18 days, shelf time - 13 days, shelf time - 27 days, shelf time - 14 days, shelf time - 6 days, shelf time - 9 days, shelf time - 3 days, shelf time - 18 days, shelf time - 2 days, shelf time - 3 days, shelf time - 15 days, shelf time - 19 days, shelf time - 4 days, shelf time - 1 days,
.
40 35 30 days
.
25
..
20 15 10 5
.. .. ... . .. . ..... ... ...... . . . ..
.
cost - 16.82 cost - 5.98 cost - 4.82 cost - 3.56 cost - 156.33 cost - 14.98 cost - 129.34 cost - 4.21 cost - 5.88 cost - 6.98 cost - 1.29 cost - 89.99 cost - 2.86 cost - 12.87 cost - 56.99 cost - 65.49 cost - 7.99 cost - 5.98
. . .. .. ... .. .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 dollars
correlating the shelf time of a product against the cost of the item at final sale Figure 2.12 In Figure 2.12, the shelf time of an item that has been sold is correlated with the price of the item. Two noticeable trends emerge from the scatter chart. Figure 2.13 shows those two trends.
.
40 35 30 days
.
25 20 15 10 5
..
.. .. ... . . . . . . .... ... ..... . . . ..
.
. . .. .. ... .. .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 dollars
identifying the major trends using a scatter chart Figure 2.13
10
© 1997 William H. Inmon, all rights reserved
The two trends shows that some high-cost items sell very quickly. The length of time other items spend on the shelf (roughly) varies according to the price of the item. The contrast between the two trends that have been detected by the scatter chart are of real interest to the DSS analyst. The natural question to ask is — for slow-moving items, what characteristic is there about the item that separates it from a fast-moving item? If the DSS analyst can make that determination, the business person will have a real chance to price and advertise items so that the amount of shelf time will be minimal, or that the company’s profits for long shelf-time items will be optimal. Still another way to analyze detailed data is by creating ratios. Figure 2.14 shows some ratios that have been created from the shelf time versus cost data. shelf time - 5 days, shelf time - 1 day, shelf time - 20 days, shelf time - 3 days, shelf time - 21 days, shelf time - 10 days, shelf time - 3 days, shelf time - 35 days, shelf time - 13 days, shelf time - 5 days, shelf time - 18 days, shelf time - 16 days, shelf time - 17 days, shelf time - 8 days, shelf time - 10 days, shelf time - 12 days, shelf time - 6 days, shelf time - 4 days,
cost - 10.99 cost - .75 cost - 89.95 cost - 1.75 cost - 65.00 cost - 15.95 cost - 59.99 cost - 5.95 cost - 3.75 cost - 2.99 cost - 3.76 cost - 17.96 cost - 2.87 cost - 8.95 cost - 17.75 cost - 98.00 cost - 6.97 cost - 2.99
shelf time - 4 days, shelf time - 1 day, shelf time - 22 days, shelf time - 2 days, shelf time - 27 days, shelf time - 10 days, shelf time - 32 days, shelf time - 4 days, shelf time - 8 days, shelf time - 3 days, shelf time - 6 days, shelf time - 13 days, shelf time - 1 days shelf time - 3 days, shelf time - 11 days, shelf time - 23 days, shelf time - 3 days, shelf time - 2 days,
cost - 12.67 cost - 1.19 cost - 65.98 cost - 2..21 cost - 90.00 cost - 16.00 cost - 65.98 cost - 6.95 cost - 3.79 cost - 2.76 cost - 3.87 cost - 19.44 cost - 1.77 cost - 2.61 cost - 49.95 cost - 97.34 cost - 8.97 cost - 4.55
shelf time - 7 days, shelf time - 3 day, shelf time - 24 days, shelf time - 4 days, shelf time - 18 days, shelf time - 13 days, shelf time - 27 days, shelf time - 14 days, shelf time - 6 days, shelf time - 9 days, shelf time - 3 days, shelf time - 18 days, shelf time - 2 days, shelf time - 3 days, shelf time - 15 days, shelf time - 19 days, shelf time - 4 days, shelf time - 1 day,
cost - 16.82 cost - 5.98 cost - 4.82 cost - 3.56 cost - 156.33 cost - 14.98 cost - 129.34 cost - 4.21 cost - 5.88 cost - 6.98 cost - 1.29 cost - 89.99 cost - 2.86 cost - 12.87 cost - 56.99 cost - 65.49 cost - 7.99 cost - 5.98
creating ratios can be very elucidating cost/shelf time = 4.97 for the entire population cost/shelf time by dollar value from .00 to 2.00 from 2.01 to 4.00 from 4.01 to 6.00 from 6.01 to 10.00 from 10.01 to 15.00 from 15.01 to 50.00 greater than 50.00
.54 1.02 1.08 2.17 4.96 10.98 10.42
creating ratios can be one metric that can bring to light many different examples that are of interest Figure 2.14 The first ratio shown in Figure 2.14 is that of the ratio of cost divided by shelf time for the entire population. The value that has been calculated is $4.97. The second set of ratios that have been calculated in Figure 2.14 are a profile set of ratios that calculate shelf time by dollar value. The set of profiles shown in Figure 2.14 shows that as the price of an item grows larger, the longer the item can be expected to remain on the shelf. Such profiles of ratios can be very elucidating.
© 1997 William H. Inmon, all rights reserved
11
There is a word of caution about ratios. Ratios can be very deceiving. A ratio says nothing about the number of calculations that went into the ratio. A ratio representing 1,000,000 calculations is a much more stable and believable ratio than one that represents 10 calculations. Figure 2.15 makes this point.
cost/shelf time by dollar value -
number of occurrences
12,980 from .00 to 2.00 .54 9,871 from 2.01 to 4.00 1.02 14,971 from 4.01 to 6.00 1.08 7,517 from 6.01 to 10.00 2.17 2,981 from 10.01 to 15.00 4.96 1,987 from 15.01 to 50.00 10.98 716 greater than 50.00 10.42 in order to be most effective, ratios should be accompanied by the number of occurrences used in calculating the ratio Figure 2.15 In Figure 2.15 some ratios have as many as 15,000 calculations that were made to form the ratio. Other ratios have had less than 1,000 calculations as the basis for their calculation. The more calculations a ratio represents, the more believable the ratio. The micro techniques of data analysis that have been discussed are valid and appropriate for data in the data warehouse environment. However, given the volumes of data that are found in a data warehouse and the volumes of data on which data mining and data exploration is done, it is unreasonable to think that micro analysis will be done on the entire body of data. There simply is too much data to entertain such an activity. Instead, from a strategic standpoint, it is necessary to break the body of data into subpopulations for analysis, then use the techniques of micro analysis on those subpopulations. The selection of the subpopulation is one of the secrets to doing effective data mining. As a rule, it is wise to select intuitive subsets. Looking at demographics to select one subset of the population is a good place to start.
12
© 1997 William H. Inmon, all rights reserved
Figure 2.16 shows the strategic approach to the selection of subpopulations.
subset D
subset C
subset A subset B
breaking the large mass into subsets and identifying the distinguishing characteristics of each subset is an extremely useful thing to do. Each subset will have its own peculiarities, and the differences between subsets can present great marketing opportunities Figure 2.16 Even if it were possible to do a massive analysis across huge volumes of data, the presence and discovery of false positives makes such an approach unwise. As a rule the large population of data is divided into large subsets. The subsets then set the parameters for profiles of data across the population. Figure 2.17 shows some of the common ways that subdivision of data and the creation of profiles can be done. 12 9
3 6
accidents between: 5:00 am - 6:00 am - 23 6:00 am - 7:00 am - 98 7:00 am - 8:00 am -103 8:00 am - 9:00 am -291 ........................................ small businesses by county Otero - 951 El Paso - 8,975 Terrell - 10,876 Bexar - 128,081 Austin - 87,917 ................................
profiles are a good way to characterize a population. Three common ways of creating a population is by time, by location, or by demographics
population by characteristic female, 21-45 years, college degree female, 46 and older, college degree male, 21-45 years, college degree male, 46 and older, college degree
- 29 - 34 - 62 - 13
Figure 2.17
© 1997 William H. Inmon, all rights reserved
13
Figure 2.17 shows that profiles can be created along the boundaries of time, geography, or demographics of the population. These three ways of creating subsets of a population are very normal and correspond to the ways that businesses react with their customers. However, there is nothing that says that subpopulations which are studied by the data analyst have to be broken up in one dimension. It is entirely possible (even likely) that multiple dimensions can be used to sub divide data, as seen in Figure 2.18.
of course, the dimensions of demographics can be combined. In this case, time, location, and demographics are represented in a single profile 1980 Texas - females over 21 - 12,987,245 Texas - females under 21 - 3,776,430 Texas - males over 21 - 12,389,339 Texas - males under 21 - 3,798,229 Arkansas - females over 21 - 2,872,228 Arkansas - females under 21 - 640,449 Arkansas - males over 21 - 2,640,991 Arkansas - males under 21 - 625,981 .............................................................. ...............................................................
1990 Texas - females over 21 - 13,997,302 Texas - females under 21 - 3,807,208 Texas - males over 21 - 12,902,810 Texas - males under 21 - 3,982,228 Arkansas - females over 21 - 2,972,220 Arkansas - females under 21 - 643.990 Arkansas - males over 21 - 2,972,220 Arkansas - males under 21 - 638,449 .............................................................. ...............................................................
Figure 2.18 In Figure 2.18 the three dimensions have been used to subdivide the population that is studied by the DSS analyst. There is one set of data for 1980, another for 1990, and so forth. Within the 1980 data, there is a subdivision of data by state, then by gender within state. Each of the subclassifications of data has then has its micro analysis of data done on it.
Exploitation Once the patterns have been discovered and the subpopulations been discerned, how does the DSS analyst go about exploiting the correlations that have been discovered? The first thing the DSS analyst can do is to look in the right place. And exactly where are the right places? (Or at least where have been the most productive places to look for exploitation?) Figure 2.19 illustrates the most productive places to look for opportunities for exploitation.
14
© 1997 William H. Inmon, all rights reserved
UP
$
CASHIER
as near to the mainline business as possible
$ as near to the cash flow of the company as is possible
as near to the customer as possible
1.75
as near to the sale as possible
where to do data mining? Figure 2.19 Figure 2.19 shows that the most productive places to look for opportunities for exploitation have been as near to: ■ the mainline business as possible, ■ the cash flow of the business as possible, ■ the customer as possible, and ■ the sale as possible. The mainline business is, of course, different for every company. A manufacturer looks to the efficient production and packaging of goods. A retailer looks to the stocking of the SKU and its tracking during the sales process. A banker looks to the analysis of profitability. An insurance executive looks
© 1997 William H. Inmon, all rights reserved
15
to the selling of new policies, the handling of claims, the risk of going into new lines of business. The telecommunications manager looks to the acquisition of market share and the satisfaction of the customer. Each business is different in this regard. There indeed are novel ways to use data mining and data exploration outside of the arenas discussed; however, the classical use of the techniques of mining have been in the arena discussed. The essence of the data miner is to be able to anticipate the marketplace and to be able to anticipate it before the competition does so. In anticipation, the data miner is able to achieve marketplace opportunity. Figure 2.20 shows that through data mining the corporation knows where the marketplace is heading or is able to discover niches in the marketplace that have been otherwise overlooked.
anticipating the marketplace - direct sales - new products - new sales - establish market share - niche advertising - influence buying patterns - establish initial buying pattern Figure 2.20 Exactly how can the marketplace be anticipated? There are a host of ways that anticipation can be done, including: ■ direct sales opportunities, ■ the creation of new products and new packaging of old products, ■ new sales to undiscovered market niches, ■ establishment of market share and market momentum, ■ niche advertising, ■ the influencing of buying patterns of the customer, and ■ the establishment of initial buying patterns (in the hopes that the initial buying pattern will be the start of a long term habit). Through the use of correlations, the DSS analyst creatively uses the correlation to position products and packages to take advantage of the information about the habits of consumers.
16
© 1997 William H. Inmon, all rights reserved
To be a bit more specific about how exploitation might occur, consider the suggestions made in Figure 2.21.
event A
event B
when event A happens, event B usually happens - make sure that facilitators of A are in sync with facilitators of B - tie A and B together as a product - tie A and B together in a package - tie A and B together in an advertisement - promote A and B together - restock A and B on the same schedule - position A and B closely together in the same location - when a change is detected in A anticipate the same change in B - continue to check the strength of the A and B relationship - continue to check the relationship of A and B with other variables - attempt to understand the business basis of A and B Figure 2.21
Reusing Exploration There is no doubt that data mining and exploration requires a large amount of resources, however successful or unsuccessful the mining and exploration might be. Figure 2.22 shows that the data mining and data exploration effort consumes considerable resources.
market share
analysis
a lot of work goes into creating an effective exploitation of information Figure 2.22 In order to make the most use of resources as possible, it is a wise decision to try to salvage as much of the data mining and data exploration as possible. At the end of the data mining and data exploration effort, rather than throw everything away, trying to put as much in order as possible so that the next data mining and data exploration effort that arises can make use of work that has already been done is a wise policy. If every new data mining and data exploration effort must build their entire infrastructure from scratch, there is the likelihood of massive amounts of wasted and duplication of effort. Figure 2.23 shows that there is considerable infrastructure that might be saved from one data mining effort to the next.
© 1997 William H. Inmon, all rights reserved
17
market share
analysis
when it comes time to do another data mining/data exploration effort, it is a waste to have to recreate the entire infrastructure from scratch all over again Figure 2.23 The first step of doing data mining and data exploration from the data warehouse is that of the analysis done by the explorer. The types of work results that might be saved by the explorer include: ■ intermediate analysis results, ■ analytical programs, and ■ a metadata record of the iterative processing that has occurred. Figure 2.24 shows the results of analysis that might be saved in preparation for the next exploration.
analysis metadata
intermediate analyses
analytical programs
metadata
storing intermediate analysis, analytical programs, and metadata are all important to enabling the data miner/data explorer to be able to build on the work that has already been done Figure 2.24 In addition to the intermediate results that might be saved, there are some other dimensions of heuristic processing that might be saved as well. Figure 2.25 suggest some of the DSS parameters that may be of use to future DSS analysts.
18
© 1997 William H. Inmon, all rights reserved
analysis metadata
intermediate analyses
analytical programs
metadata
of special importance are - assumptions - sequence of processing - description of iterations - structure of data - state of base data there are some aspects of processing that are peculiar to the world of DSS analytical processing that are not found elsewhere in the world of operational or other DSS processing Figure 2.25 In Figure 2.25 it is suggested that such things as: ■ assumptions, ■ sequence of iterative processing, ■ a description of each of the iterations, ■ a description of the structure of the data read and the results achieved, and ■ a description of the base data and its state are all useful descriptions for the analyst trying to reconstruct the work that has been done and to salvage partial results. The analytical results that have been achieved are also of interest to the explorer who wishes to build on previous efforts. Figure 2.26 shows some of the details that describe the results that might be saved.
analysis analytical results analytical results if analytical results are to be reused - reports certain information about the analysis - spreadsheets must be retained and stored in an - temporary analysis accessible format - distilled data - description of content - multi dimensional data - sequence of iteration - analyst name - time of analysis - creating program - assumptions - interpretation of result - next iteration embarked on - related documentation - physical name of analysis - location of analysis - size of analysis
Figure 2.26
© 1997 William H. Inmon, all rights reserved
19
And finally, the results achieved by the attempts at exploitation are measured in the marketplace. Figure 2.27 shows the traditional marketplace measurements.
market share
analysis
measuring the results of exploitation - increased market share - increased sales - increased profitability - expanded product line - increased revenue - increased "pull through" sales Figure 2.27
Tools The tools that are used for data mining and data exploration can be divided into one of several camps. Figure 2.28 shows the different classifications of tools useful in the world of data mining and data exploration.
tools for data mining/exploration market share
analysis 3
2
4 1 1 - DATA WAREHOUSE
3 - ANALYSIS PREPARATION
- data warehouse integration, transformation - data warehouse metadata - relational dbms - parallel hardware platforms 2 - ANALYSIS - DSS analyst metadata tools - query formulation - query storage - spreadsheet - multidimensional dbms - departmental platforms/ work stations - graphical displays - statistical processing - data selection, transport
- graphical display
4 - EXPLOITATION - corporate measurement of results
some of the tools for data mining/data exploration
Figure 2.28
20
© 1997 William H. Inmon, all rights reserved
Summary The two Tech Topics on data mining and data exploration began with a discussion of an approach to data mining. The steps in the approach are: ■ infrastructure development ■ exploration ■ analysis ■ interpretation ■ exploitation The steps in the approach are heuristically derived. As such, there is no particular order in which the steps are executed. There are two types of DSS analyst — explorers and farmers. Explorers are people who do not know what they want, but will know it when they find it. Farmers are people who know what they want and regularly find it. Both types of people are served in the interest of data mining and data exploration. Data mining and data exploration occurs at two levels — micro and macro. Both levels of exploration are necessary in order to be effective. The basis of data mining is correlating different types of data to each other. There are many ways to correlate different types of data — by units of data, by time, by geography, by demographics, and so forth. When correlations are spotted, trends can be developed. There are different kinds of trends: ■ long term trends, ■ short term trends, and ■ sub trends, etc. The data warehouse sets the stage for data mining and data exploration for a variety of reasons, such as: ■ The stability of the data in the data warehouse. The DSS analyst knows that the changes in results are a result of the changes in hypothesis, not in the data when using a data warehouse. ■ The integrated data found in the warehouse. ■ The historical data found in the warehouse. ■ The metadata that describes the content of the warehouse. ■ The summary data found in the warehouse, and so forth. The largest problem the DSS analyst has in doing data mining and data exploration is that of the management of the volume of data found in the data warehouse. There is so much data that the DSS analyst drowns in it. Not only does the volume of data hide important relationships, the sheer volume of data ensures that some “false positives” will be found. Sampling is a good technique to cut down the volumes of data that need to be manipulated. Random samples can be drawn. Judgment samples can be drawn. If the judgment samples are drawn along the lines of demographic analysis, the sampling can be a very powerful technique. Summary data can be very instructive as to directing the DSS analyst where to look for interesting correlations. The intuition of the experienced DSS analyst should not be discounted as well.
© 1997 William H. Inmon, all rights reserved
21
Correlations between variables can be analyzed in many ways. One way is through the existence of the variables to each other. Another way is through the analysis of the values of the variables as they occur with each other. Once the mathematical basis is discovered, the business basis for the relationship needs to be examined. In some cases, there will be no business basis, and in other cases there will be a business basis. When the business basis of the relationship is discovered, the result can be interpreted and exploitation can occur. Analysis occurs at a macro level and a micro level. At the micro level there are many ways to analyze relationships of data. Some of the ways include: ■ looking at the mean, ■ creating a profile, ■ looking at the highest and lowest, ■ looking at the median, ■ creating scatter diagrams, ■ creating ratios, and so forth. Profiles are a good way to characterize a population. Profiles can be made in many ways — by time, by geography, and by demographics, or by a combination of the characteristics. The best chance of exploitation classically has been as near to : ■ the mainline business as possible, ■ the cash flow of the company as possible, ■ the customer as possible, or ■ the sale as possible. The marketplace can be anticipated in many ways. Some of the classical positionings of the marketplace have been: ■ in direct sales, ■ in new products and new packages, ■ in finding new sales, ■ in establishing market share, ■ in niche advertising and promotions, ■ influencing buying patterns, ■ in establishing new buying patterns, and so forth.
22
© 1997 William H. Inmon, all rights reserved