311 79 25MB
English Pages 536 Year 1989
f'Tc/vr* rix? %%t o vnq rvc? V&HTT T T A 'A ,»" V\ ii.Lifl.Vi
r?.
/N '";f'UT) a XT : 0 and that their sum is unity, that is, notation.
TABLE 2.6.3 Observed and Expected Frequencies of Lengths of Run
Length
1
2
3
4
5
6
7
8
>9
Total
Obs. freq. Exp. freq.
110 100
40 50
26 25
16 12.5
2 6.2
4 3.1
1 1.6
1 0.8
0 0.8
200 200
25
/>l-f-/>2 + /,3“l_ • • • ^
Pk
(2.6.1)
= ^
(The three dots are read “and so on.”) This sum can be written in a shorter form, k
(2.6.2) by use of the summation sign 2. The expression SjLi denotes the sum of the terms immediately following the 2 from the term with j = 1 to the term with j = k. The sample size is denoted by n. Expected frequencies are therefore nPj. EXAMPLE 2.6.1—The probability distribution of lengths of run is a discrete distribution in which the number of distinct lengths—values of Xj—has no limit. Given an endless supply of random digits, you might sometimes find a run of length 20, or 53, or any stated length. If you add the Pj values, 1 /2, 1 /4, 1 /8, and so on, you will find that the sum gets nearer and nearer to 1. Hence the result 2Pj = 1 in (2.6.2) and holds for such distributions. By the time you add runs of length 9 you will have reached a total slightly greater than 0.998. EXAMPLE 2.6.2—(i) Form a cumulative frequency distribution from the sample frequency distribution of 200 lengths of run in table 2.6.3. Cumulate from the longest run downwards; i.e., find the number of runs >8, >7, and so on. Find the cumulative percent frequencies. What percent of runs are of length (//) greater than 2; (///) less than 2; (iv) between 3 and 6, inclusive, i.e., of length 3, 4, 5, and 6? Ans. (//) 25%, (iii) 55%, (iv) 24%. EXAMPLE 2.6.3—How well do the estimates from example 2.6.2 compare with the correct percentages obtained from the probability distribution? Ans. Quite well; the correct percentages are (ii) 25%, (iii) 50%, (iv) 22.4%.
TECHNICAL TERMS class limits continuous data cumulative frequency distribution cumulative frequency polygon discrete data expected frequencies histogram
linear interpolation probability distribution median sample size skew positive true class limits
REFERENCES 1. 2. 3. 4.
Cochran, W. G. Suppl. J. R. Stat. Soc. 3 (1936):53. Pearson, K., and Lee, A. Biometrika 2 (1902):364. World Almanac 210 (1976). World Almanac 54 (1975).
3 The Mean and Standard Deviation
°g> ^7 3.1—Arithmetic mean. The frequency distribution provides the most complete description of the nature of the variation in a population or in a sample drawn from it. For many purposes for which data are collected, however, the chief interest lies in estimating the average of the values of A' in a population— the average number of hours worked per week in an industry, the average weekly expenditure on food of a family of four, the average corn acres planted per farm. The average palatability scores of the three foods were the criteria used in comparing the foods in the comparative study in section 1.9. Given a random sample of size n from a population, the most commonly used estimate of the population mean is the arithmetic mean X of the n sam¬ ple values. Algebraically, X, called X bar, is expressed as X-(X,+X2 + ... + X„)/n - (1 /n)Y.X,
(3.1.1)
1-1
Also, investigators often want to estimate the population total of a variable (e.g., total number employed, total state yield of wheat). If the population contains N individuals, NX is the estimate of the population total derived from the sample mean. As an example of the sample mean, the weights of a sample of 11 forty-year-old men were 148, 154, 158, 160, 161, 162, 166, 170, 182, 195, and 236 lb. The arithmetic mean is X = (148 + 154 + . . . + 236)/II = 1892/11 = 1721b Note that_only 3 of the 11 weights exceed the sample mean. With a skew dis¬ tribution, X is not near the middle observation (in this case 162 lb) when they are arranged in increasing order. The 11 deviations of the weights from their mean are, in pounds. -24,-18,-14, -12, -11,-10, -6, -2, +10, +23, +64 The sum of these deviations is zero. This result, which always holds, is a consequence of the way in which X was defined. The result is easily shown as follows. 26
27
In this book, deviations from the sample mean are represented by lowercase letters. That is,
x, = A, - A x2 = X2-X •
•
•
xn = Xn-X Add the above equations: n
n
X>, = i-i
=0
(3.1.2)
/- i
since A = 2A,/n as defined in (3.1.1). With a large sample, time may be saved when computing X by first forming a frequency distribution of the individual As. If a particular value Ay- occurs in the sample fj times, the sum of these Ay values is fjXj. Hence, k
X - (1 /«)(/,*, + f2X2 + . . . +fkXt) - (1 /«) JifjXj
(3.1.3)
7-1
As before, k in (3.1.3) is the number of distinct values of X. For example, in the ratings of the palatability of the food product C by a sample of n = 25 males, 1 male gave a score of +3, 2 gave +2, 7 gave +1,8 gave 0, 5 gave — 1, and 2 gave —2. Hence the average sample rating is A = (1 / 25) [(1 )(3) + (2)(2) + (7)(1) + (5)(-l) + (2)(-2)] = 5/25 = 0.2 Since every sample observation is in some frequency class, it follows that *
k
Jlfj = n
H(X/'Z) = 1
7-1
7-i
(31-4)
The sample relative frequency fj/n is an estimate of the probability Pj that A has the value A) in this population.
3.2—Population mean. The Greek symbol n is used to denote the popula¬ tion mean of a variable A. With a discrete variable, consider a population that contains N members. As the sample size n is increased with n = N, the relative frequency fj jn of class j becomes the probability Pj that A has the value A) in this population. Hence, when applied to the complete population, (3.1.3) becomes k
*-E 7-1
p,x,
(3.2.1)
3: Mean and Standard Deviation
28
In discrete populations, (3.2.1) is used to define n even if TV, the number of units in the population, is infinite. For example, the population of random digits is conceptually infinite if we envisage a process that produces a never ending supply of random digits. But there are only 10 distinct values of X, that is, 0, 1, 2, .. ., 9, with Pj = 0.1 for each distinct digit. Then, by (3.2.1), n = (0.1)(0 + l+ 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9) = (0.1)(45) = 4.5 One result that follows from the definition of n is k
k
k
k
Y. Pj(X, - ») - E PjXj - Z 7-1
7-1
- ** - M E Pj - 0
7-1
(3.2.2)
7-1
since 2*_, P} = 1 from (2.6.2). Result (3.2.2) is the population analog of result (3.1.2) that 2x, = 0. EXAMPLE 3.2.1—The following are the lengths of 20 runs of odd or even random digits drawn from table A 1: 1,4, 1, 1, 3, 6, 2, 2, 3, 1, 1,5, 2, 2, 1, 1, 1, 2, 1, 1. Find the sample arithmetic mean (/) by adding the numbers, (ii) by forming a frequency distribution and using (3.1.3). Ans. Both methods give 41 /20 = 2.05. EXAMPLE 3.2.2—The weights of 12 staminate hemp plants in early April at College Station, Texas, were approximately (1): 13, 11, 16, 5, 3, 18, 9, 9, 8, 6, 27, and 7 g. Calculate the sample mean, 11 g, and the deviations from it. Verify that 2x = 0. EXAMPLE 3.2.3—Ten patients troubled with sleeplessness each received a nightly dose of a sedative for a 2-week period. In another 2-week period they received no sedative (2). The average hours of sleep for each patient during each period were as follows: Patient Sedative None
1
2
3
4
5
6
7
8
1.3 0.6
1.1 1.1
6.2 2.5
3.6 2.8
4.9 2.9
1.4 3.0
6.6 3.2
4.5 4.7
910 4.3 5.5
6.1 6.2
(/') Calculate the mean hours of sleep Xs under sedative and XN under none. (ii) Find the ten differences D = (sedative — none). Show that their mean XD equals Xs — XN. (iii) Prove algebraically that this always holds. EXAMPLE 3.2.4—In repeated throws of a six-sided die, appearance of a 1 is scored 1, appearance of a 2, 3, 4, is scored 2, and appearance of 5 or 6 is scored 3. If Xj denotes the score (j = 1, 2, 3), (/) write down the probability distribution of the XJt (ii) find the population mean n, (iii) verify that 2P,(A) — n) = 0. Ans. (/') For Xj = 1, 2, 3, P, = 1/6, 3/6, 2/6; (ii) n = 13/6. EXAMPLE 3.2.5—In a table of truly random digits the probability Pj of a run of length j of even or odd digits is known to be 1 /2;. With an infinite supply of random digits we could conceivably get runs of any length, so the number of distinct lengths of run in this population is infinite. By adding the terms in 2PyA) = 2//2J for j = 1, 2, 3,..., satisfy yourself that the population mean n is 2. By y = 10 you should have reached a total equal to 1.9883 and by j — 14, 1.9990. The sum steadily approaches but never exceeds 2 as j increases. EXAMPLE 3.2.6—In a card game each card from 2 to 10 is scored at its face value, while ace, king, queen, and jack are each scored 10. (/) Write down the probability distribution of scores in repeated sampling with replacement from a well-shuffled pack, (ii) Show that population mean score n = 94/13. (iii) Write down the deviations Xj — tx. (It may help to express successive scores as 26/13, 39/13, and so on.) Verify that 2Pj(Xj — n) = 0.
the the the the
29
EXAMPLE 3.2.7—J. U. McGuire (3) counted the numbers A of European corn borers in 1296 corn plants. The following is the sample frequency distribution: X
/
0 423
1 414
2 253
3 117
4 53
5 22
6 4
7 5
8 3
9 2
>10 0
Calculate the sample mean X = 1.31. This is another skew distribution, with 65% of the observations less than the mean and only 35% greater.
3.3—Population standard deviation. Roughly speaking, the population standard deviation is a measure of the amount of variation among the values of X in a population. Along with X, the standard deviation is important as a descriptive statistic. In describing a variable population, often the first two things that we want to know are: What is the average level of XI How variable isXI Moreover, the standard deviation plays a dominant role in determining how accurately we can estimate the population mean from a sample of a given size. To illustrate, suppose we are planning a sample and would like to be almost certain to estimate n from X correct to within one unit. If all the values of X in the population lie between 98 and 102, it looks as if a small sample (say n < 10) ought to do the job. If the values of X are spread over the range from 0 to 5000, we face a much tougher problem. It is not obvious how to construct a measure of the amount of variation in a population. Presumably, this measure will depend on the sizes of the deviations X — n of the individual values of X from the population mean. The larger these deviations, the more variable the population will be. The average of these deviations, ’Z,Pj(Xj — n), is not a possible measure, since by (3.2.2) it is always zero. The average of the absolute deviations, ’EPj\Xj — n |, could be used, but for several reasons a different measure was adopted around 1800—the population standard deviation. It is denoted by the Greek letter a and defined as
0.3.1) V j. i We square each deviation, multiply by the probability Pj of its occurrence, add, and finally take the square root. For the population of random digits, n = 4.5. The successive deviations Xj — ju are —4.5, —3.5, —2.5, —1.5, —0.5, +0.5, +1.5, +2.5, +3.5, +4.5. Each deviation has probability Pj = 0.1. Thus, we get . o = {(0.1)[( —4.5)2 + ( —3.5)2 + ( —2.5)2 + (+1.5)2 + (-0.5)2 + ( + 0.5)2 + (+1.5)2 + ( + 2.5)2 + ( + 3.5)2 + ( + 4.5)2]}1/2 (7= V8725 = 2.8 7 2 The quantity k
°2-T. Pj(Xj - n? j-1
(3.3.2)
3: Mean and Standard Deviation
30
is called the population variance. The following result enables us to calculate a2 and hence a without finding any deviations Xj — fi. Expand the quadratic terms (Xj - p)2 in (3.3.2). This step gives k
k
k
a1 - Y. p,x) - 2»* E PjXj + n1 E P) j-1
7-1
(3-3-3)
7-1
Since '2PJXJ = n and XPj = 1, (3.3.3) may be rewritten k
k
= 1.5, the power increases from 0.32 to 0.59; (ii) for 0 = 2, from 0.52 to 0.76. EXAMPLE 5.6.4—In calculating the power of a two-tailed test at the 5% or a lower significance level, verify from (5.6.2) that the probability of Z being less than Z, can be regarded as zero with little loss of accuracy if 0 = -Jn(^A — ^)/ 0.5. For this value of 0 the power is only 0.079. Thus we rarely need two areas from table A 3 when calculating the power.
TABLE 5.6.1 Power of a Test of the Null Hypothesis m = Mo Made from the Mean A' of a Normal Sample with Known a
0 = Jn(nA- Mo)/° = Level of Test 0.05 .05 .01 0.01
One- or TwoTailed one two one two
1.5
2
2.5
3
3.5
0.44 .32 .20 0.14
0.64 .52 .37 0.22
0.80 .71 .57 0.47
0.91 .85 .75 0.61
0.97 .94 .88 0.82
71
5.7—Testing a mean when a is not known. In most applications in which a null hypothesis about a population mean is to be tested, the population standard deviation a is not known. In such cases we use the sample standard deviation s as an estimate of a and replace the normal deviate by t = MX — fi0)/s
with (n - 1) df
For two-tailed tests the significance levels of t with different degrees of freedom are read directly from table A 4. In terms of X, H0 is rejected at level a if X lies outside the limits n0 ± tas/ >Jn. For a one-tailed test at level a, read the t value for level 2a from table A 4. The following interesting historical example is also of statistical interest in that the data analyzed Xt were not direct measurements but rather contrasts (in this case, differences) of direct measurements. Arthur Young (1) conducted a comparative experiment in 1764. He chose an area (1 acre or x/i acre) in each of seven fields on his farm. He divided each area into halves, “the soil exactly the same in both.” On one half he broadcast the wheat seed (the old husbandry) and on the other half he drilled the wheat in rows (the new husbandry). All expenses on each half and the returns at harvest were recorded. Table 5.7.1 shows the dif¬ ferences Xt in profit per acre (pounds sterling) for drilling minus broadcasting on the seven fields. On eye inspection, drilling does not look promising. The natural test is a two-sided test of the null hypothesis u = 0. The sample data in table 5.7.1 give 5 = 1.607. Hence t = JnX/s = V7(-1.01)/1.607 = -1.663
with 6 df
The null hypothesis that the two methods give equal profit in the long run cannot be rejected. From table A 4 the significance probability (the probability that |r| > 1.663 given H0) is about 0.15. This sort of analysis of differences is often called a paired-comparison t test. The following illustrates a one-tailed pairedcomparison t test of a nonzero /j,0. On 14 farms in Boone County, Iowa, in 1950, the effect of spraying against corn borers was evaluated by measuring the corn yields on both sprayed and unpsrayed strips in a field on each farm. The 14 differences (sprayed minus unsprayed) in yields (bushels per acre) were as follows: -5.7, 3.7, 6.4, 1.5, 4.3, 4.8, 3.3, 3.6, 0.5, 5.0, 24.0, 8.8, 4.5, 1.1 The sample mean difference was X = 4.70 bu/acre, with s = 6.48 bu/acre. The cost of commercial spraying was $3/acre and the 1950 crop sold at about $1.50/bu. So a gain of 2 bu/acre would pay for the spraying. TABLE 5.7.1 Differences (drilled minus broadcasting) in Profit per Acre (pounds sterling)
Field
1
2
3
4
5
6
7
Total
Mean
X,
-0.3
-0.3
-2.3
-0.7
+ 1.7
-2.2
-3.0
-7.1
-1.01
5: Tests of Hypotheses
72
Does the sample furnish strong evidence that spraying is profitable? To examine this question, we perform a one-sided test with H0:p = 2, HA\p < 2. The test criterion is t
= n/T4(4.7 - 2.0)/6.48
=
1.56
with 13 df
The one-tailed 5% level in table A 4 is 1.771, the significance probability for t = 1.56 being about 0.075, not quite at the 5% level. The lower 95% limit in a one¬ sided confidence interval is 4.7 — (1.771)(1.73) = 1.64 bu/acre. As noted previ¬ ously, the confidence interval agrees with the verdict of the test in indicating that there might be a small monetary loss from spraying. In the 14 differences, note that the eleventh difference, 24.0, is much larger than any of the others and looks out of line. Tests described in sections 15.4 and 15.5 confirm that this value prob¬ ably comes from a population with a mean different from the other 13 differ¬ ences, although we know of no explanation. This discrepancy casts doubt on the appropriateness of the paried-comparison t test. Perhaps these 14 differences should not be considered as a random sample of size 14 from a single normal population. If we believed that outlying differences of the magnitude of 24.0 could be expected to occur with some regularity in careful reruns of this experi¬ ment, we might prefer to consider the data as a sample from a nonnormal popula¬ tion and analyze using a nonparametric procedure such as the sign test or the Wilcoxon test described in chapter 8. EXAMPLE 5.7.1—Samples of blood were taken from each of 8 patients. In each sample, the serum albumen content of the blood was determined by each of two laboratory methods A and B. The objective was to discover whether there was a consistent difference in the amount of serum albumen found by the two methods. The 8 differences (A — B) were as follows: 0.6, 0.7, 0.8, 0.9, 0.3, 0.5, —0.5, 1.3, the units being gm/100 ml. Compute t to test the null hypothesis (H0) that the population mean of these differences is zero and report the approximate value of your significance probability. What is the conclusion? Ans. t = 3.09 with 7 df, P between 0.025 and 0.01. Method A has a systematic tendency to give higher values. EXAMPLE 5.7.2—In an investigation of the effect of feeding 10 ng of vitamin B12 per pound of ration to growing swine (5), 8 lots (each with 6 pigs) were fed in pairs. The pairs were distinguished by being fed different levels of an antibiotic that did not interact with the vitamin; that is, the differ¬ ences were not affected by the antibiotic. The average daily gains (to about 200 lb liveweight) are summarized as follows: Pairs of Lots Ration
1
2
3
4
5
6
7
8
With B,2 Without B,2 Difference, D
1.60 1.56 0.04
1.68 1.52 0.16
1.75 1.52 0.23
1.64 1.49 0.15
1.75 1.59 0.16
1.79 1.56 0.23
1.78 1.60 0.18
1.77 1.56 0.21
For the differences, calculate the statistics D = 0.170 lb/day and sD = 0.0217 lb/day. Clearly, the effect of Bj2 is statistically significant. EXAMPLE 5.7.3—The effect of Bl2 seems to be a stimulation of the metabolic processes including appetite. The pigs eat more and grow faster. In the experiment above, the cost of the additional amount of feed eaten, including that of the vitamin, corresponded to about 0.130 lb/day of gain. Make a one-sided test of H0:^D 0.130 lb/day. Ans. t - 1.843, just short of the 5% level.
73 EXAMPLE 5.7.4—In example 5.7.3, what is the lower 95% one-sided confidence limit for nDl Ans. 0.129 lb/day of gain, indicating (apart from a 5% chance) only a trivial money loss from B,2-
5.8—Other tests of significance. As we have noted, the method of making a significance test is determined by the null and the alternative hypotheses. Suppose we have data that are thought to be a random sample from a normal distribution. In preceding sections we showed how to make one-tailed and two-tailed tests when the null hypothesis is /x = no and the alternative is /u = nA. In statistical analysis, there are many different null and alternative hy¬ potheses that are on occasion useful to consider and therefore many different tests of significance. Five are listed: 1. With the same data, our interest might be in the amount of variability in the population, not in the population mean. The null hypothesis H0 might be that the population variance has a known value al, while HA is that it has some other value. As might be anticipated, this test (section 5.11) is made from the sample value of s2. 2. In other circumstances we might have no hypothesis about the values of H or o2 and no reason to doubt the randomness of the sample. Our question might be, Is the population normal? For problems in which the alternative says no more than that the sample is drawn from some kind of nonnormal population, a test of goodness of fit is given (section 5.12). 3. With the same null hypothesis, the alternative might be slightly more specific, namely that the data come from a skewed, asymmetrical distribution. For this alternative, a test for skewness (section 5.13) is used. 4. In other applications, the alternative might be that the distribution differs from the normal primarily in having longer tails. A test criterion whose aim is to detect this type of departure from normality (kurtosis) appears in section 5.14. 5. The tests described in references (2), (3), and (4) usually are conducted to determine whether or not analyses that assume a normally distributed popula¬ tion, such as those of sections 5.2 to 5.8, are appropriate. When one of these tests, or perhaps the graphical test described in 4.13, indicates the assumption of nor¬ mality is not appropriate, often so-called nonparametric procedures are used. Some of these procedures are described in chapter 8.
5.9—Frequency distribution of s2. In studying the variability of popula¬ tions and the precision of measuring instruments and of mass production processes, we face the problem of estimating a2—in particular, of constructing confidence intervals for a2 from the sample variances s2 and of testing null hypotheses about the value of o2. As before, s2 = (A, — X)2/(« — 1). If the X, are randomly drawn from a normal distribution, the frequency distribution of the quantity (n - \)s2/o2 is another widely used distribution, the chi-square distribution with (n — 1) df. Chi-square (x2) with v degrees of freedom is defined as the distribution of the sum of squares of v independent standard normal deviates. Thus, if Zb Z2,..., Zv are independent standard nor-
5: Tests of Hypotheses
74
mal deviates, the quantity Z? +
z2
+ z2
+
(5.9.1)
follows the chi-square distribution with v df. The form of the distribution has been worked out mathematically. Table A 5 presents the percentage points of the distribution—the values of x2 that are exceeded with the stated probabilities 0.995, 0.990, and so on. The relation between s2 and x2 is made a little clearer by the following algebra._Remember that (n — l).y2 is the sum o£ squares of deviations, 2(A, — X)2, which may be written 2[(A, — p) — (X — p)]2, where p is the population mean. Hence, (n - 1)52
(A, - p)2
j
(*2 - M)2
'jH-
G
o
G
(X,, - p)2 “t
•
•
•
i
n(X - p)2 y
2
G
G
G
(5.9.2) Now the quantities (Xt — p)/a are all in standard measure—they are standard normal deviates Z, when the data are a random sample from a normal population with mean p and standard deviation a. And Jn(X — p)/o is another normal deviate, since the standard error of X is a/ Jn. Hence we may write (n - 1 )s2/a2 = Z2 + Z2 + . . . + Z2 - Z2+1
(5.9.3)
Thus (n — l)j2/ the values of chi-square exceeded with probabilities 0.975 and 0.025, respectively. The probability that a value of x2 drawn at random lies between these limits is 0.95. Since x2 = vs2/a2, the probability is 95% that when our sample was drawn, Xo.975 < VS2/0-2 < Xo.025
(5.10.1)
Analogously to what we found in section 5.3 these inequalities are equivalent to vs2/X0.025 < o-2 < VS2/xl.915
(5.10.2)
Expression (5.10.2) is the general formula for two-sided 95% confidence limits for a2. With s2 computed from a sample of size n,v = n— 1. As an example, we show how to compute the 95% limits for a2, given one of the s2 values obtained from n = 10 swine gains. For 9 df, table A 5 gives 19.02 for X0.025 and 2.70 for Xo.975- Hence, from (5.10.2) the 95% interval for a2 is 9s2/19.02 o\, compute X2 = vs2/Oq = 2x2/a*
(5.11.1)
with significance at the 5% level if x2 > Xo.oso with v df. As an example, suppose
5: Tests of Hypotheses
76
an investigator has used for years a stock of inbred rats whose weights have tr0 = 26 g. The investigator considers switching to a cheaper source of supply although the new rats may show greater variability, which would be undesirable. A sample of 20 new rats gave s2 = 23,297/19,5 = 35.0, in line with suspicions. As a check, the investigator tests H0\(j = 26 g against HA:a > 26 g. X2
= 23,297/262 = 34.46 (df = 19)
In table A 5, Xo.oso = 30.14, so the null hypothesis is rejected. To test H0:a2 > against the alternative a2 < Oq, reject at the 5% level if X2 = 2jt2/oo < Xo.95o- F°r a two-sided test, reject if either x2 < X0.975 or X2 > X0.025 • For example, the 511 samples of means A of 10 swine gains in weight in table 4.6.1 gave a sample variance sj> = 10.201 with 510 df. Since the original population (table 4.14.1) had a2 = 100, the theoretical variance a \ of means Xof samples of size 10 equals 100/io = 10. Let us see if that value is accepted by the two-sided 5% level test. Table A 5 stops at v = 100. For values of v > 100, an approximation is that the quantity Z=V27-V2^T1
(5.11.2)
is a standard normal deviate. In our example x2 = (510)( 10.201)/10 = 520.25; v = 510. Hence Z = VI040.5 - V10I9 = 0.335 This value lies far short of the two-tailed 5% level, 1.96. The null hypothesis + 1), where v is the degrees of freedom that the experimental plan provides. Thus we compare
(18.57)(10)/8 = 23.2
(86.57)(17)/(15) = 98.1
101
D. R. Cox (13) suggests the multiplier (u + l)2/*'2. This gives almost the same results, imposing a slightly higher penalty when v is small. A comparison like the above from a single experiment is not very precise, particularly if n is small. The results of several paired experiments using the same criterion for pairing give a more accurate picture of the success of the pairing. If the criterion has no correlation with the response variable, a small loss in accuracy results from pairing due to the adjustment for degrees of freedom. A substantial loss in accuracy may even occur if the criterion is badly chosen so that members of a pair are negatively correlated. When analyzing the results of a comparison of two procedures, the investigator must know whether samples are paired or independent and must use the appropriate analysis. Sometimes a worker with paired data forgets they are paired when it comes to analysis and carries out the statistical analysis as if the two samples were independent. This mistake is serious if the pairing has been effective. In the virus lesions example, the worker would be using 2s2/n = 91.43/8 = 11.44 as the variance of D instead of 18.57/8 = 2.32. The mistake throws away all the advantage of the pairing. Differences that are actually significant may be found nonsignificant, and confidence intervals will be too wide. Analysis of independent samples as if they were paired seems to be rare in practice. If the members of each sample are in essentially random order so that the pairs are a random selection, the computed s2D may be shown to be an unbiased estimate of 2a2. Thus the analysis still provides an unbiased estimate of the variance of Xx — X2 and a valid t test. There is a slight loss in sensitivity, since t tests are based on (n — 1) df instead of 2(n — 1) df. As for assumptions, pairing has the advantage that its t test does not require a, = a2. “Random” pairing of independent samples has been suggested as a means of obtaining tests and confidence limits when the investigator knows that a, and a2 are unequal. Artificial pairing of the results by arranging each sample in descending order and pairing the top two, the next two, and so on, produces a great underestimation of the true variance of D. This effect may be illustrated by the first two random samples of swine gains from table 4.11.1. The population variance a2 is 100, giving 2a2 = 200. In table 6.13.1 this method of artificial pairing has been employed. Instead of the correct value of 200 for 2a2, we get an estimate s2D of only 8.0. Since s-p = V8.0/10 = 0.894, the t value for testing D is t = 6.3/0.894 = 7.04 with 9 df. This gives a Rvalue of much less than 0.1%, although the two samples were drawn from the same population. TABLE 6.13.1 Two Samples of 10 Swine Gains Arranged in Descending Order, to Illustrate Erroneous Conclusions from Artificial Pairing Sample 1 Sample 2 Differences
57 53 4
53 39 44 32 9 7 242 - 469 -
39 36 34 31 30 30 8 6 4 632/10 - 72.1
33 29 24 12 24 19 19 11 910 5 1 s2D = 72.1/9 = 8.0
Mean = 35.6 Mean - 29.3 Mean = 6.3
6: Comparison of Two Samples
102
EXAMPLE 6.13.1—In testing the effects of two painkillers on the ability of young men to tolerate pain from a narrow beam of light directed at the arm, each subject was first rated several times as to the amount of heat energy he bore without complaining of discomfort. The subjects were then paired according to these initial scores. In a later experiment the amounts of energy received that caused the subject to complain were as follows (A and B denoting the treatments). Pair A B
1
2345678
15 2 4 1 5 7 1 0 67343230
9
Sums
-3 -6
32 22
To simplify calculations, 30 was subtracted from each original score. Show that for appraising the effectiveness of the pairing, comparable variances are 22.5 for the paired experiment and 44.6 for independent groups (after allowing for the difference in degrees of freedom). The preliminary work in rating the subjects reduced the number of subjects needed by almost one-half. EXAMPLE 6.13.2—In a previous experiment comparing two routes A and B for driving home from an office (example 6.3.3), pairing was by days of the week. The times taken ( — 23 min) for the 10 pairs were as follows: A B A - B
5.7 2.4 3.3
3.2 2.8 0.4
1.8 1.9 -0.1
2.3 2.0 0.3
2.1 0.9 1.2
0.9 0.3 0.6
3.1 3.6 -0.5
2.8 1.8 1.0
7.3 5.8 1.5
8.4 7.3 1.1
Show that if the 10 nights on which route A was used had been drawn at random from the 20 nights available, the variance of the mean difference would have been about 8 times as high as with this pairing. EXAMPLE 6.13.3—If pairing has not reduced the variance so that s2D = 2 (- y[h/a)(^A - Mo) + Za
where Z„ denotes the two-tailed significance level, with Z010 = 1.645, Z005 = 1.960, and Z001 = 2.576. With a paired experiment, apply this result to the population of differences between the members of a pair. In terms of differences, H0 becomes Mo = 0 and Ha is (iA = 8, which we assume > 0. The test criterion is yfn D/aD. Hence, (5.6.2) is written Z < — yfn8/aD — Za
or
Z > — yfn8/aD + Za
(6.14.1)
For power values of any substantial size, the probability that Z lies in the first region will be found to be negligible and we need consider only the second. As an example, suppose we want the power to be 0.80 in a two-tailed test at the 5% level so that Z005 = 1.960. From table A 3, the normal deviate exceeded by 80% of the distribution is Z = —0.842. Hence, in (6.14.1) we want —0.842 = — yfn8/aD + 1.960 or M/ttd = 2.802
n = 1.9(t2d/82
Note that with yfn8/aD = 2.802, the first region in (6.14.1) is Z < —4.762, whose probability is negligible as predicted. To obtain a general formula, verify that if the power of the test is to be P\ the lower boundary of the second region in (6.14.1) is — Z2(l _/>-). If we write /3 = 2(1 — P'), the formula for the needed sample size n is n - (Z. + Ztfall?
(6.14.2)
In a one-tailed test at level a, (5.6.3) shows that in terms of differences the power is the probability that a normal deviate Z lies in the region Z > — \fn8/oD + Z2a Hence, with a one-tailed test,
(6.14.3)
6: Comparison of Two Samples
104 n = (Z2a + Z0)2a2D/82
(6.14.4)
For two independent samples of size n, the test criterion is y[n(X, — X2)/(j2a), instead of yjnD/aD for paired samples. Hence, the only change in (6.14.2) and (6.14.3) for n is that 2a2 replaces p2. To make a one-tailed test, we find z and use only a single tail of the normal table. For a continuity correction in such a one-tailed test, subtract 0.5 from |/ — F|. In interpreting the results of these x2 tests in observational studies, caution is necessary, as noted in chapter 1. The two groups being compared may differ in numerous ways, some of which may be wholly or partly responsible for an observed significant difference. For instance, pipe smokers and nonsmokers may differ to some extent in their economic levels, residence (urban or rural), and eating and drinking habits, and these variables may be related to the risk of dying. Before claiming that a significant difference is caused by the variable under study, it is the investigator’s responsibility to produce evidence that disturbing variables of this type could not have produced the difference. Of course, the same responsibility rests with the investigator who has done a controlled experiment. But the device of randomization and the greater flexibil¬ ity that usually prevails in controlled experimentation make it easier to ensure against misleading conclusions from disturbing influences. The preceding x2 and z methods are approximate, the approximation becoming poorer as the sample size decreases. Fisher (9) has shown how to compute an exact test of significance. For accurate work the exact test should be used if (/') the total sample size N is less than 20 or (//) if N lies between 20 and 40 and the smallest expected number is less than 5. For those who encounter these conditions frequently, reference (10), which gives tables of the exact tests covering these cases, is recommended. EXAMPLE 7.11.1—In a study as to whether cancer of the breast tends to run in families, Murphy and Abbey (11) investigated the frequency of breast cancer found in relatives of (/) women with breast cancer and (//) a comparison group of women without breast cancer. The data below, slightly altered for easy calculation, refer to the mothers of the subjects. Breast Cancer in Subject
Breast Cancer in Mother Total
Yes No
Yes
No
Total
7 193 200
3 197 200
10 390 400
Calculate x2 and P (/') without correction and (ii) with correction for continuity for testing the null hypothesis that the frequency of cancer in mothers is the same in the two classes of subjects. Ans. (/) X2 =■ 1.64, P = 0.20; (//) x? = 0.92, Pc « 0.34. Note that the correction for continuity always increases P, that is, makes the difference less significant. With these data, a case can be made for either a two-tailed test or a one-tailed test on the
7: Binomial Distribution
128
supposition that it is very unlikely that breast cancer in mothers would decrease the risk of breast cancer in the daughters. With a one-tailed test, Pc = 0.17, neither test giving evidence of a relation between breast cancer in subject and mother. EXAMPLE 7.11.2—C. H. Richardson has furnished the following numbers of aphids (Aphis rumicis L.) dead and alive after spraying with two concentrations of solutions of sodium oleate: Concentration of Sodium Oleate (%)
Dead Alive Total Percent dead
0.65
1.10
Total
55 13 68 80.9
62 3 65 95.4
117 16 133
Has the higher concentration given a significantly different percent kill? Ans. x? = 5.76, P < 0.025. EXAMPLE 7.11.3—In 1943 a sample of about 1 in 1000 families in Iowa was asked about the canning of fruits or vegetables during the preceding season. Of the 392 rural families, 378 had done canning, while of the 300 urban families, 274 had canned. Calculate 95% confidence limits for the difference in the percentages of rural and urban families who had canned. Ans. 1.42% and 8.78% (without correction for continuity).
7.12—Test of the independence of two attributes. The preceding test is sometimes described as a test of the independence of two attributes. A sample of people of a particular ethnic type might be classified into two classes according to hair color and also into two classes according to color of eyes. We might ask, Are color of hair and color of eyes independent? Similarly, the numerical example in the previous section might be referred to as a test of the question, Is the risk of dying independent of smoking pipes? In this way of speaking, the word independent carries the same meaning as it does in Rule 3 in the theory of probability. Let pA be the probability that a member of a population possesses attribute A and pB the probability that the member possesses attribute B. If the attributes are independent, the probability of possessing both attributes in pApB. Thus, on the null hypothesis of indepen¬ dence, the probabilities in the four cells of the 2x2 contingency table are as follows: Attribute A Attribute# (1) Present (2) Absent Total
(1) Present
(2) Absent
Total
pApB pAqB
qApB qAqB
pB qB
pA
qA
1
Two points emerge from this table. The null hypothesis can be tested either by comparing the proportions of cases in which B is present in columns (1) and (2) or by comparing the proportions of cases in which A is present in rows (1) and (2). These two %1 2 tests are exactly the same as is easily verified.
129 Second, the table provides a check on the rule given for calculating the expected number in any cell. In a single sample of size N, we expect to find NpApB members possessing both A and B. The sample total in column (1) is our best estimate of NpA, while that in row (1) similarly estimates NpB. Thus (column total)(row total)/(grand total) is (NpA)(NpB)/N = NpApB as required.
7.13—Sample size for comparing two proportions. The question, How large a sample do I need? is naturally of great interest to investigators. For comparing two means, an approach that is often helpful is given in section 6.14. It should be reviewed carefully, since the same principle applies to the compari¬ son of two proportions. The approach assumes that a test of significance of the difference between the two proportions is planned and that future actions will depend on whether the test shows a significant difference or not. Consequently, if the true difference p2 — P\ is as large as some amount 5 chosen by the investigator, the test is desired to have a high probability P' of declaring a significant result. Formula (6.14.2) for n, the size of each sample, can be applied. For two independent samples, use &= p2 — pi and a2D = pxqx + p2q2, which gives
n = (Z„ + Z(3)2(pxqx + p2q2)/(p2 - px)2
(7.13.1)
where Za is the normal deviate corresponding to the significance level to be used in the test, /3 = 2(1 — P'), and Z& is the normal deviate corresponding to the two-tailed probability (3. Table 6.14.1 gives (Za + Z0)2 for the commonest values of a and fi. In using this formula, we substitute the best advance estimate of pxqx + p2q2 in the numerator. For instance, suppose that a standard antibiotic has been found to protect about 50% of experimental animals against a certain disease. Some new antibiotics become available that seem likely to be superior. In comparing a new antibiotic with the standard, we would like a probability P' = 0.9 of finding a significant difference in a one-tailed test at the 5% level if the new antibiotic will protect 80% of the animals in the population. For these conditions, table 6.14.1 gives (Z0 + Zp)2 as 8.6. Hence
n = (8.6)[(50)(50) + (80)(20)]/302 = 39.2 Thus, 40 animals should be used for each antibiotic. Some calculations of this type will soon convince you of the sad fact that large samples are necessary to detect small differences between two percentages. When resources are limited, it is sometimes wise, before beginning the experi¬ ment, to calculate the probability that a significant result will be found. Suppose that an experimenter is interested in the values px = 0.8, p2 = 0.9 but cannot make n > 100. If formula (7.13.1) is solved for Z^, we find
0.5
- Za - 2 - Z,
7: Binomial Distribution
130
If the experimenter intends a two-tailed 5% test, Za = 2 so that Z0 = 0. This gives /3 = 1 and P' = 1 — /3/2 = 0.5. The proposed experiment has only a 50-50 chance of finding a significant difference in this situation. Formula (7.13.1), although a large sample approximation, should be accurate enough for practical use, even though there is usually some uncertainty about the values of p, and p2 to insert in the formula. Reference (12) gives tables of n based on a more accurate approximation. EXAMPLE 7.13.1—One difficulty in estimating sample size in biological work is that the proportions given by a standard treatment may vary over time. An experimenter has found that the standard treatment has a failure rate lying between p, = 30% and p, = 40%. With a new treatment whose failure rate is 20% lower than the standard, what sample sizes are needed to make P' = 0.9 in a two-tailed 5% test? Ans. n = 79 when p, = 30% and n = 105 when p, = 40%. EXAMPLE 7.13.2—The question of sample size was critical in planning the 1954 trial of the Salk poliomyelitis vaccine (13), since it was unlikely that the trial could be repeated and an extremely large sample of children would obviously be necessary. Various estimates of sample size were therefore made. One estimate assumed that the probability an unprotected child would contract paralytic polio was 0.0003, or 0.03%. If the vaccine was 50% effective (that is, decreased this probability to 0.00015, or 0.015%), it was desired to have a 90% chance of finding a 5% significance difference in a two-tailed test. How many children are required? Ans. 210,000 in each group (vaccinated and unprotected). EXAMPLE 7.13.3—An investigator has p, = 0.4 and usually conducts experiments with n = 25. In a one-tailed test at the 5% leve^ what is the chance of obtaining a significant result if (/') p2 = 0.5, (ii) p2 = 0.6? Ans. (/) 0.18, (//) 0.42.
7.14—The Poisson distribution. As we have seen, the binomial distribu¬ tion tends to the normal distribution as n increases for any fixed value of p. The value of n needed to make the normal approximation a good one depends on the value of p\ this value of n is smallest when p = 0.5. For p < 0.5, a general rule, usually conservative, is that the normal approximation is adequate if the mean H = np is greater than 15. In many applications, however, we are studying rare events, so that even if n is large, the mean np is much less than 15. The binomial distribution then remains noticeably skew and the normal approximation is unsatisfactory. A different approximation for such cases was developed by S. D. Poisson (14). He worked out the limiting form of the binomial distribution when n tends to infinity and p tends to zero at the same time in such a way that p = np is constant. The binomial expression for the probability of r successes tends to the simpler form, r
P(r)--e ' rl
r- 0,1,2,...
(7.14.1)
where e = 2.71828 is the base of natural logarithms. The initial terms in the Poisson distribution are P(0)-e“
P(l)-ve->
P(2)=|e~'
-PO) -
i
(10.7.7)
EXAMPLE 10.7.1—In table 10.1.1, subtract each sister’s height from her brother’s; then compute the sum of squares of deviations of the differences. Use the results given in table 10.1.1 to verify (10.7.4) as applied to differences. EXAMPLE 10.7.2—You have a random sample of n pairs (Xu,X2i) in which_the correlation between Xu and X2i is p. What is the correlation between the sample means A, and X21 Ans. p, the same as the correlation between members of an individual pair. EXAMPLE 10.7.3—The members of a sample, Xu X2, . . ., X„, all have varian£e , = «, — 1.) The approximate degrees of freedom in the estimated variance = 1,X2s2/ni are df - (2y,.)7(2y2/i/f)
(12.10.3)
As an example with unequal
the public school expenditures per pupil per
TABLE 12.10.1 Analysis of Variance with Samples of Unequal Sizes Source of Variation Between classes Within classes Total
df
Sum of Squares
Mean Square
F
a — 1 N- a
- X2 /N 22*?, - 2Xj./n,
4
4/s1
s2
N- 1
22*?, - X2JN
231 TABLE 12.10.2 Public School Expenditures per Pupil per State (in
Northeast
Southeast
1.33 1.26 2.33 2.10 1.44 1.55 1.89 1.88 1.86 1.99 Total Mean
1.66 1.37 1.21 1.21 1.19 1.48 1.19
17.63 1.763 10 0.1240
", *,2
9.31 1.330 7 0.0335
$1000)
South Central
North Central
Mountain Pacific
1.16 1.07 1.25 1.11 1.15 1.15 1.16 1.26 1.30
1.74 1.78 1.39 1.28 1.88 1.27 1.67 1.40 1.51 1.74 1.53 17.19 1.563 11 0.0448
1.76 1.75 1.60 1.69 1.42 1.60 1.56 1.24 1.45 1.35 1.16 16.58 1.507 11 0.0404
10.61 1.179 9 0.0057
state in five regions of the United States in 1977 (10) are shown in table 12.10.2. Do the average expenditures differ from region to region? The analysis of variance (table 12.10.3) shows F highly significant. The variance of a regional mean is a2/n, . These variances differ from region to region, but since the «, do not differ greatly, we use the average variance of a regional mean for t tests. This average is o2/nh, where nh is the harmonic mean of the nh in this case 9.33. Hence the estimated standard error of a regional mean is taken as VO-0515/9.33 = 0.0743. With 43 df, the difference between two means required for significance at the 5% level is V2(0.0743)(2.02) = 0.212. Using this yardstick, average expenditure per pupil in the Northeast exceeds that in any other region except the North Central, while the North Central exceeds the southern regions. Since the within-region values of s] look quite dissimilar, a safer procedure when comparing two class means is to estimate the standard error of the difference in means using only the data from these two classes. With this method the mean expenditures for the Northeast and Mountain Pacific regions no longer differ significantly. EXAMPLE 12.10.1—How many degrees of freedom does Satterthwaite’s method assign to the pooled mean square within regions in the F test in table 12.10.3? Ans. Since the pooled within-regions sum of squares is we take vt = p,sj, = 2.215 in (12.10.3). We get df = 2.2 1 52/0.1818 = 27, instead of 43 df in the analysis of variance, which assumes that the 0.25 from table A 14, part II. The hypothesis of a linear relation is accepted. (A further test on these data is made in section 19.4.) As mentioned in section 9.15, a weighted regression is sometimes used with a single sample. In the population model F, = a + /3Xj + ef, the investigator may have reason to believe that F(e,) is not constant but is of the form X,a2, where X, is known, being perhaps a simple function of Xt. In this situation the general principle of least squares chooses a and b by minimizing TABLE 12.11.3 Subdivision of Rates Sum of Squares into Regression and Deviations
Source of Variation
df
Sum of Squares
Mean Squares
Linear regression Deviations Within rates
1 2 37
15,700 393 5,651
15,700 196 153
F 102.6 1.28
12: Analysis of Variance
234 fl
Y.wi(Yt- a - bX,)2
(12.11.3)
i=i
where w, = 1 /A,. The principle is to weight inversely as the residual variances. This principle leads to the estimates b = 'hwixiyi/'hwix1i and a = y — bx, where y, x are weighted means and the deviations xh yt are taken from the weighted means. Furthermore, if the model holds, the residual mean square 2w,.(K,. - F(.)2/(n - 2)
(12.11.4)
is an unbiased estimate of a2 with (n — 2) df, while V(b) = o2/'Lwix2i. Thus the standard tests of significance and methods for finding confidence limits can be used if the A, are known. EXAMPLE 12.11.1—The fact that in table 12.11.1 each rate is twice the preceding rate might suggest that a linear regression of the Y, on the logs of the rates or on the numbers 1, 2, 3, 4 was anticipated. Fit a weighted regression of the K, on Xt = 1, 2, 3, 4. Does the reduction in sum of squares due to regression exceed 15,700? Ans. No. The reduction is practically the same—15,684. On this A-scale, however, the slope is steeper between X = 3 and X = 4.
12.12—Testing effects suggested by the data. The rule that a comparison L is declared significant at the 5% level if L/sl exceeds f005 is recommended for any comparisons that the experiment was designed to make. Sometimes in examining the treatment means we notice a combination that we did not intend to test but which seems unexpectedly large. If we construct the corresponding L, use of the t test for testing L/sl is invalid, since we selected L for testing solely because it looked large. Scheffe (11) provides a conservative test for this situation that works both for equal and unequal With a means, the rule is: Declare \L \ /sL significant only if it exceeds yj(a — 1)F005 where F005 is the 5% level of F for degrees of freedom vx = a — 1, i>2 = N — a, which equals a(n — 1) in classes with equal n. In more complex experiments, v2 is the number of error degrees of freedom provided by the experiment. Scheffe’s test agrees with the t test when a = 2. It requires a substantially higher value of L/sl for statistical significance when a > 2—for example, yj(a — 1 )F = 3.08 when a = 5 and 4.11 when a = 10 as against 1.96 for t in experiments with many error degrees of freedom. Scheffe’s test allows us to test any number of comparisons that are picked out by inspection. The probability of finding an erroneous significant result (Type I error) in any of these tests is at most 0.05. A test with this property is said to control the experimentwise error rate at 5%. This test supplies the protection against Type I error that the investigator presumably wants when a comparison catches the eye simply because it looks unexpectedly large. The penalty in having a stiffer rule before declaring a result significant is, of course, loss of power relative to the t test, a penalty that the
235
investigator may be willing to incur when faced with a result puzzling and unexpected. It has been suggested that control of the experimentwise error rate is appropriate in exploratory studies in which the investigator has no compari¬ sons planned in advance but merely wishes to summarize what the data seem to suggest. Even in such studies, however, the analysis often concentrates on seeking some simple pattern or structure in the data that seems plausible rather than on picking out comparisons that look large. Thus, experimentwise control may give too much protection against Type I errors at the expense of the powers of any tests made.
12.13—Inspection of all differences between pairs of means. Sometimes the classes are not ordered in any way, for example, varieties of a crop or different makes of gloves. In such cases the objective of the statistical analysis is often either (/) to rank the class means, noting whether the means ranked at the top are really different from one another and from the rest of the means, or (ii) to examine whether the class means all have a common yu or whether they fall into two or more groups having different jus. For either purpose, initial tests of the differences between some or all pairs of means may be helpful. In the doughnut data in table 12.2.1, the means for the four fats (in increasing order) are as follows: Fat
4
13
2
Mean grams absorbed
62
72 76 _
85
LSD
]
12:i
The standard error of the difference between two means, dls2/n, is ±5.80 with 20 df (table 12.2.1). The 5% value of t with 20 df is 2.086. Hence the difference between a specific pair of means is significant at the 5% level if it exceeds (2.086)(5.8) = 12.1. The highest mean, 85 for fat 2, is significantly greater than the means 72 for fat 1 and 62 for fat 4. The mean 76 for fat 3 is significantly greater than the mean 62 for fat 4. None of the other three differences between pairs reaches 12.1. The two solid lines in the table above connect sets of means judged not significantly different. The quantity 12.1, which serves as a criterion, is called the least significant difference (LSD). Similarly, 95% confidence limits for the population difference between any pair of means are given by adding ±12.1 to the observed difference. Objections to use of the LSD in multiple comparisons have been raised for many years. Suppose that all the population means yu, are equal so that there are no real differences. Five classes, for instance, have ten possible comparisons between pairs of means. The probability that at least one of the ten exceeds the LSD is bound to be greater than 0.05; it can be shown to be about 0.29. With ten means (45 comparisons among pairs), the probability of finding at least one significant difference is about 0.63 and with 15 means it is around 0.83. When the yu, are all equal, the LSD method still has the basic property of a
12: Analysis of Variance
236
test of significance, namely, about 5% of the tested differences will erroneously be declared significant. The trouble is that when many differences are tested in an experiment, some that appear significant are almost certain to be found. If these are reported and attract attention, the test procedure loses its valuable property of protecting the investigator against making erroneous claims. TECHNICAL TERMS comparison contrast experimentwise error rate fixed effects model least significant difference
model I multiple comparison one-way classifications orthogonal test of linearity REFERENCES
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
Lowe, B. 1935. Data, Iowa Agric. Exp. Stn. Richardson, R., et al. J. Nutr. 44 (1951 ):371. Snedecor, G. W. 1934. Analysis of Variance and Covariance. Ames. Fisher, R. A., and Yates, F. 1938. Statistical Tables. Oliver & Boyd, Edinburgh. Query. Biometrics 5 (1949):250. Hansberry, T. R., and Richardson, C. H. Iowa State Coll. J. Sci. 10 (1935):27. Rothamsted Experimental Station Report. 1936, p. 289. Rothamsted Experimental Station Report. 1937, p. 212. Box, G. E. P. Ann. Math. Stat. 25 (1954):290. Statistical Abstract of the United States. U.S. Bur. of Census 152 (1977). Scheffe, H. 1959. The Analysis of Variance. Wiley, New York.
13 Analysis of Variance: The Random Effects Model
13.1— Model II: random effects. With some types of single classification data, the model used and the objectives of the analysis differ from those under model I. Suppose that we wish to determine the average content of some chemical in a large population of leaves. We select a random sample of a leaves from the population. For each selected leaf, n independent determinations of the chemical content are made giving N = an observations in all. The leaves are the classes, and the individual determinations are the members of a class. In model II, the chemical content found for the jth determination from the /th leaf is written as Xij = ii + Aj + €jj
(i = 1,. . ., a\ j = 1,. . ., n)
(13.1.1)
where A( = ^V(0, aA), e,y = JV(0, a2). The symbol ii is the mean chemical content of the population of leaves—the quantity to be estimated. The symbol At represents the difference between the chemical content of the / th leaf and the average content over the population. By including this term, we take into account the fact that the content varies from leaf to leaf. Every leaf in the population has its value of A,-, so we may think of A,as a random variable with a distribution over the population. This distribution has mean 0, since the A,- are defined as deviations from the population mean. In the simplest version of model II, it is assumed in addition that the A,- are normally distributed with variance aA. Hence, we have written At = jV(0, aA). What about the term ey? This term is needed because (/) the determination is subject to an error of measurement; and (ii) if the determination is made on a small piece of the leaf, its content may differ from that of the leaf as a whole. The e,y and the At are assumed independent. The further assumption e,y = ^V(0, 0.10. If the suspected outlier is the lowest observation, the criterion is 10 (X2 — Xx )(A"6 — Xx).
Two suspicious outliers may be tested by repeated use of either of these tests. First, test the most extreme value; then test the second most extreme value in the sample of size (n — 1) formed by omitting the most extreme value. Make both tests no matter what the verdict of the first test is, since a second outlier may mask a first outlier if both lie on the same side. Consider the data (n = 10): -1.88, -0.85, -0.50, -0.42, -0.26, 0.02, 0.24, 0.44, 4.74, 5.86 For the most extreme value, Dixon’s criterion is (A”10 — X9)/(Xl0 — X2) = (5.86 - 4.74)/[5.86 - (-0.85)] = (5.86 - 4.74)/(5.86 + 0.85) = 0.17— nowhere near the 5% level. But the second extreme gives (4.74 — 0.44)/(4.74 + 0.85) = 0.769. For n = 9, the 1% level of Dixon’s criterion is 0.635. Thus, both are regarded as outliers. The MNR method gives the same results. Since we are now making more than one test, the levels in tables A 15(0 and A 16 must be changed to keep the probability of finding no spurious outlier at the
15: Failures in the Assumptions
280
5% or 1% levels. Rosner (8) has given the 5%, 1%, and 0.5% levels for the ESD (MNR) method with two outliers. Rather than test and delete observations from the most extreme inwards, the masking effect can be avoided by testing points from the least extreme to the most extreme. If a point is determined to be an outlier, it is deleted from the sample with all points more extreme than it. Hawkins (9) discusses how to achieve a constant a-level using this procedure.
15.5—Suspected outliers in one-way or two-way classifications. If Xu is the y'th observation in the z'th class in a one-way classification, its residual from the standard model is du = Xjj — XL, since Xtj = X(. If we expect that within-class variances will be heterogeneous, we may decide to test a large residual by comparing it only with other residuals in the same class. Then table A 15(z) or A 16 can be used, since we are dealing with a single sample. If within-class variances look homogeneous, however, a more powerful test is obtained by defining the MNR as
MNR = max
max| djj |/ Vresidual SS (15.5.1)
Regarding the distribution of this MNR, any du is correlated with the other dik in its class but uncorrelated with the dlm in other classes. Its significance levels in table A 15 lie between those in column (i) MNR and those in column (ii) MNR, which would apply if all dij were mutually independent. In this use of table A 15, note that n is the total number of observations over all classes. The significance levels are usually much closer to column (ii) than to column (/). With 5 classes and 4 in a class, we read table A 15 at n = 20. In a two-way classification with one observation per cell, the residuals that we examine when checking on the presence of extreme values are dtJ = XiJ - X,. - Xj + X Table A 17 gives the 5% and 1% levels of the MNR = max | d\ yjhl^djj for a two-way classification, computed by Stefansky (6). The table is indexed by the number of rows and columns in the two-way classification. As an example of its use, we take an experiment (10) in which the gross error is an impossible value that should have been spotted at once and rejected but was not. The data are ratios of dry weight to wet weight of wheat grain, which of course cannot exceed 1. In table 15.5.1 the MNR test detects this outlier, 1.04 (P < 0.01). If an observation is clearly aberrant but no explanation can be found, it is not obvious what is best to do. Occasionally, the aberrant observation can be repeated; a subject who gave an aberrant reading under treatment 3 might take the treatment again on a later occasion with only a minor departure from the experimental plan. If it is decided to treat the outlier as a missing observation, this decision and the value of the outlier should be reported. The investigator
281 TABLE 15.5.1 Ratio of Dry to Wet Grain in an Experiment with Four Treatments, Four Blocks
Data Nitrogen Applied Block
None
1 2 3 4
0.72 .72 .70 0.73
Early
Middle
Residuals Nitrogen Applied Late
0.73 0.73 0.79 0.78 .72 .72 1.04 .76 .76 0.74 0.76 0.78 Residual SS = 2)24^ - 0.0497 MNR = 0.159/ V0.0497 = 0.713
None
Early
0.021 -0.079 .029 - .021 - .071 .159 0.021 -0.059 max 141 - 0.159
Middle
Late
0.011 .009 - .031 0.011
0.046 - .016 - .056 0.026
(1% level - 0.665)
should also analyze the data with the outlier present to learn which conclusions are affected by presence or absence of the outlier. Sometimes an investigator knows from past experience that occasional wild observations occur, though the process is otherwise stable. Except in such cases, statisticians warn against automatic rejection rules based on tests of signifi¬ cance, particularly if there appear to be several outliers. The apparent outliers may reflect distributions of the observations that are skew or have long tails and are better handled by methods being developed for nonnormal distributions. Ferguson (11) has noted that the tests for skewness and kurtosis (sections 5.13, 5.14) are good tests for detecting the presence of numerous outliers. Barnett and Lewis (12) give an excellent review of techniques for detecting outliers. Robust estimation techniques have also been used in detecting aberrant observations. Such techniques automatically give less weight to data values that are unusual in relation to the bulk of the data. Books by Hampel, Ronchetti, Rousseeuw, and Stahel (13) and Huber (14) contain details of robust estimation for the one-sample case as well as the more general regression problem.
15.6—Correlations between the errors. If care is not taken, an experiment may be conducted in a way that induces positive correlations between the errors for different replicates of the same treatment. In an industrial experiment, all the replications of a given treatment might be processed at the same time by the same technicians to cut down the chance of mistakes or to save money. Any differences that exist between the batches of raw materials used with different treatments or in the working methods of the technicians create positive correla¬ tions within treatments. In the simplest case these situations are represented mathematically by supposing an intraclass correlation p, between any pair of errors within the same treatment. In the absence of real treatment effects, the mean square between treatments is an unbiased estimate of a2 [1 + (n - 1 )p7], where n is the number of replications, while the error mean square is an unbiased estimate of REM (F = Jna/5rem)
To sum up: Recall that the residuals du in the analysis of variance of the X^ are the deviations of the Xtj from an additive model. If we actually have a quadratic nonadditive model, the du would be expected to have a linear regression on the product djdj = {XL - X_)(X j — X_), since the latter is the sample estimate of a^j. Tukey gave two further valuable results: (/) The regression coefficient of dtj on djdj is approximately B = (1 — p)/X_. Hence, the power p to which Xtj must be raised to produce additivity is estimated by p = (1 — BX. ). Anscombe and Tukey (18) warn, however, that a body of data rarely determines p closely. Try several values of p, including any suggested by subject matter knowledge. (//) Tukey gives a test of the null hypothesis that the population value of B is zero. For this, we perform a test of the regression coefficient B. From the residual sum of squares in the analysis of variance of X we subtract a term with 1 df that represents the regression of djj on djdj. Table 15.8.1 shows this partition of the residual sum of squares and the resulting F test of B. The table uses the results: 22 )(2*?) - (Sacj-XSx,x2)]/D
(17.2.10)
335
The numerical example in table 17.2.1 comes from an investigation (1) of the sources from which corn plants in various Iowa soils obtain their phosphorus. The concentrations in parts per million (ppm) of inorganic (Arl) and organic (X2) phosphorus in the soils and the phosphorus content Y of corn grown in the soils were measured for 17 soils. The observations, means, sums of squared deviations, and sums of cross products of deviations are also given in table 17.2.1. Substitution of the numerical values into (17.2.8) through (17.2.10) gives D = (1519.3)(2888.5) - 835.72 = 3,690,104 by = [(1867.4)(2888.5) - (757.5)(835.7)]/3,690,104 = 1.2902
(17.2.11)
b2 = [(757.5)(1519.3) - (1867.4)(835.7)]/3,690,104 = -0.1110
(17.2.12)
From (17.2.4), b0 = 76.18 - (1.2902)(11.07) - (-0.1110)(41.18) = 66.47
TABLE 17.2.1 Inorganic Phorphorus Xu Organic Phosphorus X2, and Estimated Plant-Available Phosphorus Y in 17 Iowa Soils at 20°C (ppm)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0.4 0.4 3.1 0.6 4.7 1.7 9.4 10.1 11.6 12.6 10.9 23.1 23.1 21.6 23.1 1.9 29.9
Total Mean
188.2 11.07 ii
X\
M
Soil Sample
1519.3
Y
*2
64 60 71 61 54 77 81 93 93 51 76 96 77 93 95 54 99
53 23 19 34 24 65 44 31 29 58 37 46 50 44 56 36 51 700 41.18
1295 76.18 = 835.7
Sx22 = 2888.5
Y
Y-Y
61.1 64.4 68.4 63.5 69.9 61.4 73.7 76.1 78.2 76.3 76.4 91.2 90.7 89.4 90.0 64.9 99.4
2.9 -4.4 2.6 -2.5 -15.9 15.6 7.3 16.9 14.8 -25.3 -0.4 4.8 -13.7 3.6 5.0 -10.9 0.4
1295.0
0.0
2Xiy = 1867.4 Zx2y = 757.5
2/ = 4426.5
336
17: Multiple Linear Regression
The estimated equation is F = 66.47 + 1.2902*! -0.1110*2
(17.2.13)
The Y/ for a particular vector (*n, *,2) is the estimated mean of Y for the distri¬ bution of outcomes associated with (*n, *,2). One can also consider F, to be the prediction for the F, observed at (*n, *,2). Therefore, the F values are sometimes called predicted values. Most often, they are called, simply, the F-hat values. The value of Fis given for each soil sample in table 17.2.1. For instance, for soil 1, F, = 66.47 + (1.2902)(0.4) + (-0.1110)(53) = 61.1 ppm The observed F = 64 ppm deviates by +2.9 ppm from the estimated. The residu¬ als F — F in table 17.2.1 measure the deviations of the sample points from the regression estimates. The estimates and corresponding deviations are considered further in later sections. The model with two *-variables and a limited sample of observations pro¬ vides an introduction to the calculations associated with multiple regression and also demonstrates that with larger models and samples the calculations will be considerable. Almost always such calculations will be done on computers by soft¬ ware designed for such work. The purpose here is to concentrate on the meaning and use of the statistics and to outline the computations that the computers per¬ form. The use of matrix notation is now common in describing the calculations involved in fitting large linear models. An appendix to this chapter provides an introduction to the notation. Some familiarity with matrix operations will allow access to many applications in the literature. The model with two A-variables (17.2.1) is now extended to k variables in n observations. The model equation for the zth observation is
Y, = 0O + /3i*n +
+ • • • + PkXik + e,
(17.2.14)
There are n such equations, one for each observation, and these can be summa¬ rized in matrix notation as Y = X0 + €
(17.2.15)
where Y is an n x 1 vector, X is an n x k matrix, /? is a (A: + l)-dimensional column vector, and e an ^-dimensional column vector. That is, ■
■
-
Yy
1
*,,
*12
• • •
Xxk
A)
€l
y2
1
*21
*22
. • .
X2k
01
f2
• • •
• • •
• • •
1
Xn]
X„2
•
Yn
=
• • • • • •
•
+
•
Xnk & »
_
337
The elements of the vector Y are the observed responses and the elements of the X matrix are the observed values of the explanatory variables. The method of least squares seeks estimates of the 0s that minimize the sum of squared deviations between the fitted and observed responses. The method leads to a system of nor¬ mal equations XXb = X'Y
(17.2.16)
The X'X matrix (pronounced X prime X or X transpose X) is
n
2*„
2LY,.2
* • •
2*/*
2*?,
2*„*n
• • •
^xnxik
2***„
2*,**n
• • •
2**
XX
2*/t
The entries in X'X are sums of squares and sums of products of the 2f-variables. The matrix is symmetric with sums of squares on the diagonal and sums of prod¬ ucts as off-diagonal entries. The 2f-variable associated with 0O is always 1 so that the entry in the upper left of X'X is the sum of squares of n ones. The entries in X'Y are sums of products of A'-variables with the F-variable:
1 2Y, ' 2*„ Yt Y'v —
\^XikYj
The estimate of 0 is the solution to the system (17.2.16) and is bt= (X'X)-1 X'Y
(17.2.17)
The solution assumes that the inverse of X'X is defined. This assumption will be retained throughout. The normal equations are sometimes reduced to a smaller set of equations expressing the Y- and A'-variables as deviations from their respective means. The reduced normal equations are (x'x)br = x'y
(17.2.18)
17: Multiple Linear Regression
338
and their solution is (17.2.19)
b, = (x'x)-'x'y where
1,XiX2
• • •
S*!**
•
^x2xk
•
•
xx =
1,XxXk
?,X2Xk
. . •
1,x2k
Zx2y
2xky
bk
and is an abbreviation for S?_ixiXyt. The system (17.2.17) contains k + 1 equations and the unknowns (b0, bu . .., bk), while the system (17.2.18) contains k equations and the unknowns (bu b2, . . . , bk). The latter system has better numerical properties and is the system actually used by most computer soft¬ ware. The equation to estimate /30 is eliminated when the variables are expressed as deviations. The value of b0 is defined by the equation b0 = Y — bxXx - b2X2 - • • • - bkXk
(17.2.20)
The vector containing the n estimated values is Y = Xb where b is defined in (17.2.19). Because b = (X'X)_1X'Y, the expression for Y can be written as Y = X(X'X)_1X'Y The predicted value corresponding to the ith observation is Yj = b0 + bxXn + b2Xi2 + • • • + bkXik
(17.2.21)
339
The vector of differences between the observed and predicted values, often called the vector of residuals, is Y - Y = Y - Xb The vector of residuals can also be written as Y - Y = [I - X(X'X)-'X']Y
(17.2.22)
Using the data from table 17.2.1, the reduced normal equations are /l519.3 \ 835.7
835.7\ !b\
/l867.4\
2888.5/\62/ ~ \ 757.5/
and the solution is 'b\
0.000783
-0.000226\/1867 4\
b2) ~ \- 0.000226
0.000412/ \ 757-5)
/
”
1.2902' (17.2.23)
0.1110y
If the full normal equations are used, the system is / 17.0
188.20
188.2
3602.78
8585.1
\700.0
8585.10
31712.0/
w
=
W
/ 1295.0\ =
16203.8 ^54081.0/
0.000660
o o 1
b,
M br
0.646369 0.000660
0.000783
-0.000226
^ — 0.014446
-0.000226
/
lba)
700.0\
/ 1295.0\ 16203.8
0.000412/ \54081.0/
/ 66.4654\ 1.2902
\—0.1110/
(17.2.24)
Observe that the lower right 2x2 submatrix of the 3x3 inverse matrix is equal to the inverse associated with the reduced normal equations. Also, the esti¬ mated coefficients are identical to those calculated in (17.2.11) and (17.2.12). Some properties of the residuals are discussed in the next section.
17: Multiple Linear Regression
340
17.3—The analysis of variance. In the multiple regression model (17.2.14), the deviations of the Fs from the population regression plane are nor¬ mal random variables with mean zero and variance of o2yx. An unbiased estimator of o2yx is 4.x - 2(7,. - Yf/in -k- 1) = (Y — Y)'(Y - Y)/(#* - k - 1)
(17.3.0)
where n is the number of observations and k + 1 is the number of /3s estimated in fitting the model. Expression (17.3.0) is the generalization of expression (9.3.7) to the model with more than one independent variable. If k = 1, then & + 1 = 2 and expression (17.3.0) reduces to (9.3.7). In the phosphorus example, n = 17, k + 1 = 3, 0' = (I30, 0,, j82), and (« - 1 - k) is 14. The residual sum of squares, E(Y( — F,)2, also called the error sum of squares, can be computed in two ways. If the individual deviations are available, as in the last column of table 17.2.1, then their sum of squares can be found directly. In the example, 2(F(- — F,)2 = 2101.3. A second method of computing E(Yt — F,)2 uses the definition of residuals. Expressing the A-variables as deviations from their means, the estimated value for the /th observation is Yi — Y + bxxn + b2xi2 + • • • + bkxik
(17.3.1)
Because the sample means of the xtj are zero, the sample mean of the Y values is Y, and the sample mean of the residuals is zero. Let j>, = F(- — Y and denote the residuals by d( = Yt — Yt. Then Yt = Yi - Y=yi + di
(17.3.2)
An important result, proved later in this section, is 2y2 = 2y2 + Xd2
(17.3.3)
or in matrix notation
y'y = y'y + d'd This result states that the sum of squares of the deviations of the Fs from their mean can be split into two parts: (/) the sum of squares of the fitted values correc¬ ted for the mean and (//) the sum of squares of the deviations from the fitted values. The sum of squares, 'Ey2, is often called the sum of squares due to regres¬ sion. Another important result, also proved later in this section, is 'Ey2 — b]Exnyi + b2Exi2yj 4- • • • + bkExikyi
(17.3.4)
341
or in matrix notation A/A
■ /
f
y y = brx y Words often used to describe this result are that the sum of squares due to regres¬ sion is equal to the sum of products of the estimates times the right-hand sides of the normal equations. Using (17.3.4) in (17.3.3), the sum of squared deviations is = 'Ey2 - b&Xny, - bJLxayt - ... - b^x^y,
(17.3.5)
or in matrix notation d'd = y'y - tyx'y For the phosphorus example,
Id2 = 4426.5 - (1.2902)0867.4) - (-0.1110)(757.5) = 2101.3 The mean square of the deviations is s2,x = 2101.3/14 = 150.1 This is the sample estimate of the population variance a2yx. The calculations given here are summarized in the analysis of variance table 17.3.1. Sometimes an investigator is not confident initially that any of the A's are related to Y. In this event, a test of the hypothesis = (32 = • • • = (5k = 0 is helpful. This test is made in table 17.3.1 and is based on the result that, if 0, through /3k are zero, the ratio F = (regression mean square)/(deviations mean square) is distributed as F with k and (n - k — 1) df. In the phosphorus example, F = 1162.6/150.1 = 7.75 has 2 and 14 degrees of freedom. Table A 14(1) indicates that only 1% of the Fs TABLE 17.3.1 Analysis of Variance of Phosphorus Data
Source of Variation
df
Sum of Squares
Mean Square
F
Regression Deviations Total
2 14 16
2325.2 2101.3 4426.5
1162.6 150.1
7.75
17: Multiple Linear Regression
342
will exceed 6.51 in this situation when the null hypothesis is true, and therefore, the hypothesis that (5X = fi2 = 0 is rejected. Proofs of the results (17.3.3) and (17.3.4) are given here for the two-vari¬ able model and then extended to the ^-variable model. For the two-variable model recall that y = Y — Y = bxxx + b2x 2
y=y + d d = y - bxxx - b2x2 The normal equations (17.2.6) and (17.2.7) may be rewritten in the form 2x,(.y — bxxx - b2x2) = lxxd = 0
(17.3.6)
2x:2(;; — bxxx — b2x2) = 2 x2d = 0
(17.3.7)
These results show that the deviations d have zero sample correlation with any 2f-variable. This is not surprising since d represents the part of Y that is not linearly related to either Xx or X2. If (17.3.6) is multiplied by bx, (17.3.7) multi¬ plied by b2, and the two products added together, the result is l{bxxx + b2x2)d = 'Lyd = 0
(17.3.8)
This shows that the deviations are also uncorrelated with the predicted values. The result (17.3.3) is proved by using (17.3.8), 2y2 = 2( j> + d)1 = Sj)2 + 21yd + Id2 = ly2 + Id2 The result (17.3.4) can be obtained as follows. Now ly2 = l(bxxx + b2x2)2 = b]lx2x + 2bxb2lxxx2 + bjlxl
(17.3.9)
If the first of the normal equations (17.3.6) is multiplied by bx, (17.3.7) multi¬ plied by b2, and the two products are added together, the result is bx2xx + 2bxb2lxxx2 -)- b^Xi — bx2xxy -(- b2lx2y
(17.3.10)
Substituting (17.3.10) into (17.3.9) establishes (17.3.4). In matrix notation, the reduced normal equations corresponding to (17.3.6) and (17.3.7) are x'xbr = x'y
343
or x'(y - xbr) = x'd = 0 This establishes the general result that the sum of cross products of the deviations with every explanatory variable is zero. Result (17.3.4) for the general case is proved by noting that y'y = (y + d)'(y + d) - y'y + d'y + y'd + d'd = y'y + d'xbr + tyx'd + d'd = y'y + d'd
(17.3.11)
The correlation coefficient, R, between Y and Y is called the multiple corre¬ lation coefficient between Y and the Xs. From (17.3.5) and (17.3.8) it follows that 2yj> = 2j>2, so the sample value of R is always equal to or greater than zero. Furthermore, R2 = (2^)2/[(2/)(S^2)] - (2y2)2/[(S/)(2j>2)] = 2j)2/2/ In table 17.3.1, R2 = 2325.2/4426.5 = 0.53. The value of Fean be expressed in terms of R2 since 2j>2 = R2^y2 and 1,d2 = (1 — R2yLy2. Therefore, F = (n — k — \)R2/[k(\ - R2)] The F test is also a test of the hypothesis that the population R = 0.
17.4—Extension of the analysis of variance. Suppose that we have fitted a regression on k A-variables but think it possible that only p < k of the variables contribute to the regression. We might express this idea by the null hypothesis that (3p+l, (3p+2, ■ ■ ■, Pk are all zero, making no assumptions about /?,, /32,..., (3p. The analysis of variance in section 17.3 is easily extended to provide an Ftest of this hypothesis. The method is as follows. Compute the regression of Y on all k A'-variables. Then compute the regression of Yon Xu X2,. .. Xp, omitting the A'-variables with coefficients hypothesized to be zero. For each regression the sum of squared devi¬ ations (also called the error sum of squares or the residual sum of squares) will be part of the output of a computer program. Let ESS^ denote the residual sum of squares for the regression using ^-variables and ESS* the corresponding residual sum of squares for the regression using ^-variables. The model containing ^-vari¬ ables is called the full model, and the model containing /7-variables is called the reduced model. A test of the hypothesis that fip+1
= Pp+2 = ’ * * = ft* = 0
17: Multiple Linear Regression
344
is accomplished by forming (ESS, - ESSk)/(k - p) ESSp/(n - k - 1) If the hypothesis is true, the test statistic is distributed as F with (k - p) and (n - k - 1) degrees of freedom. The calculations are summarized in table 17.4.1. To illustrate these calculations we use the data in table 17.2.1 expanded to include a third A'-variable and an additional F-variable. The augmented data set is given in table 17.4.2. The variable X3 is another measure of organic phospho¬ rus. The dependent variable F, is the plant-available phosphorus measured at a soil temperature of 20°C from table 17.2.1. The other F-variable will be used in section 17.5. Use of the reduced normal equations (17.2.18) in this example results in three equations to be solved:
6,2x^ T-
b-^X\X2
-I- 632x1X3 — 2x,t
6,2x,X2
b-^E/Xi
-(- 632x2X3 — 2x2y
6,2x,X3
-\-
622X2X3 T-
bj2x3
= 2x3_y
6,(1519.30) + 62(835.69) - 63(42.62) = 1867.39 6,(835.69) + 62(2888.47) + 63(2034.94) = 757.47 -6,(42.62) + 62(20 34.94) + 63(28 9 63.88) = 3 3 8.94 The solution of this set of equations is (6„ 62,
63)
= (1.30145, -0.13034, 0.02277)
The total corrected sum of squares is 4426.5 and the sum of squares due to regression is 2339.3, leaving the error sum of squares, ESS3 = 2087.2. TABLE 17.4.1 Analysis of Variance for Testing a Subset of Coefficients
Source of Variation Residuals using Xu X2,..., Xp Additional amount due to Xp+x, Xp+2, ... ,Xk after Xlt X2,... ,Xp Residuals using Xx, X2,..., Xk
df
Sum of Squares
n - p - 1
ESS,
k-p n — k — 1
ESS, - ESS* ESS*
F = (ESS, — ESS*)/(A: — p)s1yjc
Mean Square
(ESS, - ESS *)/(* - p)
df = (k - p), (n - k - 1)
345 TABLE 17.4.2 Phosphorus Fractions in Various Calcareous Soils and Estimated Plant-Available Phosphorus at Two Soil Temperatures
Phosphorus Fractions in Soil (ppm) *
Estimated Plant-Available Phosphorus in Soil (ppm)
Soil Sample
*.
*2
*3
K, at 20°C
Y2 at 35°C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0.4 0.4 3.1 0.6 4.7 1.7 9.4 10.1 11.6 12.6 10.9 23.1 23.1 21.6 23.1 1.9 29.9
53 23 19 34 24 65 44 31 29 58 37 46 50 44 56 36 51
158 163 37 157 59 123 46 117 173 112 111 114 134 73 168 143 124
64 60 71 61 54 77 81 93 93 51 76 96 77 93 95 54 99
93 73 38 109 54 107 99 94 66 126 75 108 90 72 90 82 120
*X3 = inorganic phosphorus by Bray and Kurtz method. X2 = organic phosphorus soluble in K2C03 and hydrolyzed by hypobromite. X2 = organic phosphorus soluble in K2C03 and not hydrolyzed by hypobromite.
For this experiment, it was stated that “the primary objective of the present investigation was to determine whether there exists an independent effect of soil organic phosphorus on the phosphorus nutrition of plants” (1). That is, the exper¬ imenters wished to know if X2 and X3 are effective in predicting Yafter allowing for the effect of X,. We check on the effect of soil organic phosphorus by testing the hypothesis that & = & = 0 in the model containing Xt, X2, and X3. The calcu¬ lations are summarized in the following analysis of variance: Source Due to*! alone Due to X2, X3 after Error Total
df
Sum of Squares
Mean Square
F
l 2 13 16
2295.2 44.1 2087.2 4426.5
22.0 160.6
0.14
The sum of squares when Xx is considered alone is (Sx,y)2/l^x\ = (1867.39)2/ 1519.3 = 2295.2. The F value for testing /32 = & = 0 is 0.14 (P is about 0.87), so these two forms of organic phosphorus show little relation to plant-available phosphorus. Coming back to our two-variable example of sections 17.2 and 17.3, it is clear that we can use this method to make F tests of the two hypotheses, (/) /3, = 0 and (ii) (32 = 0 (see example 17.4.2). In the next section we demonstrate how to perform t tests of these hypotheses and to construct confidence limits for the /?,.
17: Multiple Linear Regression
346
EXAMPLE 17.4.1—The following data come from a study of the relation between the number of moths caught on successive nights in a light trap and two weather variates. The variates are Y = log(number of moths + 1), Xx = maximum temperature (previous day) in °F, X2 = average wind speed. The sums of squares and products of deviations from 73 observations are Lx] = 14.03
Lxxx2 = 1.99
Lxxy = 2.07
Lx\ = 2.07
Lx2y = -0.64 Ly2 - 3.552
(/) Calculate bx and b2. Do their signs look reasonable? (//) Calculate the analysis of variance and test HO:0X = /?2 = 0. (iii) By the extended analysis of variance, perform separate F tests of bx and b2. Ans. (/') bx = 0.222, b2 = -0.522. (//) Analysis of variance:
Source Regression on Xx and X2 Residuals Total
df
Sum of Squares
2 70 72
0.793 2.759 3.552
Mean Square
F
P
0.396 0.0394
10.05
0.0001
(iii) Extended analysis of variance:
Source
df
Sum of Squares
Mean Square
F
P
Xx alone X2 after Xx Residuals
1 1 70
0.305 0.488 2.759
0.488 0.0394
12.4
0.0008
X2 alone Xx after X2 Residuals
1 1 70
0.198 0.595 2.759
0.595 0.0394
15.1
0.0002
Both bx and b2 are clearly significant in the presence of the other ^-variable.
Several points in example (17.4.1) are worth noting. The F value for testing Xx when X2 is ignored, namely 0.305/0.0394 = 7.7, differs from F = 15.1 when Xx is tested with X2 included. This property is characteristic of multiple regres¬ sion. The value of bx changes from 0.148 to 0.222 and b2 from —0.309 to 0.522 when the other X is included in the regression. The bs both increase in absolute size when the other variable is included because the correlations of Xx and X2 with Fhave opposite signs, yet Xx and X2 are positively correlated. The important point here is that the sign and size of the coefficient representing the effect of a particular variable in multiple regression depend on the other variables present in the model. EXAMPLE 17.4.2—In the phosphorus example in tables 17.2.1 and 17.3.1, perform F tests of h, and b2. Ans. For bx, F = 14.2. For b2, F = 0.2: This type of organic phosphorus seems unrelated to plant-available phosphorus. EXAMPLE 17.4.3—Sometimes the relation between Yand the A's is not linear but can be made approximately linear by a transformation of the variable before analysis. Let W be the weight of a hen’s egg, L its maximum length, and B its maximum breadth. Schrek (3) noted that if the shape of an egg is an ellipse rotated about its major axis (the length), the relation should be W = cLB2, where c is a constant. In the log scale the relationship should be log W = /30 + /8i(log L) + /?2(log B), and /?,
347 and 02 should approach values of 1 and 2, respectively. The following are the measurements from 10 eggs. To provide easier data for calculation, we have taken Y = 10(log W - 1.6), Xt = 10(log L — 1.7), X2 = 10(log B — 1.6). Y
0.42 0.38 0.47 0.37 0.87
1.10 1.03 0.66 0.33 1.37 II
2A, = 6.18
0.10 1.32 0.99 1.00 0.73
S
0.51 0.79 0.52 0.10 0.12
*2
= 1.625
Y
0.62 0.72 0.48 0.96 1.05
0.76 2.23 1.37 2.42 2.42
6.34
= 0.196 2jc22 =
*2
0.573
'EY = 13.69 Exty = 2.017 SxjV
= 1.397
Ey2 = 5.091 (/) Calculate the sample regression and show the analysis of variance, (ii) Compare the actual and predicted weights for the first and last eggs. Ans. (0 Y = —0.573 + 0.988A, + 2.100A'2. Clearly, Y = — 0.517 + l.OA'i + 2.0A'2 would do about as well. Source
df
Sum of Squares
Mean Square
F
Due to regression Residual Total
2 7 9
4.926 0.165 5.091
2.463 0.0236
104.4
(ii) First egg (on original scale): W = 51.3, W = 48.0 (this prediction is the poorest of the 10). Last egg: W= 69.5, 68.4.
17.5—Variances and covariances of the regression coefficients. In chapter 9, with only one X-variable, the estimated variance of the slope is V(bx) = s2yx/ 'Lx2. See equations (9.2.3) and (9.3.8). With more than one A-variable we esti¬ mate a vector of bs. Associated with that vector is a matrix containing the vari¬ ances and covariances of the bs. The matrix contains the variance of the bs on the diagonal and covariances among bs as off-diagonal elements. It is called the covariance matrix of (bu b2,.. ., bk) and is (x'x)-1^
(17.5.1)
The covariance matrix is estimated by (x'xr‘4,
(17.5.2)
where s2x is defined in (17.3.0). This covariance matrix for the phosphorus example of table 17.2.1 is
(xx)
/ 0.000783 syjc - ^_0 000226
-0.000226 0.000412 150.1
0.1175
-0.0340
-0.0340
0.0618
(17.5.3)
17: Multiple Linear Regression
348
The elements on the diagonal of this matrix are the estimated variances of bx and b2, respectively. The off-diagonal element is the covariance between bx and b2. The covariance matrix of the entire vector of b = (b0, bu ..., bk)' is V(b) = (X'X)"1^
(17.5.4)
This square covariance matrix of order (k + 1) is estimated by
v(b) = (x'xr'4.
(17.5.5)
For the phosphorus example of table 17.2.1,
-7
— 2.1683\
0.0990
0.1175
— 0.0340
2.1683
-0.0340
0.0618
I H
Jrb
0.0990
7
*
1 97.0149
(17.5.6)
The upper left element of this matrix is the estimated variance of b0. The lower right 2x2 matrix is the same as (17.5.3). The expression (17.5.5) is often abbre¬ viated to V, with elements vy. Thus, is the estimated variance of b0, and v12 is the estimated covariance between bx and b2. The estimated standard errors for the bs are the square roots of the variances and are denoted by sbr The quantity t = (bt -
p,)/sbl
is distributed as Student’s t with (n — k — 1) degrees of freedom, the degrees of freedom used to find s2yx. The tests for = 0 and /32 = 0 in the phosphorus experiment are tx = bx/sbl = 1.2902/0.3428 = 3.76 t2 = b2/sh2 = -0.1110/0.2486 = -0.45
P = 0.002 P = 0.66
both with 14 degrees of freedom. The t tests are equivalent to the F tests given by the extended analysis of variance (see example 17.4.2, where you should find F = t2). Evidently in the population sampled, the fraction of inorganic phosphorus is the better predictor of the plant-available phosphorus when both inorganic and organic phosphorus are considered together in a model. The experiment indicates “that soil organic phosphorus per se is not available to plants. Presumably, the organic phosphorus is of appreciable availability to plants only upon mineraliza¬ tion, and in the experiments the rate of mineralization at 20°C was too low to be of measurable importance” (1). The covariance matrix can be used to construct a test of a linear combina¬ tion of coefficients. Let such a combination be L'b - Yi L,b,
349
where L' = (L0, L„ . . . , Lk) and b = (b0, bu . . ., 6*)'. Then
k(E U) - L'VL
-EE VA i j = Y Liv» +2 Y LiLJvij i
i2,..., xn+uk). We use s{jiyjc^+,) or $(£,.*) to denote the square root of the estimated variance.
17: Multiple Linear Regression
354
When expression (17.8.1) is used to predict the value of Y for a new speci¬ men, the estimated variance of the prediction error is - K„+|} -
+ l/n + x„+1(x'x)-'xi+l]
-4-[l + X„+1(X'X)-'X;+1]
(17.8.4)
As in section 9.9, we use s(Yn+x) to denote the square root of the estimated vari¬ ance. To illustrate the calculation, suppose that we wish to estimate the popula¬ tion mean of Y at (Xlt X2) = (4, 24) for the model of the phosphorus example of table 17.2.1. The matrix (x'x)-1 is given in expression (17.2.23), n = 17, (x„ x2) = ( — 7.1, —17.2), and syx = 12.25. Hence,
s(pyJ = (12.25) [1/17 + (0.0007 8 28)(7.12) + (0.0004117)(17.22) + 2( —0.0002265)( —7.1)(—17.2)]1/2 = 4.97 ppm If the objective is to predict a new observation for (Xx, X2) = (4, 24), the standard error of prediction is
s(Yn+1) = (12.25)(1 + 1/17 + 0.1059)1/2 = 13.22 EXAMPLE 17.8.1—For a soil with the relatively high values of A', = 30,^ = 50, calculate the standard errors of (/) the estimate of the population mean, (ii) the prediction of Yfor a new individual soil. Ans. (/) 6.65 ppm, (ii) 13.94 ppm. EXAMPLE 17.8.2—What is the standard error of the predicted value of Y for a new individual soil with A, = 30 if X\ alone is included in the regression prediction? Ans. 13.62 ppm, slightly smaller than when X2 is included. EXAMPLE 17.8.3—In the eggs example, 17.4.3, find the standard error of log W for the pre¬ dicted value for a new egg that has the sample average values of Xx and X2. Note that Y = 10(log W — 1.6) and that syx is given in the answer to example 17.4.3. Ans. 0.0161.
17.9—Interpretation of regression coefficients. In observational studies, multiple regression analyses are used extensively in attempts to disentangle and measure the effects of different A-variables on some response Y. However, there are important limitations on what can be learned by this technique. In a multiple regression, as we have seen, bx is the univariate regression of Y on the deviations of X\ from its linear regression on X2, X3, . . . , Xk. When the A-variables are highly intercorrelated, both the observed size of bx and its meaning may be puzzling. To illustrate, table 17.9.1 shows three variables: Y = gross national product, X, = total number employed, X2 = total population (excluding institutions) aged 14 years or over. The data are for the 16 years from 1947 to 1962. These data were taken from Longley’s (6) study of the performance of computer programs in multiple regression when there are some high intercorre¬ lations. Our data have been rounded. High correlations are rly = 0.983, r2y = 0.986, rX2 = 0.953.
355 TABLE 17.9.1 Regression of Y = GNP (billions of dollars) on X{ = Total Employment (millions) and X2 = Population Aged 14 and Over (millions) (1947-1962). No.
Xt
1 2 3 4 5 6 7 8
60.3 61.1 60.2 61.2 63.2 63.6 65.0 63.8
*2
Y
108 234 109 259 110 258 112 285 112 329 113 347 115 365 116 363 2x? = 186.33 2*iT = 5,171.95
Y
No.
246.4 265.0 260.7 289.7 316.6 329.8 364.2 355.9
9 10 11 12 13 14 15 16 2x,x2 — 341.625 Zx2y = 9978.625
X,
X2
Y
66.0 117 396 67.9 119 419 68.2 120 443 66.5 122 445 68.7 123 483 69.6 125 503 69.3 128 518 70.6 130 555 2*^ - 689.9375 V - 148,517.75
Y 393.3 434.5 446.3 439.0 476.4 504.1 523.5 556.6
We might expect total employment (A',) and the GNP (F) to be highly correlated, as they are = 0.983), with b\ = 27.8 billion dollars per million workers when A', alone is fitted. However, the correlation between the GNP and the total population over 14 is slightly higher still (r2y = 0.986), which is puzzling. Also, when both variates are fitted, the coefficient for employment becomes 13.45 (less than half). Thus the size of a regression coefficient and even its sign can change according to the other As that are included in the regression. The prediction equation is f = 13.451 1A, + 7.8027A2 - 1407.401 with syx = 9.152, about 2.4%. Both bx and b2 are highly significant. Thus, predictions of the GNP are quite good, but it is not easy to interpret the bs, particularly b2, in any causal sense. Incidentally, the residuals Y — Y calculated from table 17.9.1 show the typical pattern that suggests some year-to-year correlation. Four negative values are followed by five positive values, with three negatives at the end. In the phosphorous study, on the other hand, the regression coefficient bx on inorganic phosphorus changed very little from the univariate to the trivariate regression, giving us more confidence in its numerical value.
17.10—Omitted ^-variables. It is possible that the fitted regres¬ sion may not include some A-variables that are present in the population regression. These variables may be thought to be unimportant, not feasible to measure, or unknown to the investigator. To take the easiest case, suppose that a regression of Y on X, alone is fitted to a random sample, when the correct model is
Y-0 o + jSiAT, + 02X2 + e
17: Multiple Linear Regression
356
where the residual e has mean zero and is uncorrelated with Xx and X2. In repeated samples in which the Xs are fixed,
E{bx) = E
y
+
(32HiXxX2
2x{
— P\ +
$2^21
(17.10.1)
where 521 is the sample linear regression of X2 on Xx. (If X2 has a linear regression on Xx in the population and we average further over random samples, we get E (b\) = /?i + /?2^21-)
In a bivariate regression in which a third variable X3 is omitted, £(*,)- ft + 0,831
(17.10.2)
where 32)X2 +
02(X2 — H}x)
+«
(2)
TABLE 17.10.1 Prediction from a Regression Model with an Omitted Variable
Observation 1 2 3 4 5 Total Mean 2x? = 18
*2
1 2 4 6 7 20 4 2X|y = 63
Y = 1 + 3 X2 4 7 13 19 22 65 13 SxjX] = 21
xx 0 2 3 5 5 15 3 bx = 63/18 = 3.5
Y\
y- y,
2.5 + 1.5 9.5 -2.5 13.0 0.0 20.0 -1.0 20.0 + 2.0 65.0 0.0 13.0 0.0 «2, = 21/18 = 7/6
17: Multiple Linear Regression
358
It follows that if a trivariate sample regression of Y on Xx, X2, and (X3 - n3J is fitted, the sample regression coefficients are unbiased estimates of the population regression coefficients in equation (2). (») Write down the normal equations for this trivariate regression and verify that the coefficients of A", and X2 are exactly the same as the normal equations for b, and b2 in a bivariate regression of Y on Xi and X2 with X3 omitted. This shows that in repeated samples with fixed Xs, b, is an unbiased estimate of /3, + /3353,, and similarly for b2.
17.11—Effects possibly causal. In many problems the variables Xx and X2 are thought to have causal effects on Y. We would like to learn how much Y will be increased or decreased by a given change AAj in Xx. The estimate of this amount suggested by a bivariate regression equation is 6, AAj. As we have seen, this quantity is actually an estimate of (/3, + /33