138 50 27MB
English Pages [159] Year 1963
THE ESSENCE OF BIOMETRY
THE ESSENCE OF BIOMETRY BY
JOHN STANLEY PROFESSOR OF ZOOLOGY, McGILL UNIVERSITY
McGILL UNIVERSITY PRESS MONTREAL 1963
© Copyright, Canada, 1963 by McGILL UNIVERSITY PRESS All rights reserved PRINTED IN THE UNITED KINGDOM
TO PROFESSOR GEORGE J. SPENCER DEPARTMENT OF ZOOLOGY UNIVERSITY OF BRITISH COLUMBIA MY LIFE-LONG FRIEND AND BELOVED TEACHER
PREFACE This text has grown out of many, many interviews with graduate students who have come seeking help in the understanding of various statistical methods. These students say with monotonous regularity that the texts they have studied are of a high intellectual quality, but that they do not really make the matter clear. In this text, therefore, I have tried to adopt the style I would use in a personal interview. At the risk of seeming wordy, I have made the explanations rather full. Each method is illustrated with a fully worked-out example, and the notations are purposely non-standard. The symbol a is commonly used for the standard deviation of a sample, but it is often also used, perhaps on the next page, for a standard error. The novice does no t always realize that a standard error is a kind of standard deviation, and this bothers him. Therefore, I have used a form of notation, perhaps naive, but so simple that virtually anyone can follow it. The standard deviation is, crudely, SD, and the standard error of the mean of Xis SE2, and so on. One can get into difficulties here because the standard error of the regression coefficient of Y on X can become SEbyx, but still, cumbersome though it may be, this symbol does say what it means. I am indebted to Professor Sir Ronald A. Fisher, F.R.S., Cambridge and to Dr. Frank Yates, F.R.S., Rothamsted, also to Messrs. Oliver and Boyd Ltd., Edinburgh, for permission to reprint Tables II, III, III, IV, V, VIII, IX and, IX2 from their book Statistical Tables for Biological, Agricultural and Medical Research. Professor C. I. Bliss of the Connecticut Agricultural Experiment Station, New Haven, and the editors of the Annals of Applied Biology very kindly gave me permission to make much use of Professor Bliss' work on LD50, for Chapter XIII. The Iowa State University Press has allowed me to reprint (in abridged form) Table 1.3.1 of George W. Snedecor's Statistical Methods (copyright 1956). This forms Table XIX of this text. For this permission, I am most grateful. I have leaned rather heavily, too, on M. J. Moroney's excellent Penguin book, Facts from Figures, for the subject matter of Chapter X. I am also indebted to the many students who have come to me with problems, because the writing of this book grew out of the discussions we had when I attempted to help them. Last, but by no means least, I am grateful to my wife, who, knowing no biometry, was willing to sit so patiently while I worked out verbally the explanations I have here put in writing.
J.S. Montreal, April 1962
vii
TABLE OF CONTENTS
Page Vii
PREFACE
I
LIST OF FIGURES
X
LIST OF TABLES
Xi
1
The Theory of Probability
II Samples, Universes, Distributions
13
III The Normal Frequency Distribution
21
IV The Basic Concept of Significance in Statistical Analysis
32
V The Binomial Distribution
47 57
VI The Poisson Distribution
63
VII The Method of Least Squares VIII Regression
76
IX Correlation
90 101
X Rank Correlation
111
XI Contingency or Association
120
XII The Analysis of Variance XIII Fifty Per Cent Lethal Dose, or LD 50
127
BIBLIOGRAPHY
141
INDEX
142 ix
LIST OF FIGURES
Page
Figure
1 Frequency distribution of hatching times
16
2 A basically normal frequency distribution
18
3 An almost exact normal frequency distribution
22
4 A binomial frequency distribution
51
5 Examples of Poisson frequency distributions
62
6 Data of Table XXII (fitted least-squares line)
65
7 Raw data and fitted sigmoid curve from Tables XXIV and XXV
71
8 Straight-line plot of loge(AZ-1) versus T [equation (6)]
73
9 An isometric three-dimensional projection of the data of Table XXVI
79
10 Data of Fig. 9
80
11 Data of Table XXVI
81
12 Raw data of Table XXVIII
82
13 Swarms of points as correlation diagrams
91
14 A generalized dosage-mortality curve
128
15 Provisional regression line
130
x
LIST OF TABLES
Table I
Page
Composition of the contents of an urn
II Probabilities and expectations for coins
6 10
III Distribution of hatching times
15
IV Lengths of insects
18 19
V Probabilities—Table IV VI Almost exact normal frequency distribution VII Calculation—statistics of a normal frequency distribution VIII Additions to Table VII—fitting of distribution IX Ordinates of the normal distribution
21 23 30 31
X Chi-squared test applied to Tables VII and VIII
34
XI Distribution of chi-squared
35
XII Normal probability integral
40
XIII Distribution of T
43
XIV Variance ratio
45
XV Pascal's triangle to K = 12 XVI Binomial frequency distribution—calculation of ordinates XVII Fictitious data of an experimental distribution XVIII Chi-squared to test Table XVII xi
48 49 52 52
xii
LIST OF TABLES Table
XIX 95 per cent confidence limits—binomial distribution XX Fitting of a Poisson distribution
Page
54 58
XXI Chi-squared test—Poisson distribution
59
XXII Fictitious data—method of least squares
64
XXIII Fictitious data—fitting a straight line
67
XXIV Growth of a population—fitting of a sigmoid curve
71
XXV Calculations—fitting a sigmoid curve XXVI Data of Fig. 9 and Fig. 10 XXVII Data of Fig. 11
72 78 81
XXVIII Fictitious data—regression coefficient
83
XXIX Calculations for SEEbyz long method
84
XXX Calculations for 95 per cent confidence zone— Table XXVIII XXXI A partial reprint of Table XXVIII XXXII Data of Table XXVII as a correlation table
88 91 93
XXXIII Pre-fabricated correlation table
95
XXXIV Transformation of r to Z
97
XXXV Averaging of a set of values of r r
99
XXXVI Table XXXIII prepared for ranking
101
XXXVII Preliminary ranks—converting to assigned ranks XXXVIII Pairing off the values of Y and X
102 103
LIST OF TABLES Table
XXXIX Judging of eight products by two judges XL Judging of nine products by seven judges XLI Calculations leading to the value of W—Table XL XLII Paired comparisons of six products
xiii Page
104 105 106 108
XLIII Hair colour versus preference for a certain food
112
XLIV Observed and calculated cell-frequencies—Table XLIII
113
XLV Table similar to Table XLIV but showing Yates' correction XLVI Improvement or non-improvement of patients according to hospital floors XLVII Weights of fifty animals reared on a certain food XLVIII Arrangement of Table XLVII—five lots of ten animals XLIX Between-lots sum of squares—Table XLVIII L Analysis of variance—Table XLVIII LI Moulting of fifty per cent of fifth instar larvae to sixth instar—Series I LII Sigmoid dosage-mortality curve transformation to a straight line
114 118 121 123 124 124
129 131
LIII Weighting coefficients and probit values—final adjustments
132
LIV Calculating LD50 and by. —Table LI
133
LV Moulting of fifty per cent of fifth instar larvae to sixth instar—Series II LVI Calculations—Tables LI and LV LVII Corrected probit values—one hundred or zero per cent kill
137 138 140
I THE THEORY OF PROBABILITY THE WHOLE STRUCTURE of statistical analysis is based on the mathematical theory of probability, and a careful reading of this first chapter will greatly assist in the understanding of material in the subsequent chapters. Unfortunately, it is quite difficult to formulate a satisfactory definition of what is meant by `the probability of the occurrence of an event,' but let us begin with the following clarifications. Consider an urn containing one white ball and nine black balls, and assume that the contents of this urn have been thoroughly stirred and mixed. Let us dip into the urn without looking at it and without attempting in any way to exercise preference in picking out a particular ball. In other words, let us make what is called a random draw. Under these conditions, we are just as likely to pick one ball as another. If we make an immense number of such random draws, we would expect intuitively that each ball in the urn would appear an equal number of times, assuming of course that we replace the drawn ball and stir well each time before making the next draw. If we make only a few such draws, say 25, there will of course be some small differences in the frequencies with which individual balls appear. Nevertheless, it seems reasonable to suppose that as the number of draws approaches infinity, so will the frequencies of appearance of the individual balls approach equality. Under such conditions, we say that there exists equal likelihood of drawing each ball. Let us arrange the urn again, stir the contents, and make one random draw. We feel intuitively that we have `one chance out of ten' of lifting out a white ball and `nine chances out of ten' of picking a black one. Any intelligent person would probably agree with this. It is something basic in human thinking. If this is so, we may say that the a priori probability of drawing a white ball from this urn on a single random draw is P,,, = 1/10, while the a priori probability that the ball will turn out to be black is Pb = 9/10. These are called a priori probabilities because, knowing the composition of the contents of the urn, we feel that we can calculate the probabilities before we make the draws. There is no easy way to justify this feeling about a priori probabilities—it seems to be instinctive. We can support our intuition by making an immense number of draws and noting that a white ball does turn up in very nearly one-tenth of the draws, and a black ball in close to nine-tenths. This observation may confer a pleasant feeling of cleverness, but it does not really prove anything. Parenthetically, it should be pointed 1
2
THE ESSENCE OF BIOMETRY
out that in problems involving geometrical probability there may be several answers, depending upon the initial assumptions as to equal likelihood. One must be very careful in such cases to avoid being led astray by one's intuition. In the simple example of our urn, there is only one possible assumption as to equal likelihood, and, over a long series of random draws, we shall find that P is very close to (or equal to) .1, while Pb is similarly close to .9. Suppose now that the urn is handed to us with the statement that it contains one white ball and nine black balls. If this is true we would expect, on making, say, one hundred random draws, to obtain results very close to ten white balls and ninety black. Suppose that we doubt the statement about the contents, and make the hundred draws, and obtain twenty-one white and seventy-nine black balls in all. This is almost in the ratio of two white to eight black, and represents a considerable divergence from what we would have expected if the statement about the contents were true. Such a divergence from expectation could have come about through the operation of two causes: (a) there really are one white and nine black balls in the urn, but pure chance has intervened to give us the peculiar results we have, or (b) the statement is not true, and there are two white and eight black balls in the urn. We could, in fact, make a statistical test, the so-called chi-squared test, to see which is the more likely reason for the divergence. If this test is made, it will show that the probability of such a divergence, as large as this purely by chance, is less than P _ .001. That is to say, it is so unlikely that we cannot believe it can have occurred. The probability of the divergence from expectation on the basis of the urn containing three white and seven black is equally small. We conclude therefore that the statement is not true, and that the urn does in fact contain two white and eight black balls. We can now set up new a posteriori probabilities of „,P = .2 for a white ball and b P = .8 for a black ball. Note that a priori probabilities are set up before making the draws, on the basis of our prior knowledge of the situation, whereas a posteriori probabilities are set up after making the draws, on the basis of the results of the draws. It should now be realised that in the world of probability (and in statistical analysis) nothing is ever absolutely certain. All that exist are the greater or lesser probabilities that such-and-such is true. We obtain data, and we analyze them. We find that the probabilities that some things are true are so high that for all practical purposes we can assume them to be true. In other cases the probabilities are small, and again, for all practical purposes, we feel that these things are not true. In research, then, we measure our degree of certainty about our conclusions on the bases of probabilities, and we obtain these probabilities by more or less routine mathematical treatment of the raw data. These operations are statistical analysis. A very intelligent and clear-thinking person can work out many simple problems in probability by intuition, but intuition can be very deceptive. Thus it is that some basic rules have been worked out for the manipulation of probabilities in order to avoid errors. Following some necessary definitions, we shall come to a discussion of these rules.
THE THEORY OF PROBABILITY
3
Basic definitions Event. In the theory of probability, an event is any phenomenon which, a given set of conditions having been arranged and a given set of actions having been carried out, may or may not occur. In our urn example, the conditions were the composition of the contents of the urn, and the actions were the making of a random draw. Trial. A trial is a set of actions which, when carried out under a given set of conditions, may or may not bring about the occurrence of an event. Trials are almost always `random' trials, that is to say (in simple terms) the person making the trial does not inject any bias into the operation. In drawing a ball out of our urn, no attempt would be made to feel around for a white ball. One would just take any ball. Success. A trial is said to be a success if, when carried out, the hoped-for event comes about. Failure. A trial is said to be a failure if the hoped-for event does not come about. Usually failure is the direct opposite of success, but there may be a whole array of possible outcomes. A trial is a success or failure depending upon one's prior thinking. To a marksman, hitting the target is a success, to a man facing a firing squad, missing is a success! Probability. A rigid definition is difficult to write. One might say that the probability of the success of a trial is a numerical measure of the likelihood that the trial will succeed, but this definition is circular. The usual definition is this: If an event can take place in a total of N ways, of which M lead to a favourable outcome, the probability of success is MIN. In our urn, if we number the individual balls W1, B2, B3, B4, B5, B6 , B7 , B8, B9, B10, the event of picking a ball can occur in ten ways, of which one can lead to success in getting a white ball, and nine can lead to success in getting a black ball. For a white ball, N = 10, M = 1, and PW = .1, while Pb = .9. Probabilities of success and failure. Probabilities are always expressed as numerical values lying between zero and one. The probability of success is always called P, and that of failure is Q. The probability of an event which is certain to occur is P = 1; if it is certain to fail, we have Q = 1. In the urn, if all the balls are black, we are certain to draw a black ball. Pb _ 10/10 = 1, PH„ = 0, Qb = 0, Q,„ = 1. Obviously, in all cases, P + Q = 1, P = (1 — Q), and Q = (1 — P). Probability values need not be simple fractions such as one-half or one-quarter. If a rifleman shoots 777 times at a target, and obtains 253 hits, the a posteriori single-shot probability is 253/777 = .32561132 (to eight places). Mutually exclusive events. Two or more events, any one of which may occur on a single trial, are said to be mutually exclusive if the occurrence of any one of them precludes the occurrence of any other one. In the case of the tossing of coins, the appearance of head or tail forms a set of two mutually-exclusive events. If the coin shows a 2
4
THE ESSENCE OF BIOMETRY
head, it cannot show a tail, and vice versa. The concept of mutually exclusive events is very important in the theory of probability. Equally likely events. Two or more events are said to be equally likely if, as the number of trials is increased without limit, the frequency of occurrence of the events approaches equality. In common practice, events are usually considered to be equally likely unless there is good reason to think otherwise, but one must not make this assumption carelessly. Many paradoxes in the theory of probability are based on incorrect assumptions as to equal likelihood. Unconditional probability. The probability of the success or failure of a trial is said to be unconditional if that probability is not contingent in any way upon the outcome of some other trial. Thus, if a coin is tossed in the hope that it will show a head, and a die is thrown in the hope that it will show five, the two are quite unconditional because neither of the trials can possibly have any effect upon the other. Problems involving unconditional probability are always easier to solve than those involving conditional probability. Conditional probability. The probability of the success or failure of a trial is said to be conditional if the success or failure is contingent in some way upon the outcome of some other (and usually) prior trial. Thus, the probability of a man dying of anthrax is contingent upon his first contracting the disease, for which there exists some probability. This probability may itself be contingent, because the man must be in a place in which he might contract anthrax. In more complex cases we may have a situation in which event B can occur with one probability if another event A succeeds, and with a different probability if A fails. There are then four possibilities: A succeeds and B succeeds probability = P(AB), A succeeds and B fails
probability = P(AB'),
A fails and B succeeds
probability _ P(A'B),
A fails and B fails
probability = P(A'B').
The sum of these four probabilities, since these are all the things which can occur, must be one. That is to say, if the trial is made, something, anything, is bound and certain to occur. Compound probability and compound events. A compound event is one consisting of two or more sub-events, all of which must occur if the main compound event is to come about. If the probabilities are conditional, the sub-events must occur in the proper succession, but if the sub-events are unconditional, they may occur contemporaneously. The example given above, that of a man dying of anthrax, is a typical compound event with conditional probability. A compound event with unconditional probability would be the showing of two heads when two coins are tossed, either together or in succession. For each coin there is an unconditional probability P = .5 of showing a head. Neither coin is affected by the other. It does not matter
THE THEORY OF PROBABILITY
5
how they are tossed as long as both show heads, in order to fabricate the compound event of showing two heads. It is commonly asserted that, in compound events with conditional probability, the total probability for the success of the compound event may depend upon the order in which the sub-events occur. The example often used is that of the man who is shot and then dies. This probability is obviously greater than that a man will die and then be shot. This is in no way surprising, because the two compound events are not the same. In the first case, we have the (for our purposes) unconditional probability that a man will be shot. Call this P. There is then the conditional probability that, having been shot, he will die. Call this P. The total compound probability that the man will be shot and then die is (as we shall see later) P = (P5)(Psd). In the second case, we have the unconditional probability that, during some period under consideration, the man will die. Call this Pd. There is then the very small conditional probability that someone will shoot the corpse. Call this Pds. The total compound probability that the man will die and then be shot is P = (Pd)(Pds), and there is no reason why this should be equal to (Ps)(Psd). Why, indeed, should it be? Independent trials. A trial is said to be independent if the probability of success or failure is not affected in any way by previous or contemporary trials. Thus, successive tosses of a coin are independent trials. For any and all fair tosses of a coin, the probability of showing a head or a tail is for ever and immutably P = .5. Dependent trials. A trial is said to be dependent if its outcome is affected in some way by previous or contemporary trials. This is often the case in weapon trials. In the first trial the gun is new and unworn, but in all later trials there is the effect of wear of the bore, and the single-shot probability of hits usually goes down as the series of trials proceeds. Parenthetically, the failure to understand that successive independent trials really are independent has been the ruin of countless gamblers. The unthinking person says `Jones has tossed ten heads in succession. It is now time we got a tail, and I shall bet on it.' In other words, this person is saying that the tossing of ten heads in a row has in some way made the eleventh toss a dependent trial, with the probability of showing a tail greater than .5. The fallacy lies here. If, before any tossing is done, we ask `What is the probability that Jones can toss ten heads in a row?,' the answer would be P = .510 = .000,976,562, or about one chance in a thousand. We would be quite justified in betting heavily that Jones can't do it. If, on the other hand, Jones has already tossed nine heads in a row, he has done most of the job. He has already brought about an event having the very Iow probability of occurrence of P = .59, or about two chances in a thousand. The probability that the next toss will show another head is still P = .5. Who are we to say that Jones can't finish the job? If the number of successive heads is not too large, it would be equally unwise to bet that Jones can keep it up. If, however, the number of successive heads becomes very large, say fifty, we would be quite justified in thinking that the initially-assumed
6
THE ESSENCE OF BIOMETRY
a priori probability of a head (P = .5) was wrong, i.e. we would assume either that the coin is unbalanced, or that Jones is cheating. We might then set up a new a posteriori probability for heads, greater than .5, and bet on it. If Jones should toss five thousand heads in a row, we can be sure (but not utterly certain) that the coin is unbalanced, or that Jones is no man for the innocent to toss coins with.
Some basic rules of probability and some simple examples Simple, mutually exclusive events Let there be N mutually exclusive events, A1, A2, A3, A4, ... , AN, and let the unconditionaI probabilities for the success of the individual events be P1i P2, P3, P4i ... , PN. Then, on a single random trial, the probability that some one of these events, without specifying which one, will occur is P = Pl + P2 + P3 + P4 + ... PN. Obviously, if all the P's are equal, P1 = NP; where Pt is any one of the P's. If, of course, the sum of P1 to PN is one, there is no other possibility on making the trial but that some one of the events will occur. This sum can never be greater than one, but it can be less than one if the operator has not taken all possible outcomes into account. Example. Consider an urn containing some white and some coloured balls, to a total of one thousand, where the numbers and a priori probabilities are as shown in Table I. TABLE Colour of hall
T.—Composition of the contents of the urn Number in urn
A priori probabilities for one random draw Success
Red Green Purple Yellow Blue White
10 15 25 75 130 745
.010 = 10/1000 .015 = 15/1000 .025 = 25/1000 .075 = 75/1000 .130 = 130/1000 .745 = 745/1000
Failure
= = = = = =
Pr Pg Pp P, Pb
P„.
.990 .985 .975 .925 .870 .255
1.000 = Total
The probability of getting any ball except white is clearly Pro„_,,, = P, + Pa + P p + P,, + Pb = .010 + .015 + .025 + .075 + .130 = .255. But this is the same as being certain to get some ball, but not white, or P„on—w=
1—
Pw= 1— .745 =.255,
and this is the same as simply failing to get white, or Qw = .255. Similarly, the probability of getting either red, or white, or blue is Prwb =Pr+Pw+Pb= .010+.745+.130 =.885.
THE THEORY OF PROBABILITY
7
Compound probabilities Let there be N non-mutually-exclusive events A1, A2, A,, ... , AN, and let the individual unconditional probabilities of the occurrence of these events on a single random trial be P1, P2, P3, ... , PN. Then, on a single random trial, the probability Pisa that some selection of these events (comprising events AJ, Ak, and Al) will occur is = (P,)(P1)(P)• Obviously, if all the P's are equal, the probability that a set of them numbering M will occur is Pa! = PM. Example. Six people are drawn at random from the street and seated round a table. They are given the pseudonyms Jones, Brown, Smith, Robertson, Leroy, and Stevens. Since the sex ratio of human beings on the street from which these people were drawn is exactly .5, one is as likely to select a woman as a man, in each case. One person from this room is drawn at random. What is the probability that this person will be a woman, called Jones, and born on a Monday? We have the probability that it is a woman = P1 = .5; the probability that she is called Jones = 1/6 = .16667; the probability that she was born on a Monday = 1/7 = .14285. Then .011904.
Pf,yt =
(.5)(.16667)(.14285) = .011904. This is also (i)(1/6)(1/7) or 1/84 =
A compound event with conditional probability One thousand seeds of similar size and feel are placed in a bag. Of these, 275 are radish seeds. These radish seeds are of such quality that, if planted, 85 per cent will sprout. That is to say, P, = .85. One seed is picked at random. What is the probability (P,$) that it will produce a radish plant? The unconditional a priori probability of picking a radish seed is 275/1000 = .275 = P,. Therefore P,, = (.275)(.85) = .23375. What is the probability that, after planting, we shall not get a radish plant? This event can occur in two ways: a) we pick a non-radish seed, Pnon _, = 1 — P, = Q, = 1 — .275 = .725 b) we pick a radish seed, but it does not sprout, Pnon-, = P,(1 — P,) _ (P,)(Q,) = (.275)(.15) _ .04125. Now these are mutually exclusive events. If one way is used, the other way cannot be used. We therefore add these two probabilities, and we have Pno plant = .725 + .04125 = .76625. Note that .76625 + .23375 = 1. That is to say, we are certain either to get a radish plant, or not to get one.
8
THE ESSENCE OF BIOMETRY
The problem of ways of occurrence Many compound events may occur in a number of ways, the probabilities of which must be added, as above, to obtain the total probability of the compound event. In the case of tossing groups of coins, for example, the compound event of two heads and two tails can occur as HHTT, TTHH, HTHT, THTH, THHT, and HTTH, or a total of six ways. In simple cases, one can write down the number of ways immediately; in more complex cases, the number must be calculated by the theory of permutations and combinations. If the student is to do much work with compound events, he should become familiar with this theory, which can be found in any text on algebra. The matter is well discussed in Fry's Probability and Its Engineering Uses.1 Multiple successes in multiple trials Consider the operation of the repeated tossing of a set of N coins. If the set be tossed M times, on how many occasions, out of M, will the set show zero, one, two, three, up to N heads, assuming the coins to be evenly balanced? What is the probability that, on a single toss of the mass of N coins, zero, one, two, three, up to N heads will show? For simplicity, let us begin by tossing just two coins (N = 2). The results of one toss might be HH, HT or TT. Experience will show these results will occur, on the average, in the ratio of 1:2:1. Why should this be so? Consider HH (two heads). For convenience, we may think of making the trial by tossing first one coin and then the other. To obtain HH in the two tosses, the first toss must show a head, for which Ph = .5. Considered by itself, the probability that the second toss will show a head is also P,, = .5. The probability of the compound event HH is then P,,,, = (Ph)(P,,) = .25. In the case of HT, this can be obtained in two ways: HT and TH. Each way has a compound probability of P,,, = P,,, = .25, and, as these are mutually exclusive events, we add the probabilities to obtain P,,. = PT „ = .50. The probability of TT is the same as that of HH, or P„ _ .25. The three probabilities are then P2 heads + P,, & t + P2 tails = .25 + .50 + .25 = 1, and the events occur in the ratio of 1:2:1. If we were to toss this set of two coins 160 times, the expected numbers of the times we would see the various showings would be: 2 heads (HH), number = 160(.25) = 40 1 head, 1 tail, number = 160(.50) = 80 2 tails,
number = 160(.25) = 40 Total 160
Consider next the tossing of a set of three coins: the results may be three heads, or two heads and one tail, or one head and two tails, or three tails. Following the 1. T. C. Fry, Probability and Its Engineering Uses (New York, Van Nostrand, 1928).
THE THEORY OF PROBABILITY
9
reasoning used for two coins, the probability of showing three heads or three tails is Phhh = Pttr = (•5)3 = .125.
The showings HHT, HTH, and THH are merely three ways of obtaining two heads and one tail. Each way has a probability of .125, for a total of .375. Similarly, HTT, THT and TTH are three ways of showing one head and two tails, for a total probability of .375. Summarizing, we have: 3 heads and 0 tails
Phhh =
•
2 heads and 1 tail
Phht =
.375
1 head and 2 tails
Phu = .375
0 heads and 3 tails
Pt8 = .125
125
Total = 1.000 A little thought will show that these values are those of the successive terms of (.5 + .5)3 = .53 + 3(.5)2(.5) + 3(.5)(.5)2 + .53 = .125 + .375 + .375 + .125 = 1.000. This is the famous binomial theorem. The expectations for numbers of various showings in M tosses are M times the probabilities. Therefore, if we toss the set of coins ten thousand times, the various showings would occur, in theory, in the numbers shown below: 3 heads and 0 tails 1250 2 heads and 1 tail
3750
1 head and 2 tails
3750
0 heads and 3 tails
1250
Total = 10000 The above is a simple example, in which the coins are evenly balanced; heads or tails are equally likely to occur. Suppose, however, that we had seven coins so made that Ph = .57 and Pt = .43. Arranging the expansion so that the results run from zero heads to seven heads, and setting Ph = P, Pt = Q, we have (Q + P)" = (.43 + .57)' = (.43)7 + 7(.43)6(.57) + 1.2 (.43)5(.57)2 + + 7.6.5 7.6.5.4 ( 7.6.5.4.3 ( (.43)4( .57)3 + .43 )3(.57 )4 + .43)2 (.57)5 + 1.2.3 1.2.3.4 1.2.3.4.5 + 7.6.5.4.3.2 (43)(.57)6 1.2.3.4.5.6
7.6.5.42.1 ( 57)'. 1.2.3.4.5. .3.6.7
10
THE ESSENCE OF BIOMETRY
From this expansion, the terms now being evaluated, the probabilities and expectations for tossing the set of seven coins ten thousand times are as shown in Table II. TABLE IL—Probabilities and expectations for seven unbalanced
coins (Ph = .57, Pt = .43), the set tossed ten thousand times. Number of treads
Number of tails
0
7 6 5 4 3 2 1 0
1 2 3 4 5 6 7 Totals
Prob.
Theoretical expectation
Rounded expectation
.002718 .025222 .100302 .221598 .293747 .233631 .103233 .019549
27.18 252.22 1003.03 2215.98 2937.47 2336.31 1032.33 195.49
27 252 1003 2216 2938 2336 1032 196
1.000000
10000.00
10000
The above would be a laborious business if N were large or if either P or Q were very small. We shall see a better method of calculation for large N when we come to a discussion of the binomial theorem and distribution and of the use of Pascal's Triangle (Chapter V). At least J successes in M trials Consider again the tossing of the above set of seven unbalanced coins. What is the probability that, when the set is tossed, it will show, not exactly five heads, but at least five heads? That is to say, we shall accept showings of five, six, or seven heads, but reject those of zero, one, two, three, or four heads. Inasmuch as the showings of more than five heads are all mutually exclusive events, we have merely to add the probabilities of their individual occurrence: P± 5 = .233631 + .103233 + .019549 = .356413. This could also be calculated as Pt- 5 = 1 — (Pa, + P lh + P2h + P3h + P4h) = 1 — (.002718 + .025222 + .100302 + .221598 + .293747) = .356413.
The most interesting case is that for the probability of at least one head, i.e. the probability of not missing. In the above example, we would be interested in any number of heads greater than zero heads. This means that we must add Pl,, to P71i . This sum is of course 1 — Poh, or 1 — Q7 = 1 — (1 — P)7. Generalizing, the probability of not missing is P=1—(1—P)N. This is quite important in military operations, especially in bombing raids.
THE THEORY OF PROBABILITY
11
Expectation versus probability In the example set forth in Table II, it will be seen that the most likely outcome is that of four heads and three tails (P4h = .293747). But this probability is less than .5. Therefore, on a single random trial, we are more likely to get something other than four heads and three tails, even though, on a long series of trials, it will appear more often than any other single outcome. One must not, then, fall into the trap of thinking that the outcome with the largest individual probability of occurrence is the event most likely to happen on a single trial. It will not be so unless its probability is greater than .5. If we use a set of four thousand balanced coins, and toss them en masse, the event with the highest probability of occurrence is that of two thousand heads and two thousand tails. However, the chances of this actually occurring are extremely small. There are so many other possibilities in competition with it, and their aggregate probability is nearly equal to one. If one wishes to assume a safe position with respect to such matters, one should always work in terms of the probability of not failing to get the desired result. Bayes' theorem This celebrated theorem was first published in 1763 by the Rev. Thomas Bayes, and may be set forth as follows: Let C1, C2, C3, ... , CN be all the mutually exclusive causes which could operate to bring about an event E, which event is known to have occurred. Let pi, p2, p3, ... , PN be the respective a priori probabilities that each cause will operate on a given occasion. Let PI , P2, P3i ... , PN be the respective conditional probabilities that each cause, having operated, will bring about the event E. Then, if the event is known to have occurred, the probability that it did so due to the action of the K'h cause is PK —
PxPx PIP1 + P2P2 + P3P3 + ... PnPN.
Example. Suppose that in some region, the only possible causes of death of some kind of animal are the diseases A, B, C, D, and E, and that the appropriate probabilities are Disease D: p4 = .11 P4 = .14 Disease A: pi = .21 P1 = .77 Disease E: ps = .15 P5 = .39 Disease B: p2 = .37 P2 = .23 Disease C: p3 = .16 P3 = .37 The animal has died. The probability that it died because of disease C is then Pc =
p3P3
P11 P+PP 22 + P33 P + P 4P 4 + P5P 5
— .15583.
Had we wished to know the probability of death from disease B, we would have calculated the same fraction as above, but with p2P2 in the numerator.
12
THE ESSENCE OF BIOMETRY
Bayes' theorem has been the subject of much debate and some criticism, but it remains the only formula available for this sort of problem. The difficulty arises from the fact that we seldom have any accurate information about the values of the p's. There is then a tendency to assume that they are all equal. Invalid assumptions as to equal likelihood can lead to wholly ridiculous conclusions when Bayes' formula is used. Many amusing paradoxes are based on this fact.
II SAMPLES, UNIVERSES, DISTRIBUTIONS WHEN A RESEARCH WORKER begins to attack a problem, he invariably finds that he cannot make all the observations he would like to make if the time, money, and unlimited help were available. He must content himself with doing the best he can with limited resources. It is his hope that, in the end, he will be able to make reasonably true statements about the nature of the problem on the basis of an examination of perhaps only a fraction of the potentially available data. That is to say, he takes a sample from the universe, carries out a statistical study of the sample, and extrapolates to the universe. Extrapolation is always risky, but he hedges by talking in terms of probabilities and not in terms of certainties. He does not say that the mean value of X is .734; rather that, with 95 per cent certainty, the mean value of X lies between .734 + .01 and .734 — .01. These plus or minus values are called fiducial limits. The great virtue of statistical analysis is that we can make statements of this sort, and can establish fiducial limits.
Further definitions Universe. In the sense in which it is used by the statistician, the word universe means the compendium of all possible observations which might be made in connection with a given enquiry, given unlimited time, energy, and speed of operation. Some universes may be so small as to yield inadequate data (observations on some rare species, for example). Universes are commonly so large that one cannot make all the possible observations, and must work in terms of a sample; see below. Sample.
A sample comprises a limited group of observations drawn from a
universe. Random sample. A random sample is one so taken that every observation which might be included has an equal chance of being included. In practice, it is quite difficult to be sure that one really has a random sample. One must not select observations by any sort of subjective process whereby one says `I shall take this observation, but not that one.' One must set up some procedure which overrides one's predilections and determines which observations shall be taken in a random, non-subjective way. For example, if a mass of powder is to be sampled, one does not just take some off the top. This part may have too many small grains in it. Rather, one stirs the mass, cuts the pile in four, and takes one of the quarters. This process is repeated until a 13
14
THE ESSENCE OF BIOMETRY
sample of suitable size is obtained. In ore-dressing laboratories, there are gadgets for doing this. In the case of a large mass of data from which a few observations are to be drawn, the observations should be numbered, the numbers placed in a hat, and a selection made by lot. One can also use a random number table. This is a table containing the digits from zero to nine, usually arranged in blocks of five digits. Within the blocks, the digits are so arranged that there is no pattern. At any place at which a digit is printed, the digit found there is equally likely to be anything from zero to nine. A few blocks of a random number table look like this: 51133 05375 97610 68618 09134 56139 27083 16639 65762 12498 41731. In using the table, if less than one hundred observations are required, one takes the last two digits of a series of the random numbers and uses these as the numbers of the observations to be taken. Thus, from the small list above, one would take observations 33, 75, 10, 18, 34, 39, 83, 39, 62, 98, 31, and so on. If more than 99 observations are required, use the last three digits. A fairly good random number table can be made up as needed by taking the last digits of the values in a table of logarithms, arranging these in blocks of five. Beginning with loglo(180), for example, we have the values 25527, 25551, 25575, 25600, 25624. These yield one block (71504). Randomness will be increased if one jumps around in the table of logarithms rather than selecting successive logs. Random numbers are often used in work with electronic computers. The machine may be programmed to generate the numbers, or it may look them up in a table which has been placed in its `memory.' Where it is desired to draw a fairly random sample, or a set of samples from a field, bush, culture, or the like, the sampling effort should be spread evenly over the area or through the volume, so that all parts have an equal chance, in theory, of being included. Sampling is a science in itself, and the Iimited scope of this book does not permit an extended discussion'. Variable (variate). When observations are made, they will be recorded as values defining such things as size, weight, colour, length, and so on. Such values are the values of variables, sometimes called rariates. Variables may be of several kinds. Continuous variables. If all the values of a continuous variable be tabulated, they will form an unbroken series from the smallest to the largest. The interval between successive values is, in theory, infinitely small; in practice, suitably small. The heights of men form, in theory, a continuous series, but the values of coins do not. Discontinuous variables. These can assume only some values, usually integral values. The numbers of heads seen when a mass of coins is tossed form a set of values 2. See R. G. D. Steele and J. H. Torrie, Principles and Procedures of Statistics (New York, McGraw-Hill, 1960), for further discussion.
SAMPLES, UNIVERSES, DISTRIBUTIONS
15
of a discontinuous variable. One can have two heads, or three heads, but not two and a half. Ordinal variable. Ordinal variables plot as values which can be placed in some sort of order of merit, or which can be, as we say, ranked. They are much used in consumer preference-tests. One may be able to say that one soup tastes better than another, but one cannot assign it a numerical value. Nominal variables. These are variables with a `yes' or `no' quality. An animal may be dead or alive, male or female, and so on. One cannot assign numerical values in most cases. Origin and range of variables. Some variables assume values, in theory, from — co to + co, but in practice, one uses only a limited range. Thus, the mathematical distribution for the heights of men ranges from — oo to + co, but in practice, there are no mature men only one foot high, or fifty feet high. Some variables can be referred to a definite origin, or zero-point (lengths of fish, for example). Others are commonly referred to an arbitrary origin. Time is usually so referred to, often as time since the beginning of the experiment. Frequency distribution. If one plots a large series of values of a variable, with the value as abscissa and the frequency of occurrence of each value as ordinate, the resultant graph is a frequency distribution. It is usually in the form of a sort of humped curve, though hollow J-shaped distributions are not unknown. There are a number of more or less `standard' distributions for which the mathematical background is well known. These are the ones commonly used in statistical work. The commonest are the normal, the binomial, and the Poisson, but many others are known. Distributions are usually plotted in the form of continuous curves, but may also be shown as histograms. Example. Table III shows the times required for the hatching of 186 eggs of the flour beetle, Tribolium castaneum, at 32°C. and 75 per cent relative humidity, the time being measured in hours from the mean time of laying. It will be seen for example, TABLE III.—Distribution of hatching times (in hours) of 186 eggs of Tribolium castaneum at 32°C. and 75 per cent relative humid'ty. Time, hours
Number hatching
Tinte, hours
Number hatching
Tinte, hours
Number hatching
Tinte, hours
Number hatching
69 70 71 72 73 74 75
0 I 1 2 1 3 11
76 77 78 79 80 81 82
18 25 23 24 20 12 7
83 84 85 86 87 88 89
8 6 5 3 3 5 1
90 91 92 93 94 95 96
3 2 1 0 0 0
16
THE ESSENCE OF BIOMETRY
that one egg managed to hatch after only 70 hours, while 23 eggs required 78 hours, and two of them as long as 91 hours. The data of Table III are plotted as a smoothed distribution in Fig. la and as a histogram in Fig. lb. This distribution is too asymmetrical to fit a normal distribution. It would probably fit a Poisson or a binomial somewhat better.
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
Hours required after aying, to develop to hatching Fig. la and lb.—Frequency distribution of the hatching times in hours of 186 eggs of the flour beetle
Tribolium castaneum at 32°C., and 75 per cent relative humidity. b.
The data are plotted as a smooth curve in a and as a histogram in
The general shapes and characteristics of frequency distributions Symmetrical distributions. These are distributions in which both sides are the same (mirror images) about the middle ordinate. In practice, one seldom sees a really symmetrical distribution. They are nearly all skewed. Skewed distributions. These are distributions which are steep on one side and have a long `tail' on the other side. If the tail is on the right side, as it is in Fig. 1, the distribution is said to be skewed right, and vice versa. Biological phenomena are very commonly skewed right. There always seem to be some stragglers which prolong the distribution into a tail. Extremely skewed distributions, which are rare, are the shape of a reversed letter J. The ultimate in skewing (right) is shown by a distribution all the values of which come at the extreme left abscissa. Such are virtually never seen. The mean. This is what the ordinary man calls the `average.' It is calculated by adding up all the values and dividing by the number of observations. There are other
SAMPLES, UNIVERSES, DISTRIBUTIONS
17
kinds of means, such as the harmonic mean or the root-mean-square, used for special purposes. The mode. This is the high point of the curve. Its position can often be roughly determined by inspection, and ways exist to calculate it. A curve with only one high point is said to be uni-modal, but curves may be bi-modal (with two modes) or even multi-modal. In some cases, bi-modal distributions can be split into two uni-modal distributions, but this is usually difficult or impossible to do with any certainty. If there are numerous repeated modes, we have a periodic function, of which sin X is an example. Such are best handled by periodogram analysis or Fourier analysis. These are not discussed in this book. The median. The median observation divides the frequencies into two groups, with as many larger than the median as there are smaller than the median. Its position is difficult to estimate by eye from a curve, but it can be found by calculation. The range. This is the extent of spread of the distribution, i.e. the region between the smallest and largest observations. In the distribution of Fig. 1, the range is from 70 to 94 in terms of the raw data, though in theory it is from — co to + co. Kurtosis. This is a numerical measure of the degree of flattening of the top of the curve, and is applied principally in the case of normal distributions. The top of such must have a definite shape, depending upon the degree of central tendency. If it is too flat, it is said to be platykurtic, and if too sharp, it is leptokurtic. It is almost impossible, except in extreme cases, to estimate the degree of kurtosis by eye, but it can be calculated. Central tendency. The central tendency of a distribution is a measure of the extent to which the observations tend to cluster around the mean of the distribution. We shall see later that the mean is the best single value to take as representative of all the various values making up the distribution. It is, as it were, the `spokesman' of the distribution. If the distribution has a narrow range and a high mode, the values of most of the observations are close to that of the mean. We have then a high degree of central tendency (small dispersion) and we may feel some fair degree of confidence that the mean does indeed represent the distribution. With a wide flat distribution, the majority of the values deviate quite widely from the mean, which may then have scarcely any validity as a representative of the distribution. The common measure of central tendency is the standard deviation, though other measures have been used, in particular the (now-outmoded) probable error. Classes, class intervals. It will be noted that in Table III, we did not tabulate all the exact times at which the various eggs hatched. Some may have required, say, 81.1 hours, others 81.2, and so on. Rather than tabulate all the values, we lumped the data into groups, or classes (sometimes called categories). Thus all the eggs requiring from 79.5 to 80.5 hours to hatch were arbitrarily assigned 80.0 hours, the mid-value of the class. This is done not so much because one could not measure more precisely, but to reduce the effort needed for analysis and to summarize the data.
18
THE ESSENCE OF BIOMETRY Except for very special cases, the
class interval,
i.e. the change in value of the inde-
pendent variable from one class to the next, should always be the same within one set of data. Unequal class intervals, or observations unevenly spaced on a time-scale, may make the data almost impossible to analyse. One should select a class interval in such a way that the data can be assembled in about twenty or fewer classes. More may have to be used in large problems, and there is no bar to having five hundred if need be. Broadening the class interval to reduce the number of classes inevitably blurs the precision, though there are adjustments such as
Sheppard's correction
which may be used to compensate for lumping in classes.
When tabulating data, one may meet awkward cases falling exactly on the boundaries between classes. These should be distributed into the neighbouring classes as closely as possible in the ratio of the size of the two relevant classes. If this is not feasible, a random distribution right and left will do.
The relation of probability to the form of a frequency distribution Suppose that one has drawn a random sample of 126 insects, the lengths of which one can measure with some certainty to the nearest .1 mm. Suppose that these measurements tabulate
as
in Table IV and plot as in Fig. 2.
25 20 15 10 5 r-o
0 5.0
1
..1 f- 10 f-11 f-10 f -. 0 f•.10 1^ IS ( 10 I-.0)941, r•• .ØJ1 r.. , ❑W! r•• .1010r-.111,11 r- .110D r.. .11101 r- .071
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 Lengths of insects to nearest 0.1 mm. Fig. 2.-A basically normal frequency distribution, plotted from the data of Table IV. The vertical columns show the probabilities that a single observation, drawn at random from these data, would have a value appropriate to that column. The
sum of all the observations, each
TABLE IV.-Lengths of 126 insects drawn as a random sample.
value multiplied by its frequency, is 718.2, and the observations number 126. The mean, commonly symbolised as x, is then = 718.2/126 = 5.7 mm. Now we can think of the curved line of the distribution, together with the bottom horizontal line, as enclosing an area of 126 units, one for each insect measured. Suppose now that, after measuring, we throw all the 126 insects into a beaker and then draw one of them at random, and
Length, nearest .1 mm
No. of insects
Length, nearest .1 mm
No. of insects
5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7
0 1 3 5 10 15 19 20
5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
19 15 10 5 3 1 0 0
SAMPLES, UNIVERSES, DISTRIBUTIONS
19
remeasure it. It might have a length anywhere from 5.1 mm to 6.3 mm. In the beaker there are 126 insects in all, of which ten have a length of 5.4 mm. It follows that the a priori probability that the single insect we have just drawn will be 5.4 mm. long, is P5.4 = 10/126 = .07937 (approx.). Similarly, the a priori probability that it is, say, 5.6 mm. long, is P5.6 = 19/126 = .15079 (approx.). That is to say, the areas in each class-column represent the probabilities that a single observation made at random from this set of insects will have a length appropriate to that class. If then we divide all the separate frequencies for each class by 126, we shall obtain a new curve bounding an area equal to one. Then each class-column will directly represent the probability of occurrence. These probabilities are shown in Table V. TABLE
V.-Probabilities for each class, for the data of Table IV.
Length, nearest .1 mm.
No. in class
Probability
Length, nearest .1 mm.
No. in class
Probability
5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7
0 1 3 5 10 15 19 20
.00000 .00794 .02381 .03968 .07937 .11905 .15079 .15873
5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5
19 15 10 5 3 1 0 0
.15079 .11905 .07937 .03968 .02381 .00794 .00000 .00000
Total probabi ity = 1.00000
Consider now the probability that a single observation, drawn at random, will lie between 5.45 and 5.95, i.e. that it will lie somewhere in the classes 5.5, 5.6, 5.7, 5.8 and 5.9. If it lies in one class, it cannot lie in another. Inclusion in the various classes forms a set of mutually exclusive events. We add the individual probabilities to obtain the overall probability that some one of the events will occur. The result would be P = .11905 + .15079 + .15873 + .15079 + .11905 = .69841. This observation, lying between 5.95 and 5.45, would differ from the mean value (5.7) by less than, or as much as, 5.70 - 5.45 = .25 mm., or 5.95 - 5.70 = .25 mm. We might have asked whether the probability that a single observation made at random would differ from the mean by as much as or more than .25 mm. The previous observation lay somewhere in the middle part of the distribution. This second observation will lie somewhere in one of the two tails, i.e. elsewhere than in the central part. The probability of this occurring would be that of failing to lie in the central part, or P = 1 - .69841 = .30159. A little thought will show that, if we had much finer categories, we should be able to find such a deviation from the mean that, if a single observation were then made at random, its deviation would be as likely to exceed this amount as to fall short of it. For this deviation, the above probabilities (.69841 and .30159) would both equal .5000. 3
20
THE ESSENCE OF BIOMETRY
This particular deviation, located on either side of the mean, was called the probable error. The probable error is not used now and has been replaced, for practical purposes, by the standard deviation, whose value is (probable error)/.67449. The two probable error abscissae divide the distribution into three parts: a central part with an inclusion probability of .5, and two tails, each with an inclusion probability of .25. The total is, of course, one. We have written the above largely in terms of the actual values of observations. In practice one works, not with actual values, but with deviations from the mean. An insect of length 5.4 mm. represents a deviation of 5.7 — 5.4 = .3 mm. Deviations are usually measured from the true mean, but an assumed mean, commonly called Z, may be used for convenience. Compensations are made to allow for the fact that Z is not 5f, the true mean. As we shall see later, the use of an assumed mean greatly simplifies some statistical calculations. The statistician thinks in terms of deviations from the mean rather than in terms of actual values, because it is deviations that are important. A man eleven feet high may be interesting, but the really important thing is not that he is eleven feet high, but that he is about five feet taller than the mean height, taking this for the moment as six feet. The operation of probability in a given case produces deviations, not actual values, and all the formulae of statistics are built in terms of deviations. You will now realize that in a distribution with a large central tendency (small dispersion) the probability of large deviations is small, and that of small deviations is large. The probable error hugs the mean closely, as does the standard deviation. Thus the curve is steep on the sides and pointed at the top, but its top must still be the proper shape, in terms of kurtosis. With small central tendency, deviations are likely to be large rather than small, and the standard deviation may be well out from the mean on each side. If it is too far out, as we shall see later, the mean is not worth anything. With this information at hand, we are now in a position to understand a discussion of actual distributions, and the next chapter deals with the famous normal frequency distribution, sometimes called the Gaussian distribution.
III THE NORMAL FREQUENCY DISTRIBUTION THE VARIABLES of a vast array of natural phenomena tend to be distributed in fairly close agreement with what is known as the normal frequency distribution, though it is rare to find a case in which the agreement is exact. The majority of biological distributions seem to be somewhat skewed right, i.e. TABLE V1.—Data calculated for they are steep on the left, with a long tail on the right. an almost exact normal However, if the skew is not too great one can often frequency distribution. neglect it and treat the data as if they did belong to a symmetrical normal distribution. If doubt exists, tests Value of f = Frequency X of these values are available (the chi-squared test) to check the extent of agreement or lack of it. 1 0 Table VI and Fig. 3 show some data specially 2 2 3 9 calculated to provide an almost exact normal distri4 27 bution. In Table VI, the symbol f indicates the 65 5 number of times each value turned up. It is the 6 121 7 176 frequency. These frequencies in Table VI are not quite 200 8 exact as they have been slightly rounded off. 176 9 10 1 12 13 14 15
121 65 27 9 2 0
Characteristics of the normal frequency distribution
1. The distribution is symmetrical, with the mean, mode, and median coinciding; in this example, atx = 8. 2. In theory, the range is from — co to + oo, but Total f _ 1000 in practice, it is limited; in this example, from two to fourteen. 3. The tip of the curve, at the coincidence of mean, mode, and median, is rounded. It has a special shape, and if this is exactly what it should be, the kurtosis is zero. 4. If each ordinate is considered as a measure of the probability of the occurrence of such a value of the dependent variable at that abscissa, the area under the curve equals one. 5. On each side of the mean, distant from it by an amount equal to the standard deviation, there is a point of inflexion. It is at SD = 2 in this example. 6. The distribution may have an infinite number of shapes, depending upon the value of the standard deviation. The larger the standard deviation, the flatter the curve. 21
THE ESSENCE OF BIOMETRY
22
200 ~ean. mode and median
-- 150 -
Z
r3.
.D 100 standard des tat ion
c
standard deviation
50
0
•
2 3 4 5 6 7 8 9 10
0
11
13
12
14
15
Values of X, the independent variable Fig. 3.—An almost exact normal frequency distribution, plotted from the data of Table VI, showing the mean, mode, median and standard deviation.
7. Because of the infinite number of possible shapes, data to be used in calculations relating to the curve are tabulated, not in terms of absolute values of the independent variable, or in terms of deviations from the mean, but in terms of the ratio: deviation . standard deviation
As a result, one table will do for all normal frequency distributions.
Otherwise a separate table would be needed for each standard deviation. 8. Where x = a value of the independent variable, y = a value of the dependent variable, usually a frequency, f, x = the mean value of x, SD = the standard deviation, N = the number of observations in the distribution, = E
f,
e = the common constant 2.7182818, ir(pi) = 3.14159, the equation of the curve is Y_
N (SD)J2 n
e-(x—x)2/2SD2 '
and the curve is plotted in terms of deviations from x with x as its origin. As stated in Chapter II, there are a number of statistics which may be calculated for a normal distribution, and we proceed now to deal with these. The distribution of Table VI is not really suited to a demonstration of these calculations, as there is no skew, no kurtosis, and the class-interval is one. Table VII shows the same data,
THE NORMAL FREQUENCY DISTRIBUTION
23
somewhat distorted (to make the distribution less perfect), re-tabulated with a classinterval C = 2. The mean The mean is commonly calculated by adding up all the observations and dividing the total by the number of observations. Where the observations are aggregated in classes or categories, we have frequencies for the occurrence of each value of x, and so the grand total is the summation of E f (x), while N is the total of the frequencies. Thus
x—
f (x) N
9066 — 9.066. 1000
This is the method of the man in the street and is quite usable for a small quantity of data, but where the mass of data is large, the group deviation method is preferable. In this method, one takes an assumed mean located by guess-work somewhere near the true mean. This assumed mean is called Z. If a suitable value, located about the middle of the table, be taken for Z, the deviations from 2 (and also their squares) can be written out at once as shown in columns 5 and 6 of Table VII. Note that we take VII.—Data for the calculation of the statistics of a normal frequency distribution. The raw data are in columns 1 and 2.
TABLE
Column numbers for reference 1
2
3
4
5
6
7
8
9
x
j
f(x)
Ef
D'
(D')2
dev*
dev2
f (dev2)
0 2 4 6 8 10 12 14 16 18 20
0 4 47 197 281 242 134 63 24 7 1
0 8 188 1182 2248 2420 1608 882 384 126 20
0 4 51 248 529 771 905 968 992 999 1000
—5 —4 —3 —2 —1 0 +1 +2 +3 +4 +5
25 16 9 4 1 0 1 4 9 16 25
—9.066 —7.066 —5.066 —3.066 —1.066 +0.934 +2.934 +4.934 + 6.934 +8.934 +10.934
+82.1923 +49.9283 +25.6644 +9.4003 +1.1364 +0.8723 +8.6084 +24.3444 + 48.0804 +79.8164 +119.5524
+0.0000 + 199.7132 --1206.2268 —1851.8591 +319.3284 +211.0966 -I-1153.5256 =1533.6972 —1153.9296 +558.7148 +119.5524
* These will be used later for calculating the s andard deviation.
no account of class-interval. This will be adjusted later, and at the same time we shall make an adjustment for the fact that 2 is not the true mean, N. The formula now becomes C(QfD')
— 2+ N
24
THE ESSENCE OF BIOMETRY
Taking Z at, say, ten, and summing the products of column 2 times column 5, taking note of the signs, we have
x = 10 — 2(467)/1000 = 9.066. The group deviation method is enormously quicker where there is a large mass of data. If all the values of x should end in the same decimal, calculation can be simplified by having Z end in the same decimal. The standard deviation The standard deviation is a measure of the central tendency or, if we think the other way around, of the dispersion of a distribution. It also marks the position of the point of inflexion on either side of the mean. One might think that we could get a measure of dispersion by finding the sum of the deviations from the mean. This will not work because, if the mean is properly determined, the sum of the deviations (taking note of signs) will be zero. Indeed, this is a good check on the mean. Furthermore, it is reasonable to feel that large deviations should be given extra importance, more than their mere size would indicate. After all, it is the large deviations which make for large dispersion. We could eliminate the difficulty of the mutual cancellation of the deviations and also assign more importance to the large deviations, if we squared all the deviations before summing. We could then divide by N to obtain an average squared deviation. This is the variance. The square root of the variance is the standard deviation. By this method, the formula is
SD = J>f(x --)7)2 N Note that the symbol SD is used for standard deviation. The symbol commonly used is the Greek letter a (lower-case sigma). However, there is often confusion between the use of this symbol for standard deviation and for standard error. The standard deviation of one distribution may be the standard error of another. To experts, this presents no problem, but it is confusing to those beginning the study of statistics. Therefore, though the symbolism is not orthodox, it seems advisable to use SD for standard deviation, and SE for standard error. Subscripts will be used, if needed, to identify the statistic to which the symbol refers. We may thus have SE2, which is the standard error of the mean, x. SEby, would be standard error of the regression coefficient of y on x. In Table VII, column 7 shows the deviations from the true mean (x = 9.066) and column 8 shows their squares. Multiplying each squared deviation by the frequency of its occurrence (column 2) and summing, we obtain E f (x — 5)2 = 8307.6634, whence SD = x/8307.6634/1000 = ± 2.88230175. Obviously this is a long and cumbersome method with many observations, and the formula is not adapted to machine calculation (though it may be the formula of choice
THE NORMAL FREQUENCY DISTRIBUTION
25
with some more modern calculating machines, which have automatic squaring and back transfer from the multiplier dials to the keyboard). The formula can be written as
SD
E fix — x)2 =
=J
J f(x2 — 2xx + x2)
N
N
f Efx2 \J N
x2
a form more suited to machine calculation with older machines. Using this formula, we calculate from Table VII by multiplying each value in column 3 by that in column 1 and add to obtain > fx2 = 90500. We already have x2 = 9.0662, and
0500 SD = 19 ~J 1000
9.0662 = +2.8823.
Even this method can be a long process if the values of x end in awkward decimals and there are many values. In such cases, the group deviation method, as applied to the calculation of the mean, can be used. Again we use an assumed mean, Z The class interval is C = 2. The deviations from Z are D', and their squares are Di2 ; the formula becomes
jC2 [E f(D1)2 SD _ \f(
(f(D1))2]
L
The value of E f (D')2 = 2295 is easily obtained by summing the products of the values in columns 2 and 6 of Table VII, and
ND) _ .467 is already known from
the calculation of x; whence
SD =
2295 .4672 ~4 = ±2.8823. J [1000 —
Sheppard's correction for grouping into classes When observations are grouped into classes or categories, as is done in Table VII, we make the tacit assumption that all observations in a given class have the same value as that of the mid-point, or median value, of the class. Due to the fact that the distribution is higher at the mean, the true mean value of a class is closer to the mean of the distribution than the mid-point of that class. We are therefore assigning arbitary values to our observations which are a little further from the mean, right and left, than they should be. This disperses the values outward, and so makes the standard deviation slightly too large. The effect is not great if the class interval is small, but may be appreciable with a large class interval. To adjust for this, Sheppard's correction subtracts from the square of the standard deviation an amount equal to one-twelfth
26
THE ESSENCE OF BIOMETRY
of the square of the class interval, i.e. C2/I2. The formula for the standard deviation so corrected is then
SD, = SD2 —
C2 12
In the example from Table VII, this would be
SD, = '/8.3076 — 22/12 = ± 2.8239. Sheppard's correction is at best an approximation. It should not be used if the standard deviation will later be used in significance tests. Many workers never use it at all, and unless the class interval is quite large, it seems scarcely worthwhile. Student's correction for degrees of freedom Many students find the concept of the number of degrees of freedom quite difficult to understand. One can give abstruse mathematical explanations, but such are out of place here. Let us try to give a simple but convincing explanation. Consider the values of x in column 1 of Table VII. The assumption is that these values occurred with the frequencies noted because they were random observations. The values might have been anything from zero to twenty, and any value might have had any frequency. But, given a total value for > f (x), is this so? Are the values of x and their associated frequencies wholly free? The answer is that, if we are to obtain in the end the fixed value > f (x) = 9066 (for this example), ten of the frequencies could be anything but the eleventh has got to be such that E./ (x) = 9066. Thus, as we add down column 3 of Table VII, we shall find that when we have added on 126 opposite x = 18, the total is 9046. Therefore, inasmuch as the next value of x is x = 20, the next value off must be f = 1. That is to say, the last value off is restricted. The restricted, non-free value does not have to be the last one, it can be any one, but one must be restricted and N — 1 are free. Thus we say that, for a standard deviation, the number of degrees of freedom is d.f. = (N — I). To apply it to the formula, one merely substitutes (N — 1) for N in the denominator. The difference between N and N — 1 is negligible if N is large, but may be important if N is small. In some statistical calculations, the matter of the number of degrees of freedom is most important, and we shall return to it often. The standard error of the mean This is not a logical place to deal with the standard error of the mean, a statistic used in many significance tests. The concept requires careful thought and clear explanation. We shall deal with it in Chapter IV. The mode The mode is the high point of the curve and is found in that class having the highest class-frequency, the modal class. In Table VII, the modal class is obvious on inspection,
THE NORMAL FREQUENCY DISTRIBUTION
27
being the class for which x = 8, with class frequency of 281. However, the determination of the exact value of x at the mode is a matter of calculation. An exact calculation is difficult, requiring specially fitted curves, but a good approximate value can be determined as follows: Let L,..= lower limit of modal class (in example, L. = 7), Fa = class-frequency in the class next above the modal (Fa = 242), Fb = class-frequency in the class next below the modal (Fb = 197), C = class-interval (C = 2); mode = L. +
Fa +a Fb (C) ,
and, for our example, this is mode = 7 +
(242)(2) 19 7 — 8.1025. 242 + 19
The median The median observation is that which divides the N observations into two groups of equal number, half greater than the median, half smaller. That is to say, a vertical line erected at the median value of x divides the area under the curve of the distribution into two equal halves. Its position is quite difficult to determine by inspection, especially in skewed curves. Its value can by found exactly only by elaborate calculations, but approximately as follows: Let L11e = lower limit of the median class (7 in the example), Na1e = serial number of the median observation in the median class (1■1,7e = 252), Føe = class frequency of the median class (F1te = 281), C = class interval (C = 2); then
median = L.,. + (N"1e — )C . Føe
We find the median class by successively summing the frequencies of column 2 in Table VII until we override the median frequency, which will be five hundred in this case (the total frequency being one thousand). This is the class for which x = 8. The total frequency to the end of this class, as seen from column 4, is 529. The total frequency to the beginning of this class is 248. The five hundredth observation is
THE ESSENCE OF BIOMETRY
28
obviously 252 beyond the beginning of the class, i.e. at 248 + 252 = 500. Thus Nme = 252, (252 — (2)] — 8.7900. median = 7 + L 281,)
whence
That is to say, within the limits of this approximate method, there are as many observations for which x is greater than 8.7900 as there are with x less than this value. Another approximation to the mode and median It is sometimes recommended that the mode be calculated from the very approximate formula mode = 3(median) — 2(mean). We have: mean _ 9.066, median = 8.7900, mode _ 8.1025. Using the above formula, the mode would be 8.2380 and the median, 8.7448, using in each case the value of the mean and the previously calculated value of the median and the mode respectively. These are not good approximations. The median is usually, as in this example, within the interval mean-to-mode, but cases can be found in which this is not so. In such cases, the above formula gives wholly incorrect values. It seems preferable then to calculate the mode and median directly, as previously shown. Skewness and kurtosis Skewness, or rather the absence of it, is a measure of the symmetry of the distribution. A distribution skewed to the left has a long tail on the left and is steep on the right, and vice versa. The degree of skewness can be very roughly calculated as Sk
_ mean — (3 median — 2 mean) 3(mean — median) = SD SD
this method breaks down if the median does not lie between the mean and the mode. In using the formula, if Sk is positive, the distribution is skewed right, and vice versa. In the case of the example from Table VII Sk _
3(9.066 — 8.7900) = + .2873. 2.8823
The distribution is then slightly skewed right, as will be seen from a more exact method to follow. This method uses the first, second, third, and fourth Moments* The expression moment is borrowed from physics. If a mass can be rotated about a pivot, the moment is the product of the mass times the distance from the pivot. If a body be in balance about a pivot, the first moment, i.e. the sum of all such moments, is zero, and a frequency distribution is thus balanced about its mean value (Mi = 0). The second moment in physics is similarly the sum of the products of each mass squared, times the distance from the pivot. Thus, frequency in statistics corresponds to distance from the pivot in physics, and the value of X corresponds to mass.
THE NORMAL FREQUENCY DISTRIBUTION
29
about x. These are calculated as M1 = E f (x - x)/N = 0
(M1 is always zero),
M2 = E f (x — x)21N
(This is SD', already calculated = 8.3076), M3 = Ef(x - x)3/N, M4 = E f (x — X)4/11.
Thus far we have seen (in the discussion of the standard deviation) that M1, the sum of the deviations about the mean, is always zero. M2 arises in the calculation of the standard deviation, and can be obtained by summing column 9 of Table VII and dividing by N. M3 can be obtained from the sum of the products of columns 7 and 9, while M4 is obtained from the sum of the products of colums 8 and 9. Thus M1 = 0, M2 = 8307.6634/1000 = 8.3076, M3 = 11,908.8396/1000 =11.9088, M4 = 220,519.5784/1000 = 220.5192. The above operations can be done quite quickly if one has access to a calculating machine with automatic squaring and transfer to a storage dial; it is not too laborious in any case. Some short-cut formulae have been proposed for calculating the moments, using an assumed mean. These are more trouble than they are worth. We now calculate M3 G1 = +.4973.
M2./
If G1 is zero, the distribution is symmetrical, with no skew. If G1 is less than zero, the distribution is skewed left, with a long tail on the left, and vice versa. As G1 = +.4873, the distribution of the example is skewed right. We then compute
ll
G2 = [-2: M — 3 (_ +.1952 in this example). z
If this is zero, the shape of the tip of the curve of the distribution is exactly what it should be for a normal distribution, and there is no kurtosis. The shape of course is that for the particular value of the standard deviation. If G2 is less than zero, the curve is platykurtic, or too flat, while if G2 is greater than zero, the curve is too sharp, leptokurtic, as it is in this example.
30
THE ESSENCE OF BIOMETRY
The general regime of calculation Quite commonly, one wishes to calculate only the mean and the standard deviation, from which the standard error can then be calculated. If this is so, one should use ari assumed mean Z, with deviations D' and D'2. If, however, skew and kurtosis are to be calculated, one might as well resign oneself to the computation of the four moments. The mean and standard deviation will then come out as by-products. Fitting the curve to a set of data, normal distribution As has already been stated covering the characteristics of the normal frequency distribution (see p. 22, §7), data descriptive of the normal distribution are tabulated in terms of the ratio: deviation/standard deviation. For example, suppose that we had a normal distribution such that x = 10, SD = ±2.00. When looking up some figure in a table dealing with the normal distribution, for a value x = 14 one would enter the table not at x = 14, nor at the deviation +4, but at the value dev/SD = 4/2 = 2. This variable (dev/SD) is often called tau (r), but Fisher and Yates in their Statistical Tables for Biological, Agricultural and Medical Research,3 Table II, call it X. This might be somewhat confusing in connection with our Table VII, where x is the independent variable. Therefore, in our Table IX (below), which is an abbreviated version of Fisher and Yates' Table II, it is called D/SD, in the upper left-hand cell of the table. In fitting a normal distribution to the data of Table VII, what we shall do is determine the theoretical ordinate for each value of x, in terms of a perfect normal distribution for which x = 9.066 and SD = +2.8823. Table VIII shows the work TABLE VIII.-The data of Table VII, with additions to show how the distribution is fitted. Column numbers for reference
1
2
3
4
5
6
7
8
x
dev
dev/SD
T. ord.
ord.
C x (ord.)
fo
fe
0 2 4 6 8 10 12 14 16 18 20
-9.066 -7.066 -5.066 -3.066 -1.066 +.934 +2.934 +4.934 -L.6.934 +8.934 +10.934
3.1454 2.4515 1.7576 1.0637 .3698 .3240 1.0179 1.7118 2.4057 3.0996 3.7935
.000l0(?) .01974 .08516 .22661 .37253 .37852 .23762 .09219 .01950 .00010 .00001(?)
.0347 6.8487 29.5458 78.6213 129.2475 131.3257 82.4411 31.9849 6.7654 .0347(?) .0035(?)
.0694 13.6974 59.0916 157.2426 258.4950 262.6514 164.8822 63.9698 13.5308 .0694(?) .0069(?)
0 4 47 197 281 242 134 63 24 7 1
0 14 59 157 258 263 165 64 14 0(?) 0(?)
3. R. A. Fisher and F. Yates, Statistical Tables for Biological, Agricultural and Medical Research (Edinburgh, Oliver and Boyd Ltd., 1957).
31
THE NORMAL FREQUENCY DISTRIBUTION
necessary for the fitting. Columns 1 and 2 show x and the deviations (dev) from the true mean (x). Column 3 shows the ratios dev/SD, and column 4 shows the table ordinates found by using Table IX. Each table ordinate must be multiplied by a factor N/SD to make it fit our particular set of data, where N = 1000 and SD = 2.8823. This process yields the values of column 5. We then make allowance for the fact that the class interval is C = 2, multiplying the values of column 5 by two to obtain those of column 6. These are then rounded off to the nearest whole number to obtain the final fitted values of column 8, contrasting with the observed values of column 7 taken from Table VII. For example, taking the value x = 12, the deviation is 2.934, whence the ratio dev/SD = 2.934/2.8823 = 1.0179. Opposite 1.01, and by interpolation in Table IX, the table ordinate is .23762. This is multiplied by 1000/2.8823 = 346.9452 to yield 82.4411, as in column 5. This, times C = 2, gives 164.8830 of column 6, and the rounded fitted value is then 165. The calculated ordinates of column 8 may make it look as if the curve were not quite symmetrical, but this is due to the fact that we do not have the exact mean = median = mode = 9.000 at any tabulated value of x. Furthermore, it should be noted that the calculated frequencies, as tabulated, add to only 994. The six missing observations are contained in the tails of the theoretical distribution. TABLE
IX.-The ordinates of the normal distribution.'"
D/SD
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
0.0 0.1 0.3 1.0 1.7 2.4 3.0
.3989 .3870 .3814 .2420 .0940 .0224 .0044
.3989 .3965 .3802 .2396 .0925 .0219 .0033
.3989 .3961 .3790 .2371 .0909 .0213 .0024
.3988 .3956 .3778 .2347 .0893 .0208 .0017
.3986 .3951 .3765 .2323 .0878 .0203 .0012
.3984 .3945 .3752 .2299 .0863 .0198 .0009
.3982 .3939 .3739 .2275 .0848 .0194 .0006
.3980 .3932 .3725 .2251 .0833 .0189 .0004
.3977 .3925 .3712 .2227 .0818 .0184 .0003
.3973 .3918 .3697 .2203 .0804 .0180 .0002
Abr'dged from Table II of Fisher and Yates' Statistical Tables for Biological, Agricultural and Medical Research, published by Oliver and Boyd Ltd., Edinburgh, and reproduced here by permission of the authors and publishers.
IV THE BASIC CONCEPT OF SIGNIFICANCE IN STATISTICAL ANALYSIS
LET us CONSIDER two populations of the same species of fish, living in two lakes, A and B. We have been told that the fish in A are, on the whole, larger than those in B. We wish to check this assertion. We therefore take two adequate random samples al and bl from the two lakes, and calculate the mean lengths of the fish in these samples. These are x01 and Ab,, and, as was rather expected, ;1 is larger than xb1. Does this prove the assertion that the fish in A are larger than those in B? The man in the street will say it does, but should the research worker? The answer is, of course, that he most certainly should not do so, unless he proves the point by an appropriate significance test. Suppose we were to take two more samples, yielding ;2 and 4 2. Inasmuch as the samples are random, any fish might get into them, and it might well happen that, if the fish in the two lakes are not very markedly different in size, 4 2 might be greater than XQ2! This would be disconcerting to those who had made dogmatic assertions about the sizes of the fish in the two lakes, wouldn't it? Let us go back to the first two samples. If z,1 is greater than xb,, the difference can arise through the actions of two and only two causes: i) the fish in lake A really are larger than those in lake B, ii) it just so happened that such fish were included in the two samples so as to make it appear this way; that is to say, the difference arose by chance. There is always the risk of this in sampling. The smaller the samples, and the less the real difference (if there is a difference), the greater the risk. We could settle the question by calculating the probability that the difference arose by chance. If this probability is very small, we can conclude with reasonable certainty that the difference is real. If, on the other hand, the probability is large, we can conclude per contra that the difference is probably not significant. This reasoning is the basis of all significance tests used in statistical analysis. Such tests of significance are of immense importance in research work, and one should never make ex cathedra statements about differences between values without adequate backing from appropriate significance tests. It is common knowledge that such statements are often made; indeed, they are the backbone of the advertising industry. But no reputable scientist should do this sort of thing. Occasionally it will be found that a test for significance 32
SIGNIFICANCE IN STATISTICAL ANALYSIS
33
is inconclusive. If one has the courage of one's convictions in the matter, it is permissible to make an assertion, but one must point out at the same time that the data are insufficient to prove the point. One of the most useful tests is the chi-squared test (x2 test), discussed below. The chi-squared test This testis used to answer such questions as 'Do the data of this frequency distribution agree with this or that prior hypothesis?' We might apply the test to see if the data of Table VII follow a normal frequency distribution. They rather look as if they do, but there are some differences as seen from columns 7 and 8 of Table VIII. Are these differences really serious, or have they perhaps arisen by chance? If they have probably arisen by chance, we need not be too worried, and we can feel reasonably certain that the data do follow a normal distribution. As a matter of fact, we shall soon see that there is less than one chance in a thousand that these differences can have arisen by chance. We have no grounds then for asserting that the data of Table VII follow a normal distribution. Obviously any measure of the goodness-of-fit between the data of columns 7 and 8 of Table VIII must take into account the sum total of the differences between the frequencies of occurrence of the various values of x. Furthermore, just as in the case of the calculation of the standard deviation, we ought to give extra importance to the larger differences. We can do this by forming the successive differences: frequencycalculated minus frequency-observed, or (fc — fo) for the successive values of x. We then square these to form (ft, — . 0)2. However we must adjust each of these values for the value off, because, obviously if ff were generally large, the total difference would be large, whereas if fc were generally small, the difference would be also. This process yields a succession of values of the form tic — fo)2/fc. The sum of these is chi-squared. The formula is then
,2 = V r(Jc fc—DIJ When the value of chi-squared is known, one enters a table opposite the proper number of degrees of freedom (see below), to find a probability value P at the top of the table (see Table XI). This is the probability that a chi-squared value as large as or larger than that calculated could have arisen by chance, which is what we have been seeking. A point arises here: how small must P be so that we can feel reasonably sure that the difference has not arisen by chance, and is therefore real? There is no absolute value. No matter how small P may be, there is always some probability that chance has intervened (unless P = 0). In general, statisticians accept a value P = .05, or one chance in twenty. If P is greater than this, it is generally assumed that the differences did arise by chance, and are not real. In some cases, for greater surety, one may use P = .01. In publications, one uses some such expression as `The difference is significant at the .05 level' and quotes the degrees of freedom. Table X shows the calculations for
THE ESSENCE OF BIOMETRY
34
a chi-squared test, checking the data of column 7 of Table VIII, against those of column 8 in the same table. In determining the number of degrees of freedom, one first examines the classes at the ends of the table. If there are classes with very small frequencies such as zero or, say, less than five, these are commonly lumped into a single artificial class at each end of the table. Each proper class from the middle (unlumped) part of the table is given 1 d.f., each lumped artificial class is given 1 d.f., and one is then subtracted from the total d.f. In Table X, the first class, for which x = 0, must be eliminated entirely as there is nothing to work on. The last three classes, for which x = 16, 18, and 20, may be lumped and given, as a whole, 1 d.f. There are then 7 d.f. for the classes for which x = 2, 4, 6, 8, 10, 12, and 14, making a total of 8 d.f. Chi-squared will then be allotted 8 — 1 = 7 d.f. TABLE
X.—The chi-squared test applied to the data of Tables VII and VIII. Column numbers for reference
1
2
3
x
fc
fo
4 ~! 5 (fc — Jo) (fc — fo)-
6 (fc — fo)2
7 d.f.
fc
2 4 6 8 10 12 14
0 14 59 157 258 263 165 64
0 4 47 197 281 242 134 63
0 10 12 40 23 21 31 1
0 100 144 1600 529 441 961 1
0 7.1429 2.4407 10.1911 2.0504 1.6768 5.8242 .0156
I6
14
24
10
100
18 20
0 0
7 1
7 1
49 1
7.1429 (omit) ao oo
0
1 I I 1 I 1 1
Last three classes, lumped 14
32 I
Total chi-squared
18
I
324
23.1428
52.4845
1
8 d.f. = 8 — 1 =7
In columns 2 and 3 of Table X, the calculated and observed frequencies are shown. Column 4 shows the difference between calculated and observed frequencies. This difference is squared in each case in column 5, and divided by L in column 6. The sum of column 6 is chi-squared = 52.4845. Referring to Table XI, an abbreviated chi-squared table from Fisher and Yates, we see opposite 7 d.f. that the largest chi-squared in the table (under P = .001) is only
SIGNIFICANCE IN STATISTICAL ANALYSIS
35
TABLE XI.-The distribution of chi-squared.* n
.95
.80
.50
.20
.10
.05
.02
.01
.001
1 2 3 4 5
.00393 .103 .352 .711 1.145
.0642 .446 1.005 1.649 2.343
.455 1.386 2.366 3.357 4.351
1.642 3.219 4.642 5.989 7.289
2.706 4.605 6.251 7.779 9.236
3.841 5.991 7.815 9.488 11.070
5.412 7.824 9.837 11.668 13.388
6.635 9.210 11.345 13.277 15.086
10.827 13.815 16.266 18.467 20.515
6 7 8 9 10
1.635 2.167 2.733 3.325 3.940
3.070 3.822 4.594 5.380 6.179
5.348 6.346 7.344 8.343 9.342
8.558 9.803 11.030 12.242 13.442
10.645 12.017 13.362 14.684 15.987
12.592 14.067 15.507 16.919 18.307
15.033 16.622 18.168 19.679 21.161
16.812 18.475 20.090 21.666 23.209
22.457 24.322 26.125 27.877 29.588
11 12 13 14 15
4.575 5.226 5.892 6.571 7.261
6.989 7.807 8.634 9.467 10.307
10.341 11.340 12.340 13.339 14.339
14.631 15.812 16.985 18.151 19.311
17.275 18.549 19.812 21.064 22.307
19.675 21.026 22.362 23.685 24.996
22.618 24.054 25.471 26.873 28.259
24.725 26.217 27.688 29.141 30.578
31.264 32.909 34.528 36.123 37.697
16 17 18 19 20
7.962 8.672 9.390 10.117 10.851
11.152 12.002 12.857 13.716 14.578
15.338 16.338 17.338 18.338 19.337
20.465 21.615 22.760 23.900 25.038
23.542 24.769 25.989 27.204 28.412
26.296 27.587 28.869 30.144 31.410
29.633 30.995 32.346 33.687 35.020
32.000 33.409 34.805 36.191 37.566
39.252 40.790 42.312 43.820 45.315
21 22 23 24 25
11.591 12.338 13.091 13.848 14.611
15.445 16.314 17.187 18.062 18.940
20.337 21.337 22.337 23.337 24.337
26.171 27.301 28.429 29.553 30.675
29.615 30.813 32.007 33.196 34.382
32.671 33.924 34.172 36.415 37.652
36.343 37.659 38.968 40.270 41.566
38.932 40.289 41.638 42.980 44.314
46.797 48.268 49.728 51.179 52.620
26 27 28 29 30
15.379 16.151 16.928 17.708 18.493
19.820 20.703 21.588 22.475 23.364
25.336 26.336 27.336 28.336 29.336
31.795 32.912 34.027 35.139 36.250
35.563 36.741 37.916 39.087 40.256
38.885 40.113 41.337 42.557 43.773
42.856 44.140 45.419 46.693 47.962
45.642 46.963 48.278 49.588 50.892
54.052 55.476 56.893 58.302 59.703
32 34 36 38 40
20.072 21.664 23.269 24.884 26.509
25.148 26.938 28.735 30.537 32.345
31.336 33.336 35.336 37.335 39.335
38.466 40.676 42.879 45.076 47.269
42.585 44.903 47.212 49.513 51.805
46.194 48.602 50.999 53.384 55.759
50.487 52.995 55.489 57.969 60.436
53.486 56.061 58.619 61.162 63.691
62.487 65.247 67.985 70.703 73.402
42 44 46 48 50
28.144 29.787 31.439 33.098 34.764
34.157 35.974 37.795 39.621 41.449
41.335 43.335 45.335 47.335 49.335
49.456 51.639 53.818 55.993 58.164
54.090 56.369 58.641 60.907 63.167
58.124 60.481 62.830 65.171 67.505
62.892 65.337 67.771 70.197 72.613
66.206 68.710 71.201 73.683 76.154
76.084 78.750 81.400 84.037 86.661
* Abridged from Table IV of Fisher and Yates' Statistical Tables for Biological, Agricu tural and Medical Research, published by Oliver and Boyd Ltd., Edinburgh, and by permission of the authors and publishers. 4
THE ESSENCE OF BIOMETRY
36
TABLE
XI. (cont.)
n
.95
.80
.50
.20
.10
.05
.02
.01
.001
52 54 56 58 60
36.437 38.116 39.801 41.492 43.188
43.281 45.117 46.955 48.797 50.641
51.335 53.335 55.335 57.335 59.335
60.332 62.496 64.658 66.816 68.972
65.422 67.673 69.919 72.160 74.397
69.832 72.153 74.468 76.778 79.082
75.021 77.422 79.815 82.201 84.580
78.616 81.069 83.513 85.950 88.379
89.272 91.872 94.461 97.039 99.607
62 64 66 68 70
44.889 46.595 48.305 50.020 51.739
52.487 54.336 56.188 58.042 59.898
61.335 63.335 65.335 67.335 69.334
71.125 73.276 75.424 77.571 79.715
76.630 78.860 81.085 83.308 85.527
81.381 83.675 85.965 88.250 90.531
86.953 90.802 89.320 93.217 91.681 95.626 94.037 98.028 96.388 100.425
102.166 104.716 107.258 109.791 112.317
Note. For odd values of n between thirty and seventy, the mean of the tabular values for (n - 1) and (n + I) may be taken. For larger values of n, the expression 1/2x2 - V2n - 1 may be used as a normal deviate, with unit variance, remembering that the probability for chi-squared corresponds to that of a single tail of the normal curve.
24.322. We would have to go still further out to the right, to a value of P much less than P = .001, to find a chi-squared as large as or larger than 52.4845. That is to say, the probability that a chi-squared as large as or larger than 52.4845 could have arisen by chance, is less than P = .001, or less than one chance in a thousand. It follows that differences as large as or larger than those which exist between the observed and calculated frequencies can scarcely have arisen by chance. Therefore, the observed data are not a good fit to the calculated normal frequency distribution. If our observed frequencies had been those from one experiment, and the calculated frequencies, those from another experiment, we would not be justified in pooling these two sets of data, because they do not come from the same universe and are not two examples of the same distribution. The observed data might fit a Poisson distribution quite well, being a little skewed. Chi-squared is a most useful test, and we shall come back to it often. Its form is not always as simple as that which we have just seen. It should be strongly emphasised that the values used to form chi-squared must be frequencies generated by the action of probability. One could not therefore use the chi-squared test to see if a cam had the same shape as a master cam by calling the radii of one cam, say, r, and of the other ro, and then forming anything like (ro - ra)2/ro. Summing a number of such formulations will produce something that looks like a chi-squared, but the table will not work with it. The standard error of the mean Our study of the chi-squared test has given us some appreciation of the concept of significance and we are now in a position to examine the standard error of the mean, a most important statistic. This necessitates a clear understanding of the relation of
SIGNIFICANCE IN STATISTICAL ANALYSIS
37
statistics which refer to the universe versus those which refer to samples drawn from that universe. Let us consider a universe consisting of say, one million normally distributed objects, the mean length of which is to be determined. If the objects happen to be insects difficult to catch, it is clearly impossible to measure each one, add all the lengths, and divide the total by one million. We must have recourse then to the taking of samples. Let there be m random samples S„ 52, S3, S4, ... , S, with respective means x1, x2, x 3, x4,• • • , x,„• Let the standard deviations be SDI, SD2, SD3, SD4 ... , The numbers of observations used for the respective samples are n1, n2, n3, n4, ... , n,„, and the n's are not necessarily equal. It will be found that if the samples are random, and the n's are adequately large, the means will all have much the same value, as will also the standard deviations. If there are enough means to plot a frequency distribution, it will be found that the means themselves form a normal frequency distribution. They cluster around a mean of means which is an approximation to X„, the true mean (which we are seeking) of the universe. If so many means are used that all the observations of the universe are involved, these means will actually cluster around X„. Thus, each sample mean may be thought of as a single observation. Averaged together, they provide a grand mean which is a good approximation to X. Indeed, each single mean is an approximation to X„. We have then, two kinds of distributions here. One kind, within each sample, consists of a set of n observations, clustering around the sample mean, according to the sample standard deviation. In the other kind we have sample means, clustering about an approximation to X. with another, special standard deviation. This special standard deviation is the standard error of the mean. If we knew its value, we would know how our sample means clustered around X„, or at least, around a good approximation to X. Without taking all the one million observations of our universe, we cannot calculate an exact value for this SEx. any more than we can calculate X„ itself, but we can easily get a good approximation to it from the standard deviation of any adequate random sample drawn from the universe. We have m such samples, and any one of them will do for this purpose. To obtain a good approximation to the standard error of the mean, we divide the standard deviation of any adequate random sample drawn from the universe by the square root of the number of observations in that sample. Thus, for the jth sample = SD; SERJ _ (approximately)SER.
Jn;
For example, let us suppose that X„ is actually X„ = 10.876. We might find that, for the jth sample, which uses 144 observations, x;= 10.995 and SD; = 1.32. Then SER f = 1.32/12 = .110. That is to say, in terms of the information from this particular sample, X„ is close to 10.995, and sample means from this universe might be expected to distribute about this value with a standard deviation of ±.110. If we use another
38
THE ESSENCE OF BIOMETRY
sample, say the kth or the gth, we would, of course, obtain slightly different values, but they will all be acceptable approximations to X„ and to SExu. That is the beauty of this process. Now it is characteristic of the normal frequency distribution that close to 99 per cent of random observations fall within plus or minus three times the standard deviation on either side of the mean. But any sample mean is a single observation, approximating the universe mean. Therefore, a single sample-mean will, with a probability of P = .99, lie within + 3(SD) of X. The SD in this case is the standard deviation with which sample means are distributed about X„, and this is closely approximated by any standard error of the mean obtained from any adequate random sample. Thus, in this case, we can feel 99 per cent sure that X. lies somewhere in the range x;— 3(SE,j) and x1 + 3(SEXf), or in the range 10.995 — .33 = 10.665 and 10.995 + .33 = 11.325. We don't know X. exactly, but we are 99 per cent sure that it is in this range. It can be seen that if the data are such that this range is very narrow, we have located X„ quite satisfactorily. These values, at either end of the range within which X„ lies, are called fiducial limits about Xu, and this is the common way of locating values in statistical work. It is almost never possible to calculate exact values because one cannot make all the necessary observations. If you object that this is a rather hazy business, one can only reply that it is the only way to do it. If the limits you obtain are not to your liking, being too far apart, go and get more data, and narrow them! It will now be clear that values derived from the analysis of data are often quite improperly reported. A research worker will take a sample (i.e. do a limited number of experiments, as opposed to an almost infinite number) and report that the mean value of this variable is such-and-such. True, his sample mean may well be a fair approximation to the universe mean, which he could have obtained from a huge number of experiments, but how good is it? We do not know unless he quotes at least the standard error of the mean, so that fiducial limits may be calculated. If he reports only the mean and standard deviation, he must also report the number of observations so that the standard error (and then the fiducial limits) can be calculated. Therefore, in reporting experimental mean values, give also, at the very least, the standard deviation and the number of contributing observations. To do the job properly, calculate the fiducial limits within which the universe value must lie. One does not have to do much reading to see that this rule is observed only too seldom. The validity of the mean of a single sample A mean can be calculated from any set of numbers but, if the variance (SD2) of the set is very large, the mean may not be a valid representative of the set. It is usually assumed that a mean, to be valid, should be at least 2.5 times its standard deviation. In the example we have been using, x = 9.066 and SD = 2.8823. The ratio 2/SD is 3.145, and the mean is therefore valid. In the old days, when one thought in terms of the probable error, it used to said that a mean should be at least three times the probable error. The probable error is .67449(SD) and is now seldom used.
SIGNIFICANCE IN STATISTICAL ANALYSIS
39
The significance of the difference between two means; large samples where d.f. > 30 Assume that two means have been determined: R. = 10.95, Na = 49, SDa = 1.4, while Xb = 11.3, Nb = 64, SDb = 2.4. Then SER = .20 and SEE,, = .30. The difference (zb — Ra) = 11.30 — 10.95 = .35. Is this difference real, arising from a fundamental difference between the two samples, or is it due to chance? If it is due to chance, another pair of similar means might show quite a different picture. We must think c!early here. When we ask if two means are significantly different we are really asking if they could, or could not, have come from the same universe. Now two universes may differ from each other in three ways, as far as this matter is concerned. They may have (a) the same means, but different variances, or (b) the same variances but different means, or (c) different variances and different means. Under (a) the two distributions would plot about the same mean, but one distribution would be flatter than the other. Under (b) the two would have the same shape, but would centre about different means. Under (c) one would be flatter than the other, and they would centre about different means. In some simple cases, we are concerned only with the mere magnitude of the two means. In such a case we may use a simple formula which neglects the possibility of differences in variances of the two parent universes. We calculate the standard error of the difference between these two means as SEdirr = SEa + SEy, assuming that the two variables R. and JCb are not correlated (see Chapter 1X). If the difference is at least three times SEdirr, we can say that the two means do differ significantly in magnitude. In this case SEdirr = J.202 + .302 = .3605. Then diff/SEdirr = .971, less than three. The two means do not differ significantly in mere magnitude. In point of fact, of course, samples are likely to differ both in variance and in the numbers of observations involved. We could test for difference in variance by means of the F test (Chapter IV) but without doing this, we can compensate for differences in Na and Nb and for possible differences in variance, using the formula SEdirr =
Na (SE!) +
Nb
Nb Na
(SE,),
which may be written also as SEdirr=
SDI SDb Nb +—. Na
In the example above, this yields SEdirr = J6å(•20)2 + 11(.30)2 = .3849.
THE ESSENCE OF BIOMETRY
40
Then diff/SEdiff = .909, less than three. This value is not quite the same as that previously obtained, but this is not surprising, because we have asked a different question. Using the first formula, we asked `Do these two means differ in mere magnitude?' In the second case, we asked `Do the samples from which these means were calculated come from a pair of universes having the same mean and variance, or-what amounts to the same thing-do they come from a single universe?' The use of the normal frequency area or probability table in tests such as the above We have stated that, if the difference between two means is three times the standard error of the difference, the two means certainly differ and do not come from the same universe. The probability is P = .99. But what about cases in which diff/SEdirf is somewhat less than three? How certain can we then be in making statements about significance? Such cases can be dealt with by the use of a table which gives the probability of the occurrence of deviations as large as (or larger than) various multiples of the standard deviation, the fact that the standard error is a kind of standard deviation being kept in mind. Such a table, like that used to calculate the ordinates of the normal curve, is tabulated in terms of the ratio dev/SD but, instead of giving the ordinates, it gives the probability that a single observation, made at random, will differ from the mean by an amount as large as or larger than the selected deviation. Table XII is an abbreviated version of Table II, of Fisher and Yates. Consider a distribution in which x = 12 and SD = 2.3. What is the probability that a single observation, made at random, will lie outside the range eight to sixteen, that is to say, TABLE X11.-The normal probability integral (table of the area of one tail beyond an abscissa tabulated as dev/SD).'
_ dev a
00
.01
.02
.03
.04
.05
.06
.07
.08
.09
.50000 .46017 .42074 .38209 .34458 .30854 .27425 .24196 .21186 .18406 .15866 .22750 .0213499 .0431671 .0679333
.49601 .45620 .41683 .37828 .34090 .30503 .27093 .23885 .20897 .18141 .15625 .22216 13062 30359 75465
.49202 .45224 .41294 .37448 .33724 .30153 .26763 .23576 .20611 .17879 .15386 .21692 12639 29099 71779
.48803 .44828 .40905 .37070 .33360 .29806 .26435 .23270 .20327 .17619 .15151 .21178 12228 27888 68267
.48405 .44433 .40517 .36693 .32997 .29460 .26109 .22965 .20045 .17361 .14917 .20675 11829 26726 64920
.48006 .44038 .40129 .36317 .32636 .29116 .25785 .22663 .19766 .17106 .14686 .20182 11442 25609 61731
.47608 .43644 .39743 .35942 .32276 .28774 .25463 .22363 .19489 .16853 .14457 .19699 11067 24536 58693
.47210 .43251 .39358 .35569 .31918 .28434 .25143 .22065 .19215 .16602 .14231 .19226 10703 23507 55799
.46812 .42858 .38974 .35197 .31561 .28096 .24825 .21770 .18943 .16354 .14007 .18763 10350 22518 53043
.46414 .42465 .38591 .34827 .31207 .27760 .24510 .21476 .18673 .16109 .13786 .18309 10008 21569 50418
SD
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 2.0 3.0 4.0 4.8
* Abridged, with slight changes of format, from Table Ill of Fisher and Yates' Statistical Tables published by Oliver and Boyd Ltd., Edinburgh, and by permission of the authors and publishers.
for Biological, Agricultural and Medical Research,
SIGNIFICANCE IN STATISTICAL ANALYSIS
41
will lie outside g — 4 and x + 4? The deviation is four, so that dev/SD = 4/2.3 = 1.739. Entering Table XII, we find opposite dev/SD = 1.7 the value P/2 (in the body of the table) = .14231. Opposite dev/SD = 1.8 we have .14007. By interpolation, the value appropriate to dev/SD = 1.739 is .14143. This is the half-probability, so that P = .2829, and this is the probability that the single observation will lie beyond the limits x = 8 and x = 16. The probability that it will be either greater than eight or less than sixteen is clearly 1 — .2829 = .7171. When we come to compare two means, we can use the table in the same way. We have recently calculated diff/SEd rr = .35/.3849 = .909. In Table XII, opposite dev/SD = .91 we find the half-value .18141, so that P = .36282. This is much larger than the commonly accepted boundary value of P = .05 between significance and nonsignificance. We conclude therefore that the difference Re, — JCa = .35 might very well have arisen by chance. There is therefore no significant difference between the two means. The null hypothesis The test just performed is usually thought of in terms of a gull hypothesis, the hypothesis in this case being `There is no significant difference between åa and If P, however it is determined, is greater than P = .05, it is considered that the difference arose by chance, and the null hypothesis (as in this case) is confirmed. Fisher's T test In the example given above, we made use of a normal frequency area table to test the significance of a difference between two means. This is satisfactory if the number of degrees of freedom is fairly large, say thirty or more. If, however, the number of degrees of freedom is small, the use of this table does not give correct results. To obtain correct results using an area table, one would require a special table for each number of degrees of freedom, clearly an impracticable business. Therefore, with only a few degrees of freedom, one uses a table (such as Table XIII) of the distribution of a quantity T, and such a table is accurate right down to only 1 d.f. The value T with which one enters such a table is the same dev/SD with which we are familiar (or dev/SE, depending upon the problem). The T test is so convenient that, in practice, one uses it on all occasions, and does not bother with the area table at all. The significance of the difference between two means, small samples, by T test In order to calculate T we require formulae which will take account of the small number of degrees of freedom, and will allow for possible differences in the size of the two samples. The formulae must also allow for possible differences in variance of the two samples. Depending upon what has already been calculated in a given case, there are a number of equivalent formulae as follows: Formula No. 1. Use this where SD. and SDI are already known, and where
42
THE ESSENCE OF BIOMETRY
Na and Nb have been used to calculate them (but not elsewhere). The entered with Na + Nb degrees of freedom.
J
N0Nb
(xa — 4) T—
T table is
N a + Nb JNa(SDa) + Nb(SDb) J Na + Nb — 2
Formula No. 2. Use this when SE! and SEb are already known, and where N. and Nb have been used to calculate SDa and SDb (but not for converting these to SEa and SEb). Enter the T table with Na + Nb degrees of freedom.
J
(Xa — Xb)
N aNb
N.+ Nb
!Na(SEE) + Nb(SEb) J N a + Nb — 2
Formula No. 3. Use this where SD! and SDb have been calculated using Student's correction, with Na — 1 and Nb — 1 degrees of freedom for calculating SD, and SDb respectively, but Na and N6 are used elsewhere. Enter the T table with Na + Nb degrees of freedom.
J
xb)
NaNb
N,,+ Nb T— V(N.-1)SD! + (Nb — 1)SD6 Na + Nb — 2 (x° —
Formula No. 4. Use this when SE, and SEb have been calculated, and Na — 1 and Nb — 1 degrees of freedom have been used to calculate SD. and SDb, but not elsewhere, and not for converting SD's to
SE's. xb)
T—
J NaNb Na + Nb
N0(N0 — 1)SE; + Nb(Nb — 1)SEb
J
N.+ Nb -2
Example of a T test, using formula No. 4. Consider the following two samples, A and B. Sample A N.= 12, d.f. (for Na — 1, Student's correction) = 11, JCa = 17.50, E f (x— g.)2 = 141.518, SD! = 141.518/11 = 12.8653, SE! = 12.8653/12 = 1.0721 (note 12 d.f. used here), SEa = 1.0354.
SIGNIFICANCE IN STATISTICAL ANALYSIS
43
Sample B Nb = 14, d.f. (for Nb - 1, Student's correction) = 13, b = 15.50, Ef(x - xb) 2 = 125.7515, SDI = 125.7515/13 = 9.6732, SDb = 3.1102, SEI = 9.6732/14 (note 14 d.f. used here) = .6909, SEb = .8312. (17.50 - 15.50)
T-
/(11)(12)(1.0721)
!(12)(14) 12+14
+ (14)(13)(.6909)
- 1.5233
12+14-2 We enter Table XIII with 12 + 14 - 2 = 24 d.f., and opposite 24 d.f. we find .2 over 1.318. By interpolation, the value for T = 1.5233 is
P = .1 over 1.711, P = P = .148.
This means that, for the data of this example, a difference as large as or larger than 2.00 could occur between these means, purely by chance, as often as 148 times per thousand trials. This value of P is greater than P = .05, so we assume that the null hypothesis (go - Xb = 0) is confirmed. The two means do not differ significantly. TABLE XIII.-The distribution of T. Probability n= d.f.
.9
.8
.7
1 2 3 4 5
.158 .142 .137 .134 .132
.325 .289 .277 .271 .267
.510 .445 .424 .414 .408
.727 1.000 1.376 1.963 3.078 6.314 12.706 31.821 63.657 636.619 .617 .816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 31.598 .584 .765 .978 1.250 1.638 2.353 3.182 4.541 5.841 12.924 .569 .741 .941 1.190 1.533 2.132 2.776 3.747 4.604 8.610 .559 .727 .920 1.156 1.476 2.015 2.571 3.365 4.032 6.869
21 22 23 24 25
.127 .127 .127 .127 .127
.257 .256 .256 .256 .256
.391 .390 .390 .390 .390
.532 .532 .532 .531 .531
.686 .686 .685 .685 .684
.859 .858 .858 .857 .856
1.063 1.061 1.060 1.059 1.058
1.323 1.321 1.319 1.318 1.316
1.721 1.717 1.714 1.711 1.708
2.080 2.074 2.069 2.064 2.060
2.518 2.508 2.500 2.492 2.485
2.831 2.819 2.807 2.797 2.787
3.819 3.792 3.767 3.745 3.725
40 60 120 co
.126 .126 .126 .126
.255 .254 .254 .253
.388 .387 .386 .385
.529 .527 .526 .524
.681 .679 .677 .674
.851 .848 .845 .842
1.050 1.046 1.041 1.036
1.303 1.296 1.289 1.282
1.684 1.671 1.658 1.645
2.021 2.000 1.980 1.960
2.423 2.390 2.358 2.326
2.704 2.660 2.617 2.576
3.551 3.460 3.373 3.291
.6
.5
.4
.3
.2
.1
.05
.02
.01
.001
• Abridged from Table III of Fisher and Yates' Statistical Tables for Biological, Agricultural and Medical Research, published by Oliver and Boyd Ltd., Edinburgh, and by permission of the authors and publishers. Snedecor's
F test, or the variance-ratio test
Two universes may differ from each other in many ways, but in particular they may have different means or different variances (the variance is the square of the standard deviation, SD2). The significance of a difference between means may be
44
THE ESSENCE OF BIOMETRY
determined by the T test. The significance of a difference between variances is determined by the F test, devised by G. W. Snedecor, and named the F test in honour of Sir Ronald A. Fisher, author with F. Yates of the well-known Statistical Tables for Biological, Agricultural and Medical Research, from which the author has been privileged to quote so many tables in the present book. This test is carried out not in terms of the difference between two variances, V. and Vb, but in terms of the ratio between them, the greater variance always being placed in the numerator of the fraction. Thus F
_ greater variance lesser variance
Having obtained F, one then chooses a suitable significance level, the table being arranged with one page for P = .20, one for P = .10, and so on. Across the top of each page (see Table XIV) are the degrees of freedom of the universe (or sample) which yielded the greater variance. Down the left side of the table are the degrees of freedom of the universe (or sample) with the lesser variance. Having selected a page, one finds the d.f. for the lesser variance on the left, and moves across that row to come under the d.f. for the greater variance at the top of the table. At the intersection of row and column, one finds a value of F. If this is too small, one turns to the next page (or pages) with a smaller value of P and tries again in the same way. Eventually, one finds a page in which the F value at the proper place is a close approximation to the one resulting from the F test formula. P is then the value heading this page. This matter will be clear from the example given below. Example of an F test Consider two distributions (samples or universes) which have the same mean, „ = åb = 3.1471. The variance of A is V. = SD! = .9435, and Vb = SDI = .3198. The degrees of freedom for the sample with the greater variance (.9435) are Na = 12 (appearing in Table XIV as d.f.9) while those for the sample with the lesser variance (.3198) are Nb = 26 (appearing in Table XIV as d.f.r). Then F = .9435/.3198 = 2.9502. Looking in Table XIV, in the abbreviated version of its first `page' under `20 points,' we find, opposite d.f., = 26 and under d.f.9 = 12, a value F = 1.47, which is too small. We therefore move down Table XIV, equivalent to moving to a page further on in Fisher and Yates' Table V. In the homologous place under P = .10, we find F = 1.81, still too small. Finally, under P = .01, in the homologous place, we find F = 2.96, almost exactly our value of 2.9502. Therefore, the probability that a variance ratio as large as or larger than 2.9502 could have arisen by chance from the data we have in this example is P = .01, a very small probability. We conclude therefore that it did not in fact arise by chance, and that the two variances do differ significantly, i.e. the null hypothesis V. = Vb is not confirmed. Thinking in another way, we could say that sample A was drawn from a universe having a mean of 3.1741, but a variance of approximately .9435, while sample B was drawn from a universe having also a mean of 3.1741, but a smaller variance, i.e. Vb = .3198.
45
SIGNIFICANCE IN STATISTICAL ANALYSIS TABLE
XIV.-Variance ratio.*
20% points of e"-= (= F), equivalent to P = .20 (Fisher and Yates, page 47) d.f.,, d.f.t 25 26 27
1
2
3
4
5
6
8
12
24
1.73 1.73 1.73
1.72 1.71 1.71
1.66 1.66 1.66
1.62 1.62 1.61
1.59 1.58 1.58
1.56 1.56 1.55
1.52 1.52 1.51
1.47 1.47 1.46
1.41 1.40 1.40
1.32 1.31 1.30
1.82 1.81 1.80
1.69 1.68 1.67
1.52 1.50 1.49
2.16 2.15 2.13
1.96 1.95 1.93
1.71 1.69 1.67
3.32 3.29 3.26
2.99 2.96 2.93
2.62 2.58 2.55
2.17 2.13 2.10
4.91 4.83 4.76
4.31 4.24 4.17
3.66 3.59 3.52
2.89 2.82 2.75
10 % points, equivalent to P = .10, Fisher and Yates, page 49 25 26 27
2.92 2.91 2.90
2.53 2.52 2.51
2.32 2.31 2.30
2.18 2.17 2.17
2.09 2.08 2.07
2.02 2.01 2.00
1.93 1.92 1.91
5% points, equivalent to P = .05, Fisher and Yates, page 51 25 26 27
4.24 4.22 4.21
3.38 3.37 3.35
2.99 2.98 2.96
2.76 2.74 2.73
2.60 2.59 2.57
2.49 2.47 2.46
2.34 2.32 2.30
1% points, equivalent to P = .01, Fisher and Yates, page 53 25 26 27
7.77 7.72 7.68
5.57 5.53 5.49
4.68 4.64 4.60
4.18 4.14 4.11
3.86 3.82 3.78
3.63 3.59 3.56
0. % points, equivalent to P = .001 25 26 27
13.88 13.74 13.61
9.22 9.12 9.02
7.45 7.36 7.27
6.49 6.41 6.33
5.88 5.80 5.73
5.46 5.38 5.31
* Abridged, with some changes of format, from Table V of Fisher and Yates' Statistical Tables for Biological, Agricultural and Medical Research, published by Oliver and Boyd Ltd., Edinburgh, and by permission of the authors and publishers.
Which of the above tests to use in a given case? There is no set rule as to which of the above tests one should use in a given case. If one is checking the presumably steady increase of some variable by taking sets of measurements at intervals, all one is interested in is, is it bigger than it was yesterday? In such a case, one would simply carry out a T test to see if the mean value today is significantly larger than it was yesterday. If, on the other hand, one had two lots of insects, collected in two different places and possibly not really of the same strain or species, one is interested in the background from which these samples have been drawn. One would therefore probably carry out an F test as well as a T test. If one were operating a manufacturing process producing an article to stringent size-standards, one would naturally be interested in obtaining the correct mean size.
46
THE ESSENCE OF BIOMETRY
One would also like to know how the variability stood up, day by day. Thus one would compare the output, day by day, by T tests and F tests. If slightly oversize articles are saleable and the undersize ones are not, one would want to be sure that the distribution is not skewed right as this would mean an excessive percentage of rejects. If manufacture is to be extremely precise, and costs are high, one might strive for a markedly leptokurtic distribution with a small standard deviation, with, of course, the correct mean. This would cluster the sizes close to the mean, and reduce the percentage of over- and undersize articles.
v
THE BINOMIAL DISTRIBUTION The relation of the binomial to the normal distribution The normal frequency distribution is a distribution of a continuous variable. In theory, x may have any value from — co to + co, though the range of values may be limited in practice. This means that the steps between successive values of x may be infinitely small, and it is necessary then to tabulate the data in classes or categories. The binomial distribution is a distribution of a discontinuous variable, and x (for all practical purposes) can assume only integral values, and usually only positive values. In point of fact, the variable which is distributed is the frequency of occurrence of some kind of event: heads or tails, dead or alive, male or female, and so on. Obviously, as noted previously, one can have three heads or seven heads, but not two and a half. In Chapter I, dealing with probability, we discussed the expansion of (P + Q)", the binomial expansion, and used it to calculate the results to be expected when tossing groups of coins. If we toss only three coins at one toss, we might get any one of HHH, (HHT or HTT), or TTT. If we make this trial many times, these results will show in the ratio of 1:2: 1. If we toss a mass of 1,000,000 coins, we would obtain anything from 1,000,000 heads and zero tails (enormously unlikely), through 500,000 heads and 500,000 tails (most the probable result), to all tails and no heads (again enormously unlikely). We then showed that the possible results could be arranged in order, and the probabilities of their occurrence worked out from the binomial expansion. The expansion is easy to set out if n is small, but difficult if n is large. In discussing the binomial distribution we shall use the notation P = probability of success, Q = probability of failure, (P + Q) = 1, K = number of coins tossed in a batch at one time (this is what we have so far called n). N = number of times the batch of K coins is tossed, x = number of heads (or tails if you wish) appearing on the table after one toss of a batch of K coins. Then the probabilities of the various numbers of heads (zero heads, one head, two heads, ... , K heads) on a purely theoretical basis, are the values of the successive 47
48
THE ESSENCE OF BIOMETRY
terms of the expansion of (Q + P)". Out of N tosses of the batch of K coins, the frequencies of occurrence of zero heads, one head, two heads, etc., are N(Q + P)x, evaluating the successive terms. If P = Q, the distribution is symmetrical and, for large K, looks very like the normal frequency distribution. Indeed, as K approaches infinity, it can be shown that the binomial distribution approaches the normal, with a mode located at K/2. It does not become exactly the normal curve, but the difference is negligible. If P is greater than Q, the curve is skewed left, and vice versa. At this stage, before going on to discuss the calculation of the mean, the standard deviation, and the standard error of the mean, we must be sure that we are clear as to what we are dealing with. In any given investigation we may have an approximate binomial distribution brought into being through the operation of P, Q, N, and K. Behind the scenes, as it were, there is an exact distribution, calculable from these parameters. The theoretical distribution is really the universe from which our sample is drawn. In the normal distribution, there is only one way to calculate the mean, the standard deviation, and the standard error, i.e. by carrying out certain standard summations and divisions as explained in Chapter III. In the case of the binomial distribution, exactly the same operations can be carried out on the experimental data, yielding Y, SD, and SEE for the sample. For the theoretical distribution, however, owing to the nature of the distribution, these statistics can be calculated by very simple methods on a theoretical basis. Of course one could calculate them by `brute force' methods and get the same results, but there would be no point in doing so. Bear in mind that these simple methods (which will be shown later) apply only to the theoretical distribution, and must not be used to calculate the statistics of the experimental distribution. This matter will become clear when we see how to calculate the values for a theoretical distribution, given the values of P, Q, N, and K. In order to do this we require Pascal's triangle, as shown (in part) in Table XV. Each value in the body of Pascal's triangle is calculated by adding together the TABLE
K
XV.—Pascal's triangle to K = 12 Binomial coefficients
1 2 1 3 1 4 4 1 I 5 5 6 1 6 15 7 21 1 7 8 I 28 56 8 9 1 9 36 84 10 45 120 210 1 10 11 1 165 330 11 55 12 1 220 495 792 12 66
1
I
2 3
I
3 6
10
1 4 21
35 70
126
84
126
462 924
36
330 792
1 8
120
210 462
1 7
28
56
252
1 6
15
20
35
I 5
10
1 10
45 165
495
1
9 220
1
lI
55 66
1
12
I
THE BINOMIAL DISTRIBUTION
49
two numbers to the right and above it, and to the left and above it. Thus, opposite K = 12, 924 = 462 + 462.
Consider now a binomial expansion, which begins (0 + P) R
_
K(K " Q K + hQ - Y+
- I)
K _ , , K(K
P+
Q
- 1)(K - 2) QK -P 3 3 + etc. ~ 3
The values in Pascal's triangle are the coefficients 1,
K,
K(K - 1) 2! '
and the values Q",
'P, Q"_
K(K -1)(K - 2)
3!
()K-2P2
QE-3p3,
etc.
must be calculated. This process is best set up in table form, as in Table XVI, where these needed values are calculated for an experiment in which 5,000 lots of animals are reared, each lot consisting of seven animals. The probability of survival throughout rearing is P = .55, while the probability of death is Q = .45. TABLE XVI.-Calculation of the ordinates of a binomial frequency distribution where P = .55, Q = .45, K = 7, and N = 5000. Sure. in lot =X
0 2 3 4 5 6 7
Colarur numbers for reference
1
2
Powers of P
Powers of Q
P° = 1.000000 Pr = .550000 P 2 = .302500 P3 = .166375 Pa = .091506 P5 = .050328 P6 = .027681 P 7 = .015224
3
4
PQ P.T. Probs.
1 Q6 = .008304 .004567 7 Q5 = .018453 .005582 21 Q4 = .041006 .006822 35 Q3 = .091125 .008338 35 Q2 = .202500 .010191 21 Q 1 =_ .450000 .012456 7 Q° = 1.000000 .015224 1 Q7 = .003737 .003737
Totals obtained Theoretical totals
5
6
7
fr, cafe. Rounded f,,
.003737 18.685 .031969 159.845 .117222 586.110 .238770 1193.850 .291830 1459.150 .214011 1070.055 .087192 435.960 76.120 .015224
19 160 586 1194 1459 1070 436 76
.999955 4999.775
5000
1.000000 5000.000
5000
Note. The total of column 5 should be almost exactly 1.000000, and that of columns 6 and 7 should be, in each case, the number of lots, N. Always check the total of column 5 before proceeding further.
Referring to Table XVI, column 1 shows the successive powers of P, and column 2, those of Q. Owing to the iterated multiplication, one should use as many decimal places as the machine will hold for forming these values. Column 3 shows the products of the values in column 1 by those in column 2. In column 4 are the values from Pascal's triangle for the appropriate value of K, being K = 7 in this case. Column 5 shows the products of the values in column 3 by those in column 4. These are the
50
THE ESSENCE OF BIOMETRY
probabilities that a lot chosen at random will show from zero to seven survivors. There being 5,000 lots, column 6 shows the expectations, i.e. the numbers of lots (out of 5,000) which show zero to seven survivors. One cannot have fractional parts of a survivor, so the values of column 6 are rounded off to the nearest integral values in column 7. These values in column 7 are then the ordinates of a theoretical binomial distribution at x = 0 to x = 7, where P = .55, Q = .45, K = 7, and N = 5,000. The statistics of a binomial distribution The number of animals in a lot being K, and the probability of surviving being P, it is obvious that the average lot will have KP survivors; or, thinking of number of survivors as the variable x, we find that
x=KP. In this case this yields x = 7(.55) = 3.85. Doing the calculation in the ordinary way, we would find from table XVI that f (x) 19248
x — L' N = 5000 — 3.8496. The slight discrepancy is due to losses in the tail. It can be shown that SD=\IPQK whence SD = x/(.55)(.45)(7) = 1.3163. If this is calculated in the ordinary way, we would find 2
Ef(N x) — 1.3163, the same value.
SD =
The standard error of the mean is, of course, the standard deviation divided by the square root of the number of observations. Now each lot of animals reared can be thought of as an attempt to find the mean (x = 3.85), and there are N = 5,000 such lots. Then SD
SE, = =1.3163/5000 = .01862 SEI = .000347 approx. One is not generally much concerned with the skewness or kurtosis of a binomial distribution. All binomial distributions are skewed unless P = Q. However, it can be shown that skewness is measured by G1
_
Q —P
SD
—
.45—.55
1.3163
.0760,
51
THE BINOMIAL DISTRIBUTION
and the distribution is skewed to the left (more lots with many survivors), as may be seen from Fig. 4. The kurtosis is measured by G2 =
1 — 6PQ 1
— 6(.55)(.45) 1.31632
SD2
— .2799.
In a binomial distribution, skewness is zero for the symmetrical curve, and approaches zero as N becomes infinite. The kurtosis is zero only as N approaches infinity. 1500 0----
'
.
.
3 ~ 1000 v, Il '5 r2 . ■
0
\\
\
-
\
\
'0N
0 /
°
-° 6' '> 500 L f-.
-0 ,
.
.
/ /
_
/
\\ N.
/
5.
N.
N.
N.
o`
.. .
0'
1
2
3
4
5
6
7
= number of survivors in a lot Fig. 4.—A binomial frequency distribution, plotted from the data of Table XVI, where P = .55 and Q = .45. The dotted curve shows the symmetrical distribution for P = Q = .50. Note that one really should not draw in the continuous curves because, in the sense in which we are dealing, the binomial is not a continuous distribution, but exists only for integral values of X. Comparison of the above theoretical distribution with an experimental distribution We have now obtained a theoretical distribution under the conditions that P = .55, Q = .45, K = 7, and N = 5,000. If we actually did obtain 5,000 cages and rear 35,000 animals, seven to a cage, we would not, of course, obtain quite these same frequencies for various numbers of survivors. We shall need to determine whether the distribution resulting from our rearings is actually a proper binomial distribution, one in which P and Q have operated in every class with the same force. A perfect distribution, such as that calculated in Table XVI, is said to be homogeneous. The chi-squared test for homogeneity of an experimental distribution Let us suppose that we have actually carried out rearings as above, with 5,000 cages, seven animals per cage, and with P and Q presumably equal to .55 and .45. The data we might obtain are in Table XVII. We set up chi-squared as in Table XVIII, there being 8 — 1 = 7 degrees of freedom. Chi-squared = 11.6070 whence, from a table of chi-squared, P (the probability that differences as large as or larger than those we have between the observed frequencies 5
52
THE ESSENCE OF BIOMETRY
of Table XVII and the calculated frequencies of Table XVIII) is about .120. That is to say, the difference between our experimental distribution and a perfect homogeneous distribution with P = .55 and Q = .45, very probably arose by chance. Therefore our experimental distribution is very probably homogeneous. XVII.—Fictitious data of an experimental distribution, 5,000 lots, seven animals per lot.
TABLE
No. of sure%
No. of lots
No. of sarv.
No. of lots
No. of sum
No. of lots
No. of sarv.
No. of lots
0 I
11 177
2 3
590 1145
4 5
1492 1080
6 7
445 60
In an actual case, of course, one would not know P and Q a priori, but P is easily found by calculating the mean of the distribution in the ordinary way. This is then divided by K, yielding P at once. Q is then (1 — P). A theoretical distribution would then be calculated with these values of P and Q and the known values of N and K, and a chi-squared test carried out between the observed and calculated distributions. In this example, the mean of the distribution of Table XVII is 3.85, so that the theoretical distribution of Table XVI is what would be calculated after carrying out the rearings. XVIII.—Calculation of chi-squared to test the homogeneity of the distribution of Table XVII.
TABLE
fo
(Tab. XVII)
(Je — r)
(fe — f,)2
Cfc— for
(Tab. XVI) 19 160 586 1194 1459 1070 436 76
11 177 590 1145 1492 1080 445 60
8 17 4 49 33 10 9 16
64 289 16 2401 1089 100 81 256
3.3684 1.8063 .0273 2.0109 .7464 .0935 .1858 3.3684
fc
d.f. = 8 — 1 = 7
Total x2 =
fc
11.6070
Chi-squared test for a general difference between two binomial distributions In the example just discussed, a chi-squared test was carried out between an experimental and a theoretical distribution. One might have two experimental distributions and need to know if they are substantially the same. This can be tested by the chisquared test, using either one as the calculated one, the other as the observed.
THE BINOMIAL DISTRIBUTION
53
Chi-squared test for the significance of the difference between average survival of animals, where the data form two binomial distributions One commonly meets the case in which animals are reared on, say, two different diets, and are placed in lots, in cages. One might also have two different kinds of cages, or rear in two different rooms, or the like. One is interested in possible significant differences between average percentages of survival under the two treatments. Consider, then, two experiments, each using one hundred batches of twenty animals. In experiment A the animals are fed on diet A, and in experiment B on some other diet. The data are: Diet A: survival = deaths = B: Diet survival = deaths =
1743/2000 257/2000 1452/2000 548/2000
=87.15 %, = 12.85 %, = 72.60 %, = 27.40 %,
P= .8715 Q = .1285 P = .7260 Q = .2740
Diet A appears to be the better. Is this a real difference, or is it due to chance? We may consider either experiment as the calculated one and the other, as the observed. It will make no difference to the value of the resultant chi-squared. We have 2 _ [Surv.(A) — Surv.(B)]2 [Dead(A) — Dead(B)]2 Surv.(A) + Dead(A) whence (1743 — 1452)2 (257 — 548)2 2 = 378.0816. + 1743 257 From a chi-squared table it will be found that, for 1 d.f. P is much Iess than .001. That is to say, there is less than one chance in a thousand that the differences between the results from diet A and diet B could have arisen by chance. Thus diet A is significantly better than diet B. Note that the degrees of freedom amount to only one. This is so because, having fixed the survivors and dead for one diet, those for the other are then fixed, if the value of chi-squared is to be obtained. Thus only one category is free to vary. Fiducial limits for the universe of which a binomial distribution is a sample Owing to its `either-or' character, the binomial distribution is applicable to the analysis of opinion polls, voting, acceptance tests in quality control, and similar material. For example, one may wish to determine what percentage of the general populace likes some commerical product. One cannot ring all the door bells in the country, so one takes a sample of, say, one thousand representative families and records their responses—`yes' or `no'. Similarly, on the basis of a sample of parts made by a machine, if Y per cent are defective, what are the limits on the percentage of rejects in the millions of parts this machine may make in a week? These limits can be determined by calculation, but are usually found from a table.
THE ESSENCE OF BIOMETRY
54
Table XIX is an abbreviated version of Table 1.3.1 from Snedecor's Statistical Methods.4 If we know the percentage of `yes' answers from a sample of given size, this table permits us to find the fiducial limits for a 95 per cent certainty, in the general population from which the sample was drawn. XIX.-95 per cent confidence limits (fiducial limits) for a binomial distribution. The values in the body of the table are percentages.*
TABLE
Column numbers for reference 1
2
I
3
4
=f
50
100
Fraction saying `yes' = f/N
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
52-79 54-80 56-82 57-84 59-85 62-87 64-88 66-90 69-91 71-93 73-94 76-95 78-97 81-98 83-99 86-100 89-100 93-100
24-43 25-44 26-45 27-46 28-47 28-48 29-49 30-50 31-51 32-52 33-53 34-54 35-55 36-56 37-57 38-58 39-59 40-60
.33 .34 .35 .36 .37 .38 .39 .40 .41 .42 .43 .44 .45 .46 .47 .48 .49 .50
Number saying `yes'
Numbered questioned =N
5
I
6
Numbered questioned =N
250
1000
27-39 28-40 29-41 30-42 31-43 32-44 33-45 34-46 35-47 36-48 37-49 38-50 39-51 40-52 41-53 42-54 43-55 44-56
30-36 31-37 32-38 33-39 34-40 35-41 36-42 37-43 38-44 39-45 40-46 41-47 42-48 43-49 44-50 45-51 46-52 47-53
* Abridged, with some change of format, from Table 1.3.1 of George W. Snedecor's Statistical published by the Iowa State University Press. Reproduced here by permission of the publishers. Note. In column 1, if f exceeds 50, enter the table with (100 — f) and subtract each confidence limit from 100. In column 4, if f/N exceeds 50%, enter the table with (100 — f/N) and subtract each confidence limit from 100.
Methods,
Let us suppose that we have taken a sample of one thousand observations, and found 480, or 48 per cent of `yes' answers. In another part of the country, with a similar sample, 61 per cent say `yes.' In a third part of the country, with a small sample of only 50 families, 33, or 66 per cent say that they like the product. For the first sample of 1,000 families, 48 per cent said `yes.' Enter Table XIX at .48 in column 4. Under 1,000 in column 6, find the two confidence limits, 45 and 51. We can then feel 95 per cent certain that, in the populace as a whole, in this part of the 4. G. W. Snedecor, Statistical Methods (Ames, Iowa State University Press, 1959).
THE BINOMIAL DISTRIBUTION
55
country, somewhere between 45 per cent and 51 per cent of the people would say `yes' if asked. For the second sample of 1,000 people, 61 per cent said `yes'. Subtract: (100 — 61) = 39. Enter the table in column 4 at .39 and find the two limits, 36 and 42. In this case we can feel 95 per cent sure that, in this part of the country, somewhere between 58 per cent and 64 per cent of the populace would say `yes' if asked. In the third sample, we questioned only fifty people. Thirty-three, or 66 per cent said `yes.' Enter the table in column 1 at 33, and under 50 in column 2, find the limits 52 and 79. These limits are rather wide apart, due to the small size of the sample. A complete check in this part of the country might have shown a percentage of `yes' answers of 52 per cent, 53 per cent, 60 per cent, or anything up to 79 per cent. There is a 21- per cent chance that the percentage might be less than 52 per cent or more than 79 per cent. Obviously this sort of thing makes nonsense of such assertions as `Three out of four doctors prefer such-and-such cigarettes.' Unless one knows the size of the sample taken, nothing can be said as to the response of all doctors. If, as one suspects is often the case, the sample is hand-picked, the true percentage might be anything from zero per cent to one hundred per cent. In point of fact, if a hundred doctors are questioned, and 75 per cent say `yes,' the fiducial limits would be 65 per cent and 83 per cent. This is quite a wide margin. The author does not expect to see a TV commercial in which fiducial limits are mentioned! This sort of thing again points up the fact that, in doing research and in reporting upon the results, one must avoid making ex cathedra statements that `this' or `that' is exactly so. All one can really say is that a value probably lies between such and such limits. Beyond this one should not go, although many research workers do so. The binomial theorem applied to a problem in genetics Not uncommonly, a mutation arises in some plant or animal in a laboratory. One performs crosses to see if one, two, or more pairs of alleles are involved. The chi-squared test can be used in such cases. Thinking in terms of Mendel's famous three-to-one ratio for a character determined by a single pair of alleles, let us say that the parental generation is TT and ti, and the F1 generation is all Tt (heterozygous tall). The F2 generation is one (TT), two (Tt), one (tt), and these frequencies are obviously those of a binomial distribution where K = 2. Having obtained observed frequencies from crosses, we can then carry out a chi-squared test between observed and calculated frequencies, with d.f. = 3 1 = 2. If P is less than .05 (neglecting effects of crossing-over, differential mortality, etc.) then the mutation is not due to a single pair of alleles. Where two pairs of alleles are involved, the FZ generation will segregate as, say TTRR = 1 tall, round, TTRr = 2 tall, round, TTrr = 1 tall, wrinkled,
TIRR = 2 tall, round, TIRr = 4 tall, round, Ttrr = 2 tall, wrinkled,
ttRR = 1 dwarf, round, ttRr = 2 dwarf, round, ttrr = 1 dwarf, wrinkled.
56
THE ESSENCE OF BIOMETRY
These genotypes can be sorted out into the familiar nine (tall, round), three (tall, wrinkled), three (dwarf, round), one (dwarf, wrinkled) which is the usual 9:3:3: 1 ratio. We could carry out a chi-squared test between the frequencies observed and calculated. One could, however, think of the situation in another way, taking the possession of a dominant gene as a head and of a recessive gene as a tail (in terms of tossing coins). We would then have TTRR = 4 dominant, 0 recessive TTRr = 3 dominant, 1 recessive TtRR = 3 dominant, 1 recessive TTrr = 2 dominant, 2 recessive TtRr = 2 dominant, 2 recessive ttRR = 2 dominant, 2 recessive Ttrr = 1 dominant, 3 recessive ttRr = 1 dominant, 3 recessive ttrr = 0 dominant, 4 recessive
1 2 4 2 1 4 6 1 2 4 2 1
and these are the frequencies of a binomial distribution for which K = 4. The general use of the binomial distribution There is a marked tendency among research workers to cling to the normal frequency distribution, even when the plot is obviously skewed (when a chi-squared test would actually show poor agreement with a normal distribution). This is perhaps because people are more familiar with the normal distribution. When, however, there is any sort of `either-or' situation (animals may be dead or alive, male or female, and so on) one should not neglect the binomial distribution. Where P, the probability of survival (of being male, of answering `yes') is very small, one should think of the Poisson distribution. If in doubt, simply carry out a chi-squared test to see which distribution best fits the facts.
VI
THE POISSON DISTRIBUTION is a special case of the binomial distribution in which P or Q is very small, possibly approaching zero. It will be remembered from Chapter V that, where P is the probability of success and Q that of failure, the probabilities that a set of K coins when tossed will show zero, one, two, three, ... , K heads are shown by the successive terms of the binomial expansion. The general term for j heads is : THE POISSON DISTRIBUTION
P—
K(K-1)(K- 2)(K-3)... (K—j+1) Qk _ i pl.
If the set be tossed N times, the expectations for the numbers of times the set will show zero, one, two, three, ... , K heads, are N times the probabilities. The expectations and probabilities are easy to calculate using Pascal's Triangle if P or Q is not too small. If P or Q be very small, the higher powers of P or Q become so small as to be difficult to handle. Such a case would arise when dealing with the occurrence of a very rare insect in field plots, for example. In such a case we use the Poisson distribution, of which the general term is Ne-M Mx
x! where
N = numbers of samples taken, M = mean number of specimens found per sample, x _ number of specimens in a sample under consideration, corresponding to the number of heads in one toss of a mass of coins, e = 2.7182818, fx = number of samples, out of N, containing x specimens.
If S be the total number of specimens found, then M = SIN. The Poisson distribution, as in the case of the binomial distribution, is a distribution of a discontinuous variate. One may have two specimens in a sample, but not two and a half. The distribution has only one parameter, M, the average number of specimens per sample. In a theoretical Poisson distribution, the variance is always equal to M, but of course in an experimental distribution (which may not be quite 57
THE ESSENCE OF BIOMETRY
58
perfect, and not quite homogeneous) one must calculate the variance (which is the square of the standard deviation) in the ordinary way. The value of P may not be known a priori, and the exact size of the sample is not important, but the samples must be of reasonable size and all of the same size. They must of course be good random samples. As in the case of the binomial distribution, it is implied that the distribution is built up by the operation of a probability, P, which operates equally for all samples. This may or may not succeed in putting a specimen in a given sample, but it is always the same value of P. Thus, we can use the chisquared test for homogeneity of a Poisson distribution, just as we did for the binomial distribution. Poisson distributions are always markedly skewed, usually to the right, due to the small value of P. If Q is very small, they are skewed left. To calculate the ordinates of a Poisson distribution, one need know only N, the total number of samples taken, and S, the total number of specimens found. M can then be calculated as S/N. As in the case of the binomial distribution, the calculations may be set forth in an orderly fashion in a table. Table XX shows how this may be done. XX.-The fitting of a Poisson distribution where N = 1,500, S = 4,500 and M = 3.
TABLE
Column numbers for reference
1 x
2
3
Multipliers
calc. f
0 1 2
1.000000 3/1 = 3.000000 3/2 - 1.500000
74.680604 224.041812 336.062718
3 4 5 6 7 8 9 10 11 12 13 14
3/3 = 1.000000 3/4 = .750000 3/5 = .600000 3/6 = .500000 3/7 = .428571 3/8 = .375000 3/9 = .333333 3/10 = .300000 3/11 = .272727 3/12 = .250000 3/13 = .230769 3/14 = .214286
336.062718 252.047039 151.228223 75.614111 32.406015 12.152256 4.050749 1.215224 .331424 .082856 .019121 .004097
Totals
4
1499.999000
rounded f
75 224 336 If M is integral, there Jy are always two equal 336 ordinates, thus. 252 151 76 32 12 4 1 1 (I is lost in the tail) 1500
Example of fitting a Poisson distribution Suppose that 1,500 samples have been taken, in the form of small field plots. A total of 4,500 insects has been found. Then N = 1,500, S = 4,500, and M = 3. We
THE POISSON DISTRIBUTION
59
calculate first fo = Ne —M = 1,500e-3 = 74.680604. We then set up the multipliers M/ I , M/2, M/3, etc. It is then easy to show that f1 = (M/ 1)& .f2 = (M/2)fi , f3 (M/3)f2i and, in general, fi = (M/j)f f _ 1. Column 1 shows the number of insects in a plot, column 2 shows the multipliers. Each value in column 3 is obtained by multiplying the value one row up in column 3 by the multiplier opposite in column 2. For example, when x = 8, 12.152256 = (32.406015)(.375000). Beyond x = 10 the frequencies become fractional and, in terms of the rounded frequencies, total to one. In column 3, the total of 1499.9990 shows that a small frequency is still located out in the tail, for values of x greater than x > 14. The total number of insects found is, of course, the sum of the products of columns 1 and 4, and equals 4,489. The other eleven insects (to make the total of 4,500) are in the samples with calculated fractional frequencies, in the tail. Chi-squared test for the homogeneity of a Poisson distribution Exactly as in the case of the binomial distribution, we may enquire as to whether or not an experimentally obtained Poisson distribution is homogeneous; i.e. has P, the probability that a specimen will be included in a given sample, operated evenly all through the situation? XXI.—The chi-squared test for the homogeneity of an experimental Poisson distribution for N — 400, S = 800, M = 2.
TABLE
Column numbers for reference
I
2
3
X
le
le
4 (fe
—
5 fo)
(fc
—
6 f°)2
(fr
— for
7 d.f.
fe 0
I 2 3 4 5 6 7
54 108 108 72 36 14 5 )6 1
50 115 106 79 32 10 7 )8 1
4 7 2 7 4 4
16 49 4 49 16 16
.2962 .4537 .0370 .6806 .4444 1.1429
1 1 1 1 1 1
2
4
.6667
1
x2 =
3.7216 d.f. 7 — 1 = 6
Suppose that we take 400 samples from a grain-bin and find a total of 800 insects. Then N = 400, S = 800, and M = 2.00. The numbers of samples out of the 400 taken, containing various numbers of insects, are as in column 3 of Table XXI, and column 2 of that table shows the frequencies we would have if there existed a perfect homogeneous Poisson distribution (in this case) for which N = 400, S = 800 and M = 2.00. Carrying out the usual chi-squared test as shown, and lumping the last
60
THE ESSENCE OF BIOMETRY
two classes (opposite X = 6 and X = 7), x2 = 3.7213. With 7 — 1 = 6 d.f., P is greater than .70. That is to say, there is a very high probability that the differences we have between our experimental distribution and a homogeneous distribution with the same parameters, are due simply to chance. Our experimental distribution is then almost certainly homogeneous. The variance and standard error of a Poisson distribution The variance of a homogeneous Poisson distribution is equal to M, so that the standard deviation is the square root of M. What do we divide this by to obtain the standard error? We must think carefully. Suppose that we have two assistants, Jones and Smith. To Jones we say: `Go to that grain-bin in the barn, and take out 400 one-quart samples. Count the insects in each sample, and tabulate the results. When you have done this, pour all the grain and insects into one large container.' Jones does this and reports that he apparently gets a Poisson distribution, for which M = 2, the frequencies being those of column 3 of Table XXI. To Smith we say: `Go into the barn and find a single 400-quart sample from the grain-bin. Jones has got it ready for you in another container. Tell me how many insects there are in it.' Jones does this, and reports 800 insects, or an average of two per quart. Now Jones' efforts lead us to believe, on the basis of a distribution formed from 400 one-quart samples, that the average density of the insects in the bin is close to two per quart; that is to say, if we think of the bin as the universe, M„ is approximately two. Smith's efforts similarly lead us to believe, on the basis of a single, huge 400quart sample, that M„ is approximately equal to two. We can think then of Jones' 400 quarts as really one observation. Therefore, in transforming the SD to SE, we divide by the square root of one. Therefore, SE = SD = for the sample mean of a Poisson distribution. The standard error of S is similarly JS. Now let us assume a second bin of grain for which M, as determined in the same way, is M = 2.41. Is this second bin significantly more heavily infested than the first one? Let us call the two bins A and B. We have then Bin A: Ma = 2,
Su = 800,
SES, = ./800,
Bin B: Mb = 2.41,
Sb = 964,
SES` = N/964.
As the variances are not very different, we may use the simplest formula (see Chapter IV) for the standard error of the difference SEd " = ,/SEU2 + SEb = ,/1764 = ±42. Then T = diff/SE,;rr = 164/42 = 3.90. We may use this formula because large S's are virtually normally distributed. There
THE POISSON DISTRIBUTION
6I
is only one degree of freedom as we have really only two observations, Sa and Sb. From a T table, it is seen that P is in the region of P = .2, so that as far as our
data are concerned any difference is due to chance, and we cannot show any significant difference between the two bins. The chi-squared test for differences in over-all mortality Animals are often reared in batches, in cages, and as an animal may be either dead or alive, the frequencies with which the cages show various numbers of living and dead may be distributed according to a Poisson distribution, if the probability of death is small. Let us suppose that we have reared two sets of 50,000 animals, sets A and B, using two different diets. The results might be
Ra = 50,000 La = 49,780 Da = 220 Na = 50 Ma = 4.4
Total animals attempted to rear Total survived Total died Number of batches used Dead per batch (M = DIN)
Rb = 50,000 Lb = 49,620 Db = 380 Nb = 50 Mb = 7.6
Superficially, the diet for A appears to be better than that for B, but is it really better, or is the appearance of greater survival due merely to chance? We set up a small chi-squared test, using either A or B as calculated (La
—
Lb)- (Da — D Db)
r— La a
z
a
whence 7.-
(49780 — 49620)2 (220 — 380)2 = 116.88. 49780 220 +
From a chi-squared table, P is seen to be less than .001, with 1 d.f.; that is to say, there is less than one chance in a thousand that the differences we have can be due to chance. They are therefore almost certainly real, and diet A is significantly better than diet B. We shall see later (Chapter XI) that we can answer in a slightly different way the same question: whether, in this case, it made any difference which diet was used. This will be done with the aid of a two-by-two contingency table. The test will show that it most certainly did make a difference, but it will not say which diet was the better, or by how much. We would deduce, however, that diet A was the better. The shape of the Poisson distribution As stated, the Poisson distribution is usually markedly skewed, the more so, the smaller the value of M. If M should be zero, the plotted curve would be simply a vertical line at the origin because all samples would be empty. Fig. 5 shows plots of Poisson distributions for 1,500 samples (N = 1,500) and M = 1, 2, 3, 4, 5, 6, 7, and 8. The increased skewness as M decreases is very marked.
62
THE ESSENCE OF BIOMETRY
k 600 U
500 O
-o 400 4–, o Co 0 300 r •~ 200 t 0
,C4
100 •
0
0
2 3 4 5 6 7 8 9 ]0 11 12 13 14 15 16
17 18 19
X = number of specimens found in a sample Fig. 5.—Examples of Poisson frequency distributions, plotted on the bases that 1,500 samples are taken (N = 1,500), and M = 1, 2, 3, 4, 5, 6, 7, and 8. Note that the smaller the M, the greater is the skew right. As in the case of Fig. 4, we really should not draw in the continuous curves, but they are put in here to make the distributions leap to the eye.
VII
THE METHOD OF LEAST SQUARES IN CHAPTER VIII, we shall discuss the problem of regression. This subject is often taught as a sort of by-product of correlation, and it often proves to be a stumbling block to those who have not met it before. In point of fact, regression is quite easy to understand if one approaches it via the method of least squares, and then follows on with correlation as an outcome of regression. There exists a considerable mathematical background for the method of least squares. This will not be discussed, as an understanding of the background is not necessary for the use of the method. For those interested, there are many good texts on the subject. Let us consider a dependent variable, Y, which assumes a series of approximate values corresponding to a series of values of an independent variable, X. We assume that the function Y = 1(x) is single-valued. That is to say, for a given value of X, there is only one value of Y. For the moment, also, we shall assume that there is no experimental error, and that a plot of Y versus X forms a straight line sloping upward to the right. In practice, of course, there are experimental errors. For a given value of X we do not obtain an exact and unique value of Y. Thus, in practice, the plot is not a smooth straight line. It assumes the form of a number of points lying around a straight line, but not quite on it. Our problem is this: given the plotted points, what is the equation of the line which gives the best fit to these points? There is the feeling that, if it were not for the experimental error which scatters the points, they would lie on this line. One might be disposed to think that the best line would be any one for which the sum of the deviations between the points and the line is zero. This is one line which represents the points, but it is not the best line. The best line is that for which the sum of the squares of the deviations is as small as possible. It is this peculiarity of the bestfitted line which gives rise to the term, the method of least squares.
When can the method of least squares be used? The method of least squares can be used to fit a line to a set of points under any circumstances in which the fitted line has an equation which is a polynomial, or can be converted to such by some transformation. A polynomial is, of course, an equation of the form Y= a+bX+cX 2 +dX 3 +eX4 +..., 63
64
THE ESSENCE OF BIOMETRY
and for a straight line it assumes the form Y = a + bX,
where a is the intercept on the axis of X = 0, and b is the slope. If the line slopes up to the right, b is positive; negative if it slopes down. If the polynomial has more arbitrary constants than a and b, the line is curved, and the more constants one uses, the more complicated the curvature can be. The process of carrying out the method of least TABLE XXII.-Fictitious data for fitting squares evaluates the arbitrary constants so as the equation Y = a + bX by to give the best-fitting line. There is no point the method of least squares. in using more constants than are required; it X }Cale merely adds to the work. If one does use too Yobs Yo - Yo many constants, it will be found that they 0 .6089 +.1911 .80 assume very small, even vanishingly small 1 .8622 - .1622 .70 2 1.1155 -.2155 .90 values (showing that they are not needed). 1.3688 +.1312 3 1.50 One can think of the constants thus: a 1.6221 -.2221 4 1.40 moves the curve up or down on the page, b 1.8754 -1-.4246 5 2.30 2.1287 +.0713 6 2.20 determines the general slope, while c, d, and 7 2.3820 -.2820 2.10 so on deal with the curvature. 8 2.70 2.6353 x.0647 Let us begin by fitting a straight line to Total deviations in Y +.0011 the fictitious data of Table XXII, plotted as Fig. 6. We shall use the polynomial Y = a .0000 Theoretical total + bX, and must determine the proper values of a and b. In point of fact, we shall find that a = .6089 while b = +.2533, so that the desired equation is Y = .6090 + .2533X. The equation being Y = a + bX, we first substitute into this the pairs of values from Table XXII, obtaining 2.20= a +6b 1.50= a +3b .80=a+Ob 2.10 =a+7b 1.40 =a +4b .70 =a+lb 2.70= a +8b 2.30 = a+5b .90 =a+2b But here we have nine equations in only two unknowns and, except under special conditions, no pair of values for a and b exists which will exactly satisfy these equations. For example, if we determine a = .80 from the first equation, this will yield b = - .10 from the second equation, and b = +.05 from the third. Clearly then, we must reduce the nine equations to only two equations in the same two unknowns. These two equations, the normal equations, can be solved to obtain unique values of a and b. To find the normal equation of a, multiply each equation through by the coefficient of a in that equation, and add together all the resultant equations. Inasmuch as the coefficient of a is one, the normal equation of a is simply the sum of all the equations, and the normal equation of a is 14.60 = 9a + 36b.
THE METHOD OF LEAST SQUARES
65
Similarly, multiplying each equation through by the coefficient of b in that equation and adding, the normal equation of b is 73.60 = 36a + 204b. Solving a = .6090 and
b = .2533,
whence Y = .6090 + .2533X. From this equation, the calculated values in the third column of Table XXII have been computed. Had we carried more decimal places, the total deviation would have been zero. A check to see that this is so is an advisable test when fitting by the method of least squares. The fitted line is shown in Fig. 6. 3
•
•
•
•
3
4
•
•
0 0
2
5
6
7
8
Values of X, the independent variable Fig. 6.—The data of Table XXII, showing a fitted least-squares line.
A point should be noted in the normal equations. They have a sort of pattern in them. This can be seen if we write a larger set of normal equations such as 19.9 = 6a + 15b + 55c, 57.1 = 15a+ 55b + 225c, 206.9 = 55a + 225b + 979c. Note how the coefficient 55 steps down diagonally to the left from equation to equation, as do also 15 and 225. This pattern is a useful check on the equations. In actual practice, the operator does not write out all the set of original equations, but it is useful to write out one or two just to see if one has the operation set up properly. The experienced person gets the normal equations directly from the original data, thus 36 = sum of X; 9 = count of number of equations; 14.60 = sum of Yobs ; 73.6 = sum of successive values of 36=sum of X;
Yobs times
0, 1, 2, 3, etc.;
204=sum of X2.
THE ESSENCE OF BIOMETRY
66
Relation between a least-squares fitted line and the mean In a distribution, the deviations from the mean (X) total zero, and the sum of the squares of the deviations about the mean is less than about any other value. The same things are true of a least-squares fitted line. In fact, the least-squares fitted line can be thought of as a sort of sliding mean, moving through the swarm of points. It is the best representative of the swarm that one can obtain, just as the single mean is the best representative of a set of values. It is characteristic of the fitted line that it usually does not pass exactly through any data-point (though it may do so, by chance). The least-squares straight line with a predetermined intercept on the Y-axis Sometimes the nature of the experiment makes it clear that the value of Y when X = 0 must be such-and-such. That is to say, a is predetermined. Suppose that in the example of Table XXII, we must have a = .5. We then simply set the value a _ .5 in all the original equations, and find the normal equations in the usual way. There will then be only one normal equation because there is only one unknown (b). This equation will be, in this case, 73.6 = 4.5 + 204b, from which b = .3387, and the fitted equation becomes Y = .5+ .3387X. The least-squares line with a predetermined slope This implies that b is predetermined. We therefore insert the predetermined value of b in all the original equations, and proceed as before. If we set b = .2500, the fitted equation will turn out to be Y = .6222 + .2500X. A quicker method of fitting the straight line The method of least-squares, using normal equations, is clumsy, and the procedure can be much simplified; but this quicker method works only for straight lines. So far, we have had only one value of Y for each value of X. We carry out the quick method on this assumption, i.e. that the frequency (f) for each value of X is one. From Table XXII we calculate
Ef=9 fX = 36 X=36/9=4
E fY=14.60 Y = 14.60/9 = 1.6222 E fXY= 73.60
Z fX 2 = 204 It can then be shown that the slope of the required line (b) is b
E fX Z —X(EfY) 73.6 — 4(14.6) E fX — X(Q fX)
204 — 4(36)
— •2533.
THE METHOD OF LEAST SQUARES
67
The equation is then set up as Y = F + b(X - X) =1.6222 + .2533(X - 4), which reduces to Y = .6090 + .2533X as before. We shall see later that b is the regression coefficient b,,X, being the regression of Y on X. Now it might be that we have more than one observation on Y for each value of X. We might have made repeated determinations and have data as tabulated in Table XXIII. Here we see that we made eleven observations to find Y when X = 6, and obtained a mean value of Y = 2.20. Now the values off are not one but as shown in column 3 of Table XXIII. TABLE XXIII.-Fictitious data for fitting a straight line, with weights (frequencies) not all equal to one. A'
X2
f
Y
XY
0 1
0 I
1 3
.80 .70
0.00 0.70
2
4
2
.90
1.80
3
9
7
1.50
4.50
4 5 6 7 8
16 25 36 49 64
4 3 11 2 3
1.40 2.30 2.20 2.10 2.70
5.60 11.50 13.20 14.70 21.60
Ef = 36 EJX= 163 X = 163/36 = 4.5278 EfY = 64.2000 1 = 1.7833 EfXY = 333.50 EfX 2 = 899
If the table is arranged as shown, triple products, not obtainable on many calculating machines, are avoided. From Table XXIII, we have - 4.5278(64.2000) = .2660 b _ 333.500 899 - 4.5278(163) whence
Y _ Y+ b(X - X) = 1.7833 + .2660(X - 4.5278), Y = .5789 + .2660X
Fitting curved lines by the use of higher polynomials If it is desired to have the fitted line somewhat curved, one uses a polynomial of higher order such as Y = a + bX + cX2, or Y = a + bX + cX2 + dX 3, and so on. There will be one normal equation for each arbitrary constant, and each normal equation is found in the usual way. The resultant set of normal equations is then solved simultaneously to obtain the best values of a, b, c, etc. Note that the quick method discussed above is applicable only to the fitting of straight lines. 7
THE ESSENCE OF BIOMETRY
68
Fitting higher polynomials where frequencies or weights are involved In this operation, as the quick method cannot be used, one really sets up an observational equation for each observation. One does this by multiplying each observational equation by the appropriate frequency or weight. Thus, if we had wished to fit the polynomial Y = a + bX + cX 2 to the data of Table XXIII, the set of original equations would be 1(.80 = a + Ob+ Oc) =
.80= a
3(.70 =a+1b+1c)=
2.10 = 3a + 3b + 3c
2(.90 = a + 2b + 4c) =
1.80 =2a + 46 + 8c
7(1.50= a+3b+ 9c) =
10.50= 7a + 216 + 63c
and so on, down to 3(2.70 = a + 8b + 64c) =
8.10 = 3a + 24b + 192c.
These are then reduced to three normal equations, and the values of a, b, and c, determined. Fitting when the equation to be used is not initially a polynomial If the equation which it is desired to use is not initially a polynomial, it may be possible to transform it into one by some simple method. If necessary, it may be expanded into a series, such as Taylor's expansion, retaining only a few terms. Sometimes a logarithmic transformation will work. Consider for example a common equation of growth: Y = kX'" where k and m are the constants which must be determined. Taking logs to base 10 logo Y =log10k + m log10X. Setting k' = log10k, Y' = log10 Y, and X' = Iog10X, the equation becomes Y' = k' + mX'. This is a polynomial, immediately amenable to fitting by the quick method. One must of course remember to reverse the transformation from logs back to numbers, after fitting. Many other transformations are available, such as trigonometric types. These will be found in any text on the method of least-squares. Fitting sigmoid curves The sigmoid, or S-shaped curve, is very common in biological work, and it is useful to know how to fit it. Consider a population which grows at a rate k1, proportional to N, the number of
THE METHOD OF LEAST SQUARES
69
individuals in the population. If growth were uninhibited, the equation of population growth would be dN = N, dT k1 whence ek, T N= — C
'
where C is the constant of integration. In practice, this is unrealistic. The population would soon reach astronomical numbers. In actuality, resistance terms begin to operate in the form of starvation, crowding, increased death rates, and so on. These slow the growth of the population, and in time (in theory after an infinite time, in practice, after some time) the population numbers reach a steady state and N becomes asymptotic to a fixed value. The equation now becomes dN =k1N + k2N 2 (1) dT where k2 is the resistance term. Separating the variables and integrating Ck1 N= e-k'T + Ck
2
(2)
Dividing equation (2) by Ck 2 k1 N = -k,T k2 • e Ck2 + 1
(3)
This curve will have a sigmoid shape, with a point of inflexion. This point of inflexion is the only identifiable point on the curve. We therefore change the origin to this point by setting 1 /Ck2 = 1, and obtain A 1 + e-k'"
(4)
where A = k1/k2 is the asymptote and T' is time (T) measured from the new origin. This equation can be fitted by the method of least squares, using all the observations available, or, as we shall see, it can be fitted by a shorter method using only three selected observations. Using this quicker method, the curve can be fitted under two sets of conditions: i) Some experimental values of N and T are known, but A and the value of N when T = 0 are not known, ii) Some experimental values are known as above, but A is predetermined.
70
THE ESSENCE OF BIOMETRY
Fitting under case (i) We have T' = T — To, accounting for the change of origin. Setting Z = 1/N, equation (4) becomes A/N = AZ = 1 + e- x,(T-T°)
(5)
whence loge(AZ — 1) = —MT — To).
(6)
If equation (6) were plotted, and if we presume we know the values of A and To, we would find it to be a straight line, as in Fig. 8. We do not know the slope of this line a priori but call it k. We then select from the data three equally spaced observations. These will plot as points on the line. Call them pi, p2, and p3. They should be equally spaced on the time-scale, and they must be chosen so that, when plotted in relation to a sigmoid curve, the line pt —p2, when prolonged, passes above p3. The meaning of this will be clearer in the example to follow. Then, for these three points loge(AZ1 — 1) = k(T1 — To)
(7)
loge(AZ2 —1) = k(T2 — To)
(8)
loge(AZ3 — 1) = k(T3 — T0).
(9)
But, because the points are equally spaced on the time scale, (T1 — T2) = (T2 — T3), so that, assembling equations (7) to (9) in pairs loge(AZ1 — 1) — loge(AZ2 — 1) = k(T1 — T2)
(10)
loge(AZ2 — 1) — loge(AZ3 — 1) = k(T2 — T3)
(11)
AZ1 — 1 AZ2 — 1 and (AZ1 — 1)(AZ3 — 1) = (AZ 2 — 1)2 , AZ2 — 1 AZ3 - 1'
(12)
whence
and A2(Z I Z 3 —ZZ)—A(Z 1 — 2Z2 +Z3)=0.
This is possible only if .4 = 0, a trivial solution, or if A=
Z1 — 2Z2 + Z3 Z1Z3 — ZZ
(13)
Equation (13) provides a means of calculating a value of A in terms of the three selected observations. It is not of course the best possible value of A, but if pt, p2, and p3 are judiciously selected, it is a good approximate value. From equations (7) and (9) loge(AZ1 — 1) — loge(AZ3 — 1) = k(T1 — T3)
(14)
71
THE METHOD OF LEAST SQUARES
whence loge f
k-
(AZ, —
L(AZ3 — 1) 1)1
(15)
(T1 — T3)
From equation (9)
kT, — TO
loge(AZ3 — 1) +k
(16)
Example of fitting a sigmoid curve under case (i) 100
r'
0 p.2
50
l
25
0 10
0
15
20
25
30
35
40
45
50
55
60
T = time in days since beginning of experiment Fig. 7.—The raw data and fitted sigmoid curve from Tables XXIV and XXV. Note how the line pi p2, if produced, passes above p3; it must do this if the quick method of fitting is to be practicable. TABLE XXIV.—Growth of a population, to which a sigmoid curve is to be fitted. T
N
5 10 15 20 25 30 35 40 45 50 55 60
1 5 10 18 33 52 70 84 92 97 98 99
Selected points
PI
P2
p3
Assume that we have a population which has been counted every five days, and shows growth in numbers as in Table XXIV and Fig. 7. We first select three equally spaced observations, p1 at (T = 10, IV = 5), p2 at (T = 35, N = 70), and p3 at (T = 60, N = 99), and mark these on the roughly plotted sigmoid curve as in Fig. 7. We note that the line pt—p2 passes safely above p3. We do not select (T = 5, N = 1) as the first point, because the first observation may be somewhat vague as it can only be zero or one. It is essential to perform the check with the line pt P2 otherwise one may later encounter the impossible situation of having to find logarithms of negative numbers.
72
THE ESSENCE OF BIOMETRY
We now require logs to base e. These are of course available in tables but can be more easily calculated by finding first the log to base ten. This value is then multiplied by l/log10e = 2.3025851. For example, 1og1098.9609 = 1.995463. Loge98.9609 = 1.995463(2.3025851) = 4.594723. Where one must obtain the log to base e of a decimal quantity, the characteristic will be negative, thus log10 .922348 = 1.964896. This must first be transformed to a wholly negative quantity by subtracting the mantissa from the characteristic thus: I - .964896 = -.035104. Multiplying this by 2.3025851 yields loge .922348 = -.080830. Table XXV shows the calculations for fitting the curve. Actually only the values TABLE XXV.-Calculations for fitting a sigmoid curve to the data of Table XXIV. Column numbers for reference
1
2
3
4
5
6
T
N
Z = 1/N
loge(AZ - 1)
p
Neale
5
1
10 15 20 25 30 35 40 45 50 55 60
5 10 18 33 52 70 84 92 97 98 99
1.000000
.2000000 .1000000 .0555556 .0303030 .019231 .014286 .011905 .010869 .010309 .010204 .010101
+4.594664
+2.943964 +2.196732 +1.515807 + .707510 - .080958 - .848737 -1.661008 -2.448663 -3.490216 -3.915077 -4.641419
(pi)
(p2)
(ps)
2.406 5.000 10.101 19.349 33.871 52.229 70.046 83.266 91.377 95.745 97.938 99.000
dealing with pi , p2 , and p3 need be worked out, but the others are shown in order that the straight-line logarithmic transformation (equation (6)) may be plotted as in Fig. 8. From equation (13)
A - Z 1 - 2Z2 + Z3 - .200000 - 2(.014286) + .010101 - 99.954848. Z1Z3 - Z? (.200000)(.010101) - (.014286)2
(17)
From equation (15) 18.9921811
k
- loge [ .009705 J 60 - 10
From equation (16) 7,0
.151708.
= (- .151708)(60) + 4.641419 -.151708
= 29.405575.
(18)
(19)
THE METHOD OF LEAST SQUARES
> X X v
+5
•
+4 {-3
r-I
i-~
ö
73
+2 +1
E ö U
0 P- 2
—1
ö —2 —3 N
•
—4 —5 0
5
10
20
15
25
30
35
40
45
50
55
60
T = time in days since beginning of experiment Fig. 8.—The straight-line plot of log,(AZ-1) versus T, from equation (6) and Table XXV, showing the three selected fitting-points, p1, p2, and pa. Setting T' = (T
— To), equation (4) becomes _ A N 1 + e-k,(T-To) •
Substituting the values of
A,'k, and To where k = ( — k1), we have 99.954848 N =1
+ e-.151708(T-29.405575)
which reduces to N=
99.954848 1 + e4.461061 -.151708T
We can now check upon the suitability of the points p1, p2, and p3 which we selected for use in fitting. Knowing A, To, and k = ( — k1), we can calculate the values of column 4 of Table XXV, and so fit the straight line of equation (6) as shown in Fig. 8. It is seen that the line is a good fit to the points, with the exception of the point (T = 5, N = 1), which we did not select. A chi-squared test will show that P is greater than .99, so any small differences are due to chance. It should be noted, however, that this quick method is usable only if the raw data do lie quite closely on a sigmoid curve. If this is not the case, one will obtain different curves depending upon which trio of points is used in fitting. This method has been much used for curves of human populations, where the data lie smoothly on sigmoid curves. From equation (22), the values of column 6 of Table XXV have been calculated. The sigmoid curve is shown in Fig. 7. In this case chi-squared = 2.1134, and with 11 d.f., P again is greater than .99, showing an excellent fit.
74
THE ESSENCE OF BIOMETRY
Fitting the sigmoid curve under case (ii) This is the case in which we have values of T and N, but wish to fix the asymptote. We have a choice of procedure. We could fix the value of A, and then carry out the preceding method to obtain k and T0, but, in all probability, the three equations (7), (8), and (9) would be incompatible, with three equations and two unknowns (k and To). What we have done is to place another degree of restraint upon the problem. Now, not only must the curve be in the right place on the page and have the proper curvatures, but it must also have the proper asymptote. We must therefore put in another arbitrary constant to take care of this, and fit an equation of the form A `23) N + ek'+k2T'+k3T'2* 1 Re-arranging equation (23), and taking logs to base e, we have k1 + k2T + k3T 2 = loge(A — N)/N. If now, as an example, we fix A = 100 and use again the same selected points, pl , p2, and p3, we have (24) k1 + 10k2 + 100k3 = loge(100 — 5)/5 (25) k1 + 35k2 + 1225k3 = loge(100 — 70)/70 (26) 2 kl + 60k + 3600k3 = loge(100 — 99)/99. These three equations in three unknowns are compatible and are simply solved as three ordinary simultaneous equations to obtain the values of the k's. These values are substituted back in equation (23), and the fitting is finished. Fitting the sigmoid curve when both the asymptote, and the value of N when T = 0, are fixed In this case, we have added still another restraint. We could perhaps cope with the situation by adding still another arbitrary constant, using an equation of the form
N=
A ek'+k2T+k3T2+k4T37 1+
but it would be found that k4 is almost vanishingly small, and difficult to handle. If we attempt to set up equations of the forms (24) to (26), we would have three equations in two unknowns because A is known and also k1. The latter is known because we have fixed the value of N when T = 0. We are thus forced to use the method of least-squares. The procedure is as follows. Let us fix the value of A at A = 100, and the value of N when T = 0 at Art. = 1. The equation we are going to use is (23). Substituting the above values into it, we have 100 1 1 + ek' whence eki = 99 and k1 = loge(99) = 4.5951.
THE METHOD OF LEAST SQUARES
75
We have then a basic equation of the form of (24) 4.5951 + Tk2 + T2 k3 =1oge[(100 — N)/(N)] into which we substitute all the pairs of values of N and T from Table XXV. There will be twelve such equations, beginning 4.5951 + 5k2 + 25k3 = loge[(100 — 1)/(1)], 4.5951 + 10k2 + 100k3 = loge[(100 — 5)/(5)], 4.5951 + 15k2 + 225k3 = loge[(100 — 10)/(10)], 4.5951 + 20k2 + 400k3 =1oge[(100 — 18)/(18)], 4.5951 + 25k2 + 625k3 =1oge[(100 — 33)/(33)], and so on, down to 4.5951 + 60k2 + 3600k3 = loge[(100 — 99)/(99)]. The values of the right-hand sides are determined and combined with the fixed value k1 = 4.5951. We have then twelve ordinary polynomial equations which can be solved by the method of least-squares to obtain the best values of k2 and k3. Other possible methods of fitting Sigmoid curves have been fitted by calculating the successive increments of the ordinates. In Table XXV these would be 1, 4, 5, 8, 15, 19, 18, 14, 8, 5, 2, 1. These will plot as something approaching a normal frequency distribution. This is then fitted as a normal, binomial, or Poisson, whichever fits best as determined by a chisquared test. Calculated ordinates are thus obtained. These when summed successively give calculated ordinates for the sigmoid curve. However, no equation for the sigmoid curve is obtained.
VIII REGRESSION UNDER THE HEADING of the method of least squares, we have studied the problem of fitting straight Iines where we have either a single value of Y for each value of X, or have only a few duplicate values of Y for each value of X. It is quite common to find in research that for each value of X there are perhaps numerous values of Y, arranged in the form of a normal frequency distribution of Y values. This would be the case, for example, if we measured the heights of children of various ages. For each age we would find a distribution of normally distributed heights. Per contra, arranging the data in another way, for each height, we would find a normal distribution of ages. Each of these distributions has a mean, a variance, a standard deviation and a standard error of the mean. A set of Y-values belonging to a given X-value is called an X-array of Y's, or more briefly, an X-array of Y. Similarly, a set of X-values appropriate to a given Y-value is called a Y-array of X. In regression then, we come to the problem of fitting a straight line (or perhaps even a curve) through a set of values which consists of a number of X-arrays of Y (or Y-arrays of X). If the appropriate line is straight, we have linear regression. If it is curved, we have curvilinear regression. The latter is a more difficult matter, and is not discussed in this book. Let us consider then the problem of the relation of height to age in schoolchildren. In such a study, it seems more reasonable to consider age as the independent variable (X). We go therefore to a school and select groups of children of various ages, five, six, seven, eight years old, and so on. We then measure all the children and tabulate the results as heights for the various ages. We shall find, as has been said, that for any age the heights are reasonably normally distributed. We can then, by methods which are discussed below, fit a straight line to express the average relationship between height and age. This line will have a slope, and this slope is bha, the regression coefficient of height on age. If we had tabulated the data in terms of the ages for various heights, we would obtain another straight line, the regression line of age on height, and another regression coefficient, bah, not necessarily equal to bha. It can be shown that the two will be different unless they are both equal to one, in which case the slope of both lines is 45 degrees. We shall assume in these fittings that the variances of all the Y-arrays of X (and the X-arrays of Y) are equal, and that
76
REGRESSION
77
the arrays are normal distributions. These assumptions may not be quite correct, but unless they are made, the mathematics of regression becomes quite intractable. The two variables, Y = H = height, and X = age = A, bear very different relations to the regression line. Taking X = age as the independent variable, small errors in the measurements of Y = height will not greatly affect the regression line, provided the errors are random and not systematic. They will be, as it were, swallowed up in the variances of the X-arrays of Y. They will not significantly change the mean value of Y for any X. On the other hand, any error in noting down the age for a given group will seriously affect the regression line. It will cause a whole X-array of Yto be incorrectly tabulated. Furthermore, grouping X into categories may seriously affect the regression line. Provided that no silly errors have been made in the tabulation of the data, the regression line does not depend upon the particular values of X which are chosen. One may measure children of five, six, seven, eight years and so on, or children of five-and-ahalf, six-and-a-half, seven-and-a-half, and so on. Some ages may be missed out, and others measured more intensively. However, within a group of children of a given age, it is essential that there be no bias. One must not concentrate upon tall children nine years old, and neglect the shorter ones. In many cases, both regression lines have some meaning. The regression of height on age is obviously meaningful; that of age on height is less so. In other cases, one of the regression lines may have no real meaning. For example, there exists a regression of income on age, because one usually has a greater income when older. There must be a regression of age on income, but it is largely meaningless, even though one can calculate it. Where we have a situation (as under the discussion of the method of least squares) in which there is a unique value of Y for each value of X, there exists a tight functional relationship between Y and X. Knowing X, we can calculate an exact value of Y. In regression, however, the situation is less exact. For each value of X we have not a single value of Y, but an array of Y's. Knowing X we obtain a mean value of Y for that X. This looser relationship is measured by the correlation coefficient, ryx, a number varying between —1 and + I. If Y increases as X increases, r, lies between 0 and + 1, while if Y decreases as X increases, ryx lies between 0 and —1. If Y is unrelated to X, ryx = 0. We shall see later that ryx is the harmonic mean of the two regression coefficients, i.e. ryx = J(byx)(bxy)• However, no one would calculate ry, in this way unless it so happened that the values of by, and bxy were known. The equation of the regression line When regression is linear, the regression line takes the form Y = a + byx(X)
(1)
78
THE ESSENCE OF BIOMETRY
and this is the form one obtains when fitting by the method of least squares. In regression work, however, it is customary to write the equation in the form Y = 17+ b(X — X).
(2)
It is easy to see that the two forms are the same. Multiplying out the bracket of equation (2), we have Y = t` + b,..r(X) — ,:,(X)
(3)
Y = [V — b,x(X)] + b,,X(X)•
(4)
whence Setting [ — byx(X)] = a, equation (4) is the same as equation (1). The regression line as the least-squares line fitted to the means of the X-arrays of Y
.—Ir
Let us assume that each X-array of Y (in some set of data) has a fixed frequency distribution of Y-values. We could symbolize this distribution somewhat crudely as TABLE XXVI.—The data of Fig. 9 and Fig. 10, tabulated as the individual frequencies a little histogram. For reasons which will be of values of Y versus X. seen later, we plot this with the abscissae Y XY X2 X vertically, and the histogram is to be I imagined as standing out of the page towards 1 1 1 the reader. It would appear then as 1 1 2
*
—mv
*** ***
m--
1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
*
3 4 5 6 8 10 12 14 15 18 21 24 27 28 32 36 40 44
1 1 1 4 4 4 4 4 9 9 9 9 9 16 16 16 16 16
E f(X) = 130 X = 2.50
EJ(Y) = 312 Y =6.00
>f(XY) = 910
m ----Mh en - ----M V1 en -ti
= 52
Y1
We can now arrange a number of these plots in relation to a set of values of X, to form a pattern which we shall see later is a correlation table (see Chapter IX). In Fig. 9 some histograms are so arranged, and in Fig. 10 they are reduced to numbers or frequencies. In Fig. 11 each X-array of Y has been further reduced to its mean value V. The data of Fig. 9 are shown in Table XXVI, and those of Fig. 11 in Table XXVII. Using the quick method for fitting by least-squares, we find that for both Table XXVI and Table XXVII (see page 81),
E f(X 2) = 390,
REGRESSION
79
whence
b ,~ '
=>f(XY) —X Ef(Y) -910—(2.5)(312)-2.000
Ef(X) — X Ef(X)
390 — (2.5)(130)
and Y= Y+ byx(X — X), or Y = 6.00 + (2.0)(X — 2.5), or Y= 1 + 2X.
6 >, 5 4 a) 3
~--2 4 3 5 01 Values of X, the independent variable Fig. 9.—An isometric three-dimensional projection of the data of Table XXVI, showing the histograms of the X-arrays of Y. The sloping line is the fitted regression line of Yon X, namely, Y = 1 + 2X. It is seen, then, that the regression line through a set of X-arrays of Y is the leastsquares fitted straight line through the means of the X-arrays of Y. That is to say, it is fitted to the values of Y. We shall see later, under the discussion of correlation in Chapter IX, that a correlation table can be set up, and all the relevant statistics—the means, standard deviations, standard errors, byx and bxy—calculated en masse. At this point therefore, reference might well be made to Chapter IX. Significance tests relating to byx, 7, etc. Having obtained a regression coefficient and the regression equation, there are a number of questions one might ask about significance, namely:
80
THE ESSENCE OF BIOMETRY
1. Does bvx differ significantly from zero? That is to say, is there any regression? 2. Does byx differ significantly from some other fixed value which is not itself a regression coefficient, but merely a number having no variance? 3. Does bvx differ significantly from some other regression coefficient, say b' yx? 4. What are the fiducial limits about by,, the regression coefficient of the universe from which the sample was drawn? 5. What are the fiducial limits about Y„, the mean value of Yin the universe from which the sample was drawn ? 12 11
Values of Y, the dependent variable
10 9 8 7 6 5 4 3 2 1 0 ' 0
2
1
Values of
3
4
5
X, the independent variable
Fig. 10.—The data of Fig. 9, re-drawn in more conventional form, as a correlation diagram.
6. What are similar fiducial limits about some other value of YC,,,c,,, this being a calculated value of Yin the universe, other than Y„? Unfortunately, one cannot perform a chi-squared test to see if the regression line is a good fit to the data because the ordinates of the line are not probability-generated frequencies, as they must be if chi-squared is to be used. To carry out tests as above we shall require a standard error of byx and also a standard error of the difference between brx and another regression coefficient. The
REGRESSION
81
Value of means of the X-arrays of Y, the dependent variable
calculation of the standard error of byx requires the prior calculation of a quantity called the standard error of estimate, which we shall symbolize as SEE. This quantity is really the standard deviation of the deviations of the calculated values of Y from the observed values. The process of calculation is really the same as that required for an ordinary standard deviation except that, instead of taking deviations from the mean Y, we take them individually from the observed values of Yin the form ( Ycalc frequency of may 9 = 13 Yobs). Just as in the case of an ordinary standard deviation, these are squared and 8 multiplied by the frequency of occurrence. Then the summation is divided by the / frequency number of degrees of freedom. However, 7 of array — 13
TABLE XXVII.—The
data of Fig. 11, tabulated as the means of the X-arrays of Y. (See page 77.)
6 / frequency or emy = 11 a
5 4 frequency of array 13
3
1 0 1
Y
f
XY
X2
1 2 3 4
3 5 7 9
13 13 13 13
3 10 21 36
1 4 9 16
inasmuch as we have used Y and by., in calculating the values of Ya le (because these are obtained from the regression equation) we have introduced two restrictions and lost two degrees of freedom. We therefore divide by E f — 2) usually called (N — 2). The formula is then
2
0
X
2
3
4
Values of X, the independent variable
SEE' —
f ( Y` 17o)2. N- 2
It is one of the more laborious calculations in the field of statistical analysis. In calculating the standard error of byx from the standard error of estimate (SEE), we must take into account the variances of both Y and X. We have already taken into account the variance of Y in forming the numerator of the above formula. We can take in the variance of X by dividing SEE' by E f (X — X)', and so SEE' SEbyx= ~ f(X — X)2. Fig. 11.—The data of Table XXVII, derived from Table XXVI, in which each X-array of Y has been reduced to its mean value. Each mean value is then considered to have the total frequency of its X-array, i.e. f _ 13. (See page 77.)
82
THE ESSENCE OF BIOMETRY
A rigorous derivation of the above formula is somewhat difficult, and the above is merely a plausible account. Those interested in something better should consult Fisher's Statistical Methods for Research Workers.' As stated, SEE is very laborious to calculate, and SEby,, even more so, but if the correlation coefficient, ryx, is available, a simpler formula can be used N
SEE = SDy\/I — ryx
N
If N is large, so that N has almost the same value as N — 2, this reduces to SEE = SDy\/1 —
r3 .
If (N — 1) degrees of freedom have been used to calculate SDy, as would be the case if one had only a small sample, the formula becomes
The quantity SDy is simply the ordinary standard deviation of Y Y)2 — I SD,,= Ef(YN , calculated of course from the original observed values of Y, not from YYa IC. Example of the calculation of regression coefficients and significance tests
—
N-1 2. N—
ryx
0
6
>< li
5
grammes a t age
SEE = SDS,\/1
4 3 2
....
U. 111-1 . 1111 MIME, WAIN=
Consider the data of Table XXVIII = 1 and Fig. 12, representing a fictitious study å, of the relation of the weight in grammes 0 4 6 2 3 5 0 1 (Y) to age in months (X) of a small animal. Clearly the weight (column 2) II X = age in months does increase as the animals age, but 12.—The raw data of Table XXVIII, not completely regularly. The four-month Fig. showing the fitted regression line (solid animals are not as heavy as one would line); the two fiducial limit lines for slope (dotted lines) and the fiducial limits for Ye expect, while the three-month animals (curved solid lines). The fit of the regression seem rather too heavy. From column 3 line is poor, due to the aberrant value at it can be seen that we have one animal X = 3. (See page 87.) weighing 2 gm. at two months, and three weighing 5 gm. at three months. Columns 4 to 10 are shown to facilitate calculations on a machine which does not have iterated multiplication or automatic squaring, as do the more modern machines. 5. R. A. Fisher, Statistical Methods for Research Workers (New York, Hafner, 1958).
REGRESSION TABLE
83
XXVIII.-Fictitious data relating to the regression coefficient. Column numbers for reference
1
2
3
4
5
6
X
Y
f
X2
XY
Y2
1
1 2 5 3 4
2 1 3 2 2
I 4 9 16 25
I 4 15 12 20
1
2 3 4 5
4 25 9 16
7
9
8
10
(Y - F) (Y-Y)2 (X - A') (X -X)2 -2.3 -1.3 +1.7 - .3 + .7
5.29 1.69 2.89 .09 .49
-2.1 -1.1 - .1 + .9 +1.9
4.41 1.21 .01 .81 3.61
From Table XXVIII, using the quick method of least squares
Ef= 10
EfY=33
E fX= 31 X = 3.1
Y=3.3
E fX2 = 115,
E fY 2 = 131 E fXY= 115
E f(X-X)2 = 18.90 Ef (Y - Y)2 = 22.10
SDx = x/18.90/10 = 1.3748, SD, = x/22.10/10 = 1.4866,
whence
E fXY-X(EfY) = 115 -(3.1)(33)-.6720. E fX2 - X(> fX) 115 - (3.1)(31)
byx -
Interchanging X and Y bxy
_ E fXY - Y(EfX) - 115 - (3.3)(31) - .5747. E f Y2 - Y(E f Y) 131 - (3.3)(33)
The equation of the regression line of Y on X is Y = 7+ byx(X-X)=3.3+.6720(X-3.1) or Y = 1.2168+.6720X. The regression line of X on Y is X = X + bxy(Y - 7) = 3.1 + .5747(Y - 3.3) or X = 1.2035 + .5747Y. As we shall see later in Chapter IX, the correlation coefficient is ryx = -V(byx)(bxy) = J(.6720)(.5747) = .6214. Calculation of the standard error of estimate SEE, by the long method Table XXIX shows the calculations necessary for this. The values of Ycaic are determined from the regression equation of Y on X. Note the large value of (Yob, Kalo) at X = 3. This is so large that it will seriously affect the value of SEby,,. The degrees of freedom to be used are, as stated above, (Ef - 2) = 10 - 2 = 8. As a check on the fitting, note that the summation of column 5 times column 3 is zero. This is an excellent check on a least-squares fit. 7
84
THE ESSENCE OF BIOMETRY XXIX.—Showing the calculations for SEE.,„ by the long method.
TABLE
X
Yo
f
Ye
(Yo — Ye)
(Yo — Ye)2
1 2 3 4 5
1 2 5 3 4
2 1 3 2 2
1.8888 2.5608 3.2328 3.9048 4.5768
— .8888 — .5608 +1.7672 — .9048 — .5768
.78997 .31450 3.12300 .81866 .33270
From the data of Table XXIX, we have j
2Y`)Z \I13.56616 1.30221. =
SEEbYx = J Ef
Calculation of SEEbYx by the short method, using r yx Knowing that N = 10, ryx = .6214, and SD,, = 1.4866, SEEbYx = SDy J(1 — /Ix) N N
= 1.4866x/(1 — .62142)(10/8) =1.30221.
The standard error of byx This follows immediately as SEbyx =
SEEbYx _ 1.30221 V18.90 — ±.2995. f(X — X)2
Correction for small samples The above has been carried out in terms of a large sample, with more than, say, thirty degrees of freedom, but actually we have only 8 d.f. A correction can be made by multiplying SEb by (N — 1)/N. This factor would be 9/10 or .900, and SEb,„ then becomes ±.26955. Does byx differ significantly from zero? If byx does not differ significantly from zero, there is no regression. That is to say, there is no relationship such that Y increases (or decreases) as X increases. Once byx is known, this is the first test one should carry out. It is done by a T test, and T = byx SEbyx We have then T=
.6720
.2995
— 2.24.
REGRESSION
85
From a table of T (see Fisher and Yates, Statistical Tables or Table XII of this book) with 8 d.f., by interpolation, P = .0570. This is only slightly larger than .05 so that it is doubtful if byx differs significantly from zero. Does byx differ significantly from some other fixed value not zero? This test applies to some other fixed value (not zero), the value being one that is exactly determinable, i.e. that has no variance. Under these conditions T - byx SE
—G ' byx
where G is the fixed value for comparison. Choosing a fixed value G = .6000 , — .6720 — .6000 = .2404. .2995
T
From a T table, with 8 d.f. P is greater than .05, so that byx does not differ significantly from an exact value G = .6000. Does byx differ significantly from another regression coefficient, say b' yx? The problem becomes more complicated in this case because we are comparing byx with another quantity which has itself a variance (SDb yx)2. We must therefore calculate the standard error of the difference between two regression coefficients. This is one of the most laborious calculations in the whole of elementary statistics. Let us set up two sets of data, using j and k to identify them, so that we do not become confused with reiterated use of the symbol b. Let N;and Nk = the number of observations in J and K,
E fDy, = the sum of the frequencies times the deviations of Y from Yin K, E fDyk = the same sum for K, E fDx! = the sum of the frequencies times the deviations of X from I in J, E fDxk = the same sum for K, ryxj = the correlation coefficient between Y and X in J, ryxk = the same for K, SD„, SDyk, SDxj and SDxk = the four standard deviations for Y and X in J and K. Then SEaur -
1
i
JNSD,(1 — ry2x.,) + NkSDyk(1 — ryk) NJ + Nk — 4 NJSD,2J + NkSDX •
[
J
86
THE ESSENCE OF BIOMETRY
Other formulae have been proposed that are mathematically identical with the above. The formula quoted here cannot be simplified any further; fortunately, it is quite straight forward. To save space, let us suppose that its value is .0135 and that byx1 = .8500, while byxk = .6420. Is the difference significant?
T = diff/SEa~.rr =
.8500 .6420 -.2080/.0135 = 15.41. ,0135
If Ni = 22 and Nk = say 44, we enter the T table with 22 + 44 — 4 = 62 d.f., and P is less than .001. That is to say, the difference could scarcely have arisen by chance, and byx differs significantly from byxk• Fiducial limits about byx., the universe regression coefficient
From the data of Table XXVIII, we made an approximation to the true regression coefficient of the universe, in the form of byx = .6720, calculating this from a sample in the form of the ten observations of Table XXVIII. We now wish to establish fiducial limits about the true regression coefficient, with a degree of certainty of P = .95. These limits will be upper limit _ byx + T os(SEby), lower limit = byx — T 05(SEbYx), where T 05 is the value in a T table under P = .05, with (in this case) 8 d.f. This value will be found to be T = 2.306. The limits are then, for 95 per cent certainty, upper limits = .6720 + (2.306)(.2995) = 1.3626, lower limit = .6720 — (2.306)(.2995) = —.0186. The spread of these limits is very wide, and we have a very poor approximation to the true regression coefficient of the universe. This is due partly to having too small a sample and partly to the large discrepancy at X = 3. Had we wanted 99 per cent limits, we would have used T01 = 3.355 in the formula, and for fifty per cent limits we would have used T.50 = .706. It will be noted that the higher the level of confidence the wider the limits. If one insists upon one hundred per cent confidence (P = 1), the limits are — oo and + co, quite worthless of course. Does Yobs differ significantly from Yu?
One might think that SEy would be calculated by dividing SDy by the square root of N in the usual way, but this is not good enough in this case. We are dealing, not with a single array of Y's clustered around the mean of that array, but with a whole series of X-arrays of Y, each belonging to a separate value of X. This brings the standard error of estimate (SEE) into the formula and
SE- =
SEE
= (i n this example)
1.~ 1 = .4118. 10
REGRESSION
87
The fiducial limits about Y. are then upper limit = Y + T 05(SE) = 3.3 + 2.306(.4118) = 4.2496, lower limit = Y — T05(SE0 = 3.3 — 2.306(.4118) = 2.3504. These are wide limits, and we have a poor approximation to K. Y itself is 3.3/.4118 = 8.01 times its own standard error. It is thus a valid mean in terms of the sample from which it was drawn, but a poor representative of the universe mean. Note that Yobs has the same value as Y.,,~, but neither one is equal to E.
Fiducial limits about some value of Y. other than Yu Inasmuch as any YY = Y + byx(X — 1, its standard error (which is not the same as that of Ye) must be built up taking into account not only SEy but also SEby,. It must also take into account the position of the particular Y. in terms of its particular X-value. The derivation of the requisite formula is complicated and cannot be given here, but we have SEE2 + SEE 2(X — X)2 1 + (X — X)2 SE SE= _ SEE2 f N f(X — X)2 LN Ef(X — 1)2J , whence
1 (X — SE,,, _ SEE JIN+
X)2
Ef(X— X)2
X in the above formula is the X-value for the particular Y chosen for study. To establish fiducial limits (95 per cent certainty) about any given Yca,c , we proceed as before, and upper limit = Y + T05(SEy~), lower limit = Y, — T05(SE3,c). By calculating these limits for each value of YYs1 (appropriate to the various values of X), we establish two curved lines limiting a 95 per cent confidence zone about the regression line. The precision of Y, is always greatest near the mean Y, and decreases as the value of Y, diverges from Y Thus this confidence zone is narrowest near Y and widest at the ends of the regression line. The relevant calculations are shown in Table XXX, and the zone is plotted in Fig. 12 (see page 82). This figure also shows the two lines determined by the limiting value of byx. It should be very clear to the reader that this whole example gives an extremely poor approximation to the true situation, which exists (presumably) in the universe from which this sample (from Table XXVIII) was drawn. This example was purposely made so that a poor result would be obtained, in order to drive home the fallacy of reporting unanalysed results. Cursory inspection of Table XXVIII might lead one to feel intuitively that a presumption of increase in
00 00
XXX.-Calculations for establishing the 95 per cent confidence zone about the regression line of Y on X for the data of Table XXVIII. Column numbers for reference
1
2
3
4 (X - ,Q)2
X
(X - 1) (X -- R)2 Z f(X __ ß)2
1
-2.1
2 3 4 5
-1.1 -.1 +.9 +1.9
Note. I =
4.41 1.21 .01 .81 3.61 3.1; f (X -
.23333 .06402 .00053 .04286 .19100 X)2
5 1 (X - X )2 N + zf(X - X)2
.33333 .16402 .10053 .14286 .29100
6 Square root of col. 5
.5773 .4050 .3171 .3780 .5394
7
8
SEE
Yeate
times col. = col. 4, 6 = SE,Ica,,, Table XXIX
.7518 .5274 .4129 .4922 .7024
= 18.90; SEE = 1.30221; N = 10; T.05 at 8 d.f. = 2.306,
1.8888 2.5608 3.2328 3.9048 4.5768
9
10
11
T.05
Upper
Lower
times SEy,
limit
limit
1.7337 1.2162 .9521 1.1350 1.6197
3.6225 3.7770 4.1849 5.0398 6.1965
.1551 1.3446 2.2807 2.7698 2.9571
n from Table XXIX.
AlliaNIOIII30aDHaSSa aH.L
TABLE
REGRESSION
89
weight with age is confirmed. As we have seen however, the data do not support this at all, and anyone who so reported would be making a wholly unjustified statement. Possibly this kind of animal does increase in weight as it ages, but our data do not prove it, and it is no use pretending that they do. Non-linear regression On the basis of the discussion of the method of least-squares in Chapter VII, we can fit many kinds of least-squares lines. These are the regression lines if the relationship between Y and X is not a straight line. Unfortunately there exists no valid criterion of linearity of regression. Blakeman's criterion of linearity of regression is now known to be invalid. If the swarm of points is curved, there may still exist firm relationships between Y and X, but an ordinary ryz may come out as almost zero. Obviously, all significance tests are much more complicated if the regression line is not straight, and discussion concerning them is out of place in this elementary book. For those who wish to explore the matter further, there is a discussion in Chapter XVI of Steele and Torrie's Principles and Procedures of Statistics. In some cases, where the swarm is rather V-shaped, one can fit a straight regression line to each arm of the V. One can always fit a curved regression line by the method of least-squares, and this will permit the calculation of values of Yc,lc. This may be all that is needed.
IX CORRELATION CONSIDER a set of points on a plot, these being the plots of pairs of values of Y and X. Suppose that the points all fall exactly on a straight line which slopes up to the right at 45 degrees. In such a situation, there are no deviations between Yca,c and Y,bs, SEEb,,,y = 0, and b,x = b = 1. That is to say, the dependent variable is so related to the independent variable that, for all values of X, Y equals X. There is thus a close one-to-one relationship, and ryx = +1. If the slope of the line is not exactly 45 degrees an increment of X will produce an increment of Y either greater or less than that of X (depending upon whether byx is greater or Iess than 1). In either case ryx will be less than +1. In the extreme case, where Y is a constant for all values of X, the regression line is horizontal, by„ = 0, bxy = oo and ryx = 0. At the opposite extreme, X is a constant, and the values of Y form a vertical line at the appropriate abscissa. In this case, by„ = co, bxy = 0, and again ryx = 0. Now the regression line of Y on X minimizes the squares of the deviations (Y Ya), while the regression line of X on Y minimizes the squares of the deviations (X. Xo). These two sets of deviations can have the same value only if the slope of the regression line is 45 degrees, when r,x = +1 and byx = bxy = 1. This is why, in virtually all cases, there are two regression lines, two regression coefficients, and byx is not equal to bxy. If the plotted points do not lie exactly on a straight line, the relationship between Y and Xis less close and exact. The extent of the closeness is measured by the correlation coefficient, ryx, and its standard error. In practice, the swarm of points never lies exactly on a straight line, even with linear regression. Its plot looks like the shadow of a swarm of gnats projected onto is the page, as in Fig. 13a, 13b, and 13c. If the swarm is circular, as in Fig. 13b, zero. If it slopes up to the right as in Fig. 13a, ryx lies between 0 and +1, while if it slopes down to the right, as in Fig. 13c, ryx lies between —1 and 0. In obtaining data for a correlation, one mar relate X to some scale and then obtain values of Y for each value of X. In practice, however, Y and X are usually random values which pair off together. The data are then put in classes for both Yand X to reduce the size of the tabulation. Thus, for each value of X there is a normally distributed X-array of Y, and for each value of Y there is a normally distributed Y-array of X. Every array of both Y and X distributes about its own mean value. If we lump the values of Y and of X into categories, for any given category of X, ryx
90
CORRELATION
91
c
»:: • • • ' • •:•1 • ~•..• •~~ • • •
•
I
•
•
• •
• ..••s::~ ::: • • •.•.;
•'::..... ..•••••• • • • •
••
•
,, •
• ,
.
. Fig. 13.—Swarms of points, plotted for pairs of observations of Y and X, as correlation diagrams. Under a, 0 < bys < oo and < +1. Under b, bys has no determinable value 0 < rys — and rys = 0. Under c, 0 > bys > —00 and 0 > rys > —1.
then, there will be a frequency with which a given category of Y is associated with that value of X, and vice versa. Thus if we choose X as X = 15, say, there could be 71 cases for which Y = 22 when X = 15. This value 71 is the frequency. It follows that, if we were to plot Y and X as the usual co-ordinates on paper and then erect the frequencies vertically to the paper, the correlation table would take on the appearance of a forest of frequency spikes (see Fig. 9). If there were enough of these, they would form a dome-shaped surface (really a three-dimensional normal frequency surface) shaped something like the side of a Rugby football. The practical calculation of ryx, or the product-moment correlation coefficient, is a somewhat laborious process subject to large chances of error. We shall therefore approach it by easy stages using, initially, a method which no one TABLE XXXI.-A partial reprint of Table XXVIII. would actually use in practice. This X Y f XY XY2 simple method will, however, make matters clear. Let us use the data of 2 1 1 1 1 1 Table XXVIII, in which, for every 2 2 1 4 4 4 3 3 5 15 25 9 value of X, the associated values of 3 2 9 4 16 12 Y are all the same, and vice versa. 2 20 5 4 25 16 For convenience, Table XXVIII is re-printed here as Table XXXI.
Ef = 10 EfX= 31 X = 3.1 by„
fY = 33 Y= 3.3
E fX2 =115 ~f XY=115 >fY2 =131
—
>fXX Y— ED". = .6720,
EfX — X >fX bxy — E fX Y — E fX fY — Y EfY
-
.5747,
THE ESSENCE OF BIOMETRY
92
whence ryx = V(byx)(bxy) = .6214. This method is not, however, feasible if there are many observations in the table, with many frequencies. If there are, it is difficult to keep track of the calculations; mistakes creep in, and one does not obtain all the statistics one may require. Let us now demonstrate a simple example of the product-moment method. The correlation coefficient by the product-moment method Table XXXII shows the data of Table XXXI arranged as a correlation table with the frequencies in the cells, together with the deviations from the means X = 3.1 and Y = 3.3. In each cell the uppermost number is the frequency, the middle number is (X — X), and the bottom number is (Y — Y). These deviations are deviations from the true means, as stated. As is customary in correlation work, the values of Y are arranged in descending order. From Table XXXII, we may now calculate SDx = _f LfDx — V18.90/10 =1.3748, \ \f f f N SDy = .J EfDy - V22.10/10 = 1.4866, VVV N and, using a formula we have not seen before,
'' fDxDy
12.70
— — = .6214, ' (E fDX)(E fD y) V(18.90)(22.10)
r .x
and
[
SD,' (.6214)(1.4866) _ _ .6720, b b' — yx 4[SDx] 1.37 8 _
SDx _ (.6214)(1.3748)
b'— r,, SD,]
1.4866
.5747,
Y = 3.3 + .6720(X — 3.1), X = 3.1 + .5747(Y — 3.3). This kind of tabulation is satisfactory if not too many cells are filled, and if the means are such that the deviations from them are not too awkward to handle. With large masses of data, exactly the same problem arises as was seen in the case of the calculation of the mean and the standard deviation, and it can be dealt with in the same way, i.e. by the use of assumed means and a group deviation procedure. Table XXXIII shows as large a table as the page will permit. The added observational data are in italic type. The remainder of the table, in bold-faced type, is
TABLE
XXXII.-The data of Table XXVIII arranged as a correlation table. X
I
I
2
3
I
(X -
Y
(Y- 7)
5
-4-1.7
-2.1
-1.1
1=
Dx = Dy
4
+.7
3
-.3
2
-1.3
1
- 2.3
EI= Dx = D2x
=
DsDy
=
=
4
I
5
X)
-.1
+.9
+1.9
3 -.1
-1-1.7 2 +1.9 +.7 2 +.9 -.3
I -1.1 -1.3 2 -2.1 -2.3 2
1
3
2
2
-2.1
-1.1
- .1
+ .9
+1.9
4.41
1.21
.01
.81
3.61
+4.83
-4-1.43
-.17
-.27
+1.33
Ef
Dy
D222y
DxDy
3
-1-1.7
2.89
-.17
2
-1-.7
.49
+1.33
2
-.3
.09
-.27
I
-1.3
1.69
+1.43
2
-2.3
5.29
+4.83
N=
EfD1i =
10
E fDx = 0 £. fD2x = 18 90 = + 12.70
E fDxDy
0
NOI1d731111O0
1
E fD2y = E fDxDy = 22.10 +12.70
Check to see that 2'fDy = E fDx=O; E fDxDy is same value in row and column, as N must also be. '.o w
94
THE ESSENCE OF BIOMETRY
presumed to be prefabricated. Such tables can be made up, ready-mimeographed, and bound in pads. For those who might wish to extend Table XXXIII for their own purposes, it should be noted that each row of values of Dx Dy is extended by increasing the values at each end, by successive additions of the value adjacent to the 0 column. Thus, the uppermost row to the right, which reads 0, + 3, + 6, + 9, extends as + 12, +15, + 18, adding on +3 each time. A row above this would be 0, +4, +8, +12, +16, +20, and so on. With the information from Table XXXIII, we are now in a position to compute all the relevant statistics as follows: N= 42,
Zx =6,
Zy =
~ fDx = —1,
fDz = + 57,
9,
class interval for X =Cx =2,
E fDy= +5,
and Cy =3.
> JDXDy = +33.
E fD2y _ + 57 (this identity is accidental in this example).
For subsequent calculations, we shall need some quantities we may call Jz, Jy, Jx, and Jy, as follows: Jz = E JDx JN = —1/42 = —.023809, (4)2 = +.000567,
Jx = Cx(Jz) = —.047618,
Jy = fDy/N = 5/42 = +.119048, (Jy)2 = +.014172, Jy = Cy(Jy) = +.357144, XC = Zx + J. = 6 — .047618 = 5.952382, fD
SD', = (in class units)
N x (Jz)2
57
42
.000576 = ± 1.164721,
SD„ = (in true units) SD;(CC) = ±2.329442. Y=Zy +Jy = 9+.357144=9.357144 SD; = (in class units) f E 4!2 (J,)2 = j1.357143 — .014172 = ±1.158866, SD, = (in true units) SD;(Cy) = ±3.476598.
rz —L
fNxDy (Jz)(J;) , — 33/42 — (—.023809)(.119048)
(SDr)(SD;)
(1.164721)(1.158866)
= .5842,
also ryx = J(byx)(bxy) = .5842. y = , IN(1 .(1 — .58422)(42) SEEb x SD —24') — 3.476598 40 N_ SE _ SEEbyx = 2.8913 42 ~/N
+.4461.
2.8913.
XXXIII.—A pre-fabricated correlation table. The assumed means, Z. and Zr,, are placed between the double lines running through the centre. In the main body of the table, the upper italic figures are frequencies entered by the computor. The lower figures are the products (Dy) (Dx) for each cell. In the right-hand third of the table and the lower third are other preprinted figures as indicated (top and right).
TABLE
Ef
Zx
D2y
-13
+9
+2
+4
-1-1
-1-1
0
0
—1
+1
—2
+4
—3
+9
N
E fDy
E fD2y
E fDyD.,
42
+5
+57
± 33
0
2
4
6
8
10
12
18
—9
—6
—3
0
+3
1 +6
-1 9
15
—6
—4
—2
2 0
1 +2
+4
+6
12
—3
—2
2 —1
3 0
4 +1
1 -1-2
-1-3
9
0
1 0
3 0
5 0
3 0
2 0
0
6
+2
6
+3
+1
2 0
1 —1
—2
—3
3
+6
2 +4
1 +2
0
—2
—4
—6
0
+9
+6
+3
0
—3
—6
—9
0
4
12
12
9
5
0
Dx
—3
—2
—1
0
+1
+2
+3
E fDx
=
—1
D2x
+9
+4
+1
0
+1
+4
+9
E fD2x =
+ 57
0
+10
+6
0
+5
+12
0
X
fDyDx
Dy
Y
1
4
1
+6
10
-1-4
14
0 +7
10
+ 10
3
0
0
E
fDyDx
NOI1V I32IZIOJ
Zu
-16
1
E fDx Dy = +33
1/40
96
THE ESSENCE OF BIOMETRY
The fiducial limits about Pu, for 95 per cent certainty (where T.05 at 40 d.f. = 2.021) are then upper limit = 7 + T05(SEy) = 9.3571 + (2.021)(.4461) = 10.2585, lower limit = F — T05(SEE) = 9.3571 — (2.021)(.4461) = 8.4557.
byx = ryx(SDyISDx) = .5842(3.4766/2.3295) = .8719, bxy = ryx(SDx/SDy) = .5842(2.3295/3.4766) = .3914. SEby,c = ~
Ebyr
1)2 — E1~X—X
±.0192.
Note that we do not use > fDx in the numerator, as this involves deviations from Zx, and here we have them from 1. We note that f (X — 1)2 = N(SDx)2 = 227.9046, whence SEb,= ±.0192. For 95 per cent certainty, where T.05 at 40 d.f. = 2.021, the fiducial limits about byx are
I
upper limit = byx + T.os(SEbyx) = .8719 + (2.021)(.0192) = .9107, lower limit = byx — T05(SEbyx) = .8719 — (2.021)(.0192) = .8331. In order to test to see if there is any regression
b T
SEby,~
.8719 45.41. .0192 —
At 40 d.f., P is much less than .001, and therefore byx differs significantly from zero, and there is regression. One could deal with the whole problem in reverse, considering X as the dependent variable, by exchanging X for Yin all the above calculations. Does ryx differ significantly from zero? This test is carried out to see if there is really any correlation, and is done by the now familiar T test where T=r 4(1
r2 ) = .5842- r (1
42 = 4.5524. .58422)
Where N = 42, N — 2 = 40, from a table of the distribution of T, P is much less than P = .001, and thus the value ryx = .5842 does differ from zero. The significance of the difference between two correlation coefficients When testing the significance of the difference between two means, one can use a standard error of the difference built up from the standard errors of the two means concerned. It is not possible to do this in the case of the significance of the difference between two correlation coefficients, because ryx is not normally distributed. The procedure is to transform the two is to a variable Z, which is normally distributed,
CORRELATION
97
carry out the desired test, and then, if necessary, transform back to r's. This Z should not be confused with Zx, which is an assumed mean. Where, for simplicity of the formulas, we write Z = Z,.x and r = r,,x, it can be shown that 3
5
Z=r+ 3 +
7
+ t +..., 5 7
which is equivalent to Z=
loge(1 + r) - loge(1 - r) 2
Even this is not too easy to compute, and one uses a table such as Table XXXIV. TABLE
XXXIV.-The transformation of r to
z'
z
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
.0 .1 .2 .3 4 .5 .6 .7 .8 .9 1.0 1.5 2.0 2.5 2.9
.0000 .0997 .1974 .2913 .3800 .4621 .5370 .6044 .6640 .7163 .7616 .9051 .96403 .98661 .99396
.0100 .1096 .2070 .3004 .3885 .4699 .5441 .6107 .6696 .7211 .7658 9069 .96473 .98688 .99408
.0200 .1194 .2165 .3095 .3969 .4777 .5511 .6169 .6751 .7259 .7699 .9087 .96541 .98714 .99420
.0300 .1293 .2260 .3185 .4053 .4854 .5580 .6231 .6805 .7306 .7739 .9104 .96609 .98739 .99431
.0400 .1391 .2355 .3275 .4136 .4930 .5649 .6291 .6858 .7352 .7779 .9121 .96675 .98764 .99443
.0500 .1489 .2449 .3364 .4219 .5005 .5717 .6351 .6911 .7398 .7818 .9138 .96739 .98788 .99454
.0599 .1586 .2543 .3452 .4301 .5080 .5784 .6411 .6963 .7443 .7857 .9154 .96803 .98812 .99464
.0699 .1684 .2636 .3540 .4382 .5154 .5850 .6469 .7014 .7487 .7895 .9170 .96865 .98835 .99475
.0798 .1781 .2729 .3627 .4462 .5227 .5915 .6527 .7064 .7531 .7932 .9186 .96926 .98858 .99485
.0898 .1877 .2821 .3714 .4542 .5299 .5980 .6584 .7114 .7574 .7969 .9201 .96986 .98881 .99495
* Abridged from Table VIII of Fisher and Yates' Statistical Tables for Biological, Agricultural and Medical Research, published by Oliver and Boyd Ltd , Edinburgh, and by permission of the authors and publishers.
The characteristics of Z: 1. It is dependent upon r, but while r varies from -1 to + 1, Z varies from - co to + co. 2. Where r is small, below r = .5 (approximately), Z is nearly equal to r. 3. Where r is large, Z is greater than r and increases rapidly. It thus provides a more delicate testing variate at these important higher values. 4. The standard error of Z is very simple, being
SE..-
1
/N- 3
and is thus independent of the value of Z. It follows that, when several values of ryx are considered which use the same value of N, all the Z's resulting from
THE ESSENCE OF BIOMETRY
98
transformation to Z have the same standard error. Furthermore, all differences between such ryx's having the same N lead to the same standard error of the difference between their Z's after transformation. Example. From Table XXXII, we determined a value ryx = .5842, where N = 42. Does it differ significantly from, say, ryx = .6169 where N' = 44? Using Table XXXIV, we find that, for ryx = .5842, Zyx = (by interpolation) .6688, and for ryx = .6169, ZZx = .7200.
SE~yx, with d.f. = 42, is 1f 42 — 3 = ±.1601, SEi;,,, with 44 d.f. = 1/.J44 — 3 = ±.1562. As the two numbers of degrees of freedom are much the same, as are also the standard errors (whence by implication, the variances), we may use the simplest formula for the standard error of the difference SEdirr = V SEz ,x + SEZ.yx
= V.16012 + .15622 = +.22506.
Then, as usual, T=
dill = .0512/.2251 = .2275. SEdi rr
As Z is normally distributed and is independent of the number of degrees of freedom, we may as well use d.f. = oo in entering the T table. We find on doing so that P is greater than .8, so that the difference between the two Z's (and thus between the two r's) almost certainly arose by chance, and the two r's do not differ significantly. We could, if we wished, pool the data for the two r's, as they can be considered to come from the same universe. Fiducial limits about ryx , the true correlation coefficient of the universe We have determined a correlation coefficient ryx = .5842, and we must suppose that this is an approximation to the ryx„ of the universe from which this sample was drawn. In order to calculate the fiducial limits, we must work through the Z-transformation, find the fiducial limits about Zyx., and then back-transform to find those about rm. We have now that ryx = .5842, for which Zyx = .6688 and SEzy,, = ±.1601. The fiducial limits about ZM are then (d.f. = 42 — 2 = 40, T 05 = 2.201), upper limit = .6688 + (2.201)(.1601) =1.0212, lower limit = .6688 — (2.201)(.1601) = .3164. Converting these limiting values of Zyx back to values of ryx, using Table XXXIV in reverse, the limits for ryx are: upper limit = .7703 (approx); lower limit = .3062 (approx).
CORRELATION
99
Averaging several values of ryx to obtain a more precise estimate If one wishes to obtain a more exact value for the correlation coefficient for a given situation, one might carry out a number of sets of measurements, proposing to average these to obtain Fyx. This must be done by converting all the r's to Z's, averaging the Z's, and then back-converting from Zyx to F.yx. Before this can be done, however, one must be sure that all the r's are from the same universe, otherwise they cannot legitimately be averaged. Note here that it is not sufficient to say that the r's must be from the same universe if they are derived from measurements made on the same material. The universe in this case is a statistical concept implying data having a certain mean, standard deviation, etc. By chance (operating perhaps through non-random sampling, or through samples of inadequate size) it is possible to obtain statistics which do not come from the same universe even though derived from the same mass of data. The test we are about to perform (to see if the set of r's form a homogeneous set from the same universe) will not pick out the aberrant r's, but one can usually identify these, checking them if necessary by means of a T test against other r's in the set, as will be explained later. What, in effect, we shall do is to convert the r's to Z's, and then carry out a special chi-squared test with the null hypothesis, `There is no difference between this set of Z's and a homogeneous set.' Table XXXV shows the values of ryx from six samples, A to F (column 4). N and the degrees of freedom are shown in colums 2 and 3, and the conversions to Z in column 5. TABLE
XXXV.—Illustrating the averaging of a set of values of ryx. Column numbers for reference
1
2
3
4
5
Sample designation
N
d.f. = N — 3
ry x
Zyx
(N — 3)Zyx
(N — 3)Z2yx
A B C D E F
42 35 24 61 39 43
39 32 21 58 36 40
.5717 .6169 .6231 .5580 .6107 .5784
.6500 .7200 .7300 .6300 .7100 .6600
25.35 23.04 15.33 36.54 25.56 26.40
16.4775 16.5888 11.1909 23.0202 18.1476 17.4240
152.22
102.8490
Totals
226
6
7
8
Zyx = 152.22 _ 226 .6735398
We now weight each Z by a weight equal to the reciprocal of the variance of Z, the variance of Z in this case being SE?. One might ask `Why not use SD??' Remember that we are thinking of the variance of a number of Z's clustering around an approximation to Zyx,,. They are distributed around this quantity with a standard deviation which is SEE. Each SE_yx is an approximation to this (see the discussion under the standard error of the mean, in Chapter IV). 8
100
THE ESSENCE OF BIOMETRY
The reciprocal of the variance of Z is 1/(1/(N — 3)) = N — 3. We obtain a weighted mean of the Z's thus Zyx Zyx (weighted) _ (N 33 — 152.22/226 = .6735398.
The weighted values of Z are shown in column 6 of Table XXXV, and the values of (N — 3)4 are shown in column 7. The special form of chi-squared can be shown to be, x2 = 1[(N — 3)4J — Zyx E [(N — 3)Zyx], where the number of degrees of freedom is the number of Z's used, minus one. This is in this case 6 — 1 = 5. From Table XXXV, X2 = 102.8490 — .6735(152.22) = .3289. From a chi-squared table, with 5 d.f. P is greater than .99, so that we can be virtually certain that there is no difference between the set of Z's (and thus between the set of r's) and a homogeneous set. We then convert 4, back to Pyx = .5873. This is the mean value of the r's of Table XXXV. Where there are many r's to average, one should take note of the fact that each value of Z is very slightly too large. Some rather uncertain corrections have been proposed to allow for this, as discussed in Snedecor's Statistical Methods.
X
RANK CORRELATION THE PRODUCT-MOMENT CORRELATION, r px, discussed
in Chapter IX, dealt with data in the form of frequencies of occurrence of pairs of values of Y and X. The values of Y and X were objective values, arranged on scales. They might have been hours, or pounds, or what-you-will. Frequently, however, one does not have values like this. One has only subjectively rated values, or perhaps scores from some judging process. One gets such things in beauty contests and in consumer-preference trials. One may find that a soup has been given, say, a score of eight, but this is not a simple thing such as a weight or length. However, one may nevertheless wish to know if two variables of this sort are related in some way rather like a correlation, even though all one can do with them is to place them in some sort of order of merit. One may wish to know, perhaps, if two or more judges (who are compiling scores) are consistent in their judging. Such problems can be dealt with by the use of Spearman's rank correlation, symbolized by r,. The process can really be understood only by the use of an example, and for this purpose we shall use the data of Table XXXIII, thinking of the values of Y and X TABLE XXXVI.—The
data of Table XXXIII, shorn of irrelevant material, preliminary to ranking as done in Table XXXVII. Colunm numbers for reference 1 I 2 I 3 I 4 I 5 I 6 I 7 Values of X
Y
0
2
4
6
8
10
12
21 18 15 12 9 6 3 0
0
0
0
0
0
0
2 3 6
2 3 5 2
1 4 3 1
0 1 1 1
12
9
Total X frequencies
1 1 2 4
2
101
5
0
1 4 10 14 10 3 0
1
12
Total Y frequencies
0
42
102
THE ESSENCE OF BIOMETRY TABLE XXXVII.-Setting up the preliminary ranks and converting to assigned ranks. Column numbers for reference
1
2
3
4
5
6
7
8
9
10
11
12
X
Prel. rank
Ass'd. rank
X
Prel. rank
Ass'd. rank
Y
Prel. rank
Ass'd. rank
Y
Frel. rank
Ass'd. rank
10
1
3
6
25
20.5
18
1
1
9
25
22.5
10
2
3
6
26
20.5
15
2
3.5
9
26
22.5
10 10 10
3 4 5
3 3 3
4 4 4
27 28 29
32.5 32.5 32.5
15 15 15
3 4 5
3.5 3.5 3.5
9 9 9
27 28 29
22.5 22.5 22.5
8 8 8 8 8 8 8 8 8
6 7 8 9 10 11 12 13 14
10 10 10 10 10 10 10 10 10
4 4 4 4 4 4 4 4 4
30 31 32 33 34 35 36 37 38
32.5 32.5 32.5 32.5 32.5 32.5 32.5 32.5 32.5
12 12 12 12 12 12 12 12 12
6 7 8 9 10 11 12 13 14
10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5
6 6 6 6 6 6 6 6 6 6
30 31 32 33 34 35 36 37 38 39
34.5 34.5 34.5 34.5 34.5 34.5 34.5 34.5 34.5 34.5
6
15
20.5
2
39
40.5
12
15
10.5
3
40
41.0
6 6
16 17
20.5 20.5
2 2
40 41
40.5 40.5
9 9
16 17
22.5 22.5
3 3
41 42
41.0 41.0
6
18
20.5
2
42
40.5
9
18
22.5
6 6 6 6 6 6
19 20 21 22 23 24
20.5 20.5 20.5 20.5 20.5 20.5
9 9 9 9 9 9
19 20 21 22 23 24
22.5 22.5 22.5 22.5 22.5 22.5
in that table as if they were scores obtained by some judging process. Y might be the colour of the food, and X its taste, for example. For our present purposes, Table XXXIII is much cluttered with irrelevant material. Table XXXVI shows the same data, shorn of this. The first stage in the process is to arrange the values of Y and then (separately) those of X into ranks. This is best done in two steps. We first assign preliminary ranks giving rank 1 to the highest value of X and a rank of 2 to the next highest, down to a rank of 42 for the lowest value of X. In column 6 of Table XXXVI, it is seen that there are five observations for which X = 10. These occupy then the ranks 1, 2, 3, 4, and 5 of column 2, Table XXXVII. The nine cases for which X = 8 (column 5) then occupy the next nine ranks, i.e. rank 6 to rank 14
RANK CORRELATION
103
XXXVIII.-Pairing off the values of Y and X, with other calculations leading to r..
TABLE
Column numbers for reference 1
2
3
4
5
6
7
X
ARx
Y
ARy
f
(ARx - ARy)
(ARx - ARy)2
Front col.
6, Table XXXVI 18 15 12 9
1 3.5 10.5 22.5
1 1 1 2
+ 2.0 - .5 - 7.5 -19.5
4.0 .25 56.25 380.25
15 12 9 6
3.5 10.5 22.5 34.5
1 4 3 I
+ 6.5 - .5 -12.5 -24.5
42.25 .25 156.25 600.25
15 12 9 6
3.5 10.5 22.5 34.5
2 3 5 2
±17.0 +10.0 - 2.5 -14.5
289.00 100.00 6.25 210.25
12 9 6 3
10.5 22.5 34.5 41.0
2 3 6 1
+ 22.0 10.0 - 2.0 - 8.5
484.00 100.00 4.00 72.25
9 6 3
22.5 34.5 41.0
1 1 2
+18.0 + 6.0 - .5
324.00 36.00 .25
10 10 10 10
3 3 3 3
Front col. 5, Table XXXVI
8 8 8 8
10 10 10 10
From col. 4, Table XXXVI
6 6 6 6 Fromt col.
4 4 4 4
20.5 20.5 20.5 20.5 3, Table XXXVI 32.5 32.5 32.5 32.5
Front col. 2, Table XXXVI
2 2 2
40.5 40.5 40.5
inclusive. Finally, as per column 2 of Table XXXVI, the four cases for which X = 2 occupy the last four ranks i.e. ranks 39 to 42 inclusive. We then do exactly the same thing with the values of Y, but quite separately and with no reference to what we have done with the values of X. All this is shown in Table XXXVII. We must now convert the preliminary ranks to assigned ranks and pair off the values of Y and X, as is done in Tables XXXVII and XXXVIII. The assigned rank for a value of X or Y is the average of the smallest and largest preliminary ranks for that value of X or Y. For example: the assigned rank for Y = 12 is (6 + 15)/2 = 10.5. Table XXXVIII is made up by working down the colums of Table XXXVI, cell by cell, beginning with the second cell in column 6. In this cell, we have one case for
THE ESSENCE OF BIOMETRY
104
which X = 10 (assigned rank = 3) paired up with Y = 18 (assigned rank = 1). These data produce the first entry in Table XXXVIII, i.e. 10, 3, 18, 1, 1. The last entry in Table XXXVIII comes from the bottom cell in column 2 of Table XXXVI and is 2, 40.5, 3, 41.0, 2, meaning X = 2, AR, = 40.5, Y = 3, AR, = 41.0 and there are two cases like this, where X = 2 and Y = 3. In columns 6 and 7 of Table XXXVIII, the values of the difference and the difference squared are calculated. Multiplying each value in column 7 by the frequency in column 5 and summing, we obtain E (ARx — ARy)2(f) = 4987.75. It can then be shown that (where n = E f = 42)
— 1 — 6 E (f)(ARx - ARy)2 —
(6)(4987.75) —1— .40416 = .5958. 42(422 — 1) —
n(n2 — 1)
It will be remembered that ryx was .5842, a not dissimilar value. Are two observers or judges consistent in their assessments? Let us suppose that we must assess the quality of a number of products, and that we decide to employ judges to do the work for us. Obviously we require the judges (two in this case) to have the same scale of values, or nearly so. Let us employ Jones and Smith and ask them to assess products A to H by putting them in order of merit. They do this, and the results are as shown in Table XXXIX. TABLE XXX1X.—The judging of eight products by two judges, Jones and Smith. Product symbol A B C D E F G H
Ranking by Ranking by Jones = AR' Smith = AR, (ARt — AR,) (ARt — AR,)" I 2 3 4 5 6 7 8
1 3 5 2 4 8 7 6
0 I 2 2 1 2 0 2
0 1 4 4 1 4 0 4
Total
18
In this case, there is only one judgement of each thing for each judge, so that .f is always f = 1, and 6E ARS )2 - .7857. 8(82-1) t'S =1
(AR. -
In the matter of the agreement, or lack of it, between Jones and Smith, we have
T= rs f 1_r 2—.7857\
1_.78572-
3.1110.
RANK CORRELATION
105
From a T table, where the degrees of freedom are 8 — 2 = 6, P (by interpolation) is about .0214, less than P _ .05. That is to say, there is only this very small possibility that the extent of the differences between Jones and Smith has arisen by chance. We conclude then that the differences are due to lack of agreement in the scale of values, and Jones and Smith are not a pair of consistent judges. Strictly speaking, we should not use this simple method if n is less than ten, but it is often so used. Ferguson, in Statistical Analysis in Psychology and Education,6 discusses other, more elaborate methods. The case of many judges assessing numerous products In consumer trials, one may have a great many judges and many products. The `tasting bee' is an example, where groups of people sip soup or eat bits of cake, and then write out scores, placing the products in order of merit. Many of these trials are not worth the effort involved because the judges are not consistent, but they are generally accepted as good clean fun. Let us examine the case of M judges (say M = 7), assessing nine products (N = 9). The results are as shown in Table XL. TABLE
XL.—The judging of nine products by seven judges. Product symbols
Judges Knowles Lang Moore Nilson Oliver Perry Quigley Rank totals
A
B
C
D
E
F
G
H
J
Totals
1 3 1 4 1 I 2
3 1 2 3 2 3 1
5 5 4 1 3 2 5
2 4 3 5 9 4 3
6 2 8 6 4 5 4
9 8 6 2 5 9 8
7 6 7 7 6 8 7
8 9 9 8 7 7 9
4 7 5 9 8 6 6
45 45 45 45 45 45 45
13
15
25
30
35
47
48
57
45
315
Now it is obvious that the total of ranks for each judge must be +1)(9)(10)
1+2+3+4+5+ ... N= N(
2
-45.
This is so because we make the assumption that each judge must give a product a definite rank, and there are (in this example) no tied ranks. There being M judges, the grand total of ranks is total ranks =
MN(N + 1) = (7)(45) = 315. 2
6. G. A. Ferguson, Statistical Analysis in Psychology and Education (Toronto, McGraw-Hill, 1959).
106
THE ESSENCE OF BIOMETRY
If there is no real discrimination on the part of the judges, they will just write down anything, and each product would then get a sort of average rank which is average total rank =
1) —
+ M 2(N
M(N2+ 1) = 35.
If, on the other hand, the judges exercise perfect discrimination and are all in complete agreement, one of the products would get all rank 1, another would get all rank 2, and so on down to the worst product, which would be universally rated as rank N. This would be rank 9 in this example. This done, the total ranks could be arranged in a series, running from the smallest, to which all judges gave rank 1, to the largest, to which all gave rank N, (N equals 9); this series would be: M, 2M, 3M, 4M,...,9M. It follows that the total difference between the actually-observed rank totals and those which would have been seen if there had been no discrimination, is a measure of the extent to which the judges did in fact exercise discrimination. Let the difference for a product be Dr, and its square Dr . It can be shown that the greatest possible sum of squares is
E D,(max) =
M 2(N 3 —
12
In Table XLI, the actual sum of squares of differences is calculated, and is found to be 1906. With these two values we may now calculate W, the coefficient of concordance as the ratio between the actual sum of squares and the maximum possible sum, D2
D' 1906/2940 = .6483. Dr (max) Inasmuch as W can run from zero (with no agreement) to plus one with perfect agreement, it would appear that our judges do have some measure of agreement in their judging. Is this agreement significant, or did it arise by chance? W=
N) 7 2(93 — 9) — 12 = 2940. XLL—Calculations leading to the value of W, from the data of Table XL.
TABLE
Product
Actual Rank Total = Re
A B C D E F G H J
13 15 25 30 35 47 48 57 45
35 — R, = Dr
(35 — Ril' = D;
4-22 484 ±20 400 ±10 100 ± 5 25 0 0 144 —12 169 —13 —22 484 —10 100 Total 1906
Testing the significance of W Several methods are available for testing the significance of W. If N is seven or more, we can use a form of the chi-squared test, in which r.2 = M(N — 1) W = (7)(8)(.6483) = 36.3048. As is commonly the case in a chi-squared test, the degrees of freedom are one less than
RANK CORRELATION
107
the number of categories contributing to the total, or in this case 9 — 1 = 8. From a table of chi-squared it is found that with 8 d.f., P is less than .001. That is to say, there is only this small probability that the total difference between our judges' results and one in which there was no discrimination could have arisen by chance. Clearly the difference is not due to chance, and our judges do exercise discrimination. That is to say, W differs significantly from zero. In this connection, one may also use Snedecor's F test (see Chapter IV) and for this we require an adjusted form of W. The adjustment is made by reducing E Dr by one and increasing E D(max) by two and then re-calculating Was: W(adj) = 1905/2942 = .6475. It can then be shown that F—
(M — 1) Wadi — (6)(.6475) = 11.0213. 1—
.3525
Wadj
It will be noted from Chapter IV that, in passing to a table to obtain P in terms of F, two sets of degrees of freedom must be used : that for the variable with the greater variance, and that for the variable with the lesser variance. With this present usage of F, these are calculated as d.f. for variable with greater variance =
M(N 1) — 2
M
d.f. for variable with lesser variance = (M — 1) I (
—
56 2 — 7.7142, 7
(NM 1) — 2j = (6)(7.7142)
= 46.2852. L In this application, the numbers of degrees of freedom are commonly fractional, and one must use (for great accuracy) interpolation in the F table. Using an F table with the above sets of degrees of freedom, we have P less than .001, showing again that our judges are a satisfactory consistent group, exercising discrimination. The average rank-correlation One might feel that one could obtain a more accurate value of rs by determining it for every possible pair of judges and then averaging the results. This is a practicable operation if there are only two or three judges, but very cumbersome if there are many. For example, with eight judges there would be 8C2 = (8)(7)/2 = 28 pairs. Fortunately there is a simpler method, in that rs(average) =
M
11 — (7)(.6483) — 1 _ .5897.
This does not differ much from the previously obtained value of rs = .5958.
108
THE ESSENCE OF BIOMETRY
Putting the products in final order of merit Once we have assured ourselves that the judges are a consistent group and have thus done their jobs properly, we may put the products in a final order of merit, and in Table XL the order would be A, B, C, D, E, .1, F, G, H. A would be the best product, having a total rank of 13, and H the poorest, with a total rank of 57. We may then assign new ranks, 1, 2, 3, etc. The coefficient of consistence If one takes three quantities, A > B > C, and makes measurements of them with some instrument, one will always find that A is greater than B, B is greater than C, and finally C is less than A. This is so because the instrument does not think. It works simply and logically. People are not so invariant in their actions. Jones may well say that he prefers an orange to an apple because he likes the cool juicy nature of the orange. He then says that he likes an apple more than a pear because he enjoys the firm consistency of the apple and its tart taste. When, however, we question him about his preference for pears versus oranges, he says he likes pears more than oranges (an inconsistent choice). Questioned further, he says that he makes this choice because he likes the delicate aroma of the pear. Clearly the inconsistency comes about because Jones changes his criteria. Such judgements may be mathematically inconsistent, but are really emotionally quite reasonable. The fact that such inconsistent judgements can occur is the basis of Kendall's coefficient of consistence, K. Suppose that we have six products, A, B, C, D, E, and F. We present these for judging in every possible combination, asking which of each pair is preferred. For TABLE XLII.—Paired comparisons of six products. (X — X) (X — 502
A
B
C
D
E
F
Row Total =X
*
1 (I)
1 (2)
0 (3)
I (4)
1 (5)
4
1.5
2.25
0 (6)
•
I (7)
1 (8)
I (9)
0 (10)
3
.5
.25
B
0 (11)
0 (12)
+
1 (13)
1 (14)
1 (15)
3
.5
.25
C
1 (16)
0 (17)
0 (18)
'"
0 (19)
0 (20)
1
1.5
2.25
D
0 (21)
0 (22)
0 (23)
I (24)
*
1 (25)
2
.5
.25
E
0 (26)
I (27)
0 (28)
1 (29)
0 (30)
s
2
.5
.25
F
A
RANK CORRELATION
109
the pair (A, B), if A is preferred to B, we write `1,' and if B is thought to be better than A, we write `0.' We can then arrange the results in a table, as in Table XLII. The cells in this table contain the symbols '1' and `0' for all the comparative judgements of the products down the left hand side versus those across the top. Inasmuch as one cannot compare anything with itself, the diagonal row of cells is blank. For purposes of explanation, each cell is then numbered (in brackets) so that we can refer to a given cell. Let us symbolize the judgement `A is better than B' as A > B, and begin to fill the cells of the table. Suppose A > B, then we enter 1 in cell (1), 0 in cell (6). Suppose now that, for the sake of argument, B > C, then 1 in (7), 0 in (12); but if A > B > C, then A > C, and 1 in (2), 0 in (11). Suppose now that C > D, then 1 in (13), 0 in (18). But, ifA> B> C> D, then B> D, and so 1 in (8), 0 in (17); and from the above A > D, and we should have 1 in (3), 0 in (16). In point of fact, we note 0 in (3) and 1 in (16) indicating the inconsistent judgement that, while A > B > C > D, D > A (!). If the reader will follow out the whole train of reasoning in the manner shown above, he will find that if we had perfectly consistent judgement whereby A > B > C > .D > E > F, with no inconsistent reversals, all the cells above the diagonal would contain the figure 1, and all those below it, 0. Were this the case in Table XLII, the row totals (X) would be 5, 4, 3, 2, 1, 0. X would equal 2.5, and the variance of X would be V = SDX = 17.50/6 = 2.9167. This is the maximum value which V can attain for N = 6 (six things judged). If, as is actually the case, there are some inconsistent judgements, V is reduced in value; in Table XLII it is reduced to only .9167. The minimum possible value of V (which occurs when there is a maximum number of inconsistencies) depends upon whether N is odd or even. If N be odd, each row total is (N — 1)/2, and V = 0. If N be even, Vmin = .25. This may seem curious, but if the reader will set up a little table with N = 4, A > B, B > C, C > D, and A < C, A < D, and B < D, the row totals will be 1, 1, 2, 2; X will be 1.5, and V will be .25. If the introduction of inconsistencies reduces the value of V, we can use this fact to measure the extent of inconsistent judgement by setting up Kendall's K as K
V range of V'
110
THE ESSENCE OF BIOMETRY
and a little algebra will show that for N (odd) 12V K = N2 -1'
N (even) K
_ 12V- 3 N2 -4 .
In the example of Table XLII, where N = 6 (even) and V = .9167, K
12(.9167) — 3 — .25. 36-4
The meaning of K is this: K is the proportion of consistent judgements out of the maximum possible number of such. We get at it by working with the inconsistent judgements. It can be shown that the number of inconsistencies is /— N 3 — N— 12NV — 216-6-72(.9167) -6 24 24 The maximum possible number of inconsistencies occurs when V = 0 for N (odd) and V = .25 for N (even), and we have N (odd) /max
—
N 3 —N — 24 '
N (even) N 3 -4N /mx — 24 . In our example of Table XLII, /max = 8, / = 6, and so the proportion of inconsistent judgements is 6/8 = .75, while the proportion of consistent judgements is K = (8 — 6)/ 8 = .25, as previously calculated. Unfortunately, we cannot use a chi-squared test to see if the value of K is significant, because the row totals are not independent of each other. So far a suitable test has not been discovered. The whole matter of this sort of comparison of judgements can be extended to numerous judges, and a ready source of discussion of this is Facts from Figures.?
7. M. J. Moroney, Facts front Figures (Harmondsworth, Penguin Books Ltd., 1958).
XI CONTINGENCY OR ASSOCIATION THE product-moment correlation coefficient, ryx, is
designed to measure the extent of the relationship between two variables, y and x, when more or less exact values of y and x can be determined, and when one has many such values. In other cases it is not possible to assign exact values. One may have to work with attributes such as `large,' `small,' `nice-tasting,' and so on. The number of classes may be too small for the calculation of a valid ryx. It may be possible to do no more than arrange the variables in an order of merit—in ranks. In this case Spearman's rank correlation coefficient, rs, is useful, but it is only an approximation. The contingency method to be described below is a form of the chi-squared test which determines whether there is, or is not, a significant degree of relationship between two or more variables. It does not indicate the degree of relationship, or whether the relationship is direct, reverse, true, or spurious. Familiarity with the problem in hand, however, may enable the research worker to decide as to the nature of the relationship once its existence has been shown. The question as to whether gentlemen do or do not prefer blondes is now perhaps a little passé but, if the reader wishes to study the matter, he must know how to do contingency tests! In carrying out massive investigations where numerous factors are studied, it is often of the highest importance that the investigator know which of these many factors really contribute to the end result. Much money and time can be saved if one eliminates the unimportant factors. One might find, for example, that the fact that certain parents smoke heavily has no effect upon the incidence of, say, deafness in their offspring. It is then unnecessary to ask further questions about this point, and money and time are saved. The contingency method has other advantages. It will work with as few as two categories, in which case there is only one degree of freedom. It is also easy to operate. The two-by-two contingency table The simplest form of contingency is that involving the two-by-two table in which there are only four cells. It serves to study any problem in which we have, say, two kinds of people, each with two possible ways of action. We might for example have men and women who do and do not smoke. We might have patients both under 21 and over 21 who have been given cyclopropane, and some who have not. Table XLIII shows some fictitious data for a study of possible relationship between hair colour and 111
112
THE ESSENCE OF BIOMETRY
preference for some food. The numbers in the cells are the frequencies with which the two kinds of people say they do and do not like this food. XLIII.—Fictitious data for hair colour versus preference for a certain food.
TABLE
Hair colour
Like the food Do not like it
Row totals
Blonde
80
211
291
Brunette
61
15
76
Column totals
141
226
367
It will be seen that 367 persons were questioned: 291 blondes and 76 brunettes. The majority of the blondes said they did not like the food (211/291) while most of the brunettes said they did like it (61/76). Is there any significant relationship then, between hair colour and liking for this food? Now, obviously the number of persons questioned has nothing whatever to do with the answer, provided we have taken a random sample and have questioned a reasonably large number of people. Furthermore, the row and column totals do not affect the answer. Therefore, all these totals always remain inviolate in contingency calculations. They must never be altered. It is the frequencies within the cells which provide the nub of the problem. They are what they are because of the degree of relationship which may or may not exist. If there had been no association at all between hair colour and liking for the food, the cell frequencies would have been simply what one might expect on the basis of chance. That is to say, one could put all the numbers equal to or less than 291 in a hat, and draw out one of them to fill the cell which actually contains the number eighty. Once this number is fixed, if the row and column totals are to remain inviolate, all the other cell frequencies are also fixed. This shows clearly why the two-by-two table has only one degree of freedom.* To test for association, we must first determine what the cell frequencies would have been if there had been no association. Then, calling the observed cell frequencies F. and the calculated frequencies Fc, we can apply the chi-squared test to see if there is any significant difference between the cell frequencies appropriate to `no association' and those we actually have. If there is such a difference, we conclude that some association exists. For simplicity, we will temporarily call the observed frequencies A', B', C', and D' and the calculated frequencies A, B, C, and D as shown in Table XLIV. It can then be shown that the cell frequencies will be those of no association if A _ A'+C' C A'+C' A A'+B' B B'+D'' D B'+D'' C C'+D'
and
B A'+B' = D
* It has been suggested to the writer that Table XLIII is a poor example. The expressed choices may change from day to day, and so may the hair colour. This has been taken care of; these are standardized ladies, with fixed hair colour and immutable preferences. It took years to find them!
113
CONTINGENCY OR ASSOCIATION
whence A --
(A'+B')(A'+C') , B— (A'+B')(B'+D')
N
N D
~C—
(C' + D')(A' + C') N
(C'+D')(B'+D') N '
where N = grand total of the table (367, in this case). That is to say, the calculated frequency in each cell is the product of the row total times the column total appropriate to that cell, divided by N. The results for the data of Table XLIII are shown in Table XLIV. When the calculated frequencies have been found, one should always check the marginal totals to make sure that they are unchanged. TABLE XLIV.—Observed and calculated cell-frequences from Table XLIII Hair colour Blonde
Brunette Column totals
Like the food
Do not like it
Row totals
A' = 80 A = 111.80
B' = 211 B = 179.20
291
C' = 61 C = 29.20
D' = 15 D = 46.80
76 76
141.00 141.00
226.00 226.00
291
367.00 367.00
It is now obvious that, formulated in the usual way, chi-squared is x2 _ (A—A')2 + (B—B')2 + (C— C')2 + (D—D')2 A B C D - (80.00 — 111.80)2 + (211.00 — 179.20)2 + (61.00 — 29.20)2 + (15.00 — 46.80)2
111.80
179.20
29.20
46.80
= 70.9274, with one degree of freedom. It will be found in carrying out such a calculation as the above that all the numerators are equal—in this case equal to 1011.24—thus simplifying the work. From a chi-squared table it will be found that P is less than .001. That is to say, there is only this small chance that the differences between the observed totals and random totals (with no association) could have arisen by chance. Clearly, then, they did not arise by chance, and there is some degree of association between hair colour and liking for the food. It will be noted that cells C and D contribute most to the chi-squared total, so that association is proved more from the brunette data than from the blonde.
114
THE ESSENCE OF BIOMETRY
Furthermore, there are excess observed frequencies in the B and C cells, so that one would judge that it is the brunettes who like the food, and the blondes who do not. Direct calculation of chi-squared from the raw data The above method is somewhat lengthy, but it did provide an explanation of the underlying meaning of a two-by-two contingency table. Chi-squared can be more easily calculated directly from the raw data by the following formula 2
x
_
_ (367)(80 x 15 — 211 x 61)2 N(A'D' — B'C')2 70.93, 291 x 76 x 226 x 141 ' ')(C' ')(B' ) + D')(A' + C + D' (A + B
as previously calculated. Yates' correction where there are small cell-frequencies So far, we have calculated chi-squared without worrying about the fact that, while chi-squared is a continuous distribution, the cell frequencies are very far from such. We have only the values 15, 61, 80, and 211—nothing like a continuous distribution! Some correction should therefore be made to compensate for this fact, but it is not worth making unless a cell frequency is less than five in a two-by-two table. The rules for the correction are simple: 1. Add .5 to the observed frequency in any cell in which the observed frequency is less than the calculated frequency; 2. Subtract .5 from the observed frequency if it is greater than the calculated frequency; 3. Do this in every appropriate cell if you do it in any cell, but do not do it unless there is a cell or cells with observed frequency less than five; 4. Calculate chi-squared as before, using the now-adjusted observed frequencies. Table XLV shows another set of data similar to that in Table XLIV, but with small cell frequencies. The observed frequencies are A', B', C', and D', the adjusted TABLE XLV.—A table similar to Table XLIV, but showing Yates' correction. Hair colour
Like the food A' =2
Blonde
Brunette
K = 2.5 A = 7.4848 C' = 17 C" = 16.5 C
Column totals
= 11.5152 19 19 19
Do not like it
Row totals
= 11 = 10.5 = 5.5I52
13 13 13
D' = 3 D" = 3.5 D = 8.4848
20 20 20
B' B" B
14 14 14
33 33 33
CONTINGENCY OR ASSOCIATION
115
cell-frequencies are A", B", C", and D", while the calculated frequencies are A, B, C, and D. The values for chi-squared are then: without Yates' correction, e = 15.63; with Yates' correction, x 2 = 12.92. Generally speaking, Yates' correction brings the observed frequencies closer to the calculated values, and so reduces chi-squared and increases P. This gives then a rather conservative answer, tending to deny association. The effect is negligible in most cases unless the observed frequencies are fairly well below ten. There is some argument as to the real value of Yates' correction. It would seem that it is an approximation, but it can be useful. Be it noted that if the observed frequencies are very close to the calculated values, the application of Yates correction may have too strong an effect and may even give a false answer. The factorial method for the direct calculation of P The above methods get at P by calculating chi-squared and then using a chisquared table to determine P. One can calculate P directly from the raw data by the factorial method, but only for a two-by-two table. It will be of interest to see the basis of the method in terms of the theory of probability (see Chapter I). Assume a two-by-two table, laid out as in Table XLIV, with the cells named A B CD
In the ordinary sense of tossing coins, let p be the probability that an event will occur on a single trial, and p,,, be the probability that it will occur just in times in N trials. Then N! Pm=M!(N—M)!
enn-m. q
P
If, for a two-by-two table, in be the value of the frequency A, and N be the row total then the probability that the upper left-hand cell will contain the frequency
(A + B), A is
(A + B)! A B Pq
PA = A!B!
and, if the marginal total is to be fixed, this will at once fix the B cell. Similarly for the lower row of the table (C + D)! c D Pc —
C !D! P q •
It follows that the probability of obtaining the four given frequencies of such a table is, where the total frequency is (A + B + C + D) PABCD
9
_ ( A + B)!(C + D)! p(A+C)q(B+D). A!B!C!D!
116
THE ESSENCE OF BIOMETRY
This will, however, be of unknown value unless p is known, and unfortunately it is not known. All is not lost, however, because the unknown factor involving p and q will be the same for all two-by-two tables with the same marginal totals, and therefore the probability PABCD will be dependent in a given case upon a quantity 11(A!B!C!D!). The total of this, for all possible combinations of A, B, C, and D which will give the same marginal totals, can be shown to be N! (A + B)!(C + D)!(A + C)!(B + D)!' where N in this case is A + B + C + D. It follows that the probability of obtaining any given set of four cell frequencies is 1 A!B!C!D! N! (A + B)!(C + D)!(A + C)!(B + D)!
PABCD
—
(A + B)!(C + D)!(A + C)!(B + D)! (A + B + C + D)!A!B!C!D!
These factorials are of course very large, but their logarithms to base ten have been tabulated, as in Table XXX of Fisher and Yates' Statistical Tables. A table for even larger values may be found in Arkin and Colton's Tables for Statisticians.' For the example of Table XLV, not using Yates' correction, we would have PABCD -
(13 !)(20!)(19 !)(14!) (33!)(2!)(11!)(17!)(3!).
Taking logs to base ten, we have numerator
denominator
1og,o(13!) = 9.7942803
log,o(33!) = 36.9386857
logto(20!) = 18.3861246
logto(2!) = 0.3010300
logla(19!) = 17.0850946
logto(11!) = 7.6011557 loglo(17!) = 14.5510685
loglo(14!) = 10.9404084
loglo(3!) = 0.7781513
Logto(Num.) = 56.2059079
Logto(Den.) = 60.1700912
8. H. Arkin and R. R. Colton, Tables for Statisticians (New York, Barnes and Noble, 1959).
CONTINGENCY OR ASSOCIATION
117
Then, calling this example case (i), we have 1og10(P,) = 56.2059079 — 60.1700912 = — 3.9641833 = 4.0338147 (P1) = .0001086. This value of P is the probability that exactly the deviations from random values that we have in Table XLV will occur on a single trial. What we require is the probability that deviations as large as these or larger than these will occur. Such a probability may be calculated by finding the probabilities of all less probable sets of frequencies in a table such as Table XLV, and summing them. Table XLV could be symbolized as 2, 11 The less probable sets of frequencies, with deviations from random still 17, 3' greater than in Table XLV, are
2 for case (ii) and 19' 1 i for case (iii). If the 18, 1 marginal totals are to be preserved (and they must be), there are no other probabilities. In the general case, one can obtain the less probable sets of frequencies by successively reducing the least cell frequency by one until reaches zero. Applying the factorial method to cases (ii) and (iii) we find that Pri = .000,003,017 and P111 = .000,000,024. Then P = .000,108,600 + .000,003,017 + .000,000,024 = .000,111,857. This is close to the value which one would find from a chi-squared table, with 1 d.f. and xZ = 15.63, though this is beyond the range of any normal table. This value of P, just found, is the exact value. Clearly the factorial method is not practicable if the least cell frequency is large. If it should be, say 53, one would have less probable states using 52, 51, 50, etc., right down to zero, requiring 54 calculations. One should not use the factorial method if the least frequency is larger than five. Under these conditions, the chi-squared method will give a sufficiently accurate result. In a borderline case, one might use the factorial method, if a very exact value of P is required. Larger contingency tables The two-by-two contingency table is of great value, but sometimes larger tables must be used, i.e. two-by-N tables. In such cases one could extend the reasoning used thus far in this chapter, but easier methods are preferable. Let us look into the entirely fictitious case in which patients, presumably identical otherwise, are placed on various floors of a hospital (not on the floors, of course, but in rooms located on various floors). Does this have anything to do with their improvement or lack of improvement? In other words, is there any association between the floor location of a patient's room and his improvement or lack of it? Table XLVI shows a relevant set of data, the frequencies in the cells being the number of patients who did (or did not) improve, being located on the various floors. For each column of the table, signifying a floor, we form the quantities (A' + B'), (A)1(A' + B'), and (A')(A')/(A' + B'). In the right-hand column we form first A = the row total of the (A')'s, and then B = the row total of the (B')'s. Now the quantity 9*
TABLE
XLVI.—Fictitious data for the improvement (or non-improvement) of patients in rooms on various floors of a hospital. Floor on which room is located V
IV
VI
Row totals and other quantities
Patient is improved = A'
27
11
II
3
5
10
A = E A' = 67
Patient is not improved = B'
28
4
9
21
14
12
B = 11 B' = 88
55
15
20
24
19
22
A -I- B = N = 155
Column totals A' + B' The ratio: A' A' -I- B' The quantity: (A') A' B +
.4909
.7333
.5500
.1250
.2632
.4545
13.2542
8.0663
6.0500
.3750
1.3160
4.5450
A _ .4323 A -I- B
(A) A -I- B — 28.9641
is not the row total of the last row of the table, but is determined from the values of A and Ai- B above it in the right-hand column. This is important. AKA + B) Note. (A) A
AZ1.L3W OIR 3O 3DN3SS3 3H.L
III
II
I
CONTINGENCY OR ASSOCIATION
119
(A + B) is what it says, and is not the row total in practice, though it is in theory. The next quantity, (A)/(A + B), is emphatically not the row total, but is calculated by dividing A by (A + B). Similarly, the last quantity is not a row total, but is calculated from A and (A + B). This point has perhaps been laboured, but research workers have been known to do it the wrong way! It can now be shown that
72 -
[A, (A' + B'}] — AA(A + BJl A A+B [1 A + B ]
33.6066 — 28.9641 18.92. .4323(.5677)
The degrees of freedom are (R — 1)(C — 1), where R = number of rows and C = number of columns. They number then (6 — 1)(2 — 1) = 5. From a table of chi-squared with 5 d.f., P is less than .01. It follows that the deviations from `no effect of being on a given floor' are not likely to be due to chance. Therefore in this fictitious hospital, it does matter which floor you are on. After a casual examination of the table, one would try to get on either floor II or floor III. Where X is the weight (in pounds) of chocolates given to the admitting nurse, if Y is about five, one might be on floor II. Where X = .5, little will be done for you. Other statistics expressing relationships A number of statistics have been devised, rather similar to the contingency coefficient and related to the product-moment correlation coefficient, some of them for special purposes. Ferguson's Statistical Methods, Chapter XIII, gives a clear discussion of some of these. Be it noted that the correlation ratio is now known to be invalid.
XII THE ANALYSIS OF VARIANCE and most useful method of statistical analysis was invented by R. A. Fisher (now Sir Ronald Fisher), and an account of it was first published in 1923. Since that time it has been used in a vast variety of work. Knowledge of its ramifications has so expanded that a proper study would require a large book in itself. In this present text we can attempt no more than an elementary introduction, from which the reader can go on to further study should he wish to do so. This introduction has been developed in such a way as to avoid as much as possible the rather intricate notations which have been devised elsewhere. The first question the student always asks is: `What is the analysis of variance, and what can one do with it?' Let us try to find the answers. Let us suppose that we are going to study the effect of a certain food on the weight of some kind of animal (at maturity). We therefore obtain fifty presumably-identical young animals and rear them on the food, all in one batch and all looked after by one man. We would find various weights at maturity such as, say, in Table XLVII. X is the weight at maturity, and f shows the number of animals (out of fifty) at each given weight, measured in pounds. From Table XLVII we calculate
THIS VERY IMPORTANT
E f = N = 50, E fX = 1751, X = 1751/50 = 35.02, E f(X — X)2 = 2274.9800, SDI = 2274.9800/50 = 45.50. In ordinary statistical analysis, the quantity, SDI = 45.50, is what is usually called the variance; the analysis of variance, however, is a technique which partitions the sum of squares E f(X — X)2, (not the variance (SDI)) into a number of parts which are thought to be related to differing causal circumstances. These sums of squares are then used to obtain estimates of variance in the ordinary sense. We see from Table XLVII that there is some variation of mature weight, resulting in the sum of the squares of deviations (=2274.9800) from the mean weight. Had all the animals had the same weight, there would, of course, have been no deviations and 120
THE ANALYSIS OF VARIANCE
121
no sum of squares. If we now assumed that all the animals had, in every particular, identically the same treatment, the only source of this variation must be the inherent variability of the animals themselves. That is to say (to return to the term `variance'), the whole variance of the result is due to inter-animal variance. There is no other source. XLVU.-Fictitious data, being the weights (at maturity) of fifty animals reared on a certain food.
TABLE
Weight
(X
lb.=X 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
1 0 1 0 1 1 2 3 1 2 0 4 1 3 2
- X)
(X
- X)2„
-15.02
225.6004
-13.02
169.5204
-11.02 -10.02 -9.02 -8.02 -7.02 -6.02
121.4404 100.4004 81.3604 64.3204 49.2804 36.2404
-4.02 -3.02 -2.02 - 1.01
16.1604 9.1204 4.0804 1.0404
Weight lb. =X
f
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
6 0 4 2 1 1 4 2 5 0 0 1 1 0 1
(X -
X)
(X
- X)2
- .02
.0004
+1.98 +2.98 +3.98 +4.98 +5.98 +6.98 +7.98
3.9204 8.8404 15.8404 24.8004 35.7604 48.7204 63.6804
-10.98 -11.98
120.5604 143.5204
+ 13.98
195.4404
Generally speaking, however, we don't do experiments in this way. We may divide the animals into lots, and put a different man in charge of each lot. Let us suppose that we did the experiment by dividing the animals into five lots of ten animals each. These lots are to be looked after respectively by Jones, Brown, Smith, Green, and White. We shall assume that, apart from this change, all the animals receive the same treatment. We have now introduced another cause of variability. We have the variability within each lot leading to what is known as the within-lots sum of squares, and also the new between-lots variability leading to the between-lots sum of squares. The between-lots sum of squares and the within-lots sum of squares add up to the overall sum of squares which, as we shall see, is 2274.98. Now these two sums of squares have different numbers of degrees of freedom. The between-lots sum has (5 - 1) = 4 d.f., whereas the within-lots sum has 5(10 - 1) = 45 d.f. The whole experiment has (50 - 1) = 49 d.f., and it will be noted that 45 + 4 = 49. We shall now proceed to compute the two sums of squares and sort them out together with their degrees of freedom; this is what one does in the analysis of variance. We shall then see that certain significance tests can be carried out by means of variances determined from the sums of squares. Let us, for a preliminary canter, think about the between-lots sum of squares. We could hardly expect all the lot-means to have the same value; it would be reasonable
122
THE ESSENCE OF BIOMETRY
to expect some variation among them. However, if this is greater than it ought to be, in terms of the number of degrees of freedom (d.f. = 4), we begin to suspect that all is not well. We begin to think that the five lots did not all receive the same treatment and so, if we consider the matter in the formal statistical sense, the lot-means do not all come from the same universe. Looking at the situation another way, if the within-lots sum of squares is larger than it should be in terms of 45 d.f., we may suspect that the trouble lies in the fact that the animals themselves are more variable than we had assumed, while Jones, Brown, Smith, et al. are doing their best. We shall of course check by significance tests. The experiment, set up in this new way, is tabulated in Table XLVIII. In making the calculations of Table XLVIII, we used a modification of the _ X)2 group deviation method of finding the standard deviation, SD2 = . We E (X N require only the numerator of this fraction. Using an assumed mean, Z (see Chapter III), and taking deviations from this, (D'), we have SD2 = E (N,)2
1 1
D'
2
.z
whence E (X —1)2 = N(SD)2 = E (D')2 — (EN ) . Now if we set Z = 0, the deviations (D') from it are simply the values of X (the weights in the columns of Table XLVIII) and > (X — X)2 = E x2 — (1)2 for each column. For the column headed `Jones,' we have then that E (X)2 30nes = 11804. X)2 (E /NJooes = 114244/10. Thus E (X — X)2)ones = 11804 — 11424.4 = 379.6. Similar calculations are carried out for the other columns. In the extreme right column at the bottom of the table, the first and third values are row totals, and the others are calculated as for any other column. We are now in a position to calculate the sums of squares needed. The within-lots sum of squares Here we are not concerned with variability between lots, but only with that within each lot—between the animals in the lot. That is to say, for each lot we determine the value of E (X — XT, yielding 379.6, 601.6, 539.6, 266.4, and 404.9. Adding these, we obtain 2192.1, the within-lots sum of squares. Note that this is not 2274.98, which is the grand sum of squares (within-lots plus between-lots). The between-lots sum of squares Here we have no interest in the variability within lots. We are concerned with the variability of the lot-means about their grand mean. We could obtain this value by pretending that all the animals in each given lot had the same weight (the lot-mean for that lot). We then calculate the sum of the squares of the deviations of the lotmeans from the grand mean, and multiply this total by ten because we have ten
TABLE XLV111.—The data of Table XLV1I, arranged as if the animals had been reared in five lots of ten animals. Name of man in charge of each lot
EX = I = E X/10 =
1
Brown
1
Smith
Green
I
White Calculations from the rows, etc.
Weight of animals at maturity, in pounds
40 35 41 37 26 33 43 27 25 31
47 42 37 26 41 33 34 43 20 35
22 24 31 35 46 41 35 37 43 38
27 31 34 43 35 29 28 32 33 42
41 49 31 27 43 37 39 29 35 38
338
358
352
334
369
33.8
35.8
35.2
33.4
36.9
E E X = 1751 1751/50 = 35.02
E (X)'- =
11804
13418
12930
11422
14021
E E X22 = 63595
(E X)2 =
114244
128164
123904
111556
136161
17512 = 3066001
(E X)2/10 = E (X2) — (s.'X)2/10 —
11424.4
12816.4
12390.4
1 1155.6
13616.1
379.6
601.6
539.6
266.4
404.9
3066001/50 = 61320.02 63595 — 61320.02 = 2274.98
3DNVIIIVA 3O SI SÅ'IVNIV 3HI
Jones Calculations front the columns, etc.
124
THE ESSENCE OF BIOMETRY
animals in each lot. The calculations are shown in Table XLIX. The value is seen to be (10)(8.2880) = 82.880, and 82.880 + 2192.1 = 2274.98; that is to say, betweenlots sum plus within-lots sum equals total sum of squares. TABLE XLIX.—Calculation
of the between-lots sum of squares from the data of Table XLVIII.
Lot looked after by
Lot mean
Deviation of lot-mean from grand mean of 35.02
Deviation
Number of animals in lot = f
33.8 35.8 35.2 33.4 36.9
— 1.22 -l- .78 -- .18 —1.62 +1.88
1.4884 .6084 .0324 2.6244 3.5344
10 10 10 10 10
Jones Brown Smith Green White
Total 8.2880
The degrees of freedom For the total sum of squares of deviations from the grand mean, amounting to 2274.98, we have lost 1 d.f. because we do have a fixed grand mean = 35.02. The degrees of freedom are then 50 — 1 = 49. For the within-lots sum we have five lots, each of which has lost 1 d.f. because it has a fixed lot-mean. The total is then 5(10 — I) = 45. For the between-lots sum we have five lots, and have lost 1 d.f. because of the fixed grand mean, leaving 5 — 1 = 4 d.f. We can now set up the table of the analysis of variance of this example, as in Table L. TABLE
L.—The analysis of variance of the data of Table XLVIII.
Source of variance (sum of squares)
Sum of squares
Degrees of freedom
Between lots Within lots Total
82.88 2192.10 2274.98
4 45 49
Estimate of variance 82.88/4 = 20.72 2192.10/45 = 48.71
It will be seen in Table L that we have neatly partitioned the total sum of squares into two parts, belonging to the within-lots and the between-lots. From these we have obtained two estimates of variance, the value, 20.72, derived from the betweenlots sum of squares with 4 d.f., and 48.71, derived from the within-lots sum of squares with 45 d.f. The question now arises: `Did our five assistants all do their rearing work in the same way?' Had we been testing dosages of a drug, we would have asked: `Did it
THE ANALYSIS OF VARIANCE
125
make any difference what dosage we used?' This sort of question can be answered by the use of Snedecor's F test (see Chapter IV). Let us look at the matter thus: all these fifty animals presumably came from one universe in the statistical sense. This universe had some variance, and here we are speaking of variance in the usual sense, i.e. as the SD.2. We have made two estimates of this variance: the inter-class variance of 20.72 and the intra-class variance of 48.71. We could not expect these to be the same except in the trivial and unlikely situation in which all fifty animals reared to the same weight, when SDL would be zero. It is reasonable to expect on the grounds of chance alone that the two variance estimates would differ, but they should not be too different, i.e. not more than one would expect, given four and forty-five degrees of freedom. If they do differ more than this, we might well begin to feel that the five lots of animals are not from the same universe, and that Jones, Brown, Smith, et al. are going to be called in for some special treatment themselves! We therefore set up the null hypothesis that the interclass variance is not greater than might be expected in relation to that of the intraclass, given 4 and 45 d.f.; that is to say the five lot-means do not differ significantly from five lots-means derived from a single universe. In the common usage of the F test, the greater estimate of variance is put in the numerator, the lesser in the denominator, and F —
greater estimate of variance lesser estimate of variance
In the analysis of variance, however, the variance estimate derived from the betweenlots sum of squares is always placed in the numerator, whether or not it is the `greater' estimate, and we have in this case F
_ variance estimate from between-lots sum of squares _ 20.72 _ •425. variance estimate from within-lots sum of squares 48.71
Inasmuch as the expected value of F for full confirmation of the null hypothesis is F — 1.00, the hypothesis is certainly confirmed, in this case, and there is no need even to look at an F table. Values of F less than 1.00 are not common, but they do occur if one has variable material in lots with similar lot-means. The T test and the significance of the difference between lot-means Had the null hypothesis not been confirmed, one might wish to carry out tests for the significance of the differences between pairs of lot-means. There is some doubt as to how to apply the T test here, but further discussion on the matter is beyond the scope of this elementary book. More complicated cases of the analysis of variance The example just finished is an example of what is known as the one-way analysis, but two-way and even higher types do exist. In the two-way type, one might study the
126
THE ESSENCE OF BIOMETRY
effect of two different kinds of treatment upon a given object. There are then three sums of squares: the `between-treatments' sum, the `within-treatments' sum, and the `inter-action' sum, which expresses the fact that what happens from one treatment is not independent of the other. These higher types all require the use of more or less complex notations and, again, are beyond the scope of this book. Ferguson's Statistical Analysis deals very clearly with them and should be consulted. Why use the analysis of variance? At the beginning of this chapter we raised the common question: `What is the analysis of variance, and what can one do with it It is hoped that some indication of the answer has been given. It permits one to test treatments, and more than one treatment at a time. When the final answer has a variance (as it always does), it permits one to pin-point the source of the variability, and then, perhaps, to improve one's technique so as to reduce variability from the culprit source. To give but one example, if the results of toxicological experiments are less certain when rats from one source are used than they are with other rats, an analysis of variance may show that the trouble lies in variable animals, not in non-standard technique. The ideal answer— which one never gets—is one which has no standard error at all. Where there is a standard error, it exists because all the various factors in the experiment have combined together to make the answer not quite exact. The analysis of variance will tell you what has been going on. If, in the experiment of Table XLVIII, the null hypothesis had not been confirmed because of poor work on the part of Jones, Smith, et al., we can think of discharging them all. In this particular example, we would try to get more-closely standardized animals.
XIII FIFTY PER CENT LETHAL DOSE, OR LDS0 LET US CONSIDER the important and common problem of the reaction of experimental animals or plants to increasing dosages of some environmental factor, be it a drug, temperature, time, or what-not, which, if sufficiently large, will kill (or otherwise affect) the experimental subjects. Through common experience, we know that some animals are very easily killed. They drop dead at the sight of the laboratory bottle, so to speak. Others from the same lot are very hardy. The majority succumb to an average does of a lethal substance. It is obvious that if we wish to obtain some measure of the effectiveness of a poison, it would be no use talking about the smallest dose or even the largest dose. We ought rather to talk in terms of some average dose. In point of fact, it turns out that the best measure is that dosage which will, on the average, kill fifty per cent of the subjects exposed to it. This is called the fifty per cent lethal dose, or LD50. This concept is very widely used. On looking up the toxicity of DDT, for example, one reads that the oral LD50 for rats is 113 mg/kg. That is to say that, on the average, if rats are given 113 mg. of DDT per kilogram of body weight, fifty per cent of such rats will die. One may even think of exposure to `time,' and say that if Triboliuni eggs are held at 30°C. and 75 per cent relative humidity, fifty per cent will hatch after exposure to 116.86 hours of time. Here, time is the `drug,' and the egg dies as an egg and is reborn as a larva. The following account of LDS0 is based on the exposition by C. I. Bliss in `The Calculation of the Dosage Mortality Curve' (Annals of Applied Biology). Let us suppose that we dose a very large number of experimental animals with some toxic substance. For any given dosage, there will be a variation in susceptibility within each lot of animals. Thus, a given dosage will split each lot into two parts, the dead and the living. The relative proportions of these two groups will depend upon the distribution of susceptibilities within the average lot of animals. The situation after dosing could then be plotted as something like curve A of Fig. 14. In this curve, at a dosage X, the distribution is split into a shaded group (the living) and an unshaded group (the dead). The kill-percentage is the percentage ratio of unshaded to total area under the curve, and curve A will locate one point on curve B. Let us repeat the experiment many times, using increasing dosages, with a new and presumably standard lot of animals for each dosage. Calculating and plotting the percentages of kill, we obtain the sigmoid kill-percentage curve, B. 127
THE ESSENCE OF BIOMETRY
128
/ / /
R cumulative percentage kill
/
• kill
C
/
/ normal frequency distribution of individual susceptibilities - A.
/
/ / /
0
0 0
0
_
-10
-9
-
-5.0
-45
-40
-15
-3.0
-:.5
00
0.5
1.0
1. 5
:0
13
5
10
II
Il
Il
16
17
Ie
19
:11
*0
.7
all
-9
.10
43.0 ♦15
.40
.45
.50
05
90
05
109
U
X - dosage 0 .1 .2 .3 .4 -t -4 -3 X - R = X - 10 - deviations from mean dosage .0.5 +1.0 .1.5 .2.0 -20 -1.5 -1.0 -0.5 0.0 Ratio: deviation standard deviation '0 00 05 30 15 40 45 50 55 Probit dosages
+5 .25 7.5
X
0
Fig. 14.—A generalised dosage-mortality curve, A, its summated kill-percentage sigmoid curve, B, and the straight line formed from this by the probit transformation, C.
Now, a difficulty supervenes. Assuming that A is a normal frequency distribution, we would measure the dosage not in absolute terms, but in terms of (deviation from mean dosage)/(standard deviation), the usual way for such curves. The fifty per cent kill would be at the mean, with zero deviation of dosage from the mean, and dosages larger than the mean would plot as positive abscissae. But dosages less than the mean, using negative deviations, would plot as negative abscissae, and this is inconvenient.
FIFTY PER CENT LETHAL DOSE, OR LDso
129
To avoid this problem, a new system of statistical units is set up, called probits. This system is so arranged that the zero of the usual dev/SD is set at +5, and it works out so that a .01 per cent kill has a probit of 0 and a 50 per cent kill has a probit of 5.0, while a 99.99 per cent kill has a probit of 10.0. In most cases, use of a probit table changes the sigmoid surve to a straight line. If it does not do so, one can plot against log dosage, or some other function. The curve is sometimes convex upward. In effect then, one starts with a sigmoid dosage-mortality curve plotted in terms of absolute dosages, and transforms this to a substantially straight line in terms of (deviation of dosage from mean dosage)/(standard deviation of dosages). No doubt the above is not, at the moment, too clear to the reader, but it will become so as we work through an example. Let us suppose that numerous batches, each consisting of 25 eggs of the flour beetle, Tribolium confusum, are put in an incubator at 30°C. and 75 per cent relative humidity. All eggs have been laid in a very brief and negligible period of time, and can therefore be considered as being of the same age at any given moment. As time passes, these eggs hatch and pass successively through first, second, third, fourth, and fifth instars. We shall determine the moment at which the `dosage' of time shall have been sufficient to cause fifty per cent to moult from fifth to sixth instar. One can think of the insects dying as fifth instar larvae to become sixth instar `corpses.' This fifty per cent time might perhaps more appropriately be called `moulting time fifty per cent,' but we shall continue to refer to it as an LD50 to avoid confusion. Table LI shows the data from an actual experiment. Seven exactly replicated batches, each of 25 eggs, were set up. By the time the fifth instar was reached, some had died through natural causes, thus conveniently providing us with seven experimental lots not all of the same size, as shown in column 2 of this table. Column 1 shows the `time dosage' in hours since laying, and these values are coded in column 3 to reduce the labour of calculations to be made. The coding in this case is: `Divide by ten and subtract 64.' LI.-Data and calculations for the time at which fifty per cent of fifth instar larvae have moulted to sixth instar, Series I.
TABLE
Column numbers for reference
1
2
3
4
No. of time since No. of X = hatching, insects coded sixth instar in hours to moult hour
640 650 660 670 680 690 700
21 19 15 17 19 17 20
0 1 2 3 4 5 6
0 0 I 8 17 15 20
5
6
P
Q
.000 .000 .067 .471 .895 .882 1.000
1.000 1.000 .933 .529 .105 .118 .000
7
8
9
10
11
12
W -
Y calc
Exp. P
PN
1.74 2.76 3.77 4.79 5.80 6.82 7.83
.001 .013 .109 .417 .788 .966 .998
.021 .247 1.635 7.089 14.972 16.422 19.960
Yprob Yiine Ywork weight
3.50 4.93 6.25 6.19
1.80 2.80 3.80 4.80 5.80 6.80 7.80
1.51 2.41 3.55 4.93 6.17 5.76 8.12
0.17 1.74 5.55 10.67 9.55 3.06 0.49
13
130
THE ESSENCE OF BIOMETRY
Column 4 shows the numbers (at each successive ten-hour observation) which have moulted to the sixth instar. Had we been studying a poison, these would be the `dead.' Column 5 shows these values as decimal fractions of the values in column 2, and these fractions can be thought of as the a posteriori probabilities that an insect will be in the sixth instar at each observation. Column 6 shows Q = (1 — P). A brief summary of the method As the calculation of LD50 is somewhat involved, it might be as well here to outline it briefly before plunging into an example. We have then a cumulative `mortality' curve in the data of columns 3 and 4. Using a table, we convert the values of column 5 to probits and plot the results against column 3, obtaining a reasonably straight Iine. We then fit a provisional straight line to this by eye and obtain values for column 8. These are called the Y;;ne values. Using another table, we convert the values of Y;;„e to Y,„ (working values of Y), in column 9. Having found (also through a table), the `weights' of column 10, we then calculate, by the method of least squares, a final straight line. From this we can then calculate LD50 and its standard error. We can also check the goodness-of-fit by a chi-squared test. If the fit is not good, we use the final line as a second provisional line, and do the whole job over again. If the fit is still bad, we must then conclude that the raw data are inadequate to determine a good value of LDS0. Let us now do an example. 8 Example of the calculation of LD50 from the data of Table LI
• •
6 5
Y .
Table LII is an abbreviated probit table. Selecting P = .067 opposite X = 2, we find a probit value of 3.50 as entered in column 7. P = .882 yields a probit of 6.19, and so we fill up column 7. We cannot get probit values at X = 0, 1, and 6 as the table does not go quite to P = 0 or P = 1, but see Table LVII on the subject. We plot the values of column 7 against X and obtain Fig. 15 and, as the fit is fairly obvious, we run in by eye the straight line as shown in that figure, as the provisional line. Had the fit been less easy, we would have calculated this line by the method of least-squares, setting the values of column 2 arbitrarily as all equal to one (an unweighted fit).
7
3
LD,o. 672.079
•
2
0 640 650 660 670 680 690 700 710
Dosage = time in hours since time of laying of eggs I 2 3 4 5 6 7 8
X = coded dosage Fig. 15.—The `provisional' regression line, fitted by eye to the data of columns 1 and 7 of Table LI.
FIFTY PER CENT LETHAL DOSE, OR LD5o
131
LII.-Transformation of the sigmoid dosage-mortality curve to a straight line.*
TABLE
P .000 .01 .06 .10 .14 .20 .39 .41 .47 .50 .55 .69 .78 .88 .89 .90 .93 .96 .98 .998 .999
.000 2.6737 3.4452 3.7184
3.9197 4.1584 4.7207 4.7724 4.9247 5.0000 5.1257 5.4959 5.7722 6.1750 6.2265 6.2816 6.4758 6.7507 7.0537 7.8782 8.0902
.001
.002
.003
.004
.005
.006
.007
.008
.009
1.9089 2.7096 3.4536 3.7241 3.9242 4.1619 4.7233 4.7750 4.9272 5.0025 5.1282 5.4987 5.7756 6.1800 6.2319 6.2873 6.4833 6.7624 7.0558 7.8943 8.1214
2.1218 2.7429 3.4618 3.7298 3.9286 4.1655 4.7259 4.7776 4.9298 5.0050 5.1307 5.5015 5.7790 6.1850 6.2372 6.2930 6.4909 6.7744 7.0579 7.9112 8.1559
2.2522 2.7738 3.4699 3.7354 3.9331 4.1690 4.7285 4.7802 4.9323 5.0075 5.1332 5.5044 5.7824 6.1901 6.2426 6.2988 6.4985 6.7866 7.0600 7.9290 8.1947
2.3479 2.8027 3.4780 3.7409 3.9375 4.1726 4.7311 4.7827 4.9348 5.0100 5.1358 5.5072 5.7858 6.1952 6.2481 6.3047 6.5063 6.7991 7.0621 7.9478 8.2389
2.4242 2.8299 3.4859 3.7464 3.9419 4.1761 4.7337 4.7853 4.9373 5.0125 5.1383 5.5101 5.7892 6.2004 6.2536 6.3106 6.5141 6.8119 7.0642 7.9677 8.2905
2.4879 2.8556 3.4937 33519 3.9463
2.5427 2.8799 3.5015 3.7574 3.9506 4.1831 4.7389 4.7904 4.9423 5.0175 5.1434 5.5158 5.7961
2.5911 2.9031 3.5091 3.7628 3.9550 4.1866 4.7415 4.7930 4.9448 5.0201 5.1459 5.5187 7.7995 6.2160 6.2702 6.3285 6.5382 6.8522 7.0706 8.0357 8.5401
2.6344 2.9251 3.5167 3.7681 3.9593 4.1901 4.7441 4.7955 4.9473 5.0226 5.1484 5.5215 5.8030 6.2212 6.2759 6.3346 6.5464 6.8663 7.0727 8.0618 8.7190
4.1796
4.7363 4.7879 4.9398 5.0150 5.1408 5.5129 5.7926 6.2055 6.2591 6.3165 6.5220 6.8250 7.0663 7.9889 8.3528
6.2107
6.2646 6.3225 6.5301 6.8384 7.0684 8.0115 8.4316
* Abridged, with s ightly altered format, from Table IX of Fisher and Yates' Statistical Tables for Biological, Agricultural and Medical Research, published by Oliver and Boyd Ltd., Edinburgh, and by permission of the authors and publishers.
We then pick off two readily-seen points on this line, i.e. X = 0, Y = 1.80 and X = 6, Y = 7.80. Then Slope =
7.80 - 1.80 = 1.000 (in this particular case only). _0 6
Knowing the intercept on the Y-axis and the slope, we can then fill in the Y1111e values of column 8. These values are not suitable for the calculation of the final line, and must be converted to `working probits' as in column 9. For this conversion we use Table LIII, and the process is as follows: a Yt11G value is selected and found in column 1 of Table LIII. Opposite it in column 4, we find the maximum working probit. This is decreased by Q (from column 6 of Table LI) times the range from column 3 of Table LIII. For example, selecting Y11 1,e = 3.80, the maximum working probit is seen to be 8.3571, Q = .933, and the range is found to be 5.1497. Then Y,v = 8.3571 - (.933)(5.1497) = 3.5524
(3.55 approximately).
Interpolation may be needed if one cannot find the exact value of Y1 in column 1 of Table LIII. If Q = 0, no decrease is necessary, and the maximum working probit is entered directly in column 9 of Table LI.
THE ESSENCE OF BIOMETRY
132
The weights are easily calculated. Opposite the Y1,,,e value in column 1 of Table LIII, find the weighting coefficient in column 5. Multiply this by the appropriate value from column 2 of Table LI. Thus, for Y,i„e = 3.80, W = (15)(.37031) = 5.55465 (5.55 approximately), as shown in column 10 of Table LI. Be it noted that, as one fills in column 10, the weights at first increase and then decrease. The weight is at the maximum at LD50 which, as seen, must be near 670 hours, at X = 17. Actually, it is at 672.079 ± 1.763 hours. We are now ready to fit the final line. TABLE LIII.-Weighting coefficients and probit
values to be used for final adjustments.* Expected probit == Yune, Table LI
Minimum working probit (not used in this example)
Range, used to calculate Yu.
Maximum working probit, used to calculate Y.
Weighting coefficient used to calculate W
1.8 2.5 2.8 3.3 3.8 4.1 4.8 4.9 5.7 5.8 6.5 6.8 7.3 7.8
1.5118 2.1457 2.4081 2.8261 3.2074 3.4083 3.7241 3.7407 3.2724 3.0794 -0.7050 - 5.4110 -27.6230 -118.2200
419.4000 57.0500 28.1890 10.6330 5.1497 3.7582 2.5573 2.5192 3.2025 3.4519 7.7210 12.6660 35.3020 126.3400
420.9000 59.2000 30.5970
.00828 .04979 .09179 .20774 .37031 .47144 .62741 .63431 .53159 .50260 .26907 .17994 .07564 .02458
13.4590 8.3571 7.1665 6.2814 6.2599 6.4749 6.5313 7.0158 7.2551 7.6786 8.1228
* Abridged from Table IX2 of Fisher and Yates' Statistical Tables for Biological, Agricultural published by Oliver and Boyd Ltd., Edinburgh, and by permission of the authors and publishers.
and Medical Research,
For convenience, in case a calculating machine which will do triple products is not available, the data from Table LI are shown again in Table LIV, with Yw and XY,,,. The line we are going to fit is simply the least-squares regression line between Y,,, and X (see Chapters VII and VIII), and so we calculate
E W = 31.230, E WX = 101.290, X = 3.2434,
E W Y = 157.284,
Y = 5.0363,
E WX2 = 366.910, E W Y2 = 837.160 (needed later for the chi-squared test),
E WX Y = 549.103.
FIFTY PER CENT LETHAL DOSE, OR
LDso
133
In the usual way, the slope of the line is then byx =
E WXY - X (E WY) _ 549.103 - (3.2434)(157.284) E WX2 - X(Q WX) 366.910 - (3.2434)(101.290) 1.01525.
The slope of the provisional line was byx = 1.0000, a good approximation. The equation of the regression line is then Y=
Y + byx(X - X) = 5.0363 + (1.0152)(X - 3.2434)
Y = 1.7436 + 1.0152X. From this equation, the values of column 11 (Ycaic) can be filled in. This line cuts the line Yvnle = 5.00 at LD50
=I+
(5.00 - Y) - 3.2434 + (5.00 - 5.035) - 3.2079 (coded X-units). byx 1.0152 TABLE LIV.-Data for calculating LDso
and bart, from Table LI. X
X2
W
Y,,,
0 1 2 3 4 5 6
0 1 4 9 16 25 36
0.17 1.74 5.55 10.67 9.55 3.06 0.49
1.51 2.41 3.55 4.93 6.17 5.76 8.12
Y2w
XY,,,
2.280 5.808 12.602 24.305 38.069 33.178 65.934
0.00 2.41 7.10 14.79 24.68 28.80 48.72
The original coding was: `Divide by ten and subtract 64.' Reversing this procedure we have: LD50 (decoded) = (3.2079 + 64)(10) = 672.079 hr. This means that, in this experiment, at 672 hours after the laying of the eggs, fifty per cent of the resultant larvae will, on the average, be in the sixth instar. The other fifty per cent will be still in the fifth instar, and we have the situation homologous to that of the `dead' and the `living' in toxicology.
Significance tests related to LD50 Many significance tests have been devised relating to the fifty per cent lethal dose, dealing with the goodness-of-fit of the regression line, the accuracy of determination of LD50 for intercomparisons of dosage-mortality curves, and so on. We can discuss only some of them here.
134
THE ESSENCE OF BIOMETRY
The goodness-of-fit of the final regression line* The chi-squared test determines whether or not the final fitted regression line is a good fit to the original data (transformed to a straight line). For simplicity in writing the rather cumbersome formulae to follow, let E WX2 - X E WX = A = 38.39107 in the example of Table LI, > WXY — X EWY = B = 38.97673, E WY2 — Y E WY = C = 45.03444. It can now be shown that x2 = C — Bbyx = 45.03444 — (38.97673)(1.0152) = 5.46331. We must now determine the number of degrees of freedom for this chi-squared. Taking the Ycatc values of column 11 of Table LI, we pass them backwards through a probit table to obtain the `expected' values of P as in column 12 of Table LI. Each is then multiplied by the value of N in the same row, from column 2, to produce the values of PN in column 13 (Table LI). All values of PN less than 4.00 are lumped under 1 d.f. at the top and bottom of column 13. In this case we have only those at the top, i.e. PN = .021, .247 and 1.635. Each of the other values of PN leads to 1 d.f., for a total, then, of 4 + 1 = 5 d.f. However, we have placed two restrictions on our work, by using byx and Yin determining the values of Y 31, and so must sacrifice a further 2 d.f., leaving only 3 d.f. for the chi-squared test. At 3 d.f., with y2 = 5.463, the Pvalue from the chi-squared table lies between .10 and .20, so that any discrepancies between the regression line and the transformed data may well have arisen by chance. Therefore the line is a good fit to the transformed data. Estimating the accuracy of determination of LD50 Having gone to all this trouble to determine the value of LD50i we naturally want to know if it is an accurate value. Four cases must be considered: a) Chi-squared is small, so that the regression line is a good fit, and V is close to 5.00 in value. This is the case for our example of Table LI. b) The regression line is not as good a fit, chi-squared being larger, but V is still close to 5.000 in value. c) Chi-squared is small, but Vis not close to 5.00 in value. d) Chi-squared is not small, and V is not close to 5.00 in value. For cases (a) and (b) we use what may be called a large sample technique, calculating the standard error of LDS0, and using this as a measure of accuracy. For cases (c) and (d) it is not justifiable to use simply the standard error. Rather, one must calculate the fiducial limits about the supposed true value of LD50 in the universe from which the data are presumed to be drawn. Our example comes under case (a) because the regression line is well determined, and V = 5.0363. To save printing the data for other examples, however, we shall run the data of Table LI through the four cases, though only the results in the first case are proper end-results. * We can carry out a chi-squared test here because the ordinates (Y) are probability-generated.
FIFTY PER CENT LETHAL DOSE, OR LD5o
135
Case (a) In this case, we determine 1
SELD,. — byx _
1 (LD50A— X)2 J
1 1 (3.2079 — 3.2434)2 ] = ±.1763 in ten-hour units. 1.01522 [31.230 + 38.39107
In terms of one-hour units, the precision is, of course, only one tenth as great, so that SELDso = ± 1.7630 hr. Now LD50/SE = 672.079/1.7630 = 381.2, so there can be no doubt at all that the value of 672.079 is an excellent determination. Case (b) Our data do not belong under case (b), but we may go through the motions with them, as it were. In case (b), chi-squared is relatively large, and repeated operation of the regression-line fitting does not reduce it. This indicates that there are causes of irregularity in the data other than the errors of random sampling. There is, so to speak, `Summat wrong someplace.' We must compensate for this by multiplying the simple variance of LD50 by a factor chi-squared/n ,where n is the number of degrees of freedom one would use for the determination of the standard error. This would be five, in this case. (We used only three for chi-squared, but that was a special operation.) Here we use 7 minus 1 d.f. because we used Y and minus another 1 d.f., because we used byx in calculating the values of Yco,c. If we used this method we would then have 1.76302(5.4633) SELD,o = 5 = +1.8426 hr. Case (c) In this case, chi-squared is satisfactorily small, but Ywould not be close to 5.00 in value. Again, this is not our example, but we show the process, which is basically similar to the determination of fiducial limits around a regression line as in Chapter VIII and for Fig. 12. It can be shown that Limits = X +
1 [(5 Bbyx — T2
—
Y)B ± T%JA{(Bbyx — T2)/> W + (5 — 17)2)].
From the usual T table, we take T as a value appropriate to the level of confidence desired, using an infinite number of degrees of freedom, i.e. assuming a normal distribution of errors. Thus, if we wish to work in terms of 95 per cent certainty, we use T05 = 1.96. The two limits are then upper Limit, coded = 3.56766, decoded = 675.67, lower Limit, coded = 2.83982, decoded = 668.40. 10
136
THE ESSENCE OF BIOMETRY
That is to say, we are now 95 per cent sure that the true value of LD50 lies somewhere between 668.40 hr. and 675.67 hr. We do not, of course, know exactly where it lies, and we can never know. At least, as they say in the `Westerns,' we have got it `corralled.' Case (d) In this case, again, we must make a compensation, as we did under case (b), and this is done by multiplying T by the factor x2/n, which is again 5.4633/5 = 1.0927. This adjusted value of T is then used in the formula of case (c). Inter-comparisons between LD50's and between dosage mortality curves This is a large and complex subject. Were we to deal with it fully, this chapter would be the largest in the book. We may mention some of the types of tests which are done, and discuss a few of them. Bliss' publications deal with the situation in extenso. The toxicologist, doing daily routine tests of drugs may well determine LD50's, and may even investigate the significance of the difference between pairs of LD50's. He is more likely, however, to be interested in intercomparing whole sets of dosagemortality data. Commonly, for drugs which are repeatedly tested, extremely accuratelyknown dosage-mortality curves will have been worked out, and a given sample of a drug will be compared with the appropriate standard curve, its effect being examined not only at LD50 but at many dosages. For insecticides, one may study LD95 or other values. Sometimes the average minimum lethal dose is of interest. The accurately determined dosage-mortality curve mentioned above may be thought of as a curve in which the regression line fits the transformed data exactly (for all practical purposes). That is to say, it is thought of as a set of data with no variance. We mentioned such a thing in the case of the comparison of byx with another value having no variance (Chapter VIII). The toxicologist may also examine the matters of significance of differences between the slopes and positions of pairs of regression lines fitted to the transformed data of dosage-mortality experiments. The biologist, using the concept of LD50 to determine median transformation times (as in the example of Table LI) or perhaps median maturity times for such subjects as fish, will not have highly precise data for comparison, and he will be interested primarily in checking the significance of the difference between two LD50's. He may also intercompare whole sets of data, where both sets have a variance. The significance of the difference between two LD50's Under cases (a) and (c) above, one has the standard errors of the two LD50's, and no problem arises. If the difference is at least three times the standard error of the
137
FIFTY PER CENT LETHAL DOSE, OR LDso
difference, the difference can be considered significant. If one wishes to use a T test, this is done in the usual way, with T = diff/SEdirr
To provide data for intercomparisons, Table LV shows the data of Series II, very similar to those of Series I shown in Table LI. Table LVI shows results calculated from Tables LI and LV, together with results for the pooled Series I and II. It is possible to pool these two sets of data because chi-squared tests show that they are from the same universe, and the coding was arranged to be the same in both tables. One can pool without similar coding, but it means re-calculation as opposed to simple addition. TABLE LV. Data and calculations for the time at which fifty per cent of fifth instar larvae have moulted
to sixth instar, this being the data for Series II for comparison with Series I shown in Table LI. tinte No. of X = No. of insects coded sixth since laying, to moult hour instar in hours
640 650 660 670 680 690 700
17 18 20 16 20 16 20
0 1 2 3 4 5 6
0 0 4 8 II 15 20
P
Q
Yprob
Yllnc
Ywart,
W
Yealc
Exp.
PN
P
.000 1.000 .000 1.000 .200 .800 .500 .500 .550 .450 .938 .062 1.000 .000
4.16 5.00 5.13 6.54
2.50 3.30 4.10 4.90 5.70 6.50 7.30
2.15 .85 2.83 3.74 4.16 9.43 5.00 10.15 5.03 10.63 6.54 4.31 7.59 1.51
2.37 3.15 3.94 4.73 5.52 6.31 7.09
.068 .004 .032 .576 .145 2.900 .394 6.304 .698 13.960 .905 14.480 .982 19.640
Even with similar coding some precautions are necessary, as indicated by the footnote to Table LVI. That is to say, the means X and Ymust be re-calculated from the pooled values of Series I and II, as must also be LDS0i its standard error, and chisquared for the goodness-of-fit of the regression line. In point of fact, we are doing a little fudging here by this pooling. If we calculated a new regression line especially for the Series I plus II, we would do it by pooling the original data and going through the LD50 technique. This would give us weights (W) appropriate to the pooled raw data. When we produce a `synthetic' regression line by the pooling of the sums, as in Table LVI, we are using the two sets of weights which belong to Series I and II. This will produce some small discrepancies in our later bookkeeping, as we shall see. However, this short-cut pooling is quite alright if the two series are very similar. If they are not from the same universe, as shown by a chi-squared test (see below), we would not in any case pool them. Turning back now to the matter of the significance of the difference between the LDS0 's for Series I and II, the difference is 673.435 - 672.079 = 1.356. The standard error of the difference is, by a simple formula SEdlrr
= .ISE1 + SE11= x/1.76302 + 1.79472 = ± 2.21577.
138
THE ESSENCE OF BIOMETRY
This is actually greater than the difference, so that the difference is not significant. Both sets of data belong to the same universe, and there is no significant difference between the two LD50's. Calculating T, we have T = 1.356/2.21577 = .612, with (7 - 2) + (7 - 2) = 10 d.f., and P, from a T table, is between .500 and .600. This again shows no significant difference. Any difference there is, almost certainly arose by chance. LVI.-Quantities calculated from the data of Tables LI and LV for Series I, II, and I plus II.
TABLE
Quantity 12 W
?: WX X (coded) 2: WY Y (coded) F, WXY E WA'2 Z WY2 A B
C b„z x2** L D50 (decoded)
SE, D,0
d.f.***
Series I
Series I!
Series I -I- 11
31.23000 101.29000 3.24335 157.28360 5.03629 549.10250 366.91000 837.16026 38.39107 38.97673 45.03444 1.01525 5.46331 672.07900 -!-1.76300 3
40.62000 126.18000 3.10635 195.50770 4.81309 664.86980 465.00000 991.10661 73.04076 57.55446 50.11045 .78798 4.75687 673.43500 -L 1.79470 3
7I.85000 227.47000 3.16590* 352.79130 4.91011* 1213.97230 831.91000 1818.26687 111.43183 96.53119 95.14489 .86628***** 11.52185* 672.69670* ± 1.27160* 8****
* These values are to be determined from the data in the column above them, not from crosswise summation. ** For the goodness-of-fit of the final regression line. *** d.f. for the above X2. **** d.f. is 8 because 2 d.f. are lost only once. b„: here = (B1 + Bn)l(AI + An); see text.
Chi-squared tests between two sets of dosage-mortality data We have two sets of data, Series I and Series II. To these we have fitted regression lines. Do these two lines differ from each other (a) in a general sense, including both position and slope, and (b) do they differ in position or slope? We first make the general test to see if, including both slope and position, there is any significant difference between the regression lines fitted to the data of Series I versus Series II. We therefore calculate the chi-squared values for the fits of the three regression lines, for Series I, II, and I plus II, as shown in Table LVI. We have already shown how this may be done for Series I. It can then be shown that the chi-squared value for the difference between Series I and Series II is X åirr = Xi+ii - Xi - Xii -
11. 52185 - 5.46331- 4.75687 = 1.30167,
FIFTY PER CENT LETHAL DOSE, OR LD50
139
and this is with eight degrees of freedom. One might think it should be six, but it must be remembered that the two degrees of freedom for the restrictions of Y and b,;Y are lost only once. At 8 d.f., P is seen, from a chi-squared Table, to be almost 1.000, so that there is no difference between the two lines, in the general sense. We may split the above chi-squared into two parts, one for position (in terms of Y), and the other for slope (in terms of b,„). In theory, the sum of these two chisquared should be the same as the general chi-squared, namely 1.30167. As we predicted, however, having used two sets of weights to produce a synthetic regression line for Series I plus II, this identity will not be quite realized. For simplicity of notation, calling the chi-squared for position y;, and that for slope yb, we have, where b, = b,.x for Series I plus II X2 =
2
Xn
[0 71 — Yl t) —
bc(X t — X n)]. 2 = .192303
1 + (X1 - Xu) + Wil rwt At +Au 1
(b1
= 1— At
b11)2 1
— 1.29978.
A t t
Be it noted that the close similarity between xi and the general chi-squared is accidental. Actually we have ya + yb = 1.49208, not quite 1.30167 This shows that our pooling. and the use of the two sets of weights has disturbed things somewhat. Those who wish to pursue this topic further should consult the papers of C. I. Bliss. Special adjustments for zero per cent and one hundred per cent kill The mathematical background which leads to the calculation of the values in a probit table breaks down somewhat at the extremes of the table, i.e. for zero per cent and one hundred per cent kill. For very precise work, one may wish to make special adjustments. This is done by subtracting a correction from the Y,i„e values where there is zero per cent kill, and adding it where there is one hundred per cent kill. Table LVII shows such corrections, where the Y,;n, values with which one enters the table are those such as in column 8 of Table LI for, say 640, 650, and 700 hours. For a dosage of 700 hours of time, one enters directly with Yt111e = 7.80, and the correction to be added is +.3228, making the corrected probit 7.80 + .3228 = 8.1228. At 640 hours, the Y1;,,e is too small for Table LVII. Therefore add to 5.00 the value (five minus probit). We have, then, where the probit at 640 hours = 1.80, 5 + (5 — 1.80) = 8.20. The correction, as seen from Table LVII is — .2882, and the corrected value is 1.80 —.2882 = 1.5118. These corrected values would then be used in the further LD50 calculations.
140
THE ESSENCE OF BIOMETRY LVII.-Corrected probit values to use when one hundred per cent kill or zero per cent kill is observed experimentally.'
TABLE
Yune probit
5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6
Correction Corrected probit
0.8764 0.8230 0.7749 0.7313 0.6917 0.6557 0.6227 0.5926 0.5649 0.5394 0.5158 0.4940
6.3764 6.4230 6.4749 6.5313 6.5917 6.6557 6.7227 6.7926 6.8649 6.9394 7.0158 7.0940
Yüne probit
Correction
Corrected probit
Yune probit
6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8
0.4739 0.4551 0.4376 0.4214 0.4062 0.3919 0.3786 0.3660 0.3543 0.3432 0.3327 0.3228
7.1739 7.2551 7.3376 7.4214 7.5062 7.5919 7.6786 7.7660 7.8543 7.9432 8.0327 8.1228
7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9
Correction Corrected probit
0.3134 0.3046 0.2962 0.2882 0.2806 0.2734 0.2665 0.2600 0.2538 0.2478 0.2421
8.2134 8.3046 8.3962 8.4882 8.5806 8.6734 8.7665 8.8600 8.9538 9.0478 9.1421
* Abridged from Table II of `The Calculation of the Dosage-Mortality Curve' by C. I. Bliss, published in the Annals of Applied Biology, XXII (1935), 134-67. Reproduced here by permission of the author and publishers.
BIBLIOGRAPHY ARKIN, H., and COLTON,
R. R., Tables for Statisticians, New York, Barnes and Noble, 1959. Buss, C. I., `The Method of Probits,' Science, 79 (1934), 38-9. , `The Method of Probits—a Correction,' Science, 79 (1934), 409-410. , `Estimating the Dosage-Mortality Curve,' Journal of Economic Entomology, 28 (1935), 646-7. --, `The Calculation of the Dosage-Mortality Curve,' Annals of Applied Biology, 22 (1935), 134-67. , `The Comparison of Dosage-Mortality Data,' Annals of Applied Biology, 22 (1935), 307-33. , `The Determination of the Dosage-Mortality Curve from Small Numbers,' Quarterly Journal of Pharmacy and Pharmacology, 11 (1938), 192-216. FERGUSON, G. A., Statistical Analysis in Psychology and Education, Toronto, McGrawHill, 1959. FISHER, R. A., Statistical Methods for Research Workers, New York, Hafner, 1958. FISHER, R. A., and YATES, F., Statistical Tables for Biological, Agricultural and Medical Research, Edinburgh, Oliver and Boyd, 1957. FRY, T. C., Probability and Its Engineering Uses, New York, Van Nostrand, 1928. MORONEY, M. J., Facts from Figures, Harmondsworth, Penguin Books Ltd., 1958. SNEDECOR, GEORGE W., Statistical Methods, Ames, Iowa State University Press, 1959. STEELE, R. G. D., and TORRIE, J. H., Principles and Procedures of Statistics, New York, McGraw-Hill, 1960.
141
INDEX Page numbers in bold face type refer to tables.
Compound event, 4-7 Compound probability, 4, 7 Concordance, coefficient of (W), 106-107 Confidence zones, about regression lines, 82, 87-88 Consistence BAYES' THEOREM, 11-12 Kendall's K, 108-110 in judging, 101-109 Binomial distribution, 47-56 Consumer trials, 53-55, 101, 105, 111-112 characteristics, 47 Contingency, 111-119 and chi-squared tests, 51, 53 and fiducial limits, 53, 55 and chi-squared tests, 112-119 fitting, 48-49, 75 and correlation, 111 formula, 9, 57 and probability, 115-117 homogeneity of (chi-squared test), 48, tables 50-53 larger (2 x n), 117-119 kurtosis, 50-51 smaller (2 x 2), 61, 111-117 mean, 50, 94 uses of, 111 Continuous distribution, 47, 51, 62, 114 mode, 48 Continuous variable, 14-15, 51, 62, 114 and normal distribution, 48, 56 ordinates of curve, 49 Corrections to statistics kill-percentage in LDso, 139-140 Pascal's triangle, 10, 48-49 Sheppard's, 18, 25-26 and Poisson distribution, 57 skew, 50 Standard error of estimate, 84 Student's, 26, 42-43 standard deviation, 50 standard error, 50 Yates', 115 variance, 50 Z transformation, 100 Binomial theorem Correlation product-moment coefficient, 90-100 contingency, factorial calculation of P, 115-116 averaging a set of, 99 calculation of, 77, 83, 91-95 genetical application, 55-56 probability derivation, 9 and chi-squared tests, 99-100 and contingency, 111 Blakeman's criterion, 89 curvilinear (regression), 76, 89 CATEGORY: see Class and fiducial limits, 98 Central tendency, 17, 19-20, 24, 37; see also homogeneity of a set of, 99 Standard deviation; Probable error range of, 90-91 Chi-squared test, 35 and rank correlation, 111 explanation of, 33-34 and T tests, 96-98 false use of, 36, 80, 110 standard error of, 94-98 see also 2, 48-53, 59-61 swarms of points in plotting, 91 Class and class interval, 14, 17-19, 25, 31 Z transformation of, 100 rank correlation coefficient, 101-110 Coding, in LDso, 129, 138 Coin tossing, 4-11, 14, 47, 49, 57, 115 averaging a set of, 107 142 ANALYSIS OF VARIANCE, 120-126
Arrays of X and Y, 76, 78, 86, 90 Association, 111-119 Assumed mean, 20, 23, 30, 94, 122 Asymptote, of sigmoid curves, 69-72, 74-75
INDEX calculation of, 104 coefficient of concordance, 106-107 coefficient of consistence, 108-110 consistence of judges, 104-110 Kendall's K, 108-110 order of merit (ranks), 102-103, 108 and product-moment correlation, 111 ranks and ranking procedure, 102-103 and T tests, 104-105 Curve fitting binomial distribution, 48-49, 75 LD50 regression lines, 127-140 method of least squares, 63-75 normal equations, 64-65 polynomials, 63-64, 67-68 regression lines, 78-79, 82-83, 132 sigmoid curves, 68-75 straight lines, 64-66, 68-76, 78, 91 quick method for, 66, 83 see also Regression lines transformed polynomials, 68 uses of the method, 63-64 normal distribution, 30-31, 75 Poisson distribution, 58-59, 75 26, 30, 33-34, 41-43, 63, 82-84, 119, 121, 124-125, 134, 138 Deviation, standard: see Standard deviation Deviations, group method in calculations, 23, 94-95, 122 Discontinuous distribution, 47, 51, 57, 62,114 Discontinuous variable, 14, 47, 51, 57, 62 Discrimination, in judging: see Judging Dispersion: see Central tendency Distributions areas under, 18-19, 21, 30-32, 40 binomial, 47-56 continuous, 47, 51, 62, 114 definition of, 15, 23 discontinuous, 47, 51, 57, 62, 114 frequency, 13-20 normal, 21-31 Poisson, 57-62 and probability, 18, 30, 40 susceptibility to toxins in LD5o, 128 Dosages: see LD5o Draw, random, 1
DEGREES OF FREEDOM,
143
Event (in the theory of probability) compound, 4-7 equally likely, 4 definition, 3 mutually exclusive, 3, 6, 11 probability of occurrence of, 1 ways of occurrence of, 8, 9 Equation, normal, 64-65 Expectation versus probability, 9, 11 use of in contingency, 115-116 Failure, 3 Fiducial limits, 32-46 binomial distribution, 53-55 and consumer trials, 53-55 and correlation coefficient limits, 98 definition, 13, 38 and LD50, 135 mean of the normal distribution, 38 and quality control, 87 and regression coefficient, 82, 86, 96 and regression lines, 82, 87-88 see also T test; Significance tests Fisher, R. A., 41, 120 Fitting: see Curve fitting Freedom: see Degrees of freedom Frequency distributions, 13-20 F test (variance-ratio test), 39, 43-45, 107, 124-125; see also significance tests
FACTORIALS,
GAMBLING, 5
Gaussian frequency distribution: see Normal frequency distribution Goodness-of-fit, 33-36, 38, 94-95, 122, 134; see also Chi-squared test; Homogeneity; Significance tests; T tests Genetics, application of binomial theorem to, 55-56 Group deviation method, 23, 94-95, 122 Growth, 68-75, 120-126, 127-140 15-16, 78-79 Homogeneity tests (chi-squared tests) binomial distribution, 48, 50-53 and correlation coefficient, 99 Poisson distribution, 57-60 normal distribution, 33-36
HISTOGRAM,
ERROR
probable, 20, 38 standard: see Standard error
104, 110; see also Contingency; Rank correlation
JUDGING AND JUDGES,
144
INDEX
K, 108-110 Kill-percentage: see LDso Kurtosis binomial distribution, 50-51 explanation, 17-20 calculation of, by `moments', 28-29 normal distribution, 21, 28-29 see also Leptokurtosis; Platykurtosis
KENDALL'S
LDso (fifty per cent lethal dose), 127-140
calculation of the 50% lethal dose, 130-133 coding, 129, 138 dosage mortality curves, 128 and fiducial limits, 135 kill-percentage, 127, 139 pooling data of LD50 experiments, 137138 probits, 129, 131 range, used in calculation, 131-132 and significance tests, 134-139 weights used in calculation, 130-132 Likelihood, equal, 1, 4, 11, 12 Linear regression, 76-89 Logarithms factorial, 115-116 natural, calculation to base e, 72 MEAN
calculation of, 16-17, 20, 23 fiducial limits about, 38, 80, 86-87 harmonic, 77 and least squares, 78 and significance tests, 38-43, 50, 53, 60-61, 80, 86-87, 125; see also Correlation; Binomial, Normal, and Poisson distribution; Rank correlation Median, 17, 21, 27-28 Merit, order of, 102-103, 108 Method of least squares, 63-75 Mode, 17, 21, 26-28, 48 `Moments', for skew and kurtosis, 28-30 Mortality: see Binomial, Normal, and Poisson distribution; LD50; Analysis of variance 64-65 Normal frequency distribution, 21-31 areas under curve, 18-19, 21, 30-32, 40 and binomial distribution, 48, 56 degrees of freedom, 63 and dosages in LDso: see LDso
NORMAL EQUATIONS,
fiducial limits about mean, 38; see also Fiducial limits fitting the curve, 30-31, 75 formula, 22 goodness-of-fit (homogeneity by chisquared test), 33-36 kurtosis, 21, 28-29 LDso, distribution of susceptibilities, 128 mean, 16-17, 20-21, 23, 38-40 median, 17, 21, 27-28 mode, 17, 21, 26-28, 48 shapes of distribution, 22 skew, 28-29 Sheppard's correction, 18, 25-26 standard deviation, 24-26, 92, 94 Student's correction, 26, 42-43 and tabulation of data, relating to, 22, 30, 40 variance, 41 Null hypothesis, 41 Numbers, random, 14 OCCURRENCE, WAYS OF, 8, 9 Ordinates: see Binomial distribution; Normal frequency distribution, fitting; Poisson frequency distribution, fitting
10, 48-49 Pharmacological tests: see LDso Platykurtosis, 17, 29 Preference tests: see Trials, consumer Population growth: see Growth Poisson frequency distribution, 57-62 and binomial distribution, 57 characteristics, 57-58 fitting, 36, 58-59, 75 homogeneity of, by chi-squared test, 57-60 mean, 57 sampling, 57 shapes of typical distributions, 61-62 significance tests, 59-61 skew, 58, 61-62 standard error, 60 variance, 57 Polynomials, 63-64, 67-68 Probability, 1-12 a posteriori probability, 2, 3, 130 a priori probability, 2, 6, 11 Bayes' theorem, 11-12 and binomial theorem and distribution, 9 PASCAL'S TRIANGLE,
INDEX coin tossing, 4-11, 14, 47, 49, 57, 115 compound probability, 4, 7 conditional probability, 4, 7, 11 and contingency, 115-116 definition, 3 event, 1 versus expectation, 9, 11 failure, 3 and frequency distribution, 18, 30, 40-41 and Mc), 130 and normal frequency distribution, 18, 30, 40 single-shot, 3 success, 3 multiple, 8, 11 probability of not missing, 10 unconditional probability, 4 Probable error, 20, 38 Probits, 131; see also LDso Product-moment coefficient: see Correlation QUALITY CONTROL,
45-46, 87
RANDOM draw, 1 numbers, 14 observations, 26 sample, 13-14, 37, 58 trial, 7 Range, of variable, 15, 17; see also LDso Rank correlation coefficient: see Correlation Ranks and ranking: see Correlation Regression, 76-89 Regression coefficient calculation of, 78-79, 82-83, 92, 132 definition, 67, 76 and fiducial limits, 82, 86, 96 and significance tests, 79-80, 84-85, 96, 138-139 standard deviation of, 82-83 standard error of, 82-84, 96 two coefficients, meaning of, 77 Regression lines Blakeman's criterion, 89 confidence zones about the line, 82, 86-87 and correlation, 90 curvilinear line, 76, 89 equation of line, 77-78 and fiducial limits, 82, 86-87 fitting the line, 78-79, 82-83, 132 LDso, 127-140
145
and significance tests, 79-80, 84-87, 138139 and standard error of estimate, 81-84, 86-87, 94 two lines, meaning of, 77 13-20 large, 39, 134 random, 13-14, 37, 58 small, 41, 83, 134 Sampling in Poisson distribution, 57 techniques, 14, 53-55 Scores in judging, 101 Sheppard's correction, 18, 25-26 Sigmoid curves, 68-75 Significance, 32-46 Significance of the difference between two a priori probabilities (chi-squared test), 2 two binomial distributions (chi-squared test), 51-53 one correlation coefficient and zero (T test), 96, 98-99 two dosage-mortality curves (chi-squared test), 136-138 two experimental lots, 120-126 two judges or more (T test), 104-105 two LDao's (T test), 135-138 two lot-means, in analysis of variance, 125 two means, normal distribution (T test), 41-43 two means, Poisson distribution (T test), 60-61 two mortalities, binomial distribution (chi-squared test), 53 two mortalities, Poisson distribution (chi-squared test), 61 two normal distributions (chi-squared test), 33-34 two Poisson distributions (chi-squared test), 59-60 one regression coefficient and zero (T test), 79, 84, 96 one regression coefficient and a constant (T test), 79, 80, 85 two regression coefficients (T test), 79-80, 85 two regression lines in LDso analysis (chisquared test), 138-139 SAMPLES,
146
INDEX
two universes (F test), 43-45, 124-125 two variances (F test), 43-45, 124-125 Fo and F., from regression lines (T test), 80, 86-87 ?e and Fu, at any value of X, from regression lines (T test), 80, 87 Significance tests, single statistics (validity tests) association or contingency (chi-squared test), 112, 119 concordance, coefficient of (W) (chisquared test), 106-107 concordance, coefficient of (W) (F test), 107 consistence, coefficient of (K) (chi-squared test), 110 correlation coefficient (T test), 96 homogeneity of binomial distribution (chi-squared test), 48-52 correlation coefficients, a set of (chisquared test), 100 normal distribution (chi-squared test), 33-34 Poisson distribution (chi-squared test), 59-60 regression coefficient (T test), 79-80, 8485, 96, 138-139 Skew, 16-17, 28-29, 46, 48, 50, 58, 61-62 calculation, by method of `moments', 28 Snedecor's F test, 39, 43-45, 107, 124-125 Spearman's rank correlation coefficient, 101 Squares method of least: see Curve fitting sums of (analysis of variance), 120-126 Standard deviation and analysis of variance, 120-121 binomial distribution, 50 correlation coefficient, 94 definition, 17, 20, 24 normal distribution, 24-26, 92, 94 Poisson distribution, 60 Standard error binomial distribution, 50 correlation coefficient, 94-96 estimate of the regression line, 81-84, 86-87, 90, 94 definition, 24, 37 difference between statistics: see Significance of the difference between and fiducial limits: see Fiducial limits
LD50, 135, 137-138
normal distribution, 39-43 Poisson distribution, 60 and regression, 81-87, 96 regression coefficient, 82-83 Straight lines, 64-66, 68-76, 78, 91 Student's correction, 26, 42-43 Success, 3 Sums of squares (analysis of variance), 120126 Symmetry, in distributions, 16, 30, 31; see also Skew see Chi-squared test; F test; T test; Trials, consumer; Significance of the difference between; Significance tests T test, 41-43, 45-46, 60-61, 79-80, 84-87, 96, 98-99, 104-105, 125, 135-138 Toxicity, 127-140 Transformation, Z, in correlation, 97-100 Trials consumer, 53-55, 101, 105, 111-112 definition, 3 dependent, 5 independent, 5 multiple, 8, 10 random, 7, 11
TESTS:
UNIVERSE
definition, 13-20 relation to sample, 37-40; see also 36, 39-40, 43-45, 53-55, 60, 80, 86-87, 96, 98-99, 122, 125, 135-137 VARIABLE
continuous, 14-15, 51, 62, 114 discontinuous, 14, 47, 51, 57, 62 origin, 15 range, 15, 17 types, 14, 15 Variance analysis of, 120, 126 estimates of, 125 definition, 24 Kendall's K, 109 binomial distribution, 50 normal distribution, 24 Poisson distribution, 57 Variance-ratio test (F test), 39, 43-45, 107, 124-125
INDEX WAYS OF OCCURRENCE, 8, 9
W, the coefficient of concordance, 106-107 YATES' CORRECTION, 115
Z, ASSUMED MEAN, 20, 23, 30, 94, 122 Z transformation, 97-100
147
THIS BOOK IS SET IN TIMES NEW ROMAN AND IS PRINTED BY SANTYPE LIMITED SALISBURY, ENGLAND THE DUST JACKET AND COVER ARE BY ALLAN HARRISON