The Distribution of Correlation Ratios Calculated from Random Data


307 58 763KB

English Pages 6 Year 1925

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

The Distribution of Correlation Ratios Calculated from Random Data

  • Commentary
  • 41527
  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

VOi. 11, 1925

STA TISTICS: H. HOTELLING

5 657

of the periodicity curves, and 1924 and 1925 have once more bee-n years of severe epidemics. This merely constitutes another instance of the danger of estimating the value of preventive work without taking into consideration every possible factor. A complete table of the correlation coefficients (with their probable errors) for cholera and rainfall in each of the three groups is appended. Further work on the same subject, and including similar observations for the other Provinces and States in British India, is nearing completion and will be published separately.41

CORRSzAToN COErrICrNTS FOR GROUPS CHOIXRA AND RANFALL

I, II AND III

Group I With no lag With one month's lag With two months' lag With no lag With one month's lag With two months' lag With three months' lag

Withfourmonths'lag

r,

=

r

=

rew

=

Group II r7,

=

r,

r., r,

r.7

= =

+0.378-0.038 +0.2914t0.040 +0.0710.044

+0.027 0.044 +0.34640.039 +0.477-0.040 +0.372d0.038 +0.132 0.043

Group III With no lag With one month's lag With two months' lag With three months' lag

r

=

rO rw 7

=

-0.060 0.044 +0.18440.043 +0.338a0.039 +0.253 0.041

THE DISTRIB UTION OF CORRELA TION RA TIOS CALCULA TED FROM RANDOM DATA By HAROLD HOThLLING FOOD RusIARcH INSITUT, STANFORD UNIWRSIrY Communicated August 28, 1925

An investigator calculates from his data -a correlation ratio, hoping to establish between the populations from which his samples are drawn a relation constituting a natural or- economic law. It is pertinent to ask, "What is the probability of obtaining such a result if the relation in question does not hold?" Whatever the particular measure of correlation used, the frequency distribution of its values when derived from random samples of material which is really uncorrelated supplies a test of the strength of the argument.

.658 j

58STATISTICS: i. HOTELLING

o N'A.. A -S. PRIOC'.'

In 1909 "Student" gave an empirical formula for the distnrbution of the. correlation coefficient in the absence of true correlation.! This formula: was in 1915 provided with a proof by R. A. Fisher,2 who generalized it to cover the errors of sampling in cases where true correlation- exists and who, in 1921, published3. the corresponding distribution of intraclass correlations. "The brevity of [the list of statistical functions whose distributions are known] is emphasized. by the absence of investigation of other important statistics, such. as the regression coefficients; multiple correla: tions and the correlation ratio," Fisher remarked.4 Later he published studies of the distributions of regression coefficients5 and partidl correlation coefficients,6 and gave the distribution of multiple correlation coefficients7 in the absence' of true correlation. In the present paper, the distribution of the correlation ratios of samples of uncorrelated material is found., The result holds also for. the multiple correlation ratio. Notation and Definition.-Let p be the number of categories into which the range of the independent variables is divided. Let th:e number of observations; falling within the ith category be ni, and let the observed values of the dependent variable, be Yj, Y,2, . .. Y,,j for this category.

Let the total number of observations be N; then N =

ni. The mean i= 1

of the Ys will be represented by Y and their deviatiofns from this mean by yll. . y.,P, so that. P

Xk E

1=1

k =1

Ykl

NY

.'

1

and

Yij = Y + y,>.

(2)

The mean of' Yil, Y,2, ...y,, will be denoted by yi, and the respective deviations from it of these quantities by zd, z0, ... z,,,, so that Yij = yi + zg,.

(3)

The correlation ratio 7 which we shall consider is the positive square root of

Eniyi2 5=1

p

n,

i=

j=1

(4)

It necessarily lies between 0 and 1.

Geometric Interpretation.-With Fisher we assume that the population from which samples are drawn has normal distribution, so that the prob-

Voi. 1.1,1 1925

STATISTICS: H. HOTELLING

.Y ability of obtaining a particular set of values Yll, .Y. finitesimal e in each case, is

e)e- h2Z(Y,j-a)2

(

659

to within an in-

(5)

the summation being taken over the whole set of Ys. Here h and a are constants which we do not need to know.Let Yll, . . . Yp, p be cartesian co6rdinates in space of N dimensions. The loci along.which (5) is constant are hyperspheres (of N - 1 dimensions) having a common center at the point A whose coordinates all equal a. These hyperspheres meet the hyperplane (1) in concentric hyperspheres of N - 2 dimensions about the point B whose coordinates all equal Y. If we translate the axes by means of (2) to B as origin, the equations in the new co6rdinates of these (N - 2)-dimensional hyperspheres will be ni

p

, Yij = =1 j=l

0

(6) .

and

~ *=1

~ y2; = constant.

(7)

j=1

Now it is evident from the definition that the correlation ratio is invariant under the operations of adding a constant to each of the Ys and multiplying by a constant. Hence we may assume, first, that the point representing our set of observations lies on the hyperplane (6), and second, that it lies on an arbitrary one of the hyperspheres (7). This specialization does not disturb the ratio of the frequencies of any two values of n. Let us assume that the right member of (7) is unity. The probability that the correlation ratio, taken at random from uncorrelated material, lies between given limits is then proportional to the volume of the corresponding portion of the hypersphere whose equations are (6) and

*Ey2-Y

(8)

i-

Selection of Parameters on the Hypersphere.-Let us introduce quantities. a, si and

and

tij such that cos a,

(9.1)

=

Si cos a,

(9.2)

-=

h-U. S'M

(9,3)

t

=

Y Z.. U

a.

660

PRO k-n-C. N. A. S.

STA TISTICS: H. HOTELLING

Since E n yi and i-i

uv vanish, we have

z i-1

(10)

ns= i-i xi

~t(j= 0 (i 1,p2,

and

...

P).

(11).

Substituting (8) and (9.2) in (4) we have, from (9.1),

nss2= E i-i

(12)

1.

Substituting (9.2) and (9.3) in (3) and the result in (8) we find, with the help of (12) and (11), p

E

Si F

(13)

t2v.=1.

The relations (10) and (12) enable us to express sI, . . .sp as functions of p - 2 parameters ui, . ..-2: Si =

S,(U1,

.

...uP..2)

(14)

Likewise '(11) and (13) enable us to express til, . .. tp.p in terms of N -p - 1 new parameters up 1, . 3: -

tk

=

AU(Up 1.

UN-3) *

-

(15)

Combining (3), (9), (14) and (15) we have

YiV

=

S,(UI, . .. U.P-2) cos a + tV

p1 .

* z...uN3) sin a

(16)

as the equations of the hypersphere (6), (8) in terms of the parameters ul, ug, ... UN.3 and a.

The element of volume is given by8 (17) j da, Vg dui dui. .duNwhere g is the symmetric determinant of order N - 2 in which the element in the qth row and rth column is

gs,,

=

E: E

"IEaY

(18)

this formula holding even for the last row and column if we put a = UN_.2. From (16) and (18) it is evident that for each of the subscripts q and r

Vot. li, 1925

661

STA TISTICS.- H. HOTELLING

which is one of the numbers 1, 2, ... p - 2, gq, has a factor cos a, and that for each subscript which is greater than p -2 and less than N - 2, g11 has a factor sin a. These are the only factors of g., which involve a unless q or r equals N -2. In that case we have, if r * N -2,

E (-5, sin a + t,j cos a) ( aU cosa+ .;ic) i-i j-ibu which reduces to zero with the help of (11) and the relations obtained by differentiating (12), (1) and (13) with respect to ur. We have also gN-2, r = gr, N-2 =

p

n

gN-2, N-2 = E i=l

(-s, sin a + t,j cos a)2 j-1

which, by (12), (11) and (13), equals unity. Thus g has a factor cos a for each of its first p - 2 rows and its first p- 2 columns and a factor sin a for each of its next N - p - 1 rows and columns, and does not otherwise contain a. Hence g = f (U',.l. .UN.-3) cos2 N-p- a Integrating with respect to ul, .U.U.. 3 we find that the probability that the correlation ratio lies between v and v + dtv is a constant multiplied by N-p-2

COS

-2 a sinN-P-l ada = - ^P-2 (1 - 2)

2

dq

since v = cos a. Evaluating the constant multiplier by means of the condition that the correlation ratio certainly lies between 0 and 1, we find

2r (N r(P-1)r 2

1)

2 77 (N_p) 2~~j-2

-2(

-

~~~~~N-t'-2 2) 2 dn,

as the probability sought. The probability that v will by chance exceed a particular value q' is of course the integral of this expression from I' to 1. The integration can be carried out as indicated by Fisher for a simi-

lar expression.7 For the multiple correlation ratio let p represent the number of cells into which the field of the independent variables is divided. The same argument then holds, with the same result. I Probabk Error of a Correlation Coefficient, Biometrika, Cambridge, England, 6, 302-310. 2 On the Frequency Distribution of the Values of the Correlation Coefcien in Samplks from an Indefinitely Large Population, Biometrika, 10, 507-521. ' On the "Probable Error" of a Coefficient of Correlation Deduced from a Smal Sampk, Metron, Ferrara, Italy, 1, Part 4, pp. 3-32.

662

BIOLOGY: STEARN, STURDIVANT AND STEARNPRoc. N. A. S.

4 On the Mathematical Foundations of Theoretical Statistics, Phil. Trans., London, A222, 315. 6 The Goodness of Fit of Regression Formulae and the Distribution of Regression Coefficients, J. R. Stat. Soc., London, 85, 597-612. 6 The Distribution of the Partial Correlation Coefficient, Metron, 3, 329 (1924). 7The Influence of Rainfall on the Yield of Wheat at Rothamsted, Phil. Trans., London, B213, 91, 93 (1923). 8 This important expression for the volume element has been used in lectures by Professors 0. Veblen and L. P. Eisenhart. I do not find it in any of the treatises on Calculus, Analysis or Differential Geometry, save for the special case in which the manifold of integration is a surface. It may readily be proved by showing first that (17) is a relative invariant under arbitrary transformations of the parameters; and second, that if the parameters of the hypersurface are orthogonal at a point, (17) becomes at this point the simple expression for the volume element in cartesian -coordinates.

THE LIFE HISTORY OF A MICRO-PARASITE ISOLATED FROM CARCINOMA TO US GROWT'HS Cb BY E. W. STZARN, B. F. STURDIVANT AND A. E. STuARN LABORATORY OF THU PASADENA HosPITAL AND THZ GATZS CHZMICAL LABORATORY OP CALIFORNIA INSTITUTE oF TECHNOLOGY

Communicated September 15, 1925

The recent awakening of active interest in the possible parasitic origin of carcinoma has prompted the publication of this preliminary report upon an investigation in this direction, which is still in progress. It should be mentioned in advance that Young,' Louden and McCormack2 and Scott3 (who present also the unpublished work of Glover), and Nuzum4 have all described an organism possessing certain salient features in common with that here described. Young traces the presence of an organism from a "plasm" stage through several organized, "morphologicallv recognizable' stages. Louden and McCormack have isolated an organism with a very complex life history, which results are analogous to those of men like Lohnis and Smith and others who worked on life cycles of bacteria. Nuzum's extensive work on the micrococcus isola:ted by him may represent one life stage of the same organism which has attained a certain stabilization of form. The present research was, however, entirely independent of previous ones; indeed the organism was obtained by us before we knew of the publications here referred to. The results of this research are thus in some measure a confirmation and in greater measure an extension of the work of these investigators. The organism described in this paper exhibits a complex life history, and it possesses- morphological characteristics in its parasitic environ-