318 13 77MB
English Pages 185 [201] Year 1990
STATISTICAL INFERENCE
Michael Oakes
Statistical Inference
Michael Oakes
University of Sussex
Epidemiology Resources Inc. 1990
Copyright © 1986 by M. Oakes
All rights reserved.
No part of this book may be reproduced, stored in a retrieval system, or transcribed in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior writ ten permission of the publisher. Library of Congress Cataloging-in-Publication Data:
Oakes, Michael W. Statistical inference. Reprint. Originally published: Chichester; New York: Wiley, 0986. Includes bibliographical references. 1 Social sciences—Statistical methods. 2. Probabilities. 3. Psychometrics. I. Title. HA29.018 1990 519.5'4'0243 90-2834
ISBN 0-917227-04-2 Epidemiology Resources Inc. P.O. Box 57 Chestnut Hill, MA 02167 Printed in the United States of America
Available for purchase in North American only.
Contents
Foreword........................................................................................................ Preface.............................................................................................................
vii xi
PART I: ON SIGNIFICANCE TESTS
1.
The Logic of Significance Tests......................................................... 1.1 An Outline of the Significance Test ............................................. 1.2 The Power of Significance Tests................................................... 1.3 The Meaning of Associated Probability and Type 1 Error........
3 4 7 14
2.
A Critique of Significance Tests......................................................... 2.1 Decision versus Inference............................................................... 2.2 The Asymmetry of Significance Tests........................................... 2.3 The Corroborative Power of Significance Tests ........................... 2.4 Significance Tests and Effect Size................................................. 2.5 An Evaluation ..................................................................................
22 22 30 38 49 66
3.
Intuitive Statistical Judgements......................................................... 3.1 The Representativeness Heuristic................................................... 3.2 Psychologists’ Misconceptions of Significance Tests.................... 3.3 An Alternative Explanation of Low-power Designs ..................... 3.4 Intuitions of Effect Size from Statistical Significance.................. 3.5 Intuitions of Strength of Association from a Correlation Coefficient ........................................................................................
75 76 79 82 86 88
PART II: SCHOOLS OF STATISTICAL INFERENCE 4.
Theories of Probability........................................................................ 4.1 The Probability Calculus.................................................................. 4.2 The Classical Interpretation of Probability...................................
97 97 100
vi
6.
4.3 4.4 4.5 4.6
The Relative Frequency Theory of Probability............................. Fiducial Probability.......................................................................... Logical Probability .......................................................................... Personal Probability..........................................................................
101 103 104 106
5. 5.1 5.2 5.3 5.4
Further Points of Difference......................................................... The Nature and Location of Statistical Inference ......................... Inference versus Decision; Initial versus FinalPrecision.............. The Role of Prior Information....................................................... The Relevance of the Stopping Rule...............................................
110 110 113 114 115
An 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
Evaluation of the Major Schools of Statistical Inference .... Fisherian Inference............................................................................ A Critique of Fisherian Inference................................................... The Neyman—Pearson School ........................................................ A Critique of the Neyman—Pearson School................................. An Outline of Bayesian Inference................................................... A Critique of Bayesian Inference................................................... Likelihood Inference ........................................................................ A Critique of Likelihood Inference ............................................... Review ..............................................................................................
118 118 121 123 124 129 134 142 144 145
PART: III
The Role of Statistical Inference in Social Research ..................... 7.1 The Technical Legitimacyof Statistical Inference........................... 7.2 On Meta-analysis.............................................................................. 7.3 The Relevance of Statistical Inference...........................................
151 151 157 163
Epilogue........................................................................................................... References and Bibliography...................................................................... Author Index .................................................................................................
171 172 183
7.
Foreword
This is the only book that I would consider required reading for anyone who must deal with statistics on a regular basis. It describes and contrasts, in a simple, nontechnical fashion, the competing schools of statistical inference. Since many (if not most) readers will have been exposed only to the methods of the Neyman-Pearson school, with a strong emphasis on significance tests, Oakes devotes Part I of his book to laying bare the logical problems of Neyman-Pearson significance testing. Part II then describes the major competing schools of inference. In an evenhanded spirit, Part III subjects these competitors to critical scrutiny. I am, of course, not in accord with all that Oakes says. In particular, I take strong exception to his blanket condemnation of meta-analysis. Oakes may have been in censed by some abuses of meta-analysis in the social sciences, and he outlines a number of important problems in carrying out meta-analyses (most notably the “file drawer” problem). Nevertheless, I have seen enough intelligent applications in the health sciences to be convinced that meta-analysis can be a valuable adjunct to nar rative reviews. I refer especially to the clinical-trial overviews of medical therapy emanating from the Oxford collaborative groups; see Peto (1987) for some specific references and a detailed defense of meta-analysis in this context. My view is almost diametrically opposed to that of Oakes, as I think failure to provide a quantitative summary in a review is akin to failing to provide a quantitative summary of the data when reporting a single study. In both situations, the summary should be thought fully chosen and interpreted, but the potential for abuse of a summary method should not impugn its proper use, any more than (say) the potential for abuse of morphine should lead us to withhold it from a patient in agony (the latter state perhaps induced by trying to make sense out of certain purely narrative reviews). I realize of course that my last argument could be employed in defense of significance tests. In fact, I am not one of those who would ban these tests, but my views coincide with those of Oakes and of Rothman (1986) insofar as we agree that significance tests appear to have produced much more harm than good in the social and health sciences. Much of this harm could have been avoided had researchers been made to under stand the limitations of and the alternatives to the Neyman-Pearsonian outlook.
viii
The statistical crime of the twentieth century is that, for at least four decades follow ing the conquest of statistics by the Neyman-Pearson school in the 1930’s, the ex istence of alternative schools was rarely even mentioned, let alone their methods taught, in American introductory statistics courses. (The situation in the British Commonweath, with its stronger Fisherian influence, may have been a little better, but introductory students still remained largely cut off from non-frequentist approaches.) 1 am aware of only one elementary textbook from this era (Folks, 1978) that ade quately acknowledged the depth of the conflicts about statistical inference. Likewise, most advanced textbooks simply presented a single point of view (usually frequentist) and gave only passing reference to alternatives; while the texts by Barnett (1973) and Cox and Hinkley (1974) are notable exceptions, the latter at least is too mathematically sophisticated for most nonstatisticians. Although Statistical Inference should be accessible to anyone who has had an elementary statistics course, the illustrations focus on study objectives and techniques found much more frequently in the social sciences than in epidemiology. For epidemiology courses, I have found it essential to supplement chapters 2 and 3 with epidemiologic critiques of statistical methods. My primary supplement to Chapter 2 is Poole’s “Beyond the Confidence Interval” (1987), which provides a real epidemiologic example of the misleading nature of significance tests, and describes an important concept not mentioned by Oakes, the p-value function. I follow this with Goodman and Royall’s “Evidence and Scientific Research” (1988), which ex poses some of the logical incoherencies of significance testing when viewed from a likelihood perspective; this article also aids in moving from the critique of NeymanPearsonian methods to the description of alternative methods. Section 2.4 is uncritical in its discussion of correlation and “variance explained,” and the examples are far removed from epidemiologic concepts of effect size, such as relative risk, so I sup plement or replace the section with the critique by Greenland et al. (1986). Finally, the interpretative relationship between the sample size and the p value, discussed by Oakes in section 2.1, has been greatly clarified (from a likelihood/Bayesian perspec tive) by Royall (1987); this article can be read alongside chapter 5. I hope that Statistical Inference marks an historical turning point from the-chronic misrepresentation of statistics as a monolithic state to a more accurate portrayal as a primitive territory inhabited by fiercely independent and often feuding tribes. But if monolithic portrayals continue to dominate teaching, this book will be all the more important in fighting such distortions. Sander Greenland Department of Epidemiology UCLA School of Public Health Los Angeles, CA 90024-1772 December, 1989
References
Barnett, V. (1973 ; 2nd ed. 1982). Comparative Statistical Inference. New York: Wiley. Cox, D. R., and Hinkley, D. V. (1974). Theoretical Statistics. New York: Chapman and Hall. Goodman, S. N., Royall, R. M. (1988). Evidence and scientific research. American Journal of Public Health, 78, 1568-1574. Greenland, S., Schlesselman, J. J., and Criqui, M. H. (1986). The fallacy of employing standardized regression coefficients and correlations as measures of effect. American Journal of Epidemiology, 123, 203-208. Folks, J. L. (1978). Ideas of Statistics. New York: Wiley. Peto, R. (1987). Why do we need systematic overviews of randomized trials? Statistics in Medicine, 6, 233-240. Poole, C. P. (1987). Beyond the confidence interval. American Journal of Public Health, 11, 195-199 (see also comment and rejoinder, p. 491-493 of same volume). Rothman, K. J. (1986). Modern Epidemiology. Boston: Little, Brown. Royall, R. M. (1986). The effect of sample size on the meaning of significance tests. The American Statistician, 40, 313-315 (see also correspondence in vol. 41, 245-247).
This book is intended to be a commentary on the use of statistical inference in the social, biological, and behavioral sciences. I have in mind a wide variety of readers. I hope those who teach statistics will find something of value in the pages that follow, but I have especially set out to clarify some statistical issues for the consumer — the writer and reader of the empirical literature. Although the book is not intended as a textbook, I think it could be used for a graduate course in quantitative methods. Throughout, I have been at pains to make the exposition as non-technical as possi ble. A sound understanding of no more than the t-test will equip the reader for most of the argument. I write as a psychologist and hence draw upon that discipline for many of the examples examined. I should emphasize, however, that the issues are quite general. Students of biology, sociology, medicine, public health, education, political science, and geography — indeed any of the empirical sciences — should find the discussion relevant. Many of the examples are theoretically banal. This is a conscious decision; I wish to retain the comprehension and interest of the non specialist reader and to prevent the specialist from being distracted by theoretical niceties. The statistical issues generalize not only across disciplines but also across levels of theoretical sophistication within disciplines. The organization of the book is from the specific to the general. Thus in Part 1 the relative merits of significance testing and interval estimation are assessed from within the framework of Neyman-Pearson orthodoxy. In Part II, four competing ac counts of statistical inference are assessed. In Part III the scientific usefulness of each of these approaches for statistical inference is examined. Many researchers retain an infatuation with significance tests despite the formidable arguments that have been presented against them. In Chapters 1-3 I marshall these arguments, and append a few of my own, including empirical evidence, in an attempt to kill the beast. Even so, I suspect that the headless corpse will continue to flail through journal pages for years to come.
xii It is a common complaint of the scientist that his subject is in a state of crisis, but it is comparatively rare to find an appreciation of the fact that the discipline of statistics is similarly strife-torn. The typical reader of statistics textbooks could be forgiven for thinking that the logic and role of statistical inference are unproblematic and that the acquisition of suitable significance-testing recipes is all that is required of him. The desirability of re-educating researchers in this respect was pointed out in a book review of a statistics text: A more fundamental criticism is that the book, as almost all other elementary statistics texts, presents statistics as if it were a body of coherent technical knowledge, like the principles of oscilloscope operation. In fact statistics is a collection of warring factions, with deep disagreements over fundamentals, and it seems dishonest not to point this out. Elementary statistics books conspire to produce psychologists who are very able to do five way analysis of variance while remaining incapable of coherent discussion on the problems associated with Bayes’ theorem or with the attempt to delimit the applicability of probability theory. (Dusoir, 1980, p. 315)
Despite the passage of years Dusoir’s comments remain apt; the chief purpose of this book is to satisfy the need that he identified. Fisher and Neyman were at loggerheads throughout their careers. Bayesians claim advantages over both approaches and likelihood theorists feel that the Bayesians over reach themselves. With the exception of the growing minority who espouse the Baye sian argument, most research scientists know little of Bayesian inference apart from the notion that it is “subjective.” In Part III seek to explain the philosophical issues that underlie the controversies in current statistical theory. Chapters 4 and 5 lay the foundations for the critical assessment of Fisherian, Neyman-Pearson, Bayesian, and likelihood approaches to inference that is the subject of Chapter 6. In part III I briefly examine the issue of the relevance of statistical inference, of whatever kind, to scientific research. I devote a section to a new bete-noire: meta analysis. It should be stifled at birth. This book has been an appallingly long time in preparation. I blame the extent and difficulty of the literature in which I have immersed myself; others, more in sightful, allude to my indolence. Either way, there have been times when I believed that my major contribution has been to provide empirical evidence in support of Hofstadter’s Law: It always takes longer than you expect, even when you take into account Hofstadter’s Law. Through the years I have gained immeasurably from discussions with friends, of whom I should like to mention Rod Bond, Alistair Chalmers, Rob Eastwood, Aaron Sloman, and Mervyn Thomas. Harold Dale encouraged me to research the area. I am especially grateful to Edward Minium who read a draft of this work, encouraged
xiii
me, and offered constructive criticism. I am also indebted to Nancy Dreyer and Ken neth Rothman who have made this paperback edition possible. Their enthusiasm for my efforts is greatly appreciated. Finally, as the reader may already have noticed, I must admit to the use of male generic pronouns. I plead that my sole reason is an impatience, both as reader and writer with expressions such as ‘s/he’ and ‘himself or herself.’ No doubt my usage will cause offense; I can only apologize and ask that my arguments be judged on their merits.
Michael Oakes December, 1989
PART I
ON SIGNIFICANCE TESTS
Chapter 1
The Logic of Significance Tests
Introduction
As the title suggests, this book is an examination of the manner in which researchers employ statistical techniques in order to facilitate the development of explanatory theory. We distinguish at the outset those uses of statistics where the object is some form of inferential process, from those where the in tention behind the presentation of statistics is to convey in summary form some aspect or other of a mass of empirical data. This latter so-called ‘descrip tive’ use of statistics is relatively uncontroversial and will not be considered, except tangentially, in what follows. This distinction is at once hard to main tain; we shall certainly be interested in the ways researchers use measures of association (correlation, for example), the emphasis, however, will be on how such statistics are used to throw light upon theories. Moreover, as we shall see later, the whole notion of statistical inference is problematic. In short, the typical use of such techniques as analysis of variance, regres sion, and non-parametric methods, etc. will be the domain of interest. It should be pointed out, however, that the various techniques, per se, are not of central concern (I shall not, for example, be discussing the relative robustness of tests nor the degree of bias of estimators). The reasons are twofold. First, this is not an essay on theoretical statistics, the techniques under consideration have the status merely of tools in the research enterprise. Second, we shall, for the most part, be operating at a more conceptual than technical level. For example, I shall be assessing the role of significance tests in general and comparing it with the contribution of estimation procedures in general. It will be natural to devote much of the discussion to an analysis of the significance test in social and behavioural research, because it remains the case that the predominant use to which statistical techniques are put is to compute the statistical significance or otherwise of data culled from an empirical study. Other issues of relevance to the evaluation of research evidence arise more or less naturally from a critique of the concept of ‘significance’. That significance tests were an important ingredient of psychological 3
4
research reports some 30 years ago was demonstrated by Sterling (1959) who combed four leading journals to investigate the incidence of tests of significance in one calendar year. He found that of the 124 research reports in Experimental Psychology (1955), 106 (85°7o) used tests of significance. The percentages for the other journals were as follows: Comparative and Physiological Psychology (1956) 80To, Clinical Psychology (1955) 77To, and Social Psychology (1955) 82To. By 1975 significance tests were, if anything, more prevalent in the psychological literature (Oakes, 1979) and inspection of the journals from other social sciences tells the same story. This is of interest because since Stirling’s survey a number of articles have been published questioning the use made by psychologists of significance tests and their applicability to research in the social sciences in general. (Many of the critical contributions to the debate appear in a book edited by Morrison and Henkel, 1970.) In Chapters 2 and 3 we shall be examining the validity of the arguments against significance testing, speculating as to reasons why statistical procedures remain relatively unchanged, and, by means of examples from the literature and the presentation of empirical data, illustrating the likely consequences for theory development of misunderstanding the contribution of statistical analysis to empirical research. In order to facilitate the ensuing discussion it will be helpful to give a brief summary of the principal features of a significance test. The vehicle for this exposition will be the /-test for independent means, but the essentials are the same for all tests of significance. 1.1
An Outline of the Significance Test
Suppose an investigator were interested in whether violence portrayed on television has any effect on the degree of aggression subsequently exhibited by child viewers. He may take 20 typical schoolchildren and randomly assign them, 10 each, to two treatment conditions. These are X, the experimental condition, where the children view a particularly violent episode of a wellknown television crime series, and Y, the control condition, where the other 10 children view a documentary programme on the musical instruments favoured by various Asian cultures. After watching their respective program mes the children are left alone with a large doll and trained observers note the number of times each child ‘behaves aggressively’ towards the doll during a fixed time-span. Suppose our investigator entertains the substantive hypothesis that children tend to imitate behaviour which they observe to be reinforced. Then the significance test procedure runs like this: derivable from the substantive hypothesis is the statistical hypothesis H\ that the mean number of aggressive behaviours of children in condition X differs from the mean number of aggressive behaviours of children in condition Y. That is, we have the statistical hypothesis, H\ : p.x for which our investigator would like confir
5
mation. Now the negation of the statistical hypothesis H\ is another statistical hypothesis which may be labelled Ho. Thus Ho’.iix = p-y As Ho and //A are mutually exclusive and exhaustive it follows by logical implication that if Ho is denied then //A is affirmed. The investigator seeks therefore to deny Ho, or to ‘nullify’ it, and it is for this reason that it is known as the null hypothesis. However, both Ho and //A refer to population parameters and only data from two samples will be available. Furthermore, it is impossible to obtain sample data that are absolutely inconsistent with either statistical hypothesis [ 1 ]. A particular sample statistic is only more or less likely given the truth of a statistical hypothesis — rather than possible or impossible. It remains to explain what is meant by ‘more or less likely’. The significance test is based upon the notion that the particular sample statistic (in this case (X - T), the difference between the sample means) is but one instance of an infinitely large number of sample statistics that would arise if the experiment were replicated, under the same conditions, an infinite number of times. The differences between these sample statistics would simply reflect the vagaries of random sampling from an infinite population. Statistical theory has demonstrated that by using information from the samples (in this case infor mation regarding variance) and by making a few assumptions (in this case that the populations are normally distributed and with equal variance, for example) a sampling distribution can be constructed. One other assumption is made — that the null hypothesis (here gx = ^y) is true. If these assumptions are met in our example it is known that the distribution formed by plotting the difference of the two means over an infinite number of hypothetical replications would be normal with mean zero. The standard deviation of this distribution is known as the standard error of the difference of two independent means and is estimated from the sample standard devia tions. The distance of a particular sample means difference from zero can therefore be expressed as a number of estimated standard errors above or below the mean. This standardized, derived sample statistic (X — Y) — (jix — jiy)
sx-y is distributed according to student’s t distribution with (/Zj — 1) 4-(/?2 — 1) degrees of freedom (where, as usual, n\ and n^ are the respective sizes of the two samples and 5x_j>is the estimated standard error of the difference between two independent means). A sample t statistic may therefore take any value in the range - °o to + and the relative frequency, in an infinite number of sampling replications, with which t would fall within any specified interval can be readily determined. As t may take any value it is clear that no particular sample t statistic is logically inconsistent with the truth of the null hypothesis — but equally clearly, some t values will occur more frequently than others. That is, some
6
t values are more consistent with the truth of the null hypothesis than others. The significance test embodies the principle that there is some range of t values which is so inconsistent with the truth of the null hypothesis that indeed the null hypothesis should be rejected. In other words, there exist possible values of t such that one is left with this choice: either (i) the assumptions of the test are true (including the assumption Ho : = gy) and a rare event has occurred, or (ii) one of the assumptions of the test is untrue (specifically, gx gy) [2]. The point at which a t statistic becomes sufficiently rare to merit the rejec tion of the null hypothesis is a matter of convention and is usually set at that absolute value beyond which t values would occur but 5% of the time in successive replications when the null hypothesis is true. This is illustrated in Figure 1.1.1. Each of the shaded portions in Figure 1.1.1 represents 2|% of the total area under the curve (which is, of course, symmetric about zero). When t is distributed with 18 degrees of freedom (d.f.), as in our example, the absolute value of t beyond which lies 5% of the curve is 2.10. This absolute value, which must be exceeded for the null hypothesis to be rejected, will be referred to as the critical value. Likewise, the relative frequency of t values which is considered to be too small to be consistent with the truth of the null hypothesis will be referred to as the significance level (denoted by the symbol a) and, as indicated above, it is frequently set at either 0.05 or 0.01. Let us imagine that our psychologist performs his experiment having set a at 0.05, and derives a t value of 2.55. This is therefore deemed to be inconsis tent with the truth of the null hypothesis and is termed a significant result. Fur ther, the obtained value of t happens to be such that 0.02 of the t distribution is more extreme (adding both tails). This value, 0.02, I shall refer to as the associated probability of the sample statistic. To recapitulate, a significance test is a procedure whereby a sample statistic is compared with a distribution formed by the hypothetical infinite replication of the sampling programme, each time plotting that statistic under the assump tion that the null hypothesis is true. 1 have illustrated this where the interest lay in two samples from which the t statistic was derived, but the procedure is the same in principle when we wish to test the significance of a linear regres sion, or of an association between two categorical variables, or whatever.
Figure 1.1.1. Distribution of t with 18 d.f.
7 When the observed statistic is such that it would occur less than a predeter mined (100 Xa)^ of the time under random sampling if the null hypothesis were true then the result is significant and the null hypothesis is rejected in favour of its logical complement the alternate hypothesis, H\. It is evident that associated with the significance test are two possible types of error. We may reject a null hypothesis when indeed a rare event has occur red and the null hypothesis is actually true. This is known as a type 1 error, lesis is true it will occur and of all the occasions (100 x a) iro The second way in which a mistake can be made occurs when the null hypothesis is actually false, but the sample statistic is not significant and is therefore considered consistent with the truth of the null hypothesis. In this event the null hypothesis is accepted (wrongly) and the alternate hypothesis is not entertained. This is termed a type 2 error and the relative frequency with which it occurs is denoted by /5- The particular value of (3 depends, among othey^M ' e actual size of the true effect. The compleme he type 2 error is the frequency with which a significant result is obtained when the null hypothesis is false. This is termed the power of the test and its probability is (1 - /3). This too is determined relative to a particular hypothesized size of effect. That, in outline, is the structure of the significance test. I have already presented evidence to suggest that it finds favour with a great many research ers. Perhaps this is not surprising when one considers that the man who first brought it to the attention of the general academ experiment may be said to exist only in order t< disproving the null hypothesis’ (Fisher, 1935, p.
1.2 The Power of Significance Tests
We may define the power of a test of the null hypothesis as the probability that it will lead to the rejection of the null hypothesis when the null hypothesis is indeed false. We have seen that power is the complement of the type 2 error (finding no effect when the effect exists). In any particular context power is a function of three factors: (1) the significance level adopted (the type 1 error); (2) the relfabTity ottne sample data; t *y (3) the size oTthe treatment effect. The interrelationship of these factors may be illustrated with reference to the example given earlier. The psychologist, it will be remembered, wished to detect an increase in aggressive behaviour in children exposed to a violent television programme. If such an effect exists the probability of his obtaining results leading to the rejection of the null hypothesis is the power of his test. He could increase the power of his test by making his alternative hypothesis directional [3]. This can be viewed as a special case of altering the significance
8
level. Clearly, the larger his selected a level the easier it will be to obtain a significant result. Thus the psychologist could, if he wished, set his type 1 error at 0.2 for example. There is evidence that psychologists are unwilling to depart so radically from the conventional values of 0.05 and 0.01. This may be because they believe that in the interests of scientific caution they should guard against committing a type 1 error almost at any cost. Or it may be that they suspect, probably correctly, that any research hypothesis tested at the 20% level of significance would be highly unlikely to be published in a reputable psychology journal. The second factor affecting the power of this hypothetical experiment is the reliability of the sample data. In our example the most appropriate measure of this reliability is the standard error of the difference between two means;
where s2 is the pooled estimate of the (assumed equal) population variances. Obviously the reliability of a sample statistic is proportional to the inverse of its standard error. So for increased power the standard error (of the difference between two independent means, here) must be reduced. It is evident from the formula that the standard error is reduced by increasing n\ and n2 and/or reducing s2. Error variance is typically reduced by experimental control (which in practice usually means holding assumed extraneous variables constant) or by improved experimental design (of which blocking is the most familiar example). Thirdly, and perhaps most obviously, the likelihood of detection of an experimental effect depends upon the size of that effect. Large effects are relatively likely to be detected, smaller effects, less so. Let us suppose then, that the psychologist of our example can only obtain 20 children for his experiment. Further, suppose he standardizes the environ mental conditions by ensuring that both control and experimental groups have the same stimulus toys in very similar rooms, and that he has so much faith in his ‘imitation theory’ that he expects the two groups to differ in mean number of aggressive behaviours by as much as one standard deviation. Now if the psychologist performed this experiment and even if the two population means differed by one standard deviation (which in most psychological contexts is a large effect), he would have a barely better than even chance of obtaining a non-directional significant result at the 0.05 level. The power of this test is, in fact, 0.56 [4], One could speculate as to whether any self-respecting psychologist would consider an experiment with so small a chance of ‘success’ worth performing. Such speculation, however, need only be brief; psychologists do perform significance tests of very low power. Cohen (1962) defined small, medium, and large effects as 0.25a, 0.5a, and la respectively. He applied these definitions to the research plans that appeared in one volume of the Journal of Abnormal
9 and Social Psychology and computed the power of the significance tests used with a 0.05 criterion of significance (non-directional) and the actual sample sizes employed. The results of this review were frankly alarming; the average power for large effects was 0.83, for medium effects, 0.48, and for small effects average power was 0.18. ‘Taking medium effects as a conventional reference point, only one-fourth had as good as 0.60 power and the lowest quarter had less than 0.32 power!’ (Cohen, 1965, p. 98). It is curious, to say the least, that psychologists should undertake investiga tions which give them so little a priori likelihood of obtaining the significance level that they need to confer respectability upon their findings. If it is the;case that psychologists typically employ tests with little regard for power we may distinguish five interrelated consequences, as follows.
As remarked earlier, from any significance test two types of error may result; the ‘finding’ of an effect when no true effect exists (type 1 error), and the failure to find a true effect (type 2 error). For any determined design and fixed sample size the probabilities of making these two types of error are inversely related. That is, any attempt to further guard oneself against making a type 1 error by, say, setting a = 0.01 rather than a = 0.05 automatically increases the probability of making a type 2 error. Similarly, the probability of making a type 2 error cannot be decreased (that is, the power of the test cannot be increased) without correspondingly increasing the probability of a type 1 error being committed. Power can, of course, be increased by improved experimental design and by increasing sample size. Now impljcTintne conventionally low probability of type 1 errors (0.05 or 0.01) is the view that scientific caution demands a safeguard against ‘false positives’, that is, erroneously discovering an effect, a relationship, an improvement, which does not really exist. At the same time, however, there is an implicit recognition that the stringency of such a safeguard must be balanced by a concern to find these effects if they do, in fact, exist. If this were not so the type 1 error could be set at a vanishingly low level of probability — and, moreover, there would be little justification for performing a statistical test at all, one could merely consult two-digit random number tables and announce a significant finding if 01, 02, 03, 04, or 05 appeared. Such a test would have the conventional type 1 error of 0.05 but the type 2 error would be 0.95. Hence the use of statistical tests. The extent to which scientific caution need be exercised and the importance of discovery of an effect (alternatively the costs of making type 1 and type 2 errors) will vary from situation to situation. This would imply that conven tional significance levels should be abandoned and that with any particular piece of research a should be set with regard to the costs in hand. It might be objected that the estimation of costs of type 1 and type 2 errors is a subjective matter (and this is plainly the case in all but the most applied,
10
and usually non-social scientific, issues) and that the balancing of these should not be free to vary from researcher to researcher. In reply one could suggest that as long as researchers report the associated probability of their findings the reader, if he does not approve of the significance level adopted, can apply his own. Alternatively, one could accept that the balancing of the two types of error is subjective and that therefore a balance should be set conventionally. Cohen (1965) tentatively suggests the ratio of type 1 error to type 2 error should be 1 to 4. It should be noted, however, that this strategy would also involve a departure from conventional significance levels. The above discussion is predicated, of course, upon the setting, presumably by convention, of the size of the hypothesized effect. The point is that in current psychological practice the sense of a balance between the two types of error is lost. Adherence to conventional levels of significance and neglect of power considerations imply an arbitrary and crosssituational assessment of the cost of a type 1 error with scant regard to the cost of a type 2 error. We shall later see that this state of affairs can be at least partly attributed to the influence of Fisher upon the development of statistical practice. The effect is most serious when power is necessarily low. In certain clinical areas, for example, the number of patients suffering from a particular disorder may be relatively small. Here there is a case for inflating the type 1 error although ‘I here intend no advocacy of low-power studies . .. but if the exigencies of reality put a psychologist in the position of reporting a statistical test of trivial power, he owes it to himself and to his audience to point the fact out. What I do advocate is that he know the power of the test’ (Cohen, 1965, P- 97).
(2)
Misinterpretation of non-significant results
We have just argued that if researchers are to continue to employ tests of significance (and this is begging the question which is faced squarely in Chapter 2) they should abandon their slavish adherence to conventional significance levels and pay due regard to power considerations. Failure to do so, if combined with a tendency to perform low-power tests, will lead to an unacceptably high proportion of spuriously non-significant results. This has two closely related implications. First, and obviously, new and theoretically or practically important relation ships will be less likely to be discovered. This will hinder the advance of science and perhaps lead to the frustration and demoralization of researchers. Second, where lack of effect is hypothesized by a researcher a successful result, i.e. a non-significant statistic, admits of ambiguous interpretation. It may be due to the fact that no such effect exists, or alternatively the power of the test may be such that the a priori chance of finding such an effect, given
11 its existence, was inadequate. The literature abounds with examples of ambiguous non-significant findings in this sense. An example is the work of Ellsworth and Langer (1976) who investigated whether the extent to which people indulge in helping behaviour depends upon (a) the victim staring at the subject, and (b) the ambiguity of the situation. What was essentially a 2 x 2 analysis of variance design with 12 observations per cell yielded a result for main effect (staring) of F(l,44) - 1.32: ‘The hypothesis was confirmed: there was no main effect for staring’ (Ellsworth and Langer, 1976, p. 117). If, however, we take the Popperian view that scientists should subject their hypotheses to severe tests this particular example is not very impressive. On the assumption that staring does affect helping but is a small effect (Cohen, 1965), as perhaps could have been hypothesized, Ellsworth and Langer gave themselves one chance in ten of finding a significant result at the 0.05 level! Even assuming that an effect of staring would have to be of medium size in order to be theoretically important, the test employed by Ellsworth and Langer had a power of only 0.4. The above argument has a complementary aspect. If power is computed it makes sense, contrary to popular belief, to assert a hypothesis of no difference. The work of Nisbett and Temoshok (1976) illustrates this point. They note that several investigators have proposed that there are consistent in dividual differences in style of response to external stimuli. Proposed cognitive style dimensions have included ‘field dependence’, ‘automatization’, and ‘stimulus binding’. The concepts and terminology of the various approaches are similar enough to have prompted the suggestion in the literature that the three cognitive styles are related or even identical. Nisbett and Temoshok investigated this claim by computing correlations among tasks used to assess the respective styles and found that only 5 out of 36 correlations were both significant and positive. They report ‘scant evidence for generality of the external cognitive style dimension’. What they failed to report was that if the three cognitive styles are ‘related or even identical’ medium correlations of 0.3, at least, might have been expected. The size of their sample was such that the power to detect a correlation of this size, significant at the 0.05 level, was 0.92. They therefore possess very strong evidence that the three cognitive styles are not particularly related. A non-significant result when power is high is strong evidence that the hypothesized effect is not present (but see [5]).
(3)
An increase in the number of type 1 errors in the literature
It should be clear that if researchers habitually employ tests of low power they will not obtain significant results as frequently as they otherwise might. What is perhaps not so obvious is that those research endeavours which do reach significance may contain a disproportionately high number of type 1 errors. It is sometimes thought that one advantage of the 0.05 (or 0.01) significance convention is that of all the significant findings in the literature a known small
Table 1.2.1
Hypothetical outcomes of 5000 experiments with average power of 0.5 and a set at 0.05 State of World
Decision
Null hypothesis true
Accept Ho
3800
Reject
200 Type 1 errors
Ho
Null hypothesis false 500 Type 2 errors 500
proportion, namely 5% (or 1%) are actually false rejections of the null hypothesis. This is simply not the case. Consider, for example, Table 1.2.1. Let us suppose that in a given year 5000 experiments are performed, in 4000 of which the null hypothesis is actually true so that in the other 1000 the alter native hypothesis is true [ 5 ]. It is evident that application of the conventional 0.05 level of significance will lead, on average, to 3800 correct acceptances of the null hypothesis and 200 incorrect rejection s of the null hypothesis (type 1 errors). If, however, the power of the tests of real effects is as low as 0.5 (a value suggested by Cohen’s (1962) findings for medium effects) only 500 of the 1000 experiments will result in correct rejections of the null hypothesis'"it follows, then, that if all the significant results were published, the proportion of them that would be type 1 errors is not 5% but 29%! From this example it can be seen that the proportion of type 1 errors in a set of significant results is a function not only of the conventional significance level but also of the power of the tests and also the proportion'of cases in which the alternative hypothesis, is true (what could be termed the ‘a priori probability’ of the alternative hypothesis). _ _ —— (4)
Sample instability, imprecise estimation
The power of a significance test is closely (and inversely) related to the stand ard error of the the sample statistic. When power is low we have already seen that the discovery of a true effect is rendered relatively improbable. This is because the obtained sample statistic is certain to be unreliable. It follows from this that estimation of the corresponding population parameter will be necessarily imprecise. To anticipate Chapter 2, 1 shall be advocating the aban donment of significance testing in favour of estimation. Power in its narrow definition will be viewed as an irrelevant concept. However, precision, whose relationship to estimation is analogous with the relationship of power to significance testing, is a highly important statistical concept. Research pro grammes which are low in power are correspondingly low in precision (see section 2.5).
13
(5)
Poor replicability of research findings
It follows directly from the fact that low power implies sample statistic unreliability that the habitual employment of low-power significance tests results in common failure to replicate previous experimental results. This is utterly inimical to the scientific enterprise. It results in an increase in (quasi-) type 1 errors and, equally damaging, a failure to recognize that two low-power experimental findings, although superficially incompatible, (i.e. one significant the other not so) may be in reality in reasonable agreement. Tversky and Kahneman (1971, p. 107) demonstrated this latter point. They presented psychologists with the following question:
Suppose one of your doctoral students has completed a difficult and time consum ing experiment on 40 animals. He has scored and analysed a large number of variables. His results are generally inconclusive, but one before-after comparison yields a highly significant r = 2.70, which is surprising and could be of major theoretical significance. Considering the importance of the result, its surprisal value, and the number of analyses that your student has performed — would you recommend that he replicate the study before publishing? If you recommend replication, how many animals would you urge him to run? The psychologists to whom this question was put overwhelmingly recom mended replication. The median suggested number of animals to be run was 20. This follow-up question was then put (Tversky and Kahnemann, 1971, p. 107): Assume that your unhappy student has in fact repeated the initial study with 20 additional animals, and has obtained an insignificant result in the same direction, t= 1.24. What would you recommend now? Check one (the numbers in parentheses refer to the number of respondents who checked each answer): (a) He should pool the results and publish his conclusion as fact. (0) (b) He should report the results as a tentative finding (26) (c) He should run another group of (median = 20) animals. (21) (d) He should try to find an explanation for the difference between the two groups. (30) [6]
As Tversky and Kahneman point out, responses (b) and (c) can be justified on some grounds but the most popular response, response (d), is indefensible. The point is that even if the first finding were an accurate reflection of the true size of effect (which may be a generous concession bearing in mind the surpris ing nature of the findings) the power of the proposed replication (for a twotailed test, a = 0.05) is only 0.43. Even for a one-tailed test the power is only 0.57. Hence it is evident that the replication was about as successful as its power deserved. We need seek no ‘explanation for the difference between the two groups’ other than that it reflects the vaguaries of random sampling. It is ironic that perhaps the prime purpose of subjecting data to tests of significance is to discourage the formulation of fanciful hypotheses when the more parsi monious explanation of sampling variation cannot be ruled out.
14 Tversky and Kahneman, then, have indirectly replicated Cohen’s finding that psychologists underestimate power (Cohen studied the power of actual research plans; Tversky and Kahneman studied the power of recommended research plans). Further, having recommended a low-power study it seems that many psychologists proceed to misinterpret the subsequent results. Examples in the literature of psychologists straining to account for the ‘discrepancy’ between significant and non-significant low-power studies are commonplace. Summary
Statistical power should be explicitly considered in the planning and inter pretation of any significance test [7]. Negligence of power considerations may have five deleterious consequences. First, where sample size is necessarily small, power will be relatively low. It may make sense to compensate by adjusting a, the type 1 error. More general ly, it is far from obvious that keeping a constant at an arbitrarily determined value, whilst allowing /3, the type 2 error, to fluctuate from application to application, constitutes a rational inference procedure. Second, in view of Cohen’s evidence that many empirical investigations are low in power, it is likely that some important experimental effects go undetected. A related point is that given the obtaining of a non-significant result a mistaken inference may result. It may be due to the fact that the null hypothesis is true, or nearly so — or it may be due to a large effect being investigated with a low-power design. Third, the literature may contain a spuriously high number of ‘false positives’. Mere adherence to a conventionally set a level does not ensure that the proportion of reported significant results that actually reflect true, or nearly true, null hypotheses is only 5%. Fourth, although it is frequently possible, and always desirable, to construct an interval estimate from a reported significance test, where power is low the resultant confidence interval will be unacceptably imprecise. Fifth, when power is low it is difficult to replicate previously obtained experimental effects. Further, failure to do so is liable to lead to barren controversy and irrelevant attempts to ‘explain’ apparent discrepancies.
1.3
The Meaning of Associated Probability and Type 1 Error
Let us return to the imaginary experiment outlined in section 1.1. It will be remembered that the experimenter entertained the hypothesis that the level of violence portrayed on television affects the degree of aggression subsequently exhibited by child viewers. Twenty children were randomly assigned, ten each, to one of two conditions. The statistical analysis was an independent means /■-test where the dependent variable was the number of aggressive behaviours displayed during a fixed time period. In passing, we can note that the statistical power of this particular design against a medium effect size is 0.18, and against a large effect it is 0.39. Fortunately for the experimenter, however, and for
15 whatever reason, the result of the test was t = 2.55 with 18 d.f. This was sig nificant at the p < 0.05 level of significance; the associated probability being p = 0.02. In this section we will examine more carefully the meaning of these two probability statements. The central point regarding probability statements in so-called ‘classical statistics’ is that probability is defined as a limiting value of a relative fre quency of occurrence of certain event classes. The frequentist position has been propounded by Venn (1888) and discussed in detail by von Mises (1957); essential to it is the notion of ‘the long run’. The problem in defining associated probability or the probability of committing a type 1 error is in each case to identify the reference class of the long run. That the reference class appropriate to a situation is not always obvious has been noted by Barnard (1947) (see Chapter 6), and I shall have more to say on the particular problem of identifying the population towards which statistical inference is made in Chapter 7. For our immediate purposes it will suffice to regard the reference class for a typical analysis of data to be the set of indefinite (usually imaginary) repeti tions of the experiment which gave the result in question, conditional upon the null hypothesis being true. It is important to realize that all probability statements are conditional and it is incumbent upon the researcher to com municate to his readership, as far as possible, the remaining conditions under which his obtained associated probability is valid. This is usually achieved in the Method section of a report where characteristics of the subject pool, apparatus, procedure, etc. are detailed. Other technical conditions are either ensured or assumed; examples are random sampling, normal distribution of errors, homogeneity of variance (where appropriate), etc.
Associated Probability
The associated probability of a certain test result, then, is the long-run propor tion of a certain class of events relative to the totality of events, given that certain ‘states of the world’ obtain. The particular class of events varies from statistical test to statistical test, but it is usually the obtaining of a derived test statistic at least as large or at least as extreme as that obtained in the experiment in question. Whilst there is no way to rank order the importance of the ‘givens’ referred to above, the one that is most pertinent to this discussion is that the null hypothesis should be true. We can define associated probability, somewhat informally, as the long-run relative frequency with which data ‘at least this extreme’ would be obtained given that the null hypothesis is true. This we can express as associated probability of datum = P(De/Ho True) [ 8 ]. Apart from the informality of its exposition no serious statistician would question the above definition; nothing has been written which is in the least controversial. Equally uncontroversially we can describe an alternative view of associated probability which is categorically mistaken: ‘Associated probability is the probability that the null hypothesis is true given that certain data have been obtained’ We can express
16 this erroneous view of associated probability as P(Ho True/£>e). This is an attempt to make a probability statement about a proposed state of the world (i.e. a hypothesis) given the observation of certain data. But under a relative frequency view of probability ‘the probability of the truth of an hypothesis’ is meaningless. The sense of the long run is lost. ‘Our probability theory has nothing to do with questions such as: “Is there a probability of Germany being at some time in the future involved in a war with Liberia?”’ (von Mises, 1957, p. 9). Further, we can demonstrate informally that even if this view of associated probability were meaningful, it would be false. That is, even if it were mean ingful to talk of the probability of the truth of the null hypothesis (and for some who do not hold to the relative frequency view of probability it is mean ingful) the associated probability would not yield the correct value. The point is simply illustrated. Suppose, for example, we wish to examine the view that the coin used by an unusually successful cricket captain is biased. The coin is tossed six times and the null hypothesis, that the coin is fair, is sub jected to a significance test. Suppose the coin lands ‘heads’ three times and ‘tails’ three times; the probability of obtaining data as extreme as this is clearly 1. It is equally clearly absurd to conclude that in the light of this ‘experiment’ the coin is certain to be fair! A second example will clarify the true relationship between P(De/Ho True) and P(Ho True/Z7e)It is occasionally reasonable to give a relative frequency interpretation of the probability of a hypothesis. For example, suppose the Royal Mint were known to manufacture exactly equal proportions of fair and biased coins and that from a vast number of freshly minted coins one is chosen at random. We can perhaps regard this coin as having a probability of being fair of 1/2 [9]. Fair coins, of course, will produce a long-run relative frequency of ‘heads’ of 0.5. Suppose a biased coin produces a long-run relative frequency of ‘heads’ of 0.8. Now suppose that the selected coin is tossed six times and lands ‘tails’ on every occasion. The relative frequency with which a fair coin will yield results as extreme as six tails is, by the binomial expansion, 1/32 = 0.031. Hence a significance test of the null hypothesis (the coin is fair) would lead to its rejec tion at the 0.05 level of significance. Of course, such an analysis and such an interpretation is out of order. The datum, strange as it is, offers powerful evidence in favour of the null hypothesis. The true posterior probability of the null hypothesis is derived by applying Bayes’s theorem;
P(HilD) =
P(Z)/rt,)P(H;) £P(Z)///,)P(//,)
where P(H,) is the a priori probability of hypothesis i. In our example the posterior probability that we have a fair coin given that six ‘tails’ have been obtained is ________ (0.0156x0.5)________ P(Hq True/£>) = (0.0156 x 0.5) + (0.000064 x 0.5)
17
Now the purpose of this example is not to suggest that this sort of situation typically arises in the social sciences, nor even to suggest that if it did researchers would analyse the data wrongly; it is simply to point out that where it is at least reasonable to talk in terms of the probability of a hypothesis in the light of data obtained that probability is not the same as what we have described as the associated probability. The example also illustrates that if one wished to have a formal or informal indication of the ‘probability’ that the null hypothesis (or any other) is true given some datum, one would require to know not only the plausibility of that datum given the truth of the hypothesis but also its plausibility given the truth of the competitors of that hypothesis. We shall have more to say about the ‘likelihood ratio’ in Part II. That associated probability is not the same as the probability of the null hypothesis given the datum will be acknowledged by any statistician and has been noted by some psychologists (Wilson, 1961; Bakan, 1967), but textbooks rarely make the distinction explicit and some are downright misleading and mistaken. Consider, for example, the ambiguity of: ‘when we get a result which is very unlikely to have arisen by chance we say that the result is statistically significant’ (Moroney, 1951, p. 218), or the fallaciousness of: ‘A hypothesis is rejected if the probability that it is a true statement is very low — lower than some predetermined probability .. . the level of significance’ (Roscoe, 1969, p. 145). Again, the editors of a collection of readings on significance tests assert ‘.. . Thus, any difference in the groups on a particular variable in a given assignment will have some calculable probability of being due to errors in the assignment procedure ...’ (Morrison and Henkel, 1970, pp. 195-196). If the confusion as to the meaning of associated probability is at all widespread two possible explanations can be suggested. First, although when presented as ‘the probability of obtaining data given the truth of a hypothesis’ and ‘the probability of the truth of a hypothesis given obtained data’, the two interpretations are quite readily distinguishable — in many cases the distinction is very much more blurred. Compare for example (a) ‘The probability of obtaining this result by chance is 0.05’, with (b) ‘The probability that this result was obtained by chance is 0.05’. Second, much confusion may have arisen from the terminology popularized by Fisher (1935). He referred to the relative frequency with which data would occur given the truth of the null hypothesis (i.e. what we have called ‘associated probability’) as the ‘likelihood of the null hypothesis’. Now while it may be possible for statisticians to avoid being misled by this term we can perhaps sympathize with the social scientist who identifies ‘probability’ with the term ‘likelihood’. So with regard to the hypothetical findings on television programmes and aggression, from t = 2.55, 18 d.f., p = 0.02, a researcher may conclude that if there is no difference between the mean aggressiveness of children exposed to
18 the violent programme and those exposed to the control programme, then in an indefinitely large number of replications of his experiment he would obtain samples differing in mean aggressiveness by as great or greater a margin as he obtained on this occasion just 2°7o of the time. He may not conclude that ‘the probability that there is no difference between the mean aggressiveness of children exposed to the different programmes is 0.02’. Nor, obviously, could he conclude that ‘the probability that there is a mean difference is 0.98’. A second misconception as to the meaning of associated probability is that it can be taken as a measure of the stability, reliability, or replicability of a par ticular empirical finding or statistical decision. There is no truth in the supposi tion that the associated probability, p, or more often its complement, (1 - p), has anything but a very indirect bearing on the replicability of an experimental finding. It might be tempting, for instance, to suppose that if our psychologist were to repeat his experiment the probability that the second experiment would yield a result significant at the 0.05 level is 0.98 (i.e. 1 - 0.02) or perhaps 0.95 (i.e. 1 - 0.05); it is neither. The particular probability of a successful replica tion is, in fact, ~ 0.67. This value is derived by assuming the variance of scores in the two experiments to be equal and computing the power of the second test relative to the effect size found in the first experiment. While it is true to say that the value (1 - p) derived from a first experiment is, other things being equal, monotonically related to the power of an exact replication, the relationship is not linear and certainly (1 — p) is not equal to the required power. We have seen that the power of a replication of an independent means /“-test design when the first experiment has an associated probability of 0.02 is ap proximately 0.67 (this value holds independently of sample size provided that the sample sizes employed in the replication equal those of the original study). Suppose then, psychologist B, suspecting that A’s results were an artefact of the particular violent film shown and did not represent a generalizable conclu sion to violence on television, decided to perform an exact replication of A’s study, merely substituting a different violent programme. Suppose B’s results were in the same direction as A’s but were not significant. It would be folly surely for B to assert that A’s findings were indeed artefactual. There is clearly no support whatever for the view that associated prob ability, or its complement (1 - p), is a direct index of the reliability of a particular experimental result. It is therefore surprising to find this fallacy perpetuated in the literature. For example, English, concerned by the confu sion between statistical significance and practical importance, suggested that we
substitute ‘statistical stability’ for ‘statistical significance’. The term ‘stability’ points to the essential idea that the value in question will not change randomly. It thus has a more specific connotation than ‘significance’ ... If the new term is adopted, we shall have fewer term papers in which the student proudly avers that some minuscule difference must be taken seriously merely because it can — prob-
19 ably! as he usually forgets to add — be replicated 97 times in a hundred. (English, 1954, p. 158)
This error was compounded in the influential Annual Review of Psychology by Jones (1955, p. 406) who thought that the problem referred to by English ‘might be relieved partially by the change in terminology proposed’. More recently, Morrison and Henkel (1969) under the heading ‘Errors of the analytic nature of inferences based on significance tests’ write: the interpretation of a ‘significant’ finding as a ‘precise’ or ‘reliable’ estimate of the parameter can be quite misleading. A correlation coefficient of 0.34 that is ‘significant’ at the 0.05 level under the typical null hypothesis does not necessarily mean that the statistic itself is a reliable or precise estimate of the parameter value (it is your best estimate, and may be a good estimate if your sample is of the right design and large enough). A subsequent correlation of 0.12 or correlations that average 0.15 on repeated samples may also fully vindicate the decision to reject the null hypothesis. What is reliable then, is your decision to reject the null hypothesis, not necessarily the statistic that led to that decision. The first part of this quotation is entirely unobjectionable but the last sentence is simply false. It is not easy to speculate as to the origin of such a patently false interpreta tion of associated probability. Perhaps the confusion was fuelled by Fisher’s often-quoted statement: ‘... we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result’ (Fisher, 1948, p. 14). The state ment contains no falsehood but is, perhaps, not very enlightening; it certainly seems confusable with what is a false statement: ‘We may say that a statistic ally significant result will rarely fail to be experimentally demonstrable.’ To summarize: our psychologist may not assert that his finding (r = 2.55, 18 d.f., p = 0.02) implies that a replication would have a probability of being significant (two-tailed at the 0.05 level) equal to 0.98, nor equal to 0.95. This probability can, however, be computed from the available data but is perhaps not intuitively to be guessed. In this example the probability of a successful replication is about 0.67. It is of interest to note that in order for the prob ability of a successful replication to be 0.95 the associated probability of the first result would have needed to have been about 0.0001!
The meaning of the probability of a type 1 error It is possible to distinguish between the probability of a type 1 error and associated probability, although in the majority of cases they take the same numerical value. (This is because for those statistics whose distributions vary according to the degrees of freedom at hand, in order to save space most texts print only critical values with conventional associated probabilities. This is usually the case for t, F, and y2 statistics, for example.) To the average con sumer the type 1 error also serves the purpose of being an upper bound upon
20
the associated probability. The misinterpretations of associated probability detailed above are also, therefore, erroneous if applied to the type 1 error. There is, however, an additional meaning of the probability of a type 1 error, together, inevitably perhaps, with a misinterpretation. Once again the appropriate way to view the probability of committing a type 1 error is as a limiting value of a relative frequency of event classes. To be specific, if the type 1 error of an experiment is a, this means that if an indefinitely large number of experiments were performed, and each time the null hypothesis were evaluated according to the rule ‘if associated probability < a reject null hypothesis, otherwise accept null hypothesis’, then of all the occasions on which the null hypothesis is true (100 x a) Vo would result in the rejection (obviously wrongly) of the null hypothesis. Once more, any definition which in effect inverts the conditionality inherent in the above statement is false. Thus it is true to say that given that the null hypothesis is true, the probability that it will be rejected is a. It is false to say that given the null hypothesis has been rejected the probability that it is true is a. Nor, as we have seen earlier, is it true that (100 x a) Vo of all decisions to reject the null hypothesis are erroneous. Statements which embody these misconceptions are not hard to find:
If we always reject a hypothesis when the outcome of the experiment has prob ability of 0.05 or less, and if we consistently apply this standard, then we shall incorrectly reject 5 per cent of the hypotheses we test. (Edwards, 1970, p. 23) Here Edwards has omitted the word ‘null’ before ‘hypotheses’; the effect is to alter and invalidate the statement. ... if he is restricted to a total n of 80 cases, with this plan he must reject the null hypothesis with 17Vo a risk. Formally no problem exists; an eventual positive conclusion must carry a one-in-six risk of being spurious. (Cohen, 1965, p. 99)
If the probability associated with the occurrence under the null hypothesis of a par ticular value in the sampling distribution is very small, we may explain the actual occurrence of that value in two ways: first we may explain it by deciding that the null hypothesis is false, or second, we may explain it by deciding that a rare and unlikely event has occurred. Occasionally, of course, the second may be the correct one. In fact, the probability that the second explanation is the correct one is given by a, for rejecting Ha when in fact it is true is the type 1 error. (Siegel, 1956, p. 14)
Summary The underlying philosophy of significance tests assumes a relative frequency jjefinition of probability. In order to interpret associated probability, therefore, we have to identify an event class of hypothetically repeatable sample results. Assocated probability refers (usually) to the tail area of sample results as extreme, or more so, as the obtained sample statistic; jt relates to the relative frequency of sample results given the truth of the null hypothesis.
21
A statement of the form ‘the probability o£the truthjjf the null hypothesis given that this sample statistic has been obtained is x’ is termed a statement of inverse probabiljty^Such statements, although frequently encountered in statistics texts, are meaningless under a relative frequency conception of prob ability. Even in contrived examples where it is arguably meaningful to talk of ThF'probability of a hypothesis given sample results that probability is not given by the associated probability. Nor is associated probability an index of the stability or replicability of an empirical finding. This requires a power computation. Likewise the probability of a type 1 error refers to the relative frequency of sample statistics that will lead to (erroneous) rejection of the null hypothesis given that the null hypothesis is true. As before, statements referring to the probability of the null hypothesis being true given the decision to reject it, are statements of inverse probability and are therefore meaningless under the relative frequency definition of probability that is inherent in significance tests.
Notes [ 1 ] Strictly speaking there are exceptions to this rule; if, for example, the statistical hypothesis were that the population range were R1—R2 then a sample observation out side this interval would logically negate the hypothesis. Such a situation is extremely rare. [2] Of course one of the other assumptions underlying the use of the test might have been violated. It may be, for example, that the populations from which the samples are drawn are markedly non-normal. As indicated earlier, this type of ‘technical’ misuse of the significance test will be largely ignored, not least because many of the tests have shown themselves to be remarkably robust in such respects. [3] I have reviewed the issue of the legitimacy of directional tests of significance in Oakes (1979). [4] The computation of the power of a significance test is usually fairly simple, most of the results that follow, however, have been drawn from tables published by Cohen (1969). [5] I shall later argue that the proportion of cases in which the point null hypothesis is actually true is virtually nil. However, the principle advanced here also applies to situations where the null hypothesis is ‘quasi-true’, that is, when the null hypothesis is relatively diffuse and covers those cases where the size of the effect is not great enough to be of theoretical or applied interest. Such occasions are probably by no means rare. [6] The use of the word ‘unhappy’ in Tversky and Kahneman’s question is unfor tunate. Their results could conceivably be due to a demand characteristic. [7] This is a flat repudiation of one aspect of Fisher’s account of significance tests. I shall argue in Chapter 6 that Fisher’s denunciation of the Neyman-Pearson concept of power is indefensible. [8] Where the suffix ‘e’ denotes that the event class referred to is not simply the very datum obtained but includes the observation of all statistics ‘more extreme’. Note the suffix is absent in the formula attributed to Bayes. [9] Strict relative frequentists would not allow this either, refusing to attach a prob ability to a single event, but it suits the present purpose.
Chapter 2 A Critique of Significance Tests
In this chapter we shall examine some of the arguments that have been presented which suggest a reassessment or even abandonment of significance testing. There is a fairly substantial body of literature on this issue and an attempt will be made to organize and evaluate the various themes. In par ticular we shall attempt to avoid a trap which seems to have bedevilled much of the published discussion. There has been a tendency for writers to argue from grand theoretical positions, and opponents who may not share what seem to be essential assumptions have appeared to underestimate the force of certain criticisms. For example, we shall see that many of those who criticize the use of significance tests adopt an explicitly Bayesian view of probability. Classical statisticians (by which phrase is meant those statisticians — and social scien tists — who espouse a frequentist interpretation of probability) are typically loath to entertain the notion of subjective or personal probability and are therefore in danger of ignoring the merits of some arguments which do not, in fact, depend for their cogency upon an acceptance of all the more con troversial Bayesian tenets. So in this chapter we shall attempt to evaluate the contribution of significance testing without prejudice, as far as possible, to the question of whether subjective probabilities are admissible. This task completed Part II will be devoted to those criticisms of classical techniques in general which depend for their force upon differing interpretations of fundamental issues in statistical inference. Chief among these, of course, will be the Bayesian challenge. In brief, this chapter will be restricted to an examination of those issues in statistical inference which differentiate significance tests from the major relative frequency-based alternatives, namely point and interval estimation.
2.1
Decision versus Inference [1]
The first criticism of significance tests that we shall examinejs that they repre sent a calculus for decisions for action, whereas the appropriate role of 22
23
evidence in most situations in_\yhich a researcher finds himself is to offer rational grounds for a change in the degree of support to which a scientific hypothesis is entitled. This point of view is expressed most forcibly in the psychological literature by Rozeboom (1960). He illustrated his argument with reference to a hypothetical experiment conducted by a certain ‘Igor Hopewell’, but for convenience we shall recast the issues in the context of the hypothetical study of violence on television to which past reference has been made. The design, it will be remembered, was a two-sample independent means /-test with 10 observations per group. The null hypothesis was that violence on television has no effect on subsequent levels of aggressive behaviour (i.e. Ho: gx = fJ-y) The alternative was simply H,\: fiy. The critical value (a = 0.05) for this test is t = ±2.10 (18 d.f.). Rozeboom points out that in this situation all possible values of t between -2.10 and +2.10 have the same interpretative significance, namely, that (gx - gv) = 0, and that all other values of t imply or signify that (gx - /+) * 0. Now, our psychologist entertains the substantive hypothesis that violence por trayed on television increases the level of subsequent aggressive behaviour, and Rozeboom objects that for the purposes of assessing the reIativemneH4s-&f-tlie_. a 11 ernate and nu 11 h y potheses,-aT'alueof t - 1.9, say, should surely have more in common with a t value of 2.2 than with a value of - 1.9, This point has been made with reference to a two-tailedUst, but the criticismToses none of its force if applied to one-tailed tests. In similar vein Eysenck (1960) asserted that two hypothetical results with associated probabilities equal to 0.056 and 0.048 were in ‘excellent agreement’ and that to differentiate them on the grounds of their statistical significance is nonsensical. I shall shortly suggest that the degree of agreement between two results should neither be assessed by their similarity of significance nor by their similarity of associated probability. This caveat aside the position taken by Rozeboom and Eysenck has much to commend it. Now suppose our psychologist, in line with accepted procedure, adopted as his type 1 error a = 0.01 prior to data collection. The implied critical value of t is ± 2.9. If the obtained statistic were to be 2.4 he would be obliged by the model to accept the null hypothesis. It might occur to him that had he been more modest (or less conservative) and set a = 0.05 then he would instead be concluding in favour of the alternative hypothesis. As Rozeboom (1960, p. 420) says, ‘surely the degree to which a datum corroborates or impugns a proposition should be independent of the datum-assessor’s personal temerity.’ ~These prohtemv'arise ~Rozeboom~G960, p. 420) believesVbecaure ' the null hypothesis significance test treats-acceptance oi_rejection of a hypothesis as though the§.e were decisions one makes on the basis of the experimental data — i.e. that we elect to adopt one belief, rather than another, asa~result of an experimental outcome. But the primary aim of a scientific experiment is not to precipitate decisions, but to make an appropriate adjustment in the degree to which one accepts, or believes, the hypothesis or hypotheses being tested. And even if the purpose of the experiment were to reach a decision, it could not be a
24 decision to accept or reject the hypothesis, for dgcisions are voluntary_com^ mitments to action — i.e. are motor sets — whereas^acceptance or rejection of a.. hypothesis is a~cognitive~state, which may prbvidFThe~5asis~for rationaldecisions Tut is~not itseltarrived at by suchadcusron.
Thus proponents of the significance test are confronted by a major difficulty; namely, what does one do when one does not believe the result of a significance test? If one acts (whatever that may imply) according to the out come of the test one is making the significance test shoulder the entire burden of induction — what Bakan scathingly terms ‘automaticity of induction’, He subsequently asserts, ‘when we reach a point where odf”statisTicaT procedures are substitutes instead of aids to thought, and we are led to absurdities, then we must return to common sense’ (Bakan, 1967, p. 29). If, on the other hand, one allows one’s surpise or expectations to colour one’s judgement or decision, what of the much-vaunted objectivity of the significance test? This and related arguments will be analysed more fully in Part II, where the Bayesian position is assessed. In so far, then, as a significance test is a procedure for making decisions about hypotheses rather than for altering belief in them in the light of data, Rozeboom considers that it cannot properly be regarded as an inferential pro cedure at all. Hogben (1957), without by any means acceding to Rozeboom’s explicitly Bayesian outlook, nevertheless reaches the same conclusion. .Significance testing in the NeymamJearsomUadjiionisTundamentallv comjnittgd to layirTg^Hownrules of behaviour before the data are collected (what Hogben calls ‘the foreward look’) as opposed to inferring parameter values in tflelight of data (‘the backward look’). Jt will be appreciated that all statements of inverse probability are^ Tn effect, ‘backward looks’. Barnett (1973, p. 157) makes the same point, drawing the distinction between ‘initial precision’ and ‘final precision’ (after Savage, 1961). Of ‘initial precision’ he says, ‘the procedure of using the sample mean (or some other measure) to estimate n, could be assessed in terms of how well we expect it to behave . . . before we take our data’. The alternative concept of ‘final preci sion’ has ‘no formal place on the classical approach’. However, seen only as a decision procedure, the null hypothesis significance test is, in Rozeboom’s view, ‘woefully inadequate’. In order to make an effective decision the utilities of various alternatives must be considered and this is not done explicitly in significance testing. This much was, however, acknowledged by Neyman and Pearson (1933b, p. 293), who with regard to costs noted: ‘From the point of view of mathematical theory all that we can do is to show how the risk of the errors may be controlled and minimized. The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator.’ Here, quite apart from freely allowing subjectivity into the statistical domain, Neyman and Pearson were clearly advocating a flexible approach to the setting of significance levels. It must be admitted, then, that the typical practice of taking decisions with regard to a rigid, con
25 ventional type 1 error criterion, in neglect of power considerations, is indeed questionable. Although Rozeboom is surely justified in claiming that for the majority of published work a decision is not appropriate and that a science is a systgmatization of (probable) knowledge not an accumulation of decisions, it is fair to point out that some decisions have to be made — with regard to whether or not one should publish, or should discontinue a line of research, etc. It is debatable, however, whether statistical significance constitutes a rational decision rule in these contexts. It is easy to find examples in the literature of the inappropriate use of significance tests in the making of research decisions (as opposed to hypothesis testing). It might seem that the above criticism of the decision aspect of significance tests do not really apply to social science practice because rather than act on the basis of whether or not a statistic reaches significance,..giost researchers compute the associated probability for their finding and assess the plausibility of the null hypothesis accordingly. We have seen that Eysenck recommends such a practice as do a number of”authors of statistical texts (Mather, 1942; Lacey, 1953; Underwood et al. 1954). Used in this way the significance test is clearly an altogether more inferential tool. In my view most researchers do indeed act in the above manner — or, rather, they do not use nreaet a levels. But Bakan (1967, p. 14) strongly objects to ‘taking the p value as a measure of significance .... The practice of “looking up” the valueTor t . . . rather than looking up the t for a given p value, violates~th6 inference model?7" Rozeboom is not alone, however, in finding such strictures unacceptable:
To turn to the case of using accept-reject rules for the evaluation of data, ... for the purpose of getting suggestions from data, it seems clear that it is not possible to choose an a beforehand. To do so, for example, at 0.05 leads to all the doubts that most scientists feel. One is led to the untenable position that one’s conclusion is of one nature if a statistic t, say, is 2.30 and one of a radically different nature if t equals 2.31. No scientist will buy this unless he has been brainwashed and it is unfortunate that one has to accept as fact that many scientists have been brain washed. (Kempthorne, 1971, p. 490) Whereas the ‘accept-reject rules’ may demand that t be looked up for a given p value there is nothing in the classical theory of statistical inference to prevent a researcher computing the associated probability of his data and drawing appropriate inferences. This view is encapsulated by Barnett (1973, p. 160) If the test is believed to be purely inferential in function it is difficult to see the relevance of costs, or of a prior choice of a significance level. On this viewpoint the responsibility for acting on the result of the test falls on a second party — the statistician—ii^fers, the user acts on the joint basis of the inference and what he know§_a.bout costs and consequences. In this respecuif rheJiinctum of.the testis~~ solely inferential) it would seem that it is only the critical level, not the significance
26
level- which matters since this most accurately represents the inferential import ance of the test. No significance level needs to be assigned; costs and consequences do. not seerfTto have even the tangential relevance to such an assignment that they possess in a decision-making situation. There is, therefore, some truth in the contention that researchers produce an unsatisfactory concoction of the concepts of Neyman-Pearson theory of the rules of inductive behaviour together with the practice and interpretation of a roughly inferential process. But to argue as Bakan (1967, p. 13) does that 1 p is not a measure’ or the ‘a p value of 0.05 . . .’ is not ‘. .. less significant than a p value of 0.01’ is unwarranted. Bakan goes on to suggest that there is a study in the literature ‘which points up sharply the lack of understanding on the part of psychologists of the mean ing of the test of significance’. The study is by Rosenthal and Gaito (1963, p. 33) who asked 19 academic psychologists ‘to rate their degree of confidence in a variety of p levels: once assuming an n of 10, and then assuming an n of 100’. The dependent variable, ‘degree of belief’, took values between 5 (extreme confidence or belief) and 0 (complete absence of confidence or belief). Twenty-eight observations were taken from each subject; one for each of the two sample sizes across 14 associated probability levels. The reader will surely sympathize with the subjects who were required to make sense of these instructions! However, in Bakan’s opinion, ‘what is shocking is that these psychologists indicated substantially greater confidence or belief in results associated with the larger sample size for the same p value ... .how could a group of psychologists be so wrong?’ (Bakan, 1967, p. 15). Interestingly, Rosenthal and Gaito share their subjects’ views! ^akan (1967, p. 15) states that ‘one can be _ more^onfident with a small n than with a large /?’, whereaTTir'Rbsenthal and Gaito’s paper (1963, p. 37) we read, ‘objectively speaking one would expect grcatcrbelief attached to the larger sample size at every p level’^ But, to the extent that fhe~question is meaningfurjbdtTrtTTeabovFah^wers to it are wrong. Bakan’s (1967, p. 15) reasoning is this:
the probability of rejecting the null hypothesis for any given deviation from null and p value increases as a function of the number of observations. The rejection of the null hypothesis when the number of cases is small speaks for a more dramatic effect in the population: and if the p value is the same, the probability of committing a type 1 error remains the same. Thus one can be more confident with a small n than with a large n. The argument contains a number of difficulties:
(1) The refei^niie^pjyj?e l error is_coLtifus£dTJW£ m^Tal]0 as n -* oo in probability, and is strongly consistent if 0 -> 0 as n -» oo with prob ability 1. Consistency is almost universally regarded as an essential property of a reasonable estimator. There are no inferential grounds whatsoever for pre ferring a small sample. Of course, in many cases sample size is limited by investigators because information contributed exhibits diminishing returns relative to the cost of sampling. Such concerns are central to the decision theory approach, but are not of an essentially inferential nature and will not be further examined here.
Summary The significance test has been criticized, notably by Rozeboom, for being inappropriate to basic research because it is a decision procedure, whereas an i^ffeiTTTriarprocesS'irmost typically required. This distinction is important and is widely acknowledged by statisticians. Viewed solely as a decision procedure one must surely agree with Rozeboom that significance tests are, for the vast bulk of social research, inappropriate. However, stripped of the significance criterion, the same computations yield merely the associated probability and many writers argue that this is the useful and inferential outcome of the significance test although much research appears to fall between the two stools. One of the most influential critics of the significance test, however, has argued that the practice of assessing an obtained associated probability ‘violates the inference model’ — the accept-reject rules of statistical prudence. But Neyman and Pearson have no more monopoly of the uses to which a significance test may be put than did Fisher. It has been argued in this section that the associated probability of a significance test is indeed amenable to an inferential interpretation; its exact inferential import is certainly overestimated by researchers and I shall be arguing that by itself it is of minimal use, but Bakan has clouded the discussion by claiming that associated probability is not a ‘measure’. Much worse, Bakan
30
has used his argument to offer comfort to researchers who already have a penchant for employing inadequate sample sizes.
2.2
The Asymmetry of Significance Tests
We saw earlier that one of Rozeboom’s objections to the practice of perform ing significance tests was that the outcome of the test, acceptance or rejection of the null hypothesis, is at least partly determined by the ‘datum-assessor’s personal temerity’. A second objection made by Rozeboom was that the out come of a significance test, in terms of which of two, say, scientific theories gains support, is partly determined by which of the two theories is identified with the null hypothesis. To illustrate, suppose our psychologist studying the effect of violence on television upon subsequent aggressive behaviour finds a mean increase of 15% in frequency of aggression in those exposed to televised violence compared with the control group. It may be that this is not sufficient to reject a null hypothesis of no difference at conventional levels of significance — perhaps the associated probability is calculated to be 0.1. After recovering from his disappointment at failing to gain support for his theory that violence on tele vision has detrimental effects on the nation’s youth (a disappointment that, on inspection of the power of his design, he might have anticipated), it might occur to him to calculate the increase in aggression that his theory actually predicted. Of course, it is rarely possible with psychological theories to do this exactly, but with ‘a few simplifying assumptions no more questionable than is par for this sort of course’ the predicted increase may perhaps have been 20%. Now taking this value as defining the null hypothesis the obtained data may be very plausible, yielding perhaps an associated probability of 0.75. So, if the ‘no effect’ theory is identified with the null hypothesis the result will be non significant, the null hypothesis will be accepted, and the ‘no effect’ theory will gain support. If, on the other hand, the ‘harmful effect’ theory is identified with the null hypothesis the result will be non-significant, the null hypothesis will be accepted, and the ‘harmful effect’ theory will gain support. If at this point Rozeboom had merely remarked that his argument had demonstrated that significance tests are asymmetric in their treatment of rival hypotheses, no reasonable person could disagree; but he rather spoiled the force of his case with two mischievous comments. First, he declared that ‘to argue in favour of a hypothesis on the basis of data ascribed a p value no greater than 0.10 by that hypothesis certainly does not seem to be one of the more impressive displays of scientific caution’ (Rozeboom, 1960, p. 419). While this has a degree of superficial charm it fails to acknowledge that the scientific caution referred to is with regard to the mak ing of type 1 errors which are traditionally regarded as the scientist’s worst enemy. We shall, however, shortly be arguing that the elevation of the type 1 error to a position of pre-eminence is inappropriate. Further, Rozeboom’s
31 witticism lays himself open to the riposte that to argue in favour of the null hypothesis, on the basis of data ascribed a p value of even 0.75 by that hypothesis when the power of the test has not been even hinted at, is hardly more astute. Second, Rozeboom implies that which of two theories gets associated with the null hypothesis is a function of the academic climate of the time: ‘and is it not strange that what constitutes a valid statistical argument should be dependent upon the majority opinion about behaviour theory?’ (Rozeboom, 1960 p. 419). In fact, the stated policy of both the Fisherian and NeymanPearson schools is that the experimenter’s theory^should be identified~wirtrttre aiternaTe~KvDOthesis — irrespective of majority opinion — and this is indeed the dominant practice of researchers. Thus Rozeboom is correct in asserting that the significance test is asymmetric, but wrong in implying that the nature of this asymmetry is typically capricious; it is conventional. Finally, we might note in passing that a psychologist, who wished neither to test the ‘no effect’ theory nor to test the ‘harmful effect’ theory, but who wished merely to assess the effect of the violence of a particular television pro gramme upon subsequent behaviour, would conclude that the data, such as there was of it, suggested a 15% increase in aggressive behaviour but that this estimate was subject to a specifiable degree of sampling variation. And is that not the most helpful way to interpret the results? Following Rozeboom’s article there was a flurry of papers debating whether a new theory need necessarily be associated with the alternative hypothesis (Grant, 1962; Binder, 1963; Wilson and Miller, 1964; Edwards, 1965; Wilson et al', 1967). Grant (1962) criticized what he viewed as a growing tendency for psychologists to identify a new theory with the null hypothesis and hence to attempt to gain support for the theory from a failure to attain a significant result. We shall see later that the orthodox Fisherian statistician would be sympathetic to Grant’s cause because_Fisher^vehemently denied that a non-significant result implied acceptance of the null hypothesis: Jhe_iiul£__ hypothesis is never proved or established. JautUs-possibly disproved in the course of experimentation’ (Fisher, 1948, p. 16) That is, for the Fisherian, the significance~test deads" either to the rejection of the null hypothesis or the suspension of judgement. Grant pointed out that the identification of the experimental theory with the null hypothesis has the effect of rewarding the experimenter for designing an insensitive study. Binder (1963) replied, quite properly, that this was merely a particular instance of the general category of bad experimentation. Calling identification of the new theory with the null hypothesis ‘acceptance-support’ (a-s), and identification with the alternative hypothesis ‘rejection-support’ (r-s), Binder argued persuasively that the two processes were very comparable and that r-s had no convincing priority. Interestingly, Binder (1963, p. 110) begins his paper by noting that;
one can be led astray unless he recognizes that when one tests a point prediction
32 he usually knows before the first sample element is drawn that his empirical hypothesis is not precisely true. Consider testing the hypothesis that two groups differ in means by some specified amount. We might test the hypothesis that the difference in means is 0, or perhaps 12, or perhaps even 122.5. But in each case we are certain that the difference is not precisely 0.0000... ad inf., or 12.0000 . . ., or 125.50000 ... ad inf.
Hence Binder suggests that the proper way to carry out a-s research is to specify a ‘zone of indifference’ — in other words a diffuse null hypothesis — and to ensure that the probability of discovering an effect that matters is respectably high. That is, Binder suggests that an a-s strategy requires the computation of power. This point has been made in section 1.2 and will not be laboured here. What is curious, however, is that Binder, concerned lest a theory which ap proximates to the truth should be rejected despite its heuristic value, recom mends procedures to limit power. In this respect Grant and Binder are in agreement:
After arguing against a-s on the basis of the dangers of tests which tend toward leniency, Grant points out that the procedure may be equally objectionable when the test is too stringent. He illustrates the latter by an example of a theory which is useful though far from perfect in its predictions. This particular point is perfect ly in accord with my arguments since it demonstrates the parallelism of a-s and r-s. First one does not usually want an experiment that is too stringent: in a-s because it may not be desirable to reject a useful, though inaccurate theory: in r-s because one may accept an extremely poor and practically useless theory. Second, one does not want an experiment that is too lenient or insensitive: in a-s because one may accept an extremely poor and practically useless theory: in r-s because it may not be desirable to reject a useful, though inaccurate theory (Binder, 1963, p. HD These are remarkable sentiments. Binder is apparently advising a psychologist who suspects he may be entertaining an ‘extremely poor and practically useless theory’ to alleviate the danger under r-s by choosing a small sample! Likewise, a psychologist entertaining a ‘useful, though inaccurate theory’ should limit his sample size under a-s lest the Great Statistical Arbiter should force him (against his better judgement?) to abandon it. Toigpeat, fefrjjiepurpose of statistical inference, the larger the sample the 'better. The scenario Binder presents is absurd. The larger~the~sample size theT rnore stable the estimate of effect size; the better the information, the sounder the basis from which to make a decision (if a decision is necessary). It is ironic that Bakan, a severe critic of significance tests, and Binder, a pillar of orthodoxy, are united in the view that for sound inference samples can be too large. It will be evident by now that there are strong grounds for preferring estima tion procedures to hypothesis testing, but it may be remarked that significance tests accompanied by operating characteristic curves (power curves) yield substantially the same information as confidence intervals. Natrella (1960, p.
33 21), for example, states that from the operating characteristic curve, ‘although we cannot deduce the exact width of the confidence interval, we can infer the order of magnitude’. Natrella’s paper is an exposition of the relationship bet ween confidence intervals and tests of significance, and is an attempt to in dicate why many statisticians prefer to use the former. In the event Natrella’s case is made rather more strongly than she can have intended:
Suppose that we have measured 100 items, have performed a two-sided r-test (does the average gi differ from /zo?) and have obtained a significant result. Look at the curve for n = 100 in figure 2, which plots the probability of accepting Ho (the null hypothesis) against d = | (/zi - ^)/a | . From the curve we see that, when cZ is larger than 0.4, the probability of az-r-epring the null hypothesis is practically zero. Sinceour significance test did reject the null hypothesis, we may reasonably assume that our d = | (jj.1 - /xo)/o | is larger than 0.4, and thus perhaps infer a bound for the true value of 1 m - /zo I, in other words, some ‘confidence interval’ for /zi. (Natrella, 1960, p. 21) ------------ The only construction that I can put upon this is that Natrella is suggesting that d = | m - go/a | = 0.4 is a lower bound for | (gi - /zo)/cr |. This is surely false: indeed, if the test were just significant at the 0.05 level (the level against which the power curve was computed) the obtained value of d would be 0.198! The upper bound for d would be 0.396. Now if a statistician can make a mistake like this and have it published in the American Statistician we can hardly be sanguine as to the prospects for the psychologist who is already loath to compute the power of his significance tests. Obviously it is possible to interpret such data properly. For example, suppose we are only interested in detecting a deviation of a population mean from a theoretically determined value if it is-greater than j n.ix^either direction. With a set at 0.05 and sample size 100 we can compute the power of our study _tp be (L91. Now~if the result turns out non-significant we can~Ke at least 9F%‘ confident that the effect size is no greater than ju [4], This, it cannot be denied, is useful information — but much the same results from the computa tion of the 90% confidence interval with the added advantage that the best estimate of the effect size is considered explicitly in its construction. Indeed failure to consider the best point estimate led to Natrella’s mistake. However, if the result were significant we could not deduce from the power analysis that we can be at least 90% sure that the effect size is greater than Interval estimation avoids such pitfalls. Much confusion results from the determination of researchers to submit their data to null hypothesis decision procedures. The consequences (or, perhaps, the cause) is an obession with the wrong question; rather than asking whether an effect exists they should asks^What is the effect sizeT^t may be objected that to ask whether an effect exists at all logically precedes estimation of its size. This is the view of Binder (1963) and Winch and Campbell (1969), but the case is not compelling. First, it is obvious that a zero effect is an effect _size like any other — its estimation is certainly not precluded by a failure to-4
34
conduct a significance test. Second, and more important, the ‘logical’ order leads to absurdities in the case of non-significant results. Thus if a violent television programme produces on average a 15% increase in aggressive behaviour, which, by virtue of inadequate sample size is non-significant at conventional levels, the question, ‘Is there an effect?’ is answered in the negative. The subsequent question, ‘How large is the effect?’ (is it admissible to ask it?) is answered ‘15%’. ^Interval estimation, I suggest, would alleviate such difficulties and perhaps we should not have to read solemn discussions as to the extent to which we should cloud our vision lest we should not like what we see. We continue our consideration of the asymmetry of significance tests with an examination of two numerical examples: the first is a hypothetical problem posed by Edwards (1965) the second, a re-analysis by Pitz (1968) of data reported by Hershberger (1967). Edwards (1965, p. 401) examines this scenario: the boiling point of statistic acid is known to be exactly 150 °C. You, an organic chemist, have attempted to synthesize statistic acid; in front of you is a beaker full of foul-smelling glop, and you would like to know whether or not it is indeed statistic acid. If it is not, it may be any of a large number of related compounds with boiling points diffusely (for the example that means uniformly) distributed over the region from 130 °C to 170 °C. By one of those happy accidents so common in statistical examples, your thermometer is known to be unbiased and to produce normally distributed errors with a standard deviation of 1 C. So you measure the boiling point of the glop, once.
Edwards’s interest lies in the inference which might be drawn from an obser vation (relatively) close to 150 °C. The example, of course, justifies the use of the classical critical ratio test with a standard deviation of 1 °C. Suppose that the glop really is statistic acid. What is the probability that the reading will be 151.96 °C or higher? Since 1.96 is the 0.05 level on a two-tailed critical ratio test, but we are here considering only the upper tail, that probability is 0.025. Similarly the probability that the reading will be 152.96 °C or greater is 0.005. So the probability that the reading will fall between 151.96 °C and 152.58 °C, if the glop really is statistic acid, is 0.025 - 0.005 = 0.02. What is the probability that the reading will fall in that interval if the glop is not statistic acid? The size of the interval is 0.62 °C. If the glop is not statistic acid, the boiling points of the other compounds that it might be instead are uniformly distributed over a 40 °C region. So the probability of any interval within that region is simply the width of the interval divided by the width of the region, 0.62/40 = 0.0155. So if the compound is statistic acid, the probability of a reading between 151.96 C and 152.58 C is 0.02, while if it is not statistic acid that prob ability is only 0.0155. Clearly the occurrence of a reading in the region, especially a reading near its lower end, would favour the null hypothesis, since a reading in that region is more likely if the null hypothesis is true than if it is false. And yet, any such reading would lead to a rejection of the null hypothesis at the 0.05 level by the critical ratio test (Edwards, 1965, p. 410).
Edwards uses this superficially plausible argument to support his contention that not only are null hypothesis significance tests asymmetrical in their opera tion but that ‘classical procedures for statistical inference are always violently
35
biased against the null hypothesis.’ This point of view was first expounded in a much more mathematically elegant form by Lindley (1957). The argument was presented by Lindley in the form of a paradox: if H is a simple hypothesis and x the result of an experiment, the following two phenomena can occur simultaneously: (1) a significance test for H reveals that x is significant at. say, . the 5% levgl^(2) the posterior probability of H, given x, is. for.jqiiite small . prior probabilities of H, as high as 95%. We shall pay closer attention to this paradox in Chapter 6 where^tHFwholFBayesian approach will be evaluated. For the moment it is important to note that Edwards, in his attempt to demonstrate the ubiquitous bias of significance tests, quotes Lindley’s theorem as his justification. To be fair to Edwards he prefaced his discussion of the statistic acid prob lem by observing that his treatment was ‘extremely crude’. His over simplification, however, met with a forceful rejoinder from Wilson et al. (1967) and in the resultant controversy some important issues were clouded. Edwards remarked that the choice of a 40 °C range for the alternative hypothesis affected the numerical details of his problem but did not affect the general principle. However, Wilson et al. found fault with the arbitrary nature of the interval in which the datum was seen to have fallen. There certainly seems to be no compelling reasons to select the interval with width 0.62 °C. An improvement suggested by Wilson et al. is to compute the probability of obtaining an observation as close or closer to the value predicted by the null hypothesis as the datum actually occurring [5]. In the statistic acid example the probability of obtaining a reading as far from 150 °C as 152 "C is 0.05 (roughly), if the null hypothesis is true. The probabilityof obtaining a reading in the 148-152 °C range if the alternative hypothesis is true is 4/40 = 0.1. Arguing in this way Wilson et al. assert that the rejection of the null hypothesis is appropriate. Accepting, as we must, that Wilson et al.'s treatment of the statistic acid problem is better than that of Edwards, we should nevertheless examine his general contention that it is possible to obtain data which renders the null hypothesis more plausible than before the data were in — and yet lead to a rejection of that null hypothesis. For even with Wilson et al.'s method of analysis such an eventuality is possible. Thus, had the range of the alternative hypothesis been some quantity, R, greater than 80, then the probability of obtaining a datum in the 4 °C interval specified above, if the alternative hypothesis is true, is 4/7? (i.e. less than 0.05). Wilson et al. feel that situations in which the alternative hypothesis may reasonably span 80 standard deviations do not occur frequently enough to justify concern. A number of comments may be made in addition: (1) It is not necessarily unreasonable to reject a hypothesis which has gained credibility in the light of data — if the initial credibility were low and not greatly increased subsequently. Such a situation might not be untypical given a very diffuse alternative hypothesis.
36 (2) Point (1) above is predicated, as is Edwards’s problem itself, upon a deci sion being necessary. We have already suggested that in the vast majority of situations a decision is not required [6]. These arguments will not be repeated here, but we can at least note that if our organic chemist has to decide whether or not he has statistic acid it would be folly for him to decide in the affirmative. The foul-smelling glop is more likely to be parametric acid, to name but one alternative, since this boils at 152 C, the observation obtained. (3) It is evident that if the ‘statistic acid hypothesis’ is rejected at the 0.05 level of significance then a 95% confidence interval for the true boiling point of the liquid will not include 150 °C; the boiling point of statistic acid. Nor, however, would a 95% credible interval — the Bayesian equivalent — and it is a Bayesian analysis which in general Edwards recommends to his readers.
In this respect Edwards’s citing of Lindley’s statistical paradox is misleading. Lindley was quite clear as to the conditions under which his paradox obtained: ‘we first note that the phenomenon would persist with almost any prior probability distribution that had a concentration on the null value and no concentrations elsewhere ... it is, however, essential that the concentration on the null value exists’ (Lindley, 1957, p. 188). Edwards, in his problem, specified a completely uniform prior (so far as may be judged). If, of course, our organic chemist has to make some decision in the light of an observation of 152 °C when he had strong a priori reason to suspect he had synthesized statistic acid, whole new issues are opened up which must wait until Chapter 6 for discussion. If, then, Edwards sought to show that by its asymmetric handling of com peting hypotheses the significance test could give rise to inappropriate decisions, it must be concluded that the example he chose was unfortunate and his analysis misleading. A more impressive illustration of this issue is given by Pitz (1968). Pitz’s example, drawn from the psychological literature, is a re-analysis of data reported by Hershberger (1967) purporting to disconfirm a theory proposed by Day and Power (1965). Subjects were required to identify the direction of rotation of a stimulus based only upon the horizontal movement of components of the stimulus. Day and Power predicted that subjects would be unable to identify the direction of rotation but, of Hershberger’s 48 subjects, 32 perceived the direction of rotation as that being simulated. Hershberger rejected the null hypothesis (identified with Day and Power’s theory) at the 0.05 level of significance by use of a two-tailed binomial test, and concluded that subjects can perceive direction of rotation in these circumstances. We may paraphrase Pitz’s approach as follows: under the binomial expansion the probability of observing exactly r successes in n independent trials where the probability of any individual success is p is nCrpr(l - p)n~r.
37
Under the null hypothesis subjects can do no better than guess the direction of rotation — hence p - 1/2. The likelihood of Hershberger’s finding under the null hypothesis is then 48C32
x
0.532
x
0.516
(1)
Suppose now that Hershberger is correct and that subjects can perceive direction of rotation. Let the probability that a subject is successful be p* (p* > 1/2). Clearly, Hershberger could not envisage p* = 1 or only correct reports by all subjects would fail to refute his theory. The likelihood of the obtained result given Hershberger’s theory is
nCrp*T(J -pT~r= 48C32p*32(l -p*)16
(2)
We may now equate likelihoods (1) and (2) to find that value of p for which the obtained results lend equal support to both Hershberger’s and Day and Power’s theories: p*32(l-p*)16 = 0.548,
whence
0.8
This analysis clearly demonstrates that in rejecting (with Hershberger) Day and Power’s view that subjects are unable to perceive direction of rotary motion we are making a strong assumption about the alternative hypothesis. If it had been envisaged that if subjects could perceive direction the probability of success was greater than 0.8 then the results speak for the null hypothesis rather than against it. We should note that Pitz, like Edwards, is a Bayesian and thus prefers to work with likelihood ratios rather than tail area probabilities. While this distinction is important it is not crucial to the argument presented here. That is, it is being suggested that it is the asymmetry of significance tests which leads to pitfalls in inference, since data are evaluated in the light of one hypothesis only and without regard to the possible alternative values a parameter may take. One way to avoid this asymmetry is to compute the likelihood distribution, another is simply to compute a confidence interval which, as Rozeboom (1960, p. 427) puts it, treats all hypotheses with ‘glacial imparti ality’. So, the 95% confidence interval for p, the probability of success, given Hershberger’s data is approximated by (0.53 < p < 0.80). Armed with this interval an investigator is unlikely to conclude in favour of a theory which predicts a value of p greater than 0.8.
Summary Significance tests treat the null and alternative hypotheses asymmetrically. Edwards attempted to demonstrate that significance tests are always violently biased against the null hypothesis, but his numerical example was flawed. However, as Pitz has shown, when we reject a null hypothesis we may often be making an implicit, and perhaps unwarranted, inference as to the true parameter value. These arguments are predicated upon there being no prior
38
concentration of (Bayesian) probability upon the null hypothesis. The ‘glacial impartiality’ of confidence intervals circumvents these difficulties. Pitz’s reanalysis brings into sharp relief issues which the Edwards-Wilson debate left unclear. The significance test does not consider possible values of the unknown parameter other than that value which is to be ‘nullified’. This entails two further dangers; firstly that the corroborative power of the significance test may be lower than perhaps researchers imagine, and secondly that while the unknown parameter may not be equal to the nullified value it may nevertheless be unsuspectedly close to the nullified value. These issues will be dealt with in sections 2.3 and 2.4 respectively. 2.3
The Corroborative Power of Significance Tests
I have hinted in the previous section that, on those occasions where there may exist grounds for strong a priori belief in the truth of the null hypothesis, the traditional significance test may be accused of bias against the null hypothesis and inappropriate inferences may result. The nature and role of prior belief in statistical inference is highly contentious and we shall examine the issues in Chapter 6. In the meantime we note that Edward’s arguments vis-a-vis the statistic acid problem were invalidated to a marked extent because he did not postulate a degree of prior belief in the null hypothesis and for Lindley’s paradox to operate ‘it is essential that the concentration (of prior probability) on the null value exists’. It is an undisputed fact that significance tests, whether performed in the Fisherian or Neyman-Pearson traditions, make no assumptions what soever about the prior probability (or plausibility) of the null hypothesis. It is clear, then, that apart from the asymmetrical manner in which it is treated by the significance test, the null hypothesis is regarded by classical statisticians as a point hypothesis like any other. That is to say, that it is, like any other point hypothesis, almost always false. But as Edwards says (1965, p. 402): ‘If a hypothesis is preposterous to start with, no amount of bias against it can be too great. On the other hand, if it is preposterous to start with, why test it?’ This section is devoted to an analysis of the ramifications of this pointed question. All that follows hinges upon the proposition that in typical research appli cations the null hypothesis has an a priori probability of almost zero, or as frequentists might prefer, the null hypothesis in typical research applications is almost always false. It might, however, seem necessary to support this assertion. Consider then: If the null hypothesis is not rejected, it is usually because the N is too small. If enough data are gathered, the hypothesis will generally be rejected. If rejection of the null hypothesis were the real intention in psychological experiments, there usually would be no need to gather data. (Nunnally, 1960, p. 643). It seems to me, however, that most significance tests used in practice should really
39
be estimation problems. We do not wish to know whether some treatment has any effect — it almost certainly has — but only what is the magnitude of this effect, if appreciable, or whether it is positive or negative. (Smith, 1962, p. 61). In many experiments it seems obvious that the different treatments must have produced some difference, however small, in effect. Thus the hypothesis that there is no difference is unrealistic: the real problem is to obtain estimates of the sizes of the differences. (Cochran and Cox, 1957, p. 5). Before turning to the corroborative implications of the proposition we may note that considerations of the a priori falsehood of the null hypothesis renders agonizing over significance levels almost totally irrelevant. Remember ing that setting a - 0.05 means that a type 1 error, the rejection of a true null hypothesis, will occur in 5°7o of the occasions on which we test true null hypotheses — it is clear that we are talking about 5 To of (almost) nothing. Type 1 errors never occur! Perhaps the epistemological implications of testing null hypotheses known a priori to be false have been most forcefully indicated by Meehl (1967). He pointed out that if the null hypothesis is never true then failure to obtain a significant result must be attributed solely to a lack of power. It follows that as a study is improved by exercising more control, running more subjects, etc. the probability of rejecting the null hypothesis approaches 1. Now, as the experimenter’s theory is habitually identified with the alternative hypothesis, the probability that the experimental theory will be ‘corroborated’ if it yields a non-directional alternative hypothesis approaches 1 as power approaches 1. At this point we have departed slightly from Meehl’s argument because he deals only with directional alternative hypotheses because ‘in recent years it has been more explicitly recognized that what is of theoretical interest is not the mere presence of a difference but rather the presence of a difference in a certain direction’. However, whether one performs a two-tailed test and then notes if the effect is in the predicted direction, or one performs a one-tailed test because an effect in a certain direction is expected, Meehl’s contention is unaltered: ‘the effect of increased precision, whether achieved by improved instrumentation and control, greater sensitivity in the logical structure of the experiment, or increasing the number of observations, is to yield a probability approaching 1/2 of corroborating our [directional] substantive theory by a significance test, even if the theory is totally without merit’ (Meehl, 1967, p. 111). I have added the word ‘directional’ because where a theory does not predict a difference in a specified direction the probability of its being corroborated approaches 1. For example, we may entertain the theory that the relative posi tions of celestial bodies at the time of birth of an individual have a determining effect upon subsequent personality development. I would have difficulty inven ting a theory that would be more ‘without merit’ than this. It does, however, admit a non-directional prediction: that Leos, for example, will differ along a variety of personality dimensions from Pisceans. It is held by those of us who
40
believe in the a priori falsehood of null hypotheses that so long as a large enough sample is taken this ‘celestial influence’ theory will be ‘corroborated’ at any required level of significance. Most psychologists would, one presumes, prefer to account for obtained differences in non-astrological terms. A number of alternative explanations come to mind: stress experienced by the pregnant mother may affect person ality development in the child and stress may vary systematically with the months of the year by being correlated with the weather or the fiscal year; infant health and mortality will vary systematically with time of year, etc. Now if a large sample test revealed that Leos were significantly less sociable than Pisceans, with an associated probability of less than 0.001, it would not shake one’s disbelief in the celestial influence theory, yet it is not uncommon to read of psychologists’ barely concealed delight at the obtaining of very small associated probabilities (‘the results were massively significant’). A small associated probability bears upon the plausibility of the (already implausible) point null hypothesis. It bears very loosely upon the size of an obtained effect. Taken by itself it has no bearing whatsoever upon the plausibility of the proposed explanatory substantive theory. In the example given the celestial influence theory cannot be truly corroborated by a significance test because, as Meehl and others point out, if power is high enough a significant result is ensured. Certain corroboration is no corroboration at all. It may be objected that the celestial influence theory is utterly unrepre sentative of theorization in social science. But every time a researcher tests whether white noise affects the learning of a skilled task, or whether riskiness of decision varies with size of group, or whether Conservative voters differ from Labour voters in ethnocentricity, he is effectively testing a theory like that of celestial influence. It should be stressed that this is not a clarion call for directional tests of significance; if a significance test is unavoidable a directional theory should be tested non-directionally. A directional theory, however, is to be preferred to a non-directional theory. Indeed, a non-directional theory is no theory at all. At this point we return to Meehl’s argument. Whereas a non-directional theory is, with sufficient power, certain to be ‘corroborated’, a directional theory without any merit at all, has a 0.5 probability of being corroborated as power tends to unity. As Meehl says, we should not be very impressed with a meteorological theory that successfully predicts that it will rain on 17 April — given that we know that on average 50% of April days are wet! This, then, is the degree of critical examination to which we submit our theories when we attempt to corroborate them by use of standard significance tests. Meehl goes on to contrast this procedure with that of the physical sciences wherein from the substantive theory is derived a relatively sharp, often point, prediction. It is this that is tested by observation. It follows that in physics and the related disciplines the parent theory is subjected to ever more critical examination as measurement techniques, in their broadest sense, improve. That is, as power increases the ‘observational hurdle’ that the theory must
41
clear becomes greater. The opposite is the case in the social and behavioural sciences. The reason is clear; in the physical sciences the substantive theory is associated with the null hypothesis and to the extent that it defies rejection it commands respect. In psychology and the social sciences the substantive theory is associated with the alternative hypothesis and is corroborated as the null hypothesis is rejected. In this sense the observational hurdle which the theory must clear is lowered as power or experimental precision is increased. This is the great weakness of identifying a theory with the alternative hypothesis, to defend such a practice as Binder (1963) and Wilson et al. (1967) do, on grounds of ‘scientific caution’ is a nonsense. It follows from the argument set out above that an experimental finding, if in the direction predicted by a substantive theory, is not very diagnostic; it does not enable us to determine clearly whether the theory has merit or no merit. For, as we have seen, whereas the obtained finding may be derived deductively from a theory with merit, it will also occur with 0.5 probability if the theory has no merit. In contrast, an unexpected finding is highly diagnostic; it speaks strongly against the substantive theory because if the theory had merit the probability of obtaining such a result approaches zero as power increases. Meehl suggests, however, that this marked asymmetry of diagnostic power is not reflected in the research literature: The writing of behavioural scientists often reads as though they assumed — what is hard to believe anyone would explicitly assert if challenged — that successful and unsuccessful predictions are practically on all fours in arguing for and against a substantive theory. Many experimental articles in the behavioural sciences, and, even more strangely, review articles which purport to survey the current status of a particular theory in the light of all available evidence, treat the confirming instances and the disconfirming instances with equal methodological respect, as if one could, so to speak, ‘count noses’, so that if a theory has somewhat more confirming than disconfirming instances, it is in pretty good shape evidentially. Since we know that this is already grossly incorrect on purely formal grounds, it is a mistake a fortiori when the so-called ‘confirming instances’ have themselves a prior probability, as argued above, somewhere in the neighbourhood of 1/2, quite apart from any theoretical consideration. (Meehl, 1967, p. 112)
Rather than treat a disconfirming instance as a crisis for the substantive theory the researcher tends, Meehl suggests, to blame some auxiliary assump tion, many of which are readily to hand. This ad hoc explanation becomes the focus of the next experiment in what may appear to be ‘an integrated research programme’. For example, perhaps the group decision did not shift to risk although responsibility was manifestly diffused, because of the very formal layout of the room in which the decision was taken. Does formality of setting affect riskiness of group decision? The reader will recognise this ‘theoretical issue’ as a close relation of the celestial influence theory. The answer to the
42
question is ‘yes’ and if a simple yes/no answer is required no follow up experiment is necessary. In this way a great number of ‘assumptions’ and ‘factors’ can be investi gated without ever seriously testing the central theory of, say, diffusion of responsibility. Such a researcher’s ‘true position is that of a potent-but-sterile intellectual rake, who leaves in his merry path a long train of ravished maidens but no viable scientific offspring’ (Meehl, 1967, p. 114). Before we go on to consider related issues concerning the corroborative power of significance tests it will be useful to examine briefly some of the implications of Meehl’s contribution. First, it has been argued that it is fruitless to seek corroboration for a nondirectional substantive theory because as power tends to unity then ‘cor roboration’ becomes certain. It may be objected that in typical social research power nowhere approaches unity and therefore a relatively strong inference may be made; namely that if a non-significant result is obtained, while we should not conclude that the null hypothesis is true, we may at least infer the experimental effect, though present, is small in magnitude. This, it may be argued, is a justification for testing non-directional theories. However, as remarked earlier and detailed in the next section, the significance test addresses itself to the question ‘whether’ and not to the (bet ter) question ‘how much?’ It is exceedingly dangerous to infer from a relatively high associated probability that effect size must be small. We shall see in Chapter 3 that non-significant results may be much more replicable than psychologists may imagine. That effect size may be much larger than imagined may be viewed as another side of the same coin. We shall also see that a signifi cant result may speak for a very much smaller effect than is intuited by psychologists. If the end result of statistical analysis is to be to infer that an effect size is small, large, or medium-sized then the significance test is not the tool for the job. It is simply not addressed to such issues. Second, it will be appreciated that because traditional alternative hypotheses have a high probability (either 1/2 or 1) of being true a priori, the parent substantive theory which gives rise to them need not be very sophisticated. For example, it is not hard to find explanations for differences in personality between Leos and Pisceans — two or three were given earlier. This ease of ‘theoretical’ formulation is extremely dangerous because it hampers the development of explanatory systems which anywhere approach the complexity of the individuals and organizations we study. Indeed, the crucial distinction between a statistical hypothesis and a substantive theory often breaks down. To perform a significance test a substantive theory is not needed at all; the result, it seems to me at least, is that many ‘substantive theories’ are statistical hypotheses (with high a priori plausibility). It will be remembered that in this chapter we are attempting to review criticisms which distinguish tests of significance from their classical statistics counterparts, point and interval estimation. If researchers were routinely to report their findings in terms of appropriate confidence intervals stripped
43
of misleading assertions as to their significance, the onus to establish the relevance of such an interval to some theoretical formulation would fall more explicitly upon the writer considering submission and upon the editor considering publication. It may be argued, indeed this is Meehl’s own position, that psychology, and perhaps the other social sciences, are not ready for relatively precise statistical predictions, that a move towards greater quantification in our theoretical systems would be premature. This surely misses the point. If after 40 years of significance testing we are not ready to identify our theories with the null hypothesis, what grounds exist for supposing that a further 40 years’ significance testing will do the trick? Of course, precise predictions are only a distant prospect — precisely because we have not been building up repositories of reliable estimates of effect sizes, but have been speciously accepting and rejecting theories that are barely worthy of the term. The vicious circle must be broken by routine presentation of results in point and/or interval estimation terms. To the extent that such results can be used to throw light upon tolerably interesting theories, well and good. If such theoretical speculation does indeed seem premature, no matter; we must get down to the task of accumulating useful, accurate information. If the social sciences are to progress they must be rooted in sound measurement [7], Meehl’s important paper highlighted the problem of corroboration in psychology compared with that of the physical sciences and much of the blame was placed upon the structure and process of the significance test. In a com plementary contribution Lykken (1968) illustrated this point with reference to an article from the literature, raised some new issues, and made some con structive suggestions for alleviating the epistemological tangle which besets much of psychological research. Lykken referred to an article by Sapolsky (1964) that propounded the following substantive theory: some psychiatric patients entertain an un conscious belief in the ‘cloacal theory of birth’ which involves the notions of oral impregnation and anal parturition. Such patients should be inclined to manifest eating disorders: compulsive eating in the case of those who wish to get pregnant, and anorexia in those who do not. Such patients should also be inclined to see cloacal animals, such as frogs, on the Rorschach. Thus frog responders on the Rorschach test could be expected to show more eating disorders than patients not giving frog responses. In the event, of 31 frog responders examined by Sapolsky 19 had apparently manifested eating disorders, whereas the same was true of only 5 of the 31 control patients. From this information we may compute x2 with 1 d.f.; it is 11.49 which has an associated probability less than 0.001 It might seem from the above that the theory advanced by Sapolsky had received support, not to say strong support, but of course we know a priori that frog responders will differ from non-responders in their eating disorders. Sapolsky had managed only to predict the direction of the difference suc cessfully. Successful prediction has a probability of occurrence, in the absence
44
of any explanatory theory of 0.5, so at best Sapolsky’s theory gains minimal support. However, there are a multitude of alternative explanations, all surely more plausible than Sapolsky’s. As Lykken notes, the association between frog responding and eating disorders may be accounted for by the fact that ‘both symptoms are immature or regressive in character: the frog, with its disproportionately large mouth and voice may well constitute a common orality totem and hence be associated with problems in the oral sphere; squeamish people might tend to see frogs and to have eating problems and so on’ (Lykken, 1968, p. 152). In contrast, Sapolsky’s theory that belief in oral impregnation will imply frog responding merely because the frog has a cloacus is hard to swallow (so to speak) for very few people know what a cloacus is, or that a frog has one — those that do are presumably likely to know that the frog’s eggs are both fertilized and hatched externally so neither oral impregnation nor anal birth are in any way involved! What alarmed Lykken was that he found that the initial incredulity with which he and his colleagues treated Saplsky’s theory was undiminished by contemplation of the statistical significance of the result. Lykken rightly ques tioned whether it is healthy for a discipline to violate what he termed an implicit rule: ‘when a prediction or hypothesis derived from a theory is con firmed by experiment, a non-trivial increment in one’s confidence in that theory should result, especially when one’s prior confidence is low’ (Lykken, 1968, p. 152). Yet he concludes that his implicit rule is violated ‘not only in a few exceptional instances but as it is routinely applied to the majority of experimental reports in the psychological literature’ (Lykken, 1968, p. 152). Lykken therefore briefly discusses the addition of a possible codicil: ‘this rule may be ignored whenever one considers the theory in question to be overly improbable or whenever one can think of alternative explanations for the experimental results’. But he regards such an amendment as unreasonable, ‘ESP for example could never have become scientifically respectable if the first exception were allowed, and one consequence of the second would be that the importance attached to one’s findings would always be inversely related to the ingenuity of one’s readers’ (Lykken, 1968, p. 152). Despite my own enduring scepticism, Lykken’s comment regarding ESP is reasonable. But there is nothing wrong with the importance of an empirical finding being inversely related to the ingenuity of the academic community. It is a necessary and desirable feature of the growth of scientific knowledge that the apparent importance of an experimental result may be undermined at any time by the postulation, by anyone, of an alternative explanation. Having, like Meehl, demonstrated that the significance of an experimental result has no bearing upon the extent to which a substantive theory is cor roborated, Lykken goes on to show that effect size is not straightforwardly related either. Certainly it does not follow from the observation of a larger effect that the theory involved is thereby more strongly corroborated. Again, one assumes that no one would explicitly assert the contrary, but it is hard to
45
reconcile this with avowal of ‘strong support’ that is often claimed after large effects are obtained [8]. Indeed it is possible for an effect size to be so large as to seriously embarrass a theory, and just such an eventuality is revealed by Lykken’s analysis of Sapolsky’s results: Sapolsky found that 16 per cent of his control sample showed eating disorders; let us take this value as the base rate for this symptom among patients who do not hold the cloacal theory of birth. Perhaps we can assume that all patients who do hold this theory will give frog responses but surely not all of these will show eating disorders (any more than will all patients who believe in vaginal conception be in clined to show coital or urinary disturbances); it seems a reasonable assumption that no more than 50 per cent of the believers in oral conception will therefore manifest eating problems. Similarly, we can hardly suppose that the frog response always implies an unconscious belief in the cloacal theory; surely this response can come to be emitted now and then for other reasons. Even with the greatest sym pathy for Sapolsky’s point of view, we could hardly expect more than, say 50 per cent of frog responders to believe in oral impregnation. Therefore, we might reasonably predict 16 of 100 non-responders would show eating disorders in a test of this theory, 50 of 100 frog responders would hold the cloacal theory and half of these show eating disorders, while 16 per cent or 8 of the remaining 50 frog responders will show eating problems too, giving a total of 33 eating disorders among the 100 frog responders. Such a finding would produce a significant chisquare but the actual degree of relationship as indexed by the phi coefficient would be only about 0.20 [9], In other words, if one considers the supplementary assumptions which would be required to make a theory compatible with the actual results obtained, it becomes apparent that the finding of a really strong association may actually embarrass the theory rather than support it (e.g. Sapolsky’s finding of 61 per cent eating disorders among his frog responders is significantly larger fip < 0.01) than the 33 per cent generously estimated by the reasoning above (Lykken, 1968, p. 154).
This argument of Lykken’s is interesting because it demonstrates the prin ciple that departures from an expected effect size, irrespective of statistical significance, may argue powerfully against the tenability of a proposed substantive theory. In this respect Lykken’s discussion of Sapolsky’s results nicely complements Pitz’s discussion of Hershberger’s findings discussed above. In addition, we see that researchers may be able to make somewhat more specific predictions than the mere postulating of a difference in a pro posed direction. Perhaps a move towards more closely defined predictions is not as premature as many would suggest. On the other hand, the detail of Lykken’s analysis requires closer inspec tion. The problem with interpreting Sapolsky’s results is that whereas his theory essentially postulates a relationship between the holding of the cloacal theory of birth and incidence of eating disorders, his data make use of an intermediate variable, namely the frog response. The distinction becomes crucial in the first sentence of the above passage. Lykken identified 16% as the base rate of eating disorders for those ‘who do not hold the cloacal theory of birth’ — but the way this quantitity is derived means that it is strictly the base
46
rate for non-frog responders. To equate the two is to infer that Sapolsky was postulating that the only reason frog responders have eating disorders is that they are believers in the theory of cloacal birth. That is, Lykken is inferring that as well as affirming his own theory Sapolsky was implicitly, at least, deny ing others (such as that squeamish people will tend both to see frogs and to have eating disorders). Now clearly Sapolsky’s theory and the wide variety of other contenders are not logically mutually exclusive, so it has to be accepted that Sapolsky’s find ings are only embarassing to his theory if he was attempting to distinguish his theory from the others. Indeed such an attempt, given Sapolsky’s design, is doomed from the start — which is really Lykken’s central point [10]. Lykken suggests that the way to attain a respectable degree of corroboration for a substantive theory is to attempt multiple corroboration, the derivation and testing of a number of separate, quasi independent predictions. Since the prior probability of such a multiple corrobora tion may be on the order of 0.5", where n is the number of independent predictions experimentally confirmed, a theory of any useful degree of predictive richness should in principle allow for sufficient empirical confirmation through multiple corroboration to compel the respect of the most critical reader or editor. (Lykken, 1968, pp. 154-155)
This is evidently sound advice although one must record one’s reservations as to the implication that this way lies the method of quantifying degree of corroboration. If the associated probability is not an index of corroboration neither, surely, is 0.5”. It is exceedingly easy to collect confirming instances of a false theory — even quasi-independent ones. Theories are corroborated by withstanding threats from competing theories; it is fruitless to seek corrobora tion for a theory in the light of evidence without considering the plausibility of that evidence given alternative theoretical explanations [11]. Before leaving Lykken’s paper it will be useful to consider one more aspect of Sapolsky’s article to which Lykken draws attention. He suggests that traditional wisdom allows us to state:
Theory aside, Sapolsky has at least demonstrated an empirical fact namely, that frog responders have more eating disturbances than patients in general ... this conclusion means, of course, that in the light of Sapolsky’s highly significant find ings we should be willing to give very generous odds that any other competent in vestigator (at another hospital, administering the Rorschach in his own way, and determining the presence of eating problems in whatever manner seems reasonable and convenient for him) will also find a substantial positive relationship between these two variables. Let us be more specific. Given Sapolsky’s fourfold table showing 19 out of 31 frog responders to have eating disorders (61 percent), it can be shown by chisquare that we should have 99 percent confidence that the true population value lies between 13/31 and 25/31. With 99 percent confidence that the population value is at least 13 in 31, we should have 0.99 (99) = 98 percent confidence that a new sample from that population should produce at least 6 eating disorders among each 31 frog responders, assuming that 5 of each 31 nonresponders show eating
47
problems also as Sapolsky reported. That is, we should be willing to bet $98 against only $2 that a replication of this experiment will show at least as many eating disorders among frog responders as among nonresponders. The reader may decide for himself whether his faith in the ‘empirical fact’ demonstrated by this experiment can meet the test of this gambler’s challenge. (Lykken, 1968, p. 155. Copyright 1968 by the American Psychological Association. Reprinted by per mission of the publisher and author) There are a number of interesting points arising from this passage. First, Lykken’s statistical reasoning leaves a great deal to be desired; apart from an undesirable mixing of one- and two-tailed confidence intervals (not too serious since the areas concerned are small) his derivation of a 98% confidence interval by the multiplication of two quasi-probabilities is completely unwarranted. Did it not strike him as curious that if we may have 99% confidence that the population parameter is at least 13/31 we may only have 98% confidence in obtaining a sample value greater than 6/31? The correct degree of confidence approximates to 0.99997. However, to answer the question Lykken set himself, namely how much someone should be willing to bet that a replication of the experiment will yield at least as many eating disorders among frog responders as among non-responders, we should really compute a confidence interval for the difference between proportions — Lykken’s analysis makes the unjustified assumption that the ‘base rate’, 5/31, is not also subject to sam pling fluctuation. When this is done we find the degree of confidence implied by the data is of the order of 0.9998. Hence appropriate statistical analysis reveals that Lykken’s case is even stronger than he imagined; the gambler’s challenge is of a different order from $98 against $2. We should apparently be prepared to stake $9998 against $2! Numerical details apart, Lykken has raised two important problems. The first concerns the domain to which statistical inference may be applied. Can we really apply the awesome odds given above to a different hospital, a different Rorschach tester, a different operational definition of ‘eating disorder’? What is the population to which statistical inference is proper, in the Sapolsky example? These and related issues as they affect statistical in ference in general will be examined in Chapter 7. The second and distinct problem is this: suppose one can relatively uncontroversially determine the population to which inference is appropriate. Further suppose that one computes a 95% confidence interval for the value of a particular population parameter. Now what does one do (or think) if one simply does not have 95% confidence in the 95% confidence interval? This is the point of Lykken’s comment, although he fails to develop it or to offer con structive suggestions. We may further ask: if one fails to have 95% confidence in a 95% confidence interval is one merely being irrational? If not, what is the status of a confidence interval if it neither reflects subjective conviction, on the one hand, nor can be properly viewed as a probability statement, on the other? These questions are at the very heart of the disputes between the major
48
schools of thought in statistical inference. In Part II I shall attempt to analyse the nature of these disputes and to discuss their relevance to the problem of statistical analysis in the social and behavioural sciences. Summary It has been seen in section 2.1 that the structure of the significance test, especially as performed in the Neyman—Pearson tradition, commits one to a decision-theoretic viewpoint. It was argued that in the vast majority of in stances a decision is not required and that in those cases where a decision is appropriate the significance test is usually not appropriate — taking, as it does, no account of prior probabilities, costs, and utilities, etc. To the extent that the significance test has an inferential as opposed to a decision function, this purpose is achieved through the associated probability — not the significance level. However, the associated probability bears only upon one particular hypothesis, namely the (usually point) null hypothesis. The implications of this asymmetry for inference were examined in section 2.2. Now we may go further; the inferential import of the significance test bears upon a hypothesis which is known a priori, or in the absence of theory, to have a probability of 0 or 0.5, according to the directionality of the statistical alternative hypothesis. In this section I have attempted to demonstrate that the epistemological implications of this argument are extremely serious. The associated probability obtained in an experiment has no bearing whatsoever upon the degree to which the substantive theory is corroborated. Indeed, corroboration is so small as to render a single experiment, designed and analysed in the traditional manner, almost worthless as a true test of a substantive theory. We have distinguished perhaps four serious dangers emanating from the routine application of significance tests; although on examination they are interrelated:
(1) Null hypotheses rejected at highly significant levels lend a spurious sense of corroboration to the substantive theory. (2) The ease with which significance tests may be set up does nothing to discourage the formulation of hopelessly simplistic theoretical formula tions nor even the confusion of a statistical hypothesis with a substantive theory. (3) The emphasis placed by significance tests upon the truth or falsehood of the null hypothesis rather than upon the size of an effect or the nature of a relationship encourages a naive view of the interaction of social and behavioural variables [12], It also encourages what one might term a manipulative mentality. The researcher who wonders ‘would it make any difference if I varied this factor?’ is in danger of becoming, in Meehl’s colourful phrase, ‘a potent-but-sterile intellectual rake’. (4) The outcome of a significance test, the associated probability, whether or
49
not it is compared with some previously determined critical level, may frequently lead one to reject the already implausible point null hypothesis. It does not, however, encourage speculation as to which (or which area) of the vast expanse of alternate hypotheses is favoured by the data nor consideration of the possible merits of alternative theoretical explanations to the perhaps mundane one that motivated the analysis in the first place. To the extent that these issues are aired in the literature it is not uncommon to find writers suggesting that the use of significance tests is justified by the immaturity of the social and behavioural sciences. The burden of this section has been, to put it a little simplistically, to reverse the direction of this causal relationship. I am suggesting that the manifestly primitive level of much theorising is due, in no small degree but not of course exclusively, to the exceedingly crude requirements of the significance test and the absolutely marginal corroboration emanating therefrom. 2.4
Significance Tests and Effect Size
We have seen in Chapter 1 that the major inferential output of the significance tests, the associated probability, may not, as many may think, be interpreted as the probability that the null hypothesis is false. Nor may it legitimately be used as an index of the replicability of a particular experimental finding. In the last section we saw that the associated probability lends only spurious corroboration to the experimenter’s substantive theory; that corroboration increases as the alternative hypothesis is made more specific, not as an implausible null hypothesis is rejected with more confidence. We turn now to the last serious abuse of tests of significance. That abuse is to attempt to infer the size of effect established by the rejection of the null hypothesis from inspection of the associated probability (or the significance level). The term ‘size of effect’ is used here in a broad sense to mean either the difference between two population means, or the variance of several population means, or the strength of association in a multivariate population between variables, etc. Like some of the previous abuses outlined, the confusion of statistical significance with what Levy (1967) termed ‘substantive significance’ is usually implicit rather than explicit; nevertheless, just as it is easy to find articles in which highly significant results are claimed to ‘powerfully support’ a particular theory (the corroboration mistake), it is equally easy to find articles in which authors make inferences as to the size of an effect on the basis of a significance test. An example will serve to illustrate the point. ‘Children in expected reward conditions solved ... puzzles... somewhat (/? < 0.10) faster than those in unexpected reward conditions (Lepper and Greene, 1976). What is objectionable about this quotation is not that the authors had the temerity to reject the null hypothesis with unstable data but that the quantitative term ‘somewhat’ is illuminated with reference to an associated probability.
50
A further example was found by Duggan and Dean (1968) in a major (unidentified) sociology journal. In the study to which Duggan and Dean refer, the authors had compiled a three by four frequency table and computed the value of x2 in a standard test for independence. The obtained value was 14 which for 6 degrees of freedom is significant at the 0.05 level. But what of the degree of dependency? Using the value x2 = 14 and its significance as their only guide the authors concluded that were it not for the fact that variables X and Y were clearly separate measures ‘we should suspect the relationship to be tautological’. However, armed with the raw data Duggan and Dean were able to compute Goodman and Kruskal’s gamma which is a measure of relationship suited to frequency data on two ordinal variables. The obtained value was -0.30, which as Duggan and Dean remark suggests a relationship which is far from tautological. Later in the same article (that is the one Duggan and Dean analysed) the authors present another table, this time three by three, in which x2 = 6.2 which is non-significant, the associated probability being somewhat less than 0.2. In this case the majority of social scientists would accept the null hypothesis of independence between the two variables. Here again Goodman and Kruskal’s gamma was computed and the obtained value was 0.22, not markedly different in size from the relationship deemed almost tautological. There are two related morals to draw from Duggan and Dean’s analysis. First, unless one is skilled, the value of a test statistic and/or its associated probability is a very unreliable guide as to the size of effect present. Second, this problem relates not only to absolute values but also to relative values. That is, the fact that two experiments may yield sharply varying test statistics is no guarantee that the underlying strengths-of-effect are likewise variant. Equally, similar test statistics and associated probabilities may conceal widely differing strengths of relationship. It is in this respect that Eysenck’s (1960) ex hortation to psychologists to abandon significance levels in favour of associated probabilities because ‘two experiments giving p values of 0.048 and 0.056 are in excellent agreement although one is significant, while the other is not’, is a disservice. Indeed, the extent to which strength of relationship is unpredictable given only the significance of a x2 statistic was clearly demonstrated by Duggan and Dean who performed a survey of major sociological journals between the years 1955 and 1965. Choosing only articles where three by three tables with both variables ordinal were analysed, they computed gamma, the strength of relationship. Their results are reproduced in Table 2.4.1. Inspection of Table 2.4.1 suggests that ‘while chi square properly used, may be sensitive to the dependence of variables, after dependence is shown the usefulness of this statistic is exhausted’ (Duggan and Dean, 1968, p. 46). The striking conclusion from this remark is that if we accept that the null hypothesis is almost always false, that two variables are almost always dependent, then the chi-square statistic is ‘exhausted’ before we begin. The
51 Table 2.4.1 A comparison of level of significance and strength of relationship (n = 45 articles) (Reproduced with permission from Duggan and Dean, 1968)
Strength of relationship (gamma) Level of significance
0.001 0.01 0.05 0.10 0.20 0.30 +
0.000.09
0.100.19
0.200.29
0.300.39
0.400.49
0.500.59
0.600.69
0.700.79
0.800.89
3 1 4 1 1 —
2 — 3 — 1 3
2 6 — 1 1 —
8 — — — — —
— 2 — — — —
1 — — — — —
— — — — — —
2 — — — — —
3 — — — — —
same goes, of course, for t, F, Mann-Whitney U, and a great many other test statistics. There are two major impediments to inferring effect size from a test statistic (or, alternatively, from an associated probability). First, test statistics are functions both of effect size and of sample size. Hence a statistically significant effect may result from the testing of a large effect with a small sample or from the testing of a small effect with a large sample. Second, for a given sample size, the relationship between effect size and associated probability is not linear but typically sharply asymptotic. Thus for trivial increases in effect size, statistical significance can increase dramatically. It is, however possible to compute more useful statistics than those designed merely to throw light upon the tenability of the highly implausible null hypothesis.
Some alternative measures of effect size
Once more, for simplicity, we take as a point of departure the /-test for independent samples. Suppose 10 right-handed and 10 left-handed subjects are given a skilled task where stimulus-response incompatibility is high. The dependent variable is number of mistakes made. The data may be: Right-handers (X) Left-handers (T)
8 6 8 454
10 6
10 7
11 7
13 8
We have seen that the test statistic, t, is computed thus:
x- y t =------Sx-y where
E%2 + Z/ p | 1 \ (nx-i)+ (/?>,-1) \nx ny)
14 9
14 9
16 11
52
Here
With 18 d.f., and performing a two-tailed test with a set at 0.01, the critical value of t is 2.55. Our data therefore permit rejection of the null hypothesis of no difference between right- and left-handers at the 0.01 level of significance. This is all that we may validly conclude.
(i)
The Confidence Interval
Alternatively, we may construct a 99*7o confidence interval for the difference between the population means:
0.99CI:(x-y)±(r0i5x-^) Here it is 4 ± (2.55 x 1.247). Hence
C{7.180 < (jLx-iiy) < 0.820) =0.99 The exact interpretation of this statement is problematic, but a discussion of the issues involved must be delayed until Part II. For now it will suffice to state that we have computed a plausible range of values for the unknown population parameter. Our best estimate is that left-handers make four less mistakes than right handers, and the confidence interval informs us that the true difference may plausibly be as little as one mistake or as much as seven mistakes. The significance test does not yield this information. The significance test relates to what the population parameter is not-, the confidence interval gives a plausible range for what the parameter is. If, for some reason, we have a particular interest in a specific hypothesized value of the population parameter, say (jjlx - Mv) = 0 we can assess its plausi bility by inspecting the confidence interval. In this example zero does not lie within the interval and we may therefore conclude that it is not a plausible value for the population parameter. Thus we may, if we wish, use a confidence interval to conduct a significance test. The confidence interval contains all the information provided by a significance test — and more.
(ii)
The Standardized Difference between Two Means (X - F)/s
This measure allows an estimate to be made of the number of standard devia tions that separate the population means in question. It will be remembered that Cohen (1962) regarded a difference of one standard deviation as a large effect and one-quarter standard deviation as a small effect. The measure may simply be derived from the formula for a t statistic:
1 n\
r n2,
— 4- —
53 Hence - /l 1\1/2 (X- y)/$ = r —+ — \ni n2)
where n\ — n2 — n, we have (X- Y)/s=t(2/n)l/2
From our hypothetical data we can compute (X - Y)/s. It is 1.43. This in Cohen’s terminology is a large effect; it would have to be so if such a small sample were to produce a highly significant test statistic. (Hi)
The Proportion Misclassified Statistic
The data recording the number of mistakes made by left- and right-handed individuals were samples drawn from two populations. The populations may be as depicted in Figure 2.4.1. Another way in which effect size may be described is in terms of the degree of overlap of two or more populations. Specifically, suppose we were given the scores produced by subjects on the skilled task and we were required to classify each subject as either being left-handed or right-handed. In our example the decision rule must clearly be to classify everyone scoring less than nine as left-handed and everyone scoring more than nine as right-handed. (For those scoring exactly nine we may as well toss a coin.) Such a rule will minimize the proportion of subjects misclassified. It is evident that misclassification decreases as effect size increases. In the two-sample case the frequency with which misclassification occurs is approximated by the one-tailed area beyond the normal deviate [13]
z = 1 /2r
/ni + /?2y/2 \ nxn2 /
For our hypothetical data the proportion misclassified would therefore be approximately 24%. Thus although the difference between left- and right-
Figure 2.4.1. Number of mistakes made by left- and right-handed people
54
handers is highly significant, as reflected by the traditional test of significance, any attempt to use performance on the task to predict handedness would lead to almost a quarter of the subjects being misclassified. Perhaps the most obvious application of the proportion of misclassification arises in selection and diagnosis. Suppose a personality inventory which purports to measure mental instability yields significantly higher values when administered to people who, in experienced professional judgement, should be hospitalized than to those who are considered suitable for out-patient treatment. If the time of the expert is valuable can diagnosis be performed by the test? If so what proportion of misclassification may we expect? [14]. A second possible use of the proportion misclassified measure is as a manipulation check. For example, Greene (1976) used a factorial design in which 40 field-independent and 40 field-dependent female participants in a weight reduction clinic were individually counselled by a male interviewer; these interviews were systematically varied in terms of: (a) the verbal feedback (accepting vs. neutral) offered for clients’ self-disclosing statements; (b) the physical proximity of the interviewer to the client (personal vs. social distance); and (c) the interviewer (one of two). Various behavioural and self report measures of affect were taken in what was a 2x2x2x 2 fixed-effects analysis of variance design. Greene begins his results section thus: ‘To examine the effectiveness of the evaluative feedback manipulation, an analysis of variance was performed on clients’ ratings of a questionnaire item concerning how accepting the inter viewer seemed. As anticipated, the interviewer was felt to be more accepting to clients in the positive feedback condition, F(l, 64) = 4.86, p < 0.05’ (Greene, 1976, p. 572). Now from this information we can estimate the proportion of subjects who would be wrongly assigned to either the ‘accepting’ or the ‘neutral’ group, if one had only the rating of how accepting the interviewer was regarded on which to base one’s judgement. The matter is complicated a little by the absence of data bearing upon the other main effects of the analysis of variance, but we may be fairly confident that a conservative estimate of the proportion so misclassified would be 40%! From the results of other significance tests and a table of means that Greene supplies it is possible to reconstruct the analysis of variance for one of the dependent variables, ‘behavioural distancing’. It transpires that the main effect for feedback condition is not significant and accounts for about 1.8% of the total variance. Now it seems to me that the way the reader would inter pret this result (and indeed the way Greene probably interpreted it) would alter dramatically if he knew the true extent to which the feedback manipulation had worked. Let us be clear about the lesson to be learned. First, in manipulation checks, as with so many instances, the significance test is being asked to do a task to which it is not suited. Second, one is not asserting that a 40% misclassification rate is unacceptable; it may or may not be according to taste and circumstance.
55 What is important is that where a manipulation check is performed the correct statistic should be used (the proportion misclassified is by no means obligatory). Above all, significance tests, giving a spurious sense of security, should be avoided. (iv)
The Proportion of Variance accounted for, r2
An alternative measure of the strength of the association between independent and dependent variables may be obtained by calculating the correlation between them. In the two-sample case we introduce a binary code for the independent variable. In the handedness example the data become [15]:
Handedness J 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 Mistakes Y 4 5 4 6 7 7 8 9 9 11 8 6 8 10 10 11 13 14 14 16 The correlation
r_
S-v JZ+2S+
between our variables so defined is 0.603. If we square this value we obtain r2 - 0.364. Here, r2 is the proportion of variance in one variable that is statistically explained by variance in the other variable. In this example we may conclude, for example, that 36.4% of the variance in number of mistakes made is explained by handedness; 63.6% of the variance remains unexplained. Fortunately we do not require the raw data to compute r2; it may be retrieved from a published t statistic and the degrees of freedom:
„'U z2 + d.f.
For the numerical example, 3.207 3.207 + 18
(v)
0.364
The Proportion of Variance accounted for a>2
The measure of proportion of variance accounted for that is favoured by many writers is u>2 (Hays, 1963; Fliess, 1969; Vaughan and Corballis, 1969; Dodd and Schultz, 1972; Smith, 1982). The derivation of this statistic will be dealt with shortly; suffice it to say that for an independent means Z-test: 2
w
r2-1
z2 + d.f.+ 1
which for our numerical example takes the value 0.317. Unlike test statistics, such as t, F, x2, etc. the above statistics, for a given
56 experimental design, stablilize as sample size increases. In the case of the con fidence interval, of course, it is the point estimate of the difference between means that stabilizes; the interval itself reduces as sample size increases. The other four are descriptive statistics. In principle, interval estimates can be formed for the respective population parameter, but in practice the problems are sometimes considerable. Table 2.4.2 gives values of these latter four effect sizes for varying sample sizes and for associated probabilities corresponding to the two familiar levels of significance. All values are computed assuming equal sample size, the numbers appearing in the ‘sample size’ column being the size of each sample. So, for example, if a t-test were performed on data obtained from two samples of size 15 and the result were just significant at the 0.01 level, then the obtained value of t would be 2.76 and the measurements of effect size indicate that: (i) the population means are estimated as being one standard deviation apart (a large effect); (ii) any attempt to classify subjects into treatment conditions on the basis of their scores on the dependent variable would result in 31% misclassification; (iii) variation in treatment conditions is accounting for about one-fifth of the total variance in the dependent variable. Some of the more interesting features of Table 2.4.2 should be noted. For example it may seem that if a ‘large effect’ nevertheless admits 31 °7o
Table 2.4.2
The relationships between sample size, associated probability, and four measures of effect size
Sample size
P
t (two-tailed)
(X- Y)/s
Prop, misclass.
r2
10 10 15 15 20 20 25 25 30 30 40 40 50 50 100 100 200 200 250 250
0.05 0.01 0.05 0.01 0.05 0.01 0.05 0.01 0.05 0.01 0.05 0.01 0.05 0.01 0.05 0.01 0.05 0.01 0.05 0.01
2.10 2.88 2.05 2.76 2.02 2.70 2.01 2.68 2.00 2.66 1.99 2.64 1.98 2.63 1.97 2.60 1.96 2.59 1.96 2.59
0.94 1.29 0.75 1.01 0.64 0.85 0.57 0.76 0.52 0.69 0.44 0.59 0.40 0.53 0.28 0.37 0.20 0.26 0.18 0.23
0.32 0.26 0.36 0.31 0.37 0.33 0.39 0.35 0.40 0.37 0.41 0.38 0.42 0.40 0.44 0.43 0.46 0.45 0.47 0.45
0.20 0.32 0.13 0.21 0.10 0.16 0.078 0.13 0.65 0.11 0.048 0.082 0.038 0.066 0.019 0.033 0.0096 0.017 0.0077 0.013
O)
2
0.15 0.27 0.096 0.18 0.072 0.14 0.057 0.11 0.048 0.092 0.036 0.069 0.028 0.056 0.014 0.028 0.0071 0.014 0.0057 0.011
57 misclassification then perhaps the 40% misclassification implied by Greene’s manipulation check is quite respectable. But whereas 31% implies that about one-fifth of total variance is accounted for, at other points in the table 40% misclassification implies that the proportion of variance accounted for is of the order of one-twentieth — a qualitative difference. Again, the absolute values of the proportions of variance accounted for are, to the untrained eye at least, surprisingly low. A result significant at the 0.01 level from samples of size 30 is accounting for only one-tenth of total variance; were one to be rash enough to hope for, say, 30% variance accounted for one would need a test statistic with a two-tailed associated probability of approxi mately 0.00000015. The associated probability required to account for a mere 20% of variance would be close to 0.00004. Next it should be noted that while a result at the 0.01 level of significance is five times as significant as one at the 0.05 level in the sense that the latter is five times more likely to be observed if the null hypothesis is true — this factor of five is nowhere to be found among measures of effect size. With regard to r2, for example, a result at the 0.01 level is on average accounting for about 1.7 times as much variance as one at the 0.05 level. Lastly, it is interesting that although the absolute differences between the r2 and w2 estimtates are small, w2 is consistently 20-25% lower than r2.
The Derivation and Interpretation of
Consider a simple one-way fixed-effect analysis of variance design with n = 10 observations on each of the J =4 levels of factor A. The summary analysis of variance table may be as depicted in Table 2.4.3 A result such as given in Table 2.4.3 would reject the null hypothesis at beyond the 0.05 level of significance. The expected mean square for treatments effects is a2+ no a, where 2 1 v 2 0a = -j—j
Since the variance of treatments effects in a fixed factor design is more appropriately conceived as a sum of squares divided by the number of such effects (as opposed to degrees of freedom), we may define the variance attributable to factor A as &A = J EcJ Table 2.4.3
Source
SS
d.f.
MS
F
E(MS)
A
120 360 480
3 36 38
40 10
4
a2 + no a
Within cell Total
1 P(DnHl) + P(DnH2)
from (3)
A reapplication of (4) gives
P(Hi/D) =
___________P(H\) x P(D/Hi)__________ TO)X P(D/Hi)] + (P(H2) x P(D/H2)\
(6)
P(Hi) x P(D/H\) YfW x P{D/H^
(7)
or more generally P{H./D) =
Here, (5) and (7) are alternate forms of Bayes’s theorem. As an exercise in mathematical formalism, Bayes’s theorem is demonstrably a trivial deduction from the probability axioms. Controversy arises over the range of problems to which it may properly be applied. If we wish to give a numerical answer to the question, ‘what is the prob ability that we have the standard pack, Hi, given that we have drawn a black card?’ every term in (6) above must be evaluated. We may, for example, satisfy ourselves that the black card was drawn randomly from whatever pack we have. Then it might be reasonable to conclude that P(D/Hi) and P(D/H2) are 1/2 and 1/3 respectively. Perhaps the pack we were given was determined by the toss of a fair coin. We might conclude that P(H\) = P{H2) = 1/2. The required probability would therefore be
^(standard pack/black card) = (|/2 xx 1/3) = 3/5 All this seems uncontroversial, but we shall see that under the relative fre quency conception of probability the application is not without its difficulties. In other contexts, however, the use of Bayes’s theorem is far more conten tious. In our example D is the drawing of a black card and H is the label of a pack of cards. But let D stand for ‘data’ and H for ‘hypothesis’ and the epistemological implications are evident. If the terms in (6) and (7) can be evaluated, we have a way of arguing from the probability of data given some hypothesis to a statement giving the probability of a hypothesis given the observation of data. In this context P^H/D) is termed an inverse probability. The attempted justification of statements of inverse probability has been the focus of heated debate for more than 100 years. Nevertheless, this section should conclude with a reminder that the calculus of probability is not in dispute. With only occasional and, as yet, esoteric exceptions all accounts of probability are consistent with the axioms outlined above [ 1 ]. What are at issue are the meaning and scope of probability statements and it is to a review of standard positions on these problems that we now turn.
100
4.2
The Classical Interpretation of Probability
Probability was first studied systematically in the context of games of chance [2], The probability of an event came to be regarded as the ratio of the number of outcomes favourable to that event to the total number of outcomes. It seemed natural that the probability of producing an even number with a single throw of the die was NeVen/Motai = 3/6 = 1/2. Of course, the nature of the elemental outcomes has to be further constrained. The probability of throwing a six cannot be construed as the ratio of the event ‘throw six’ to the total ‘throw six’ + ‘throw not six’. If the desired probability of 1/6 is to be at tained the possible outcomes must be equipossible. The problem then becomes to define ‘equipossible’ without resort to probability (and hence circularity). Bayes’s solution was to assert that equal probabilities be assigned to events if we have no grounds (empirical, logical, or symmetrical) for doing otherwise. This proposition, dubbed the principle of insufficient reason by Laplace and renamed the principle of indifference by Keynes (1921), is the ‘most notorious principle in the whole history of probability theory’ (Kyburg 1974, p. 31). The principle allowed its adherents to conjure probabilistic knowledge from the void of ignorance. In particular, Laplace in 1774 addressed the problem of assessing the probability that a certain trial will be a success given that p successes and q failures have been observed. The prior probability of a success is totally unknown, but the principle of insufficient reason implies a rec tangular distribution for it in the interval (0, 1). The required probability of a success on the (p + q + 1 )th trial is, by applying Bayes’s theorem j^xp+1(l-x)gdx_ p+1 Joxp(l - x)'7 dx p+q+2
Hume’s problem of induction seemed to have been dissolved. In his enthus iasm for the new technique Laplace computed solutions to some age-old prob lems. For example he calculated the probability that the sun will rise tomorrow given that it had risen every day for the previous 5000 years. His answer was 1,826,214/1,826,215. But the classical conception of probability cannot withstand more than a casual examination. Suppose we are handed a pack of cards from which five cards, all of the same suit, have been removed. What is the probability of drawing a spade? We know the pack has been tampered with and surely feel this should affect our answer. But the principle of indifference asserts that we have no grounds for expecting one suit rather than another so the probability must remain 1/4. And if this is so the long-run expected frequency of spades is certain to be very close to 1/4 but that cannot be the case. Likewise, it will be remembered that in our illustrative example of section 4.1 the probability that we had the standard pack given that we had drawn a black card was computed to be 3/5. What were the five equipossible outcomes, three of which were favourable to the standard pack hypothesis? A third damaging criticism of the classical account is that by definition it can only give
101 rise to rational numbers (i.e. numbers that can be expressed as the ratio of two integers), but there are probabilities in number theory, geometry, and physics that are irrational. For instance, the probability that an integer selected at ran dom is prime is 6/tt. The principle of insufficient reason has met even fiercer opposition (it will be remembered that without it, or something similar, the classical definition of probability slips into circularity). A final paradox is due to Keynes (1921). Suppose we have a mixture of wine and water such that we know that the ratio of one to the other is less than 3 — but we know nothing more. That is, (1/3 < wine/water < 3). By the principle of insufficient reason the ratio wine/water has a uniform probability density in the interval (1/3,3). Hence
p (wine/water < 2) = |^| = 5/8 o/ J
Similarly,
p(water/wine < 1/2) -
8/3
= 15/16
Now the events (wine/water < 2) and (water/wine < /2) are clearly equivalent, but they have been assigned differing probabilities. Numerous authors have made another point; namely that where ignorance of a variable is represented by a rectangular probability distribution, a non linear transformation of the unknown variable (of which we should, logically, be equally ignorant) does not yield a rectangular distribution. Bertrand (1889) produced equally alarming paradoxes. Perhaps under the weight of those criticisms, but more likely due to the seductive nature of its successors, the classical conception of probability is no longer taken seriously by statisticians or philosophers of science. ^Jaynes, 1976, gives a spirited defence of Laplace, however.)
4.3
The Relative Frequency Theory of Probability
We have seen that the principle of insufficient reason allowed statements of aleatory probability to be transformed, via Bayes’s theorem, into statements of epistemic probability. This requires the positing of the probability of hypotheses a priori — before the data are gathered. In most applications of what become known as the rule of succession the subjective nature of such ascriptions of probabilities a priori was evident. The procedure was certain to attract criticism. The most vehement opponent of the classical account was Venn (1888) who attacked the principle of insufficient reason with a degree of savagery that was later to typify behaviourists’ criticisms of introspectionism. His dismissal of probabilities a priori profoundly influenced Fisher, but his conception of probability as the limiting value of a relative frequency was most fully developed by von Mises.
102
Von Mises (1928) begins his account of probability theory by stating that it is a branch of empirical science and not of logic [3]. The empirical domain of probability is outlined thus: ‘just as the subject matter of geometry is the study of space phenomena, so probability theory deals with mass phenomena and repetitive events (p. vii of 1951 Preface). By mass phenomena von Mises had in mind such aggregates as are summarized by social statistics and by ‘repetitive events’ he intended successive tosses of coins and the like. These aggregates, the proper domain of probability theory, were called ‘collectives’ and von Mises explicitly denied that probabilities could justifiably be assigned to other propositions. ‘Our probability theory has nothing to do with ques tions such as: “is there a probability of Germany being at some time in the future being involved in a war with Liberia?”’. It is evident that von Mises’ doctrine replaced probability firmly in the aleatory tradition and that by denying epistemic applications of the term he greatly reduced the scope of probability as a jargon term relative to the variety of its uses in everyday speech. Of course, there are many examples of scientific terms having more restricted meanings than their everyday counterparts. The precedent cited by von Mises is ‘work’. We speak of ‘working in the library’ but for the physicist even the labourer holding a bag of cement is not working until he raises the bag — for in its jargon sense work is force x distance. Furthermore, it should be appreciated that in reducing the applicability of the probability calculus von Mises was attempting to avoid the enormities of Laplace’s more enthusiastic followers. (Todhunter, 1865, records that Condorcet had calculated the probability that the duration of the reigns of the 70 kings of Rome was 257 years. His answer was 0.000792). Probabilities, then, apply only to collectives and collectives are aggregates which have two empirical properties. First, ‘the relative frequencies of certain attributes become more and more stable as the number of observations is increased’ (von Mises, 1928, p. 12). This amounts to asserting that for a collective of elements some of which possess attribute A, the relative fre quency, nA/n, with which in n observations attribute A is observed, has a limiting value as n tends to infinity. This limiting value is defined as the mean ing of P(A), the probability of A. A second requirement is that the elements possessing attribute A should be randomly distributed in the long run of obser vations. Hence it should be impossible to numerically specify a subset of the infinite collective for which the limiting frequency of A does not exist, or takes some value other than P(A). It is easy to show that for the idealized notion of a collective the axioms of probability are satisfied by von Mises’ relative frequencies, but when we attempt to apply the notion to empirical collectives difficulties arise. Firstly, and fundamentally, a probability, in the frequentist conception, is a property of a collective and not of an element of that collective. Hence we may speak of a fair coin producing ‘heads’ half the time in the long run. We may even speak of the probability of a toss of a coin producing ‘heads’. But we cannot speak of the fourth toss, nor of any particular toss. Similarly, we
103 cannot speak of the probability of what we might call a ‘composite single event’. Thus the probability that the first 50 tosses of a fair coin will produce 20 ‘heads’ — is undefined. We may also, in any application of the relative frequency definition of prob ability, find difficulty in specifying the appropriate collective. Suppose we wished to ascertain the probability that a given individual will die within the next 12 months. We note immediately that such single-class evaluations are explicitly avoided by von Mises — yet mortality statistics comprise just the sort of collective to which he addressed himself. Let us assume then, that like almost any other practitioner we ignore the philosophical embarassment and proceed to calculate the relative frequency with which individuals die within 12-month periods. Having found the value we may think to ourselves that males and females have differing life expectancies and that we should also take into account our individual’s present age. It may seem that our revised prob abilities are more accurate or better in some other way. But the frequentist theory of probability gives us no grounds for preferring one value over another. Further, we may think it wise to take more evidence into account. Our individual’s occupation and health record might also be used to ‘refine’ the collective. In such circumstances the computed probability varies spasmodically; in the limit the collective contains one element, our individual, and the probability statement is totally uninformative. This objection to the arbitrary nature of the collective is usually referred to as the problem of specifying the reference class. Despite these difficulties the frequentist view attracted, and still attracts, a great many adherents. Philosophically, its credentials were unimpeachable among behaviourist social scientists. Von Mises’ positivism was apparent in his rejection of subjectivism, his operational definition of probability and his grounding of the term in empirical aspects of the real world. The relative frequency conception of probability is the foundation of the Neyman-Pearson school of statistical inference and, as we shall see in Chapter 6, the distinguishing features of the frequentist view entail corresponding structural weakness in the statistical edifices built upon them.
4.4
Fiducial Probability
In a brief review of theories of probability with special reference to their relevance to statistical practice, mention must be made of fiducial probability, the brainchild of Fisher. Ronald Fisher was undeniably a ‘scientist’s statis tician’. He recognized that the scientist desires of a statistical analysis that it produce grounds for rational belief in a proposition (and, indeed, this propo sition — in other words that the inference be epistemic — but that also the procedure should be devoid of subjective elements. He sought, therefore, an argument whereby objective probabilities could be transformed to give an epistemic outcome without resorting, as Laplace had, to the positing of knowledge a priori. Referring, for example, to the estimation of an unknown
104 mean and standard deviation he writes (Fisher, 1935, pp: 195-196): It is the circumstance that statistics sufficient for the estimation of these two quan tities are obtained merely from the sum and the sum of squares of the obser vations, that gives a peculiar simplicity to problems for which the theory of errors is appropriate. This simplicity appears in an alternative form of statement, which is legitimate in these cases, namely, statements of the probability that the unknown parameters, such as and a, should lie within specified limits. Such statements are termed statements of fiducial probability, to distinguish them from the statements of inverse probability, by which mathematicians formerly attempted to express the results of inductive inference. Statements of inverse probability have a different logical content from statements of fiducial probability, in spite of their similarity of form, and require for their truth the postulation of knowledge beyond that ob tained by direct observations.
The struggle to reconcile the aleatory and epistemic aspects of probability is apparent everywhere in Fisher’s work. Like Venn before him Fisher con ceived of his statistical theory as a contribution to, and a reflection of, human inductive reasoning. ‘That such a process of induction existed and was possible to normal minds, has been understood for centuries; it is only with the recent development of statistical science that an analytic account can now be given, about as satisfying and complete, at least, as that given traditionally of the deductive processes’ (Fisher, 1955, p. 74). But as to the exact ‘logical content’ of Fisher’s fiducial probability he is nowhere explicit. It has even been suggested that ‘it would be unfair to criticize “Fisher’s interpretation of probability”, because he never actually gives one’ (Kyburg, 1974, p. 74).
4.5
Logical Probability
Unlike Fisher, Bayesians are happy to embrace the concept of inverse prob ability. In order to do so they abandon the frequentist definition and instead take probability to be a measure of belief. Some Bayesians regard this belief as subjective or personal, but the historically antecedent approach is to con ceive of probability as an objective measure of rational belief in a proposition. The first systematic account of this logical view of probability was given by Keynes (1921) and, as we shall see, it contrasted markedly with the frequentist approach that was being developed at about the same time. Whereas von Mises was at pains to locate probability theory alongside other empirical disciplines, Keynes from the outset regarded probability as a branch of logic. His account was strictly epistemic; on his view probability was the degree of rational belief in a proposition warranted by some body of evidence. In logic we may begin with premisses:
All men are mortal Socrates is a man
105
and rationally conclude, with certainty; Socrates is mortal
But we may easily invent examples where the conclusion seems to be rendered probable by the premisses but not certain. Thus: Most English people eat meat Most meat eaters are warm blooded
Some English people are warm blooded
Here the conclusion is not certainly entailed by the premisses but is ‘partially entailed’. The degree to which a conclusion, h, is warranted by the evidence, e, is denoted by P(h/e) and thus to Keynes probability is the logic of partial entailment, of which the traditional logical calculus is a special case (i.e. if e entails h, then P(h/e) = 1; if e entails not-7z, then Ptf/e) = 0). It should be immediately apparent that probability, on this view, is a logical relation between propositions rather than a feature of certain real or imaginary collectives. Belief is always warranted by evidence and hence our belief in a proposition (i.e. its probability) is conditional upon our current knowledge. If we appear to make unconditional probability statements it is simply because the background knowledge is implicitly assumed. A further point should be emphasized. The logical theory of probability is strictly normative. If the probability of a proposition can be computed, then to the extent that any individual’s personal degree of belief differs from the objectively warranted degree of belief that individual is simply wrong. So much for the theory but how do Keynes’s ideas work out in practice? Unfortunately most readers have found him disappointingly reticent. If we ask how the degree of warranted belief is to be ascertained we learn that (i) some probabilities are not amenable to quantification, (ii) differing probabilities are not necessarily comparable, and (iii) frequently we know the degree of warranted belief by ‘direct acquaintance’, and yet, somewhat contradictorily, ‘some have greater powers of intuition than others’ (Keynes, 1921, p. 133). Not surprisingly, it proved impossible to construct a set of statistical tech niques from Keynes’s cautious theoretical framework. Jeffreys (1961) gave an altogether bolder account of the theory of logical probability. As Ramsey (1931) had done before him, Jeffreys axiomatized rational belief, but unlike his predecessor he developed a great number of Bayesian statistical techniques whereby initial or prior belief could be transformed by evidence into posterior belief. As with the vast majority of Bayesian analyses, in Jeffreys’ examples the ‘evidence’ is data of the standard random sampling type amenable to the usual relative frequency definition of probability. But Jeffreys is emphatic that while relative frequencies may constitute grounds for a degree of belief the meaning of probability is not exhausted by the notion of relative frequency.
106 The essence of the present theory is that no probability, direct, prior, or posterior is simply a frequency. . .. Even if the prior probability is based on a known frequency, as it is in some
cases, reasonable degree of belief is needed before any use can be made of it. It is not identical with the frequency. (Jeffreys, 1961, pp. 401-402).
Of course, the controversy arises chiefly when the prior probability is not based on a known frequency for, as Jeffreys acknowledges, whenever a prob ability of a hypothesis is based on relative frequency data, we may justifiably ask what the probability of the hypothesis was before the data were collected. In the end we are faced, as was Laplace, with the problem of specifying reasonable belief in the context of total ignorance — and like Laplace, Jeffreys invokes the principle of indifference. In doing so Jeffreys encountered difficulties additional to those outlined in section 4.2. Suppose, for example, that we are totally ignorant of a parameter which may in principle take any values from - oo to + °o. If prior belief is distributed uniformly over this infinite range it follows that the prior probability that the parameter takes any particular point value is zero. Hence the posterior prob ability of any point value must also be zero. But this means that Bayesian significance tests cannot be meaningfully carried out. There must be some prior concentration of belief on the null value to get the significance test off the ground. Jeffreys’ ingenious solution was to propose that if we had no inkling as to the value of a parameter we have no basis for knowing whether or not it is needed to explain the data. Hence by the principle of indifference half our uncertainty should be concentrated on the null value and the rest uniformly distributed over the possible range of the parameter. Similarly, in order that any particular statistical model may be critically assessed, he asserts that the posited parameter values must have non-zero prior probability merely by virtue of the fact that the model has been proposed. 'Any clearly stated law has a positive prior probability until there is definite evidence against it. This is the fundamental statement of the simplicity postulate’ (Jeffreys, 1961, p. 129). Whatever the pragmatic merits of methodological ploys needed to start the inference process the logical status of the principle of indifference and the simplicity postulate remains questionable. As a result, most modern Bayesians have given up the attempt to formalize objectively warranted degrees of belief. They prefer instead to identify probability with personal uncertainty. 4.6
Personal Probability
The personalist account of subjective probability has been defined thus: subjective probability refers to the opinion of a person as reflected by his real or potential behaviour. This person is idealized; unlike you and me he never makes mistakes ... or makes such a combination of bets that he is sure to lose no matter
107 what happens ... the probability is basically a probability measure in the usual sense of modern mathematics. It is a function Pr that assigns a real number to each of a reasonably large class of events A, B, ... including a universal event S, in such a way that if A and B have nothing in common, Pr(>4 or B) = PrG4) + Pr(B) PrM) > 0 Pr(S) = 1
The extra-mathematical thing, the thing of crucial importance, is that Pr is entirely determined in a certain way by potential behaviour. Specifically, Pr(v4) is such that Pr(/4 )/Pr(not A) = PrG4 )/(1 - Pr(/1))
is the odds that thou wouldst barely be willing to offer for A against not A. (Reproduced with permission from Savage et al. 1962, pp. 11-12) [4],
So defined, personal probability is applicable to any situation about which man may have an opinion. In particular, it applies to scientific theories and statistical hypotheses. Subjective probabilities are revised by adjusting prior opinions in the light of data to form posterior probabilities. The statistical technique involved is the repetitive application of Bayes’s theorem and hence personalists are most commonly referred to as Bayesians. Although there are some Bayesian statisticians who remain adherents of the logical theory of probability described in the previous section, they are in a minority. Henceforth, therefore, the terms ‘personal probability’ and ‘subjec tive probability’ will be used interchangeably and where I refer to the Bayesian view of probability it will be in the sense defined by Savage above. Detailed critical evaluation of Bayesian inference will be presented in Chapter 6, but we should remark here upon a couple of issues. As implied by Savage, according to the Bayesian, all probabilities are sub jective probabilities, and that is to say that they are opinions, reflections of uncertainty. Probabilities reside in the mind of the individual and not in the external world. A Bayesian may properly express his probability that a tossed coin will-land heads, but not his uncertainty as to the probability that the coin will land heads. For that would amount to his making a statement of his uncertainty as to his uncertainty. De Finetti (1974) and other Bayesians regard such a regress as unacceptable. Hence for the Bayesian there are no ‘true’ or ‘objective’ probabilities. This feature of Bayesian inference has led some to criticize it on the grounds that it is incompatible with a realist view of science: ‘Combining a Bayesian theory of inference with scientific realism will not be an easy task’ (Giere, 1976, p. 65) [5]. Although no Bayesian would criticize my personal probability regarding any uncertain proposition on the grounds that it deviated from the ‘true’ prob ability, nevertheless Bayesian inference is prescriptive rather than descriptive. Wherein lies the prescription? The answer lies in the way I revise my probability in the light of data. Suppose, for example, I have two urns before me, one (urn A) containing two
108
red balls, the other (urn B) containing one red and one black ball. I choose one of the urns. My personal (prior) probability that I have urn A may be 1/2. I now draw a ball at random from the urn and find that it is red. What ought I now to believe? By Bayes’s theorem my posterior probability that I have urn A is P(A/red) =
_______ P(xed!A)P(A)_______ P^ed/A )P(A) + P(red/B)P(5) lxl/2 _ (1 x 1/2)+ (1/2 x 1/2) '
Now for reasons best known to myself my prior probability that I had urn A may have been 1/4 rather than 1/2. No Bayesian will quibble with this so long as I revise my opinion appropriately on drawing a red ball:
It will be remembered that subjective probability is defined in terms of potential betting behaviour. It can be shown that if I do not revise my prior probability, whatever that may be, in accordance with Bayes’s theorem, a gambler will be able so to arrange bets against me that he is certain to make a profit. He can make a ‘Dutch Book’ against me. If probabilities are revised according to Bayes’s theorem they are said to be ‘coherent’. This, then, is the discipline required of its practitioners by Bayesian inference; that probability judgements should be coherent, not accurate. What saves Bayesian inference from total subjectivity is the happy fact that as more data become available the posterior probabilities of individuals with initially disparate prior probabilities will converge. This is the only sense in which Bayesian probabilities are ‘objective’; so long as sufficient data are at hand there will be intersubjective agreement as to the appropriate posterior probability [5]. Summary I have not attempted an exhaustive account of philosophical approaches to the concept of probability. The theories described in this chapter all share an adherence to the mathematical calculus of probability statements and all have attendant philosophical difficulties. The schools of statistical inference to be described in Chapter 6 each take a relatively clear-cut stance on the issue of probability. From the Neyman-Pearson viewpoint probability is strictly objective and is defined in terms of relative frequencies. This involves prob lems in specifying the reference class and renders probability statements in the single case inadmissible. Fisher’s account of probability is likewise objective. He was a harsh critic of subjective probability, but the logic of his attempt to make statements of fiducial probability statements about parameters, without
109
resort to prior probabilities, has eluded subsequent writers. Modern Bayesians have, for the most part, abandoned the classical and logical theories of probability and espouse the personal probability interpretation of de Finetti (1974) and Savage (1961). Likelihood Inference involves an objective interpre tation of probability. Edwards (1972) espouses relative frequency although a propensity approach is also appropriate [6].
Notes [ 1 ] Of the alternatives, Shafer’s (1976a,b) non-additive probability calculus is perhaps the most interesting. [2] It is not my intention to give a detailed exposition of the history of the concept of probability. Todhunter (1865) is the standard reference but it is not very rewarding. Hacking (1975) traces the development of probabilistic ideas up to the mid-seventeenth century. One of the best accounts is to be found in Acree (1978). His unpublished doctoral thesis came to my attention rather late in the preparation of this book. Theories of Statistical Inference in Psychological Research: A Historico-Critical Study is a superb piece of scholarship. It deserves a wider audience. [3] Somewhat paradoxically Venn (1888) had regarded his work as a contribution to philosophy (as its title The Logic of Chance implies). [4] ‘Thou’ refers to the idealized person earlier mentioned. [5] According to Giere (1976, p. 64) ‘it is clear that the view of science that emerges is instrumentalistic. Theories themselves are not objects of belief, but only formal devices for organizing events and their associated degrees of belief. There is no need to regard the supposed objects of theories as “real”, and to do so invites the charge of being unempirical or “metaphysical”’. The operational definition of probability in terms of potential betting behaviour, and the apparent fascination of some Bayesians with mathematical formalisms and resultant negligence of real-world problems renders this criticism understandable, but in my view it is too strong. I do not think that the Bayesian’s denial of objectively warranted degrees of belief and of ‘physical prob abilities’ necessarily implies an instrumental approach to scientific theories and concepts. [6] More recently attempts have been made to fill the gap between objective theories of probabilities that admit no statements about the single case and subjective theories that deny the existence of objective probabilities ‘in the world’. Of these, the ‘propen sity theory’ is the most interesting. The best account is given by Mellor (1971). I do not discuss it here because it does not imply a fully fledged theory of statistical inference. It is probably most applicable to Likelihood Inference (see Hacking, 1965), but the author of the most fully articulated version of that approach, Edwards (1972), insists on a relative frequency interpretation of probability. An Objective Theory of Prob ability (Gillies, 1973) is also worthy of study.
Chapter 5 Further Points of Difference
In Chapter 4 we reviewed some philosophical approaches to the concept of probability and mentioned the respective positions taken by the major schools of statistical inference. We turn now to some other issues that distinguish the schools one from another. 5.1
The Nature and Location of Statistical Inference
By the 1920s dissatisfaction with the concept of inverse probability was widespread and statisticians were seeking an alternative way in which to make probability statements about unknown population parameters without recourse to a priori knowledge. Fisher (1930) introduced his fiducial argument (see section 4.4), but despite its initial popularity and wide influence (Fisher, 1932, 1935) its lack of a coherent logical basis was apparent to Neyman and Pearson (if to few others). Both Neyman (1929) and Pearson (1925) had flirted with the concept of inverse probability, but by 1933, under the influence of von Mises’ frequency account of probability they had developed a radically different approach: Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. Here, for example, would be such a ‘rule of behaviour’: to decide whether a hypothesis H, of a given type be rejected or not, calculate a specified character, X of the observed facts, if X > A'o reject H if X < Xo accept H. Such a rule tells us nothing as to whether in a particular case H is true when X < Xo or false when X > Xo. But it may often be proved that if we behave according to such a rule, then in the long run we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false (Neyman and Pearson, 1933 p. 291).
These remarks encapsulate the Neyman-Pearson approach. They aban doned the attempt to make direct inferences about hypotheses. Instead they limited their probability statements to where they can be made uncontroversially — the sample space.
110
Ill If random sampling can be ensured (or assumed) then for a given population parameter and an assumed statistical model, the distribution of sample out comes can be computed. The task Neyman and Pearson set themselves was to partition the sample space of possible outcomes in such a way that in the long run decisions as to particular parameters would be in some sense optimal. This entailed the development of minimum variance estimators and the like. In a further departure from Fisher’s prescription they identified two forms of error: the wrong rejection of the null hypothesis (the familiar type 1 error) and, in addition, the mistaken acceptance of the null hypothesis (the type 2 error). Having identified these two forms of mistaken inference, a number of strategies are available. That chosen by Neyman and Pearson was to develop tests with a specified type 1 error and, within this constraint, to minimize the type 2 error. This led to the search for so-called Universally Most Powerful Tests. As we have seen, the espousal of a relative frequency interpretation of prob ability and the emphasis upon the sample space, prohibits statements as to the likely truth of an individual statistical hypothesis. The same is true of the con fidence interval. A Neyman-Pearson 95% confidence interval is a statement about the interval, not the parameter. Of all intervals constructed according to the Neyman-Pearson rule, 95% will cover the true population parameter; but we can say nothing about this interval. Any confidence (in the layman’s sense of the term) that we may have in a particular interval is undefined prob abilistically. From the perspective of the other three schools of inference this is not good enough. In particular, Bayesians and the proponents of Likelihood Inference take a radically different approach. At the heart of these two methods is the idea of likelihood, a concept popularized by Fisher and intermittently advocated by him. We therefore turn now to an examination of this concept. Central to the Neyman-Pearson account is the view that the probability of obtaining a sample result X given parameter value 0, denoted P(X/0), is a function of X with 0 regarded as a constant. Likelihood inverts this concep tualization. We define the likelihood of hypothesis 0, given the sample result X, denoted by L(0/X), as being proportional to P(X/0) the constant of pro portionality being arbitrary. Here the obtained sample result is regarded as the constant, rival values of the parameter, 0, the variable. The underlying idea is that if we wish to assess the relative merits of rival hypotheses once the data have been collected, the hypothesis under which the sample result was most probable is regarded as being the best supported by the experimental outcome. It has the greatest likelihood. Hence likelihoods are intrinsically comparative. The likelihood ratio of two hypotheses on some datum is the ratio of the likelihoods on that datum. Hence in applications of the idea of likelihood the constant of proportionality is no embarrassment; it cancels out. Comparisons of many hypotheses may conveniently be achieved by expressing the likelihood of each hypothesis as a ratio of the likelihood for the
112 hypothesis and the maximum likelihood. The arbitrary constant is adjusted so that this maximum likelihood takes the value 1. Thus conceived, the likelihood of a hypothesis is equal to its likelihood ratio. When these likelihoods are plotted a likelihood curve is formed of the likelihood function. The hypotheses may be discrete or continuous in nature. It must be clear that likelihoods are not probabilities, they violate the axioms of probability, they do not sum to one, and no special meaning attaches to any part of the area under a likelihood curve. What then, is their attraction? The proponents of Bayesian and Likelihood Inference assert that likelihood ratios constitute rational grounds for preferring one hypothesis to another in the light of data. The Law of Likelihood states: Within the framework of a statistical model, a particular set of data supports one statistical hypothesis better than another if the likelihood of the first hypothesis, on the data, exceeds the likelihood of the second hypothesis. Furthermore, there is nothing else to be learned from the sample data because the Likelihood Principle asserts: Within the framework of a statistical model, all the information which the data provide concerning the relative merits of two hypotheses is contained in the likelihood ratio of those hypotheses on the data. The centrality of the Likelihood Principle to Bayesian and Likelihood Infer ence implies two major contrasts with the Neyman-Pearson school. First, inference is direct to the parameter space; there is no indirect inference by ‘analogy’ from assumed typicality of an outcome of a procedure. ‘The second criticism is much more substantial. The use of significance tests based on the sampling distribution of a statistic t(x) is in direct violation of the Likelihood Principle.... Similar objections can be raised to confidence intervals derived from such tests’ (Lindley, 1965, pp. 68-69). Classical techniques violate the Likelihood Principle because they allow inferences to be drawn on the basis of tail area probabilities from a sampling distribution — rather than the ordinate. That is, ‘information’ from sample statistics not observed is treated as relevant to the inference — ‘a hypothesis which may be true may be rejected because it has not predicted observable results which have not occurred. This seems a remarkable procedure’ (Jeffreys, 1961, p. 385). Thus far, in deriving the likelihood function, the Bayesian and Likelihood Inference schools are in agreement; they differ, however, in the use to which it is put. The proponents of Likelihood Inference (Barnard, 1949; Barnard et al., 1962; Birnbaum 1962, 1968; Edwards, 1972) regard the likelihood function as the only form of relevant information — not only from the sample but in total. Likelihood itself, is taken as a measure of support for a hypothesis, no probability being thought appropriate. Bayesians, however, acknowledge the likelihood function as the sole source of sample information but combine it (multiplicatively) with the probability distribution which represents prior opinion to form the Bayesian posterior distribution. Thus
113 Bayesians make thoroughgoing (subjective) probability inferences to the parameter space. Fisher’s position with regard to these issues is rather hard to define, although he played a central role in the implied controversies. Indeed he invented the concept of likelihood and we may judge his attitude towards the Neyman-Pearson inference to the sample space (Fisher, 1959, p. 101): To regard the test as one of a series is artificial; the examples given have shown how far this unrealistic attitude is capable of deflecting attention from the vital matter of the weight of the evidence actually supplied by the observations on some possible theoretical view, to, what is really irrelevant, the frequency of events in an endless series of repeated trials which will never take place.
However, some of Fisher’s significance tests violated the Likelihood Prin ciple: ‘at no time does Fisher state why one is allowed to add the clause “or a greater value” so as to form a region of rejection’ (Hacking, 1965, p. 82), and we have already referred to the difficulties associated with his fiducial argument. This section may be summarized succinctly. Fisherian inference allows statements of fiducial probability on the parameter space. Neyman-Pearson inference allows statements of relative frequency probability on the sample space. Likelihood Inference allows statements of likelihoods on the parameter space. Bayesian inference allows statements of subjective probability on the parameter space.
5.2
Inference versus Decision; Initial versus Final Precision
We saw in the previous section that Neyman and Pearson limited their prob ability statements to the sample space, thus avoiding the morass of Fisher’s fiducial argument and the discredited notion of inverse probability. The logical structure of their approach is hard to fault, the probabilistic statements they make are true deductively (conditional upon random sampling, the appro priate statistical model, etc.). On the other hand, it is not clear if any inference is involved. Indeed by 1938 Neyman had eschewed the concept, preferring ‘inductive behaviour’, and speaking of hypothesis testing he asserted that ‘to decide to “affirm” does not mean to “know” nor even to “believe”’ (Neyman, 1938, p. 56). These sentiments, in the prevailing positivist climate, were not unwelcome. Hence Neyman and Pearson viewed the statistician’s task to be to devise optimum decision rules, not to inform belief. In this they were fun damentally opposed by Fisher and latterly by Bayesians and likelihood theorists. The differential emphasis between decisions and inferences reflects a further distinction that may be drawn between the schools of inference. Neyman and Pearson’s concentration upon the sample space of sample outcomes entails the formulation of the decision rule before the data are collected. It is at this point that decisions as to significance level, sample size, etc. must be taken. This
114
done the ‘inference’ follows automatically. The other schools emphasize the evaluation of rival hypotheses in the light of the obtained data. The distinction was recognized by Hacking (1965, p. 99): ‘Thus we infer that the Neyman-Pearson theory is essentially the theory of before-trial betting. It is not necessarily the theory for evaluating hypotheses after trials on the set-up in question have been made, and the results discovered. In all the cases studied so far the theory of likelihood tests does, however, seem appropriate to the latter situation’ [ 1 ]. The distinction between before- and after-trial betting alluded to above has been variously termed ‘the forward and backward look’ (Hogben, 1957) and ‘initial and final precision’ (Barnett, 1973). I will follow Barnett’s terminology and we note that in so far as Fisherian, Bayesian, and Likelihood Inference all emphasize likelihood they are seeking final precision.
5.3
The Role of Prior Information
The schools differ in the manner in which they employ prior information. As we have seen, the Neyman-Pearson school requires use of judgement prior to data collection. Pearson (1962, pp. 395-396) explained the thinking behind this approach thus: We were certainly aware that inferences must make use of prior information and that decisions must take account of utilities, but after some considerable thought and discussion round these matters we came to the conclusion, rightly or wrongly, that it was rarely possible to give sure numerical values to these entities, that our line of approach must proceed otherwise. Thus we came down on the side of using only probability measures which could be related to relative frequency. Of neces sity, as it seemed to us, we left in our mathematical model a gap for the exercise of a more intuitive process of personal judgement in such matters ... as the choice of the most likely class of admissible hypotheses, the appropriate significance level, the magnitude of worthwhile effects and the balance of utilities.
But once these judgements have been made, and the data collected, the logic of the analysis must be followed remorselessly. To do otherwise, to test at the 0.05 level and then announce a result significant at the 0.1 level for example, is to violate the model. Fisher’s opposition to prior probabilities was at least as fierce as that of Neyman and Pearson, but equally he eschewed rules of inductive behaviour laid down before data collection. Hence, although he was responsible for the near-universal adoption of the 0.05 and 0.01 significance levels his own approach to them was more cavalier. He regarded them as arbitrary conven tional cut-off points that could be waived if circumstances dictated. That is, he regarded the decision whether to accept or reject the null hypothesis as residing with the investigator rather than being taken for him in the NeymanPearson manner. The choice of the rejection region could be made after the data had been analysed, and in making this judgement the investigator should
115 be influenced by the associated probability — and whatever other considera tions seem appropriate. The attitude of likelihood theorists towards the role of prior information is essentially similar to that of Fisher; namely, that the outcome of the statistical analysis should be strictly objective but that the interpretation of that finding should be informally mediated by matters of prior plausibility, utility, and the like. They differ, of course, in the nature of the objective statistical outcome, Fisher emphasizing associated probabilities and fiducial intervals, likelihood theorists computing the likelihood function. Bayesian inference is unique among the schools considered here in the formal manner in which it incorporates prior information. The justification is disarmingly simple, ‘in so far as we want to arrive at opinions on the basis of data, it seems inescapable that we should use, together with the data, the opinions that we had before it was gathered’ (Savage et al. 1962, p. 13). For, as we have seen, according to the Bayesian, all probabilities are opinions and the role of data is to revise existing opinion by the successive application of Bayes’s theorem. This view requires that prior knowledge or opinion must be quantified — whether it is ‘readily quantifiable’ or not. Naturally the fact that much opinion is not amenable to a frequentist interpretation does not concern the Bayesian as he refutes the frequentist theory of probability. Bayesians take as axiomatic the proposition that if one has an opinion it can be specified in quantitative terms. More surprisingly, and controversially, they assert that if one does not have an opinion that can be quantified too! Bayesians differ, then, in their explicit consideration of frankly subjective issues. This is the source of the major criticisms from those who espouse the more objective methods. The Bayesian response is to point out that no school of statistical inference claims an approach free of subjective judgements (a fact social scientists would do well to ponder), but that uncertainty should be dealt with rigorously rather than intuitively and informally. It is for this reason that Savage makes the superficially surprising claim that, ‘it helps to emphasize at the outset that the role of subjective probability in statistics is, in a sense, to make statistics less subjective’ (Savage et al. 1962, p. 9).
5.4
The Relevance of the Stopping Rule
Imagine psychologist A chose to study the efficacy of a drug treatment for schizophrenics by difference in prognostic ratings in a test-treatment-retest design. Now if he finds it difficult to obtain suitable patients he may decide to do a dependent means /-test after only 20 subjects have been run. If he finds the test statistic is not quite significant at the 0.05 level he may decide to run a further 10 subjects and perform a further /-test on all 30. Psychologist A can not now perform a r-test in the Neyman-Pearson tradition at the 0.05 level of significance (under the reasonable assumption that had he reached significance with the first 20 he would have stopped and rejected the null hypothesis). The reason is that the first test had an associated type 1 error of
116 0.05 and to this must be added the probability of going on and wrongly rejec ting the null hypothesis. In the limit, as n gets large, the type 1 error for this procedure approaches (0.5 + (0.95 x 0.05)) = 0.0975. We have, then, specified a possible abuse of Neyman-Pearson significance testing. What is of interest, however, is the evident dependence of the infer ence not only upon A’s data but also upon the manner in which it was collected. To highlight the issue, suppose psychologist B ran 30 subjects outright, in a different hospital, and performed his /-test. It is possible that A and B may obtain the same data and yet be forced to differing conclusions. We may even imagine rather awkward half-way positions in which, say, A merely glanced at the means of his two dependent samples without actually computing the t statistic. Those who espouse Bayesian and Likelihood Inference regard it as ludricous that the furtive behaviour and intentions of a scientist should influence the evidential import of his data. After all, according to the Likelihood Principle all the support for a hypothesis is contained within the likelihood function. So it is that Neyman-Pearson procedures differ from those that adhere to the Likelihood Principle in being sensitive to the stopping rule by which data are collected. Both sides regard the other’s position on this issue as a fundamental weakness: ‘I learned the stopping rule principle from Professor Barnard, in conversation in the summer of 1952. Frankly, I then thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe that some people resist an idea so patently right!’ (Savage et al. 1962, p. 76). Table 5.4.1 School
A Comparative Guide to the major schools of statistical inference
Interpretation Emphasis Emphasis Use of Initial on of on testing prior or final probability decision or informatio n precision or estimation inference
Neyman- Frequentist Pearson
Decision
Both
Informal
Initial
Fisherian
Fiducial
Inference Both
Informal
Final
Bayesian
Subjective
Inference Estimation Formal
Final
Inference Estimation Informal
Final
Likelihood Frequentist
Location and nature of inference
Sensitivity to sampling procedure
Probability statement on sample domain Probability statement on parameter domain Probability statement on parameter domain Likelihood statement on parameter domain
Sensitive
Variable
Insensitive
Insensitive
117 The Neyman-Pearson theory requires cognizance to be taken of the rule under which data were collected because it is only then that the sample statistic can be placed in the appropriate reference class (sample space) of possible outcomes. Since likelihood theory and Bayesian inference have no place for sample statistics that might have obtained, but did not, such considerations are irrelevant to the inference. Fisher’s position on this issue is unclear; although he endorsed the Likelihood Principle he occasionally violated it. Table 5.4.1 is presented as a summary statement of the questions examined in the preceding chapters. It enables the reader to identify at a glance the com monalities and differences between the major schools of statistical inference. The price of such a stark document is, of course, that it oversimplifies extremely complex issues. In particular it makes no allowance for differences of opinion within an individual approach — this is unrealistic. Furthermore, the entries represent my own synthesis of the respective positions of the four schools and are therefore unlikely to meet with universal approval.
Notes [1] Hacking (1965, p. 96) illuminates the distinction nicely, if a little unfairly:
Suppose you are convinced on the probabilites of different horses winning a race. Then you will choose among bookmakers he who gives you what, in your eyes, are the most favourable odds. But after the race, seeing that the unexpected did in fact happen, and the lame horse won, you may wish you had wagered with another bookmaker. Now in racing we must bet before the race. But in testing hypotheses, the whole point of testing is usually to evaluate the hypothesis after a trial on the set-up has been made. And although size and power may be the criteria for before-trial betting, they need not be the criteria for after-trial evaluation.
Chapter 6 An Evaluation of the Major Schools of Statistical Inference
6.1
Fisherian Inference
Prior to Fisher’s enormously influential contribution, statistical inference had primarily involved large sample estimation of population parameters by the method of inverse probability. It is true that some 20 years earlier Gosset, under the pseudonym ‘Student’, had succeeded in deriving the t distribution, but ‘small sample statistics’ only became fashionable with the publication of Fisher’s texts (1932, 1935). The range of significance tests presented made small sample statistical inference possible, provided that the question ‘how much?’ was replaced by ‘whether?’ — that is, provided estimation gave way to significance. I have argued at length that this was a retrograde development, but we turn now to a more detailed examination of Fisherian inference. Suppose, to take a very simple example, that our interest lies in the number of problems, of the Progressive Matrices type, that can be solved by a specified group of schoolchildren. Further suppose that our knowledge of performance on this task is such that we already have a stable measure of score variance and that our interest, therefore, lies solely in the mean. Let the data collected from 20 subjects be: X\ 13, 15, 16, 18, 20, 20, 22, 24, 25, 25, 26, 27, 27, 27, 28, 30, 32, 34, 34, 37 (