Selected Papers of Frederick Mosteller (Springer Series in Statistics) 0387202714, 9780387202716

Frederick Mosteller has inspired numerous statisticians and other scientists by his creative approach to statistics and

131 58 8MB

English Pages 670 [651] Year 2006

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Selected Papers of Frederick Mosteller (Springer Series in Statistics)
 0387202714, 9780387202716

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Springer Series in Statistics Advisors: P. Bickel, P. Diggle, S. Fienberg, U. Gather, I. Olkin, S. Zeger

Springer Series in Statistics Alho/Spencer: Statistical Demography and Forecasting. Andersen/Borgan/Gill/Keiding: Statistical Models Based on Counting Processes. Atkinson/Riani: Robust Diagnostic Regression Analysis. Atkinson/Riani/Cerioli: Exploring Multivariate Data with the Forward Search. Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition. Borg/Groenen: Modern Multidimensional Scaling: Theory and Applications, 2nd edition. Brockwell/Davis: Time Series: Theory and Methods, 2nd edition. Bucklew: Introduction to Rare Event Simulation. Cappé/Moulines/Rydén: Inference in Hidden Markov Models. Chan/Tong: Chaos: A Statistical Perspective. Chen/Shaol/Ibrahim: Monte Carlo Methods in Bayesian Computation. Coles: An Introduction to Statistical Modeling of Extreme Values. Devroye/Lugosi: Combinatorial Methods in Density Estimation. Efromovich: Nonparametric Curve Estimation: Methods, Theory, and Applications. Eggermont/LaRiccia: Maximum Penalized Likelihood Estimation, Volume I: Density Estimation. Fahrmeir/Tutz: Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd edition. Fan/Yao: Nonlinear Time Series: Nonparametric and Parametric Methods. Ferraty/Vieu: Nonparametric Functional Data Analysis: Models, Theory, Applications, and Implementation Fienberg/Hoaglin: Selected Papers of Frederick Mosteller Frühwirth-Schnatter: Finite mixture and markov switching models. Ghosh/Ramamoorthi: Bayesian Nonparametrics. Glaz/Naus/Wallenstein: Scan Statistics. Good: Permutation Tests: Parametric and Bootstrap Tests of Hypotheses, 3rd edition. Gouriéroux: ARCH Models and Financial Applications. Gu: Smoothing Spline ANOVA Models. Györfi/Kohler/Krzyz.ak/Walk: A Distribution-Free Theory of Nonparametric Regression. Haberman: Advanced Statistics, Volume I: Description of Populations. Hall: The Bootstrap and Edgeworth Expansion. Härdle: Smoothing Techniques: With Implementation in S. Harrell: Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Hart: Nonparametric Smoothing and Lack-of-Fit Tests. Hastie/Tibshirani/Friedman: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Hedayat/Sloane/Stufken: Orthogonal Arrays: Theory and Applications. Heyde: Quasi-Likelihood and its Application: A General Approach to Optimal Parameter Estimation. Huet/Bouvier/Poursat/Jolivet: Statistical Tools for Nonlinear Regression: A Practical Guide with S-PLUS and R Examples, 2nd edition. Ibrahim/Chen/Sinha: Bayesian Survival Analysis. Jolliffe: Principal Component Analysis, 2nd edition. Knottnerus: Sample Survey Theory: Some Pythagorean Perspectives. Kolen/Brennan: Test Equating: Methods and Practices. Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume I. Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume II. Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume III. Küchler/Sørensen: Exponential Families of Stochastic Processes. Kutoyants: Statistical Influence for Ergodic Diffusion Processes. Lahiri: Resampling Methods for Dependent Data. Le Cam: Asymptotic Methods in Statistical Decision Theory. Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts, 2nd edition. Liu: Monte Carlo Strategies in Scientific Computing. Longford: Models for Uncertainty in Educational Testing. (continued after p. 660)

Stephen E. Fienberg David C. Hoaglin Editors

Selected Papers of Frederick Mosteller

Stephen E. Fienberg Department of Statistics Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 [email protected]

David C. Hoaglin Abt Associates Inc. 55 Wheeler Street Cambridge, MA 02138 [email protected]

Library of Congress Control Number: 2006926459 ISBN-10: 0-387-20271-4 ISBN-13: 978-0387-20271-6 Printed on acid-free paper. © 2006 Springer Science + Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science + Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed in the United States of America. 9 8 7 6 5 4 3 2 1 springer.com

(MVH)

Preface

Both of us have had the good fortune to work closely with Frederick Mosteller and learn from him throughout our professional careers. We have been inspired by his creative approach to statistics and its applications and we, along with countless others, have benefited from reading his papers and books. Through this volume we hope to share the variety and depth of Mosteller’s writings with a new generation of researchers, who can build upon his insights and efforts. In many ways this volume of selected papers can be viewed as a companion to an earlier volume that we assembled with William Kruskal and Judith Tanur, A Statistical Model: Frederick Mosteller’s Contributions to Statistics, Science, and Public Policy, Springer-Verlag (1990), and Fred’s forthcoming autobiography, which will also be published by Springer-Verlag. This volume contains a reasonably complete bibliography of Mosteller’s papers, books and other writings. Several of these, which Fred co-authored with his friend and long-time collaborator John Tukey, are included in the multi-volume Collected Works of John W. Tukey (CWJWT). Conveniently, the bibliography in CWJWT tells whether a paper appears in one of those volumes. CWJWT includes papers P10, P70, P133, and P142 from the Mosteller bibliography. We say that the bibliography is “reasonably” complete because we know that Fred continues to work on projects and to collaborate on revisions of books. We chose the papers for this volume to give a broad perspective on Fred’s work, ranging from statistical theory through applications in a variety of domains, and reflecting his long and varied career. We iterated with Fred over a period of a couple of years until we had a collection that captured the nature of his contributions and also fit within a single volume. Each paper reflects one or more aspects of Fred’s approach to statistical research and its application, and the papers often reflect his general philosophy on how one should go about doing good science. In preparing the papers for this volume, we scanned the original, converted the resulting image to text, and transformed the text into input for LATEX. By careful proofreading we have tried to ensure that each paper is faithful to

vi

Preface

the original. We have, however, corrected typographic errors in the original papers when we encountered them (not often) and occasionally made other straightforward changes. Also, we have restructured some tables, because of incompatabilities in page size and layout. We are deeply indebted to a number of others who contributed to the preparation of the volume, including Cleo Youtz, Marjorie Olson, and Jessa Piaia at Harvard University, and especially to those who actually prepared parts of the LATEX document over the 15 years we have been slowly working on this volume: Valerie Baddon (at York University), Howard Fienberg, Valerie Lenhart, John Clark, Heather Wainer, and Heidi Sestrich (all at Carnegie Mellon). Heather in particular has valiantly worked with us to ensure that the format and content were indeed correct and as consistent as feasible, given the diversity of the original publications and their format styles, and she helped immeasurably in the reproduction of the figures from the original papers. Stephen E. Fienberg David C. Hoaglin

December 24, 2005

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

Frederick Mosteller—A Brief Biography Stephen E. Fienberg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1. Unbiased Estimates for Certain Binomial Sampling Problems with Applications M.A. Girshick, Frederick Mosteller, L.J. Savage . . . . . . . . . . . . . . . . . . . . . 57 2. On Some Useful “Inefficient” Statistics Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3. A k-Sample Slippage Test for an Extreme Population Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4. The Uses and Usefulness of Binomial Probability Paper Frederick Mosteller, John W. Tukey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5. The Education of a Scientific Generalist Hendrik Bode, Frederick Mosteller, John W. Tukey, Charles Winsor . . . . 147 6. Remarks on the Method of Paired Comparisons: I. The Least Squares Solution Assuming Equal Standard Deviations and Equal Correlations Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7. Remarks on the Method of Paired Comparisons: II. The Effect of an Aberrant Standard Deviation When Equal Standard Deviations and Equal Correlations Are Assumed Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

viii

Contents

8. Remarks on the Method of Paired Comparisons: III. A Test of Significance for Paired Comparisons when Equal Standard Deviations and Equal Correlations Are Assumed Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9. An Experimental Measurement of Utility Frederick Mosteller, Philip Nogee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 10. A Mathematical Model for Simple Learning Robert R. Bush, Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 11. A Model for Stimulus Generalization and Discrimination Robert R. Bush, Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 12. The World Series Competition Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 13. Principles of Sampling William G. Cochran, Frederick Mosteller, John W. Tukey . . . . . . . . . . . . . 275 14. Stochastic Models for the Learning Process Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 15. Factorial 12 : A Simple Graphical Treatment Frederick Mosteller, D.E. Richmond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 16. A Comparison of Eight Models Robert R. Bush, Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 17. Optimal Length of Play for a Binomial Game Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 18. Tables of the Freeman-Tukey Transformations for the Binomial and Poisson Distributions Frederick Mosteller, Cleo Youtz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 19. Understanding the Birthday Problem Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 20. Recognizing the Maximum of a Sequence John P. Gilbert, Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 21. The Distribution of Sums of Rounded Percentages Frederick Mosteller, Cleo Youtz, Douglas Zahn . . . . . . . . . . . . . . . . . . . . . . . 399 22. The Expected Coverage to the Left of the ith Order Statistic for Arbitrary Distributions Barry H. Margolin, Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

Contents

ix

23. Bias and Runs in Dice Throwing and Recording: A Few Million Throws Gudmund R. Iversen, Willard H. Longcor, Frederick Mosteller, John P. Gilbert, Cleo Youtz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 24. An Empirical Study of the Distribution of Primes and Litters of Primes Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 25. A Conversation About Collins William B. Fairley, Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 26. Statistics and Ethics in Surgery and Anesthesia John P. Gilbert, Bucknam McPeek, Frederick Mosteller . . . . . . . . . . . . . . . 457 27. Experimentation and Innovations Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 28. New Statistical Methods in Public Policy. Part I: Experimentation Frederick Mosteller, Gale Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 29. Classroom and Platform Performance Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 30. The Clinician’s Responsibility for Helping to Improve the Treatment of Tomorrow’s Patients Bucknam McPeek, John P. Gilbert, Frederick Mosteller . . . . . . . . . . . . . . . 513 31. Innovation and Evaluation Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 32. Combination of Results of Stated Precision: I. The Optimistic Case Frederick Mosteller, John W. Tukey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 33. Combination of Results of Stated Precision: II. A More Realistic Case Frederick Mosteller, John W. Tukey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 34. Allocating Loss of Precision in the Sample Mean to Wrong Weights and Redundancy in Sampling with Replacement from a Finite Population J.L. Hodges, Jr., Frederick Mosteller, Cleo Youtz . . . . . . . . . . . . . . . . . . . . 575 35. Reporting Clinical Trials in General Surgical Journals John D. Emerson, Bucknam McPeek, Frederick Mosteller . . . . . . . . . . . . . 583

x

Contents

36. Compensating for Radiation-Related Cancers by Probability of Causation or Assigned Shares Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 37. Methods for Studying Coincidences Persi Diaconis, Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 38. A Modified Random-Effect Procedure for Combining Risk Difference in Sets of 2 × 2 Tables from Clinical Trials John D. Emerson, David C. Hoaglin, Frederick Mosteller . . . . . . . . . . . . . 623 39. The Case for Smaller Classes and for Evaluating What Works in the Schoolroom Frederick Mosteller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643 40. Frederick Mosteller and John W. Tukey: A Conversation Moderated by Francis J. Anscombe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647

Frederick Mosteller—A Brief Biography Stephen E. Fienberg Carnegie Mellon University

Frederick Mosteller celebrated his 89th birthday this year (2005) but remains professionally active and involved in several research projects. For an extensive review of many of his contributions, see the special volume A Statistical Model [5], prepared in his honor. John Tukey’s biography of Fred (as he is known to his friends, collaborators, colleagues, and students) in that volume is especially noteworthy. Fred was born Charles Frederick Mosteller in Clarksburg, West Virginia, on December 24, 1916.1 The family later moved to the Pittsburgh area, where Fred attended Schenley High School and later Carnegie Institute of Technology (now Carnegie Mellon University). In college, he was interested in mathematics and, in particular, in combinatoric problems. This inclination led him to the statistician Edwin C. Olds, who in turn steered Fred into the field of statistics. Fred completed his Sc.M. degree at Carnegie Tech in 1939 and then enrolled at Princeton University to work on a Ph.D. with Samuel Wilks. In addition to participating with Wilks and others in a wartime research group (see the discussion in [17]), Fred assisted Wilks in his role as editor of the Annals of Mathematical Statistics. Fred received the ASA Samuel S. Wilks Award in 1986. At Princeton Fred also began his lifelong interaction and collaboration with John Tukey, described in part in [3] and in their joint interview in Statistical Science [14]. Fred met his wife, Virginia (1917–2000), when he was a college freshman (she rode the same streetcar that Fred took from Wilkinsburg to the campus each day), and they were married in 1941, just as the statistical war research 1

This is a birth-date he shares with my mother, and a birthday—same day but different year—he shares with David Wallace, a collaborator of Fred’s. One of Fred’s favorite probability examples has been the well-known birthday problem, see [8]. This event also links to another of Fred’s preoccupations, the statistical analysis of coincidences [4].

2

Stephen E. Fienberg

effort was gearing up. Fred accepted a position in the Department of Social Relations at Harvard University in 1946, and he remained on the faculty of the university in various positions (see below) for the rest of his career. Fred and Virginia moved to their home in Belmont after the birth of their son Bill in 1947. Their daughter Gale was born in 1953. Fred became Professor of Mathematical Statistics at Harvard in 1951 and led the effort to create the Department of Statistics, which opened in 1957. He served as the department’s first chair, from 1957 to 1969. I remember my first meeting with Fred in the fall of 1964 as I was entering graduate school. Fred took me to lunch at the Harvard Faculty Club, where he insisted that I try the horse steak. As I was busy chewing, he shared with me some handwritten notes from Tukey on assessing probability assessors, a problem that was to be my first project as one of Fred’s research assistants. Although we wrote up the results of that effort as a joint technical report, I confess it was not until a collaboration on the topic with Morrie DeGroot, some 18 years later, that I came to understand and appreciate Fred’s and John’s insights into the problem. As I was completing my Ph.D. under Fred’s supervision in 1968, he organized a group of us to write a book built around the recent developments in categorical data analysis, especially linked to the use of log-linear models. This project ultimately produced Discrete Multivariate Analysis: Theory and Practice [1]. He was the guiding light behind the project and our constant editor and sometimes contributor, but in typical fashion he insisted that only Yvonne Bishop, Paul Holland, and I be listed as “authors.” Ultimately, he agreed to let us acknowledge his efforts by listing him as a “collaborator” on the title page. During his years in the Statistics Department, Fred formally supervised 17 Ph.D. dissertations, but he served on the committees of countless others in Statistics, Social Relations, and other parts of Harvard. Over the years he was always available to offer comments on works in progress and unpublished manuscripts, and wise students took advantage of his generosity. Fred later served as chair of two other departments, Biostatistics and Health Policy and Management, both in the School of Public Health, and he also taught courses in the Harvard Law School and the John F. Kennedy School of Government. On retirement in 1987, Fred maintained his office in the Department of Statistics, where he continued with his usual array of multidisciplinary projects, almost as if nothing had changed. At the end of 2003, he dismantled his office and relocated to the Washington, D.C. area. Fred’s bibliography is astounding; it contains 65 books, nearly 350 papers in books and journals, 41 miscellaneous publications, and 26 reviews.

Frederick Mosteller—A Brief Biography

3

And, not surprisingly, many of these were coauthored or coedited by over 200 other individuals. My personal favorite is Fred’s classic 1964 book with David Wallace, Inference and Disputed Authorship: The Federalist [15], which was republished in 1984 in expanded form [16]. The intriguing analyses of The Federalist Papers presented by Mosteller and Wallace include one of the first major uses of Bayesian methods, and they provide an early exposition of Laplace’s method for approximating distributions. A partial list of Fred’s varied methodological research interests includes publications on inefficient statistics, sampling, Bayesian methods, paired comparisons, the jackknife, statistics in sports, contingency table analysis, exploratory data analysis, randomized experiments, robustness and metaanalysis and research synthesis. Depending on how one measures collaboration, John Tukey is Fred’s most extensive collaborator; but others prominent on the list include Thomas Chalmers, John Gilbert, David Hoaglin, Bucknam McPeek, and Fred’s longtime research assistant, Cleo Youtz. Other collaborators of note include John Bailar, Bill Cochran, Persi Diaconis, Milton Friedman, Bill Kruskal, Pat Moynihan, Jimmie Savage, Judy Tanur, Allen Wallis, Sam Wilks, Charlie Winsor, and Gale Mosteller (his daughter). And I am pleased to be included on the list. Beginning in the 1950s, Fred helped lead an effort to bring probability and statistics to American high schools. He was instrumental in producing teacher’s manuals, and this involvement led to one of the early elementary statistics texts, Probability with Statistical Applications [13]. Fred used a version of this book as the text for his pioneering 1961 televised course on NBC’s Continental Classroom, which introduced him to students across the nation, young and old. Many textbooks on varied topics followed. When the American Statistical Association set up a joint committee with the National Council of Teachers of Mathematics (NCTM) in the 1960s to change the statistical content of the secondary school mathematics curriculum, Fred led the effort once again. He helped to organize, and goaded others into contributing to, the preparation of the ASA-NCTM Committee’s early products, including the 4volume collection Statistics by Example [9, 10, 11, 12] and Statistics: A Guide to the Unknown [17], which has now appeared in multiple forms and multiple editions. Fred has always been an organizer, and this talent was recognized by many different societies and other organizations that came to Fred for help with projects, as well as to fill leadership positions. Among the societies he has led as president are (in approximately chronological order): the Psychometric Society, the American Statistical Association, the Institute of Mathematical Statistics, the American Association for the Advancement of Science, and the

4

Stephen E. Fienberg

International Statistical Institute. In the 1960s, he served as Chairman of the Board of Directors of the Social Science Research Council and later as Vicechair of the President’s Commission on Federal Statistics, which led to the creation of the Committee on National Statistics at the National Research Council (NRC). At the NRC and elsewhere he has served on so many statistical and interdisciplinary committees and task forces that one observer [6, p. 87] was led to remark: Applied mathematicians are of course essential in most assessments, for help in designing tests, auditing calculations, and assisting in the drawing of conclusions. Much of this is routine craftwork. But one unusual, crucial role should be recognized. It is my guess that statisticians Frederick Mosteller (Harvard) and John Tukey (Princeton) have served on or assisted more technical committees than anybody else alive . . . . It is not just their ability to manipulate numbers that keeps these experts in demand, but sensibility in thinking through questions of macro-experimental design: how inquiries should be cast, what evidence and logic are applicable, how discrimination can be increased, how uncertainties and sensitivities should be probed, what inferences are allowable from evidence. Mosteller and Tukey outlined this role in an article in 1949 in which they called for education of “scientific generalists” who would master “science, not sciences” [2]. The products of many of these activities are mentioned in the bibliography that follows. Recognition of his accomplishments has come in many forms. Fred has received honorary degrees from the University of Chicago (1973), Carnegie Mellon University (1974), Yale University (1981), Wesleyan University (1983), and Harvard University (1993). He is also an honorary fellow of the Royal Statistical Society, an honorary member of the International Statistical Institute, and an elected member/fellow of the American Academy of Arts and Sciences, the American Philosophical Society, the Institute of Medicine, and the National Academy of Sciences. He has served as the Committee of Presidents of Statistical Societies’ R.A. Fisher Lecturer, and has been honored in numerous other ways. By the age of 89, most people have long since retired and turned to pastoral pursuits. Yet, though Fred has been officially retired for eighteen years, he remains remarkably active. Nonetheless I recall being at least partially surprised when, perusing the table of contents of a 2002 issue of Statistics in Medicine a few years ago, I came across a paper by Lincoln Moses, John Buehler, and guess who [7], on one of Fred’s longstanding research interests, meta-analysis; this was a reminder that we needed to continually update the bibliography that follows until this volume was sent to press!

Frederick Mosteller—A Brief Biography

5

Fred will remain a role model for statisticians and other scientists whom he has mentored, taught, and otherwise influenced over the years. It is our hope that reprinting a selection of his contributions to statistics in a single volume will help his work similarly to influence new generations of researchers.

References 1. Y.M.M. Bishop, S.E. Fienberg, and P.W. Holland (with contributions by R.J. Light and F. Mosteller). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA, 1975. 2. H. Bode, F. Mosteller, J.W. Tukey, and C. Winsor. The education of a scientific generalist. Science, 109:553–558, 1949. [Paper 5 in this volume] 3. D.R. Brillinger. John W. Tukey: The life and professional contributions. Annals of Statistics, 30:1535–1575, 2002. 4. P. Diaconis and F. Mosteller. Methods for studying coincidences. Journal of the American Statistical Association, 84:853–861, 1989. [Paper 37 in this volume] 5. S.E. Fienberg, D.C. Hoaglin, W.H. Kruskal, and J.M. Tanur, editors. A Statistical Model: Frederick Mosteller’s Contributions to Statistics, Science, and Public Policy. Springer-Verlag, New York, 1990. 6. William W. Lowrance. Modern Science and Human Values. Oxford University Press, New York, 1985. 7. L.E. Moses, F. Mosteller, and J.H. Buehler. Comparing results of large clinical trials to those of meta-analysis. Statistics in Medicine, 21:793–800, 2002. 8. F. Mosteller. Understanding the birthday problem. The Mathematics Teacher, 55:322–325, 1962. [Paper 19 in this volume] 9. F. Mosteller, W.H. Kruskal, R.F. Link, R.S. Pieters, and G.R. Rising, editors. Statistics by Example: Exploring Data. Addison-Wesley, Reading, MA, 1973. 10. F. Mosteller, W.H. Kruskal, R.F. Link, R.S. Pieters, and G.R. Rising, editors. Statistics by Example: Weighing Chances. Addison-Wesley, Reading, MA, 1973. 11. F. Mosteller, W.H. Kruskal, R.F. Link, R.S. Pieters, and G.R. Rising, editors. Statistics by Example: Detecting Patterns. Addison-Wesley, Reading, MA, 1973. 12. F. Mosteller, W.H. Kruskal, R.F. Link, R.S. Pieters, and G.R. Rising, editors. Statistics by Example: Finding Models. Addison-Wesley, Reading, MA, 1973. 13. F. Mosteller, R.E.K. Rourke, and G.B. Thomas, Jr. Probability with Statistical Applications. Addison-Wesley, Reading, MA, 1961. Second Edition, 1970. 14. Frederick Mosteller and John W. Tukey: A Conversation. Moderated by Francis J. Anscombe. Statistical Science, 3 (1988), pp. 136–144. [Paper 40 in this volume] 15. F. Mosteller and D.L. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, MA, 1964. 16. F. Mosteller and D.L. Wallace. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. Springer-Verlag, New York, 1984. 17. J.M. Tanur, F. Mosteller, W.H. Kruskal, R.F. Link, R.S. Pieters, and G.R. Rising, editors. Statistics: A Guide to the Unknown. Holden-Day, San Francisco, 1972. 18. W.A. Wallis. The Statistical Research Group, 1942-1945 (with discussion). Journal of the American Statistical Association, 75:320–335, 1980.

Bibliography

This bibliography of Frederick Mosteller’s writings generally follows the four categories that Fred has used in his personal bibliography: books, papers, miscellaneous, and reviews. Ordinarily the publications in each category appear in chronological order. We deviate from that ordering to bring together closely related books. We list Fred’s name as it appeared on the publication, usually “Frederick Mosteller.” For works with other authors, we list the names in the order that they appeared on the publication. For publications by a committee or panel, we list the members. If, in addition to being one of the editors of a book, Fred is the author or coauthor of a chapter, we list that chapter under the book but not among the papers. We give any additional information that we have, such as new editions, different printings, and translations. We realize that our information about translations is not complete, nor do we always know of reprinting of papers or chapters.

BOOKS B1 Hadley Cantril and Research Associates in the Office of Public Opinion Research, Princeton University. Gauging Public Opinion. Princeton: Princeton University Press, 1944. Second Printing, 1947. • Chapter IV. Frederick Mosteller and Hadley Cantril. “The use and value of a battery of questions.” pp. 66–73. • Chapter VII. Frederick Mosteller. “The reliability of interviewers’ ratings.” pp. 98–106. • Chapter VIII. William Salstrom with Daniel Katz, Donald Rugg, Frederick Mosteller, and Frederick Williams. “Interviewer bias and rapport.” pp. 107–118. • Chapter XIV. Frederick Williams and Frederick Mosteller. “Education and economic status as determinants of opinion.” pp. 195–208.

8

Bibliography

• Appendix II. Frederick Mosteller. “Correcting for interviewer bias.” pp. 286–288. • Appendix III. Frederick Mosteller. “Sampling and breakdowns: technical notes.” pp. 288–296. • Appendix IV. Frederick Mosteller. “Charts indicating confidence limits and critical differences between percentages.” pp. 297–301. B2 H.A. Freeman, Milton Friedman, Frederick Mosteller, and W. Allen Wallis, editors. Sampling Inspection: Principles, Procedures, and Tables for Single, Double, and Sequential Sampling in Acceptance Inspection and Quality Control Based on Percent Defective. New York: McGraw-Hill, 1948. • Chapter 6. Frederick Mosteller and David H. Schwartz. “Use of sampling inspection for quality control.” pp. 55–68. • Chapter 13. Frederick Mosteller and David H. Schwartz. “Application of the standard procedure to control sampling.” pp. 135–136. B3 Frederick Mosteller, Herbert Hyman, Philip J. McCarthy, Eli S. Marks, and David B. Truman, with the collaboration of Leonard W. Doob, Duncan MacRae, Jr., Frederick F. Stephan, Samuel A. Stouffer, and S.S. Wilks. The Pre-election Polls of 1948: Report to the Committee on Analysis of Pre-election Polls and Forecasts. New York: Social Science Research Council, Bulletin 60, 1949. • Chapter V. Prepared by Frederick Mosteller. “Measuring the error.” pp. 54–80. B4 William G. Cochran, Frederick Mosteller, and John W. Tukey, with the assistance of W.O. Jenkins. Statistical Problems of the Kinsey Report on Sexual Behavior in the Human Male: A Report of the American Statistical Association Committee to Advise the National Research Council Committee for Research in Problems of Sex. Washington, D.C.: The American Statistical Association, 1954. Two parts of this book were published earlier—papers P31 and P32. B5 Robert R. Bush and Frederick Mosteller. Stochastic Models for Learning. New York: Wiley, 1955. B6 A group of the Commission on Mathematics of the College Entrance Examination Board. Introductory Probability and Statistical Inference for Secondary Schools: An Experimental Course. Preliminary edition. New York: College Entrance Examination Board, 1957. Members of the group: Edwin C. Douglas, Frederick Mosteller, Richard S. Pieters, Donald E. Richmond, Robert E.K. Rourke, George B. Thomas, Jr., and Samuel S. Wilks. These names were inadvertently omitted from this edition.

Bibliography

9

A revised edition of this volume was published in 1959. B7 A group of the Commission on Mathematics of the College Entrance Examination Board. Teachers’ Notes and Answer Guide: Supplementary material for the revised preliminary edition of “Introductory Probability and Statistical Inference.” New York: College Entrance Examination Board, 1959. Members of the group: Frederick Mosteller, Richard S. Pieters, Robert E.K. Rourke, George B. Thomas, Jr., and Samuel S. Wilks. B7 and the revised edition of B6 were translated by Professor Marta C. Valincq into Spanish, and published by Comision de Educacion Estadistica del Instituto Interamericano de Estadistica Rosario (Rep. Argentina), 1961. B8 Commission on Mathematics. Program for College Preparatory Mathematics. Report of the Commission. New York: College Entrance Examination Board, 1959. Members of the Commission on Mathematics: Albert W. Tucker, Chairman, Carl B. Allendoerfer, Edwin C. Douglas, Howard F. Fehr, Martha Hildebrandt, Albert E. Meder, Jr., Morris Meister, Frederick Mosteller, Eugene P. Northrop, Ernest R. Ranucci, Robert E.K. Rourke, George B. Thomas, Jr., Henry Van Engen, and Samuel S. Wilks. B9 Frederick Mosteller, Robert E.K. Rourke, and George B. Thomas, Jr. Probability with Statistical Applications. Reading, MA: Addison-Wesley, 1961. Second printing, 1965. B10 Frederick Mosteller, Robert E.K. Rourke, and George B. Thomas, Jr. Probability with Statistical Applications, Second Edition. Reading, MA: Addison-Wesley, 1970. Extensive revision of B9. World Student Series Edition, Third Printing, 1973, of B10. Published for the Open University in England. B11 Frederick Mosteller, Robert E.K. Rourke, and George B. Thomas, Jr. Probability and Statistics. Reading, MA: Addison-Wesley, 1961. Official textbook for Continental Classroom. Derived from B9 by abridgment and slight rewriting. Also translated into Turkish. B12 Frederick Mosteller, Robert E.K. Rourke, and George B. Thomas, Jr. Probability: A First Course. Reading, MA: Addison-Wesley, 1961. Derived from B9 by abridgment and slight rewriting. Also translated into Russian. And a Bulgarian translation into Russian. Sofia, Bulgaria: J. Stoyanov, 1975.

10

Bibliography

B13 Frederick Mosteller, Robert E.K. Rourke, and George B. Thomas, Jr. Probability: A First Course, Second Edition. Reading, MA: AddisonWesley, 1970. B14 Frederick Mosteller, Robert E.K. Rourke, and George B. Thomas, Jr. Teacher’s Manual for Probability: A First Course. Reading, MA: AddisonWesley, 1961. B15 Frederick Mosteller, Robert E.K. Rourke, and George B. Thomas, Jr. Instructor’s Manual to Accompany Probability with Statistical Applications, Second Edition, and Probability: A First Course, Second Edition. Reading, MA: Addison-Wesley, 1970. B16 Frederick Mosteller, Keewhan Choi, and Joseph Sedransk. A Catalogue Survey of College Mathematics Courses. Mathematical Association of America, Committee on the Undergraduate Program in Mathematics, Report Number 4, December 1961. B17 Frederick Mosteller and David L. Wallace. Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley, 1964. B18 Frederick Mosteller and David L. Wallace. Applied Bayesian and Classical Inference: The Case of the Federalist Papers. New York: Springer-Verlag, 1984. Second Edition of B17. A new chapter dealing with authorship work published from about 1969 to 1983 was added in the second edition. A new, lengthy Analytic Table of Contents replaced the original Table of Contents. B19 Panel on Mathematics for the Biological, Management, and Social Sciences. Committee on the Undergraduate Program in Mathematics, Mathematical Association of America. Tentative Recommendations for the Undergraduate Mathematics Program of Students in the Biological, Management and Social Sciences. Berkeley, CA: CUPM Central Office, 1964. Panel members: John G. Kemeny, Chairman, Joseph Berger, Robert R. Bush, David Gale, Samuel Goldberg, Harold Kuhn, Frederick Mosteller, Theodor D. Sterling, Gerald L. Thompson, Robert M. Thrall, A.W. Tucker, and Geoffrey S. Watson. B20 Frederick Mosteller. Fifty Challenging Problems in Probability with Solutions. Reading, MA: Addison-Wesley, 1965. Also translated into Russian, 1975. A second edition of the Russian translation was published in 1985. For this edition, Frederick Mosteller provided a new problem, “Distribution of prime divisors.”

Bibliography

11

B21 Frederick Mosteller. Fifty Challenging Problems in Probability with Solutions. New York: Dover, 1987. Reissue of B20. B22 Panel on Undergraduate Education in Mathematics of the Committee on Support of Research in the Mathematical Sciences of the National Research Council. The Mathematical Sciences: Undergraduate Education. Washington, D.C.: National Academy of Sciences, 1968. Panel members: John G. Kemeny, Grace E. Bates, D.E. Christie, Llayron Clarkson, George Handelman, Frederick Mosteller, Henry Pollak, Hartley Rogers, John Toll, Robert Wisner, and Truman A. Botts. B23 John P. Bunker, William H. Forrest, Jr., Frederick Mosteller, and Leroy D. Vandam, editors. The National Halothane Study: A Study of the Possible Association between Halothane Anesthesia and Postoperative Hepatic Necrosis. Report of the Subcommittee on the National Halothane Study, of the Committee on Anesthesia, Division of Medical Sciences, National Academy of Sciences–National Research Council. National Institutes of Health, National Institute of General Medical Sciences. Washington, D.C.: U.S. Government Printing Office, 1969. • Part IV, Chapter 1. Byron W. Brown, Jr., Frederick Mosteller, Lincoln E. Moses, and W. Morven Gentleman. “Introduction to the study of death rates.” pp. 183–187. • Part IV, Appendix 3 to Chapter 2. Frederick Mosteller. “Estimation of death rates.” pp. 234–235. • Part IV, Chapter 3. Yvonne M.M. Bishop and Frederick Mosteller. “Smoothed contigency-table analysis.” pp. 237–272. • Part IV, Chapter 8. Lincoln E. Moses and Frederick Mosteller. “Afterword for the study of death rates.” pp. 395–408. B24 The President’s Commission on Federal Statistics. Federal Statistics: Report of the President’s Commission, Vol. I. Washington, D.C.: U.S. Government Printing Office, 1971. Members of the Commission: W. Allen Wallis, Chairman, Frederick Mosteller, Vice–Chairman, Ansley J. Coale, Paul M. Densen, Solomon Fabricant, W. Braddock Hickman, William Kruskal, Robert D. Fisher, Stanley Lebergott, Richard M. Scammon, William H. Shaw, James A. Suffridge, John W. Tukey, and Frank D. Stella. The Staff: Daniel B. Rathbun, Executive Director, Paul Feldman, Deputy Executive Director, and Norman V. Breckner, Assistant Director. B25 The President’s Commission on Federal Statistics. Federal Statistics: Report of the President’s Commission, Vol. II. Washington, D.C.: U.S. Government Printing Office, 1971.

12

Bibliography

• Chapter 6. Richard J. Light, Frederick Mosteller, and Herbert S. Winokur, Jr. “Using controlled field studies to improve public policy.” pp. 367–398. B26–B31 Judith M. Tanur and members (Frederick Mosteller, Chairman, William H. Kruskal, Richard F. Link, Richard S. Pieters, and Gerald R. Rising) of the Joint Committee on the Curriculum in Statistics and Probability of the American Statistical Association and the National Council of Teachers of Mathematics, editors. These comprise essays by many authors, and the same essay may appear in more than one of the books. B26 Statistics: A Guide to the Unknown. San Francisco: Holden-Day, 1972. B27 Second edition of B26. San Francisco: Holden-Day, 1978. With Erich L. Lehmann, Special Editor. Re-issued: Monterey, CA: Wadsworth & Brooks/Cole, 1985. B28 Third edition of B26. Judith M. Tanur, Frederick Mosteller, William H. Kruskal, Erich L. Lehmann, Richard F. Link, Richard S. Pieters, and Gerald R. Rising, editors. Pacific Grove, CA: Wadsworth & Brooks/Cole, 1989. Translated into Spanish: La Estad´ıstica: Una gu´ıa de lo desconocido. Madrid: Alianza Editorial, 1992. B29 Statistics: A Guide to Business and Economics. San Francisco: HoldenDay, 1976. With E.L. Lehmann, Special Editor. B30 Statistics: A Guide to the Biological and Health Sciences. San Francisco: Holden-Day, 1977. With E.L. Lehmann, Special Editor. B31 Statistics: A Guide to Political and Social Issues. San Francisco: HoldenDay, 1977. With E.L. Lehmann, Special Editor. • •



Frederick Mosteller. “Foreword.” pp. viii–x. Also in second edition, pp. ix–xi; third edition, pp. ix–x; and in Statistics: A Guide to Business and Economics, pp. viii–x. Lincoln E. Moses and Frederick Mosteller. “Safety of anesthetics.” pp. 14–22. Also in second edition, pp. 16–25; third edition, pp. 15–24; and in Statistics: A Guide to the Biological and Health Sciences, pp. 101– 110. Frederick Mosteller and David L. Wallace. “Deciding authorship.” pp. 164–175. Also in second edition, pp. 207–219; third edition, pp. 115– 125; and in Statistics: A Guide to Political and Social Issues, pp. 78–90.

Bibliography

13

• John P. Gilbert, Bucknam McPeek, and Frederick Mosteller. “How frequently do innovations succeed in surgery and anesthesia?” In second edition, pp. 45–58. Also in Statistics: A Guide to the Biological and Health Sciences, pp. 51–64. • John P. Gilbert, Richard J. Light, and Frederick Mosteller. “How well do social innovations work?” In second edition, pp. 125–138. Also in Statistics: A Guide to Political and Social Issues, pp. 47–60. B32 Frederick Mosteller and Daniel P. Moynihan, editors. On Equality of Educational Opportunity: Papers deriving from the Harvard University Faculty Seminar on the Coleman Report. New York: Random House, 1972. • Chapter 1. Frederick Mosteller and Daniel P. Moynihan. “A pathbreaking report.” pp. 3–66. • Chapter 8. John P. Gilbert and Frederick Mosteller. “The urgent need for experimentation.” pp. 371–383. B33 Frederick Mosteller and Robert E.K. Rourke. Sturdy Statistics: Nonparametrics and Order Statistics. Reading, MA: Addison-Wesley, 1973. B34 Frederick Mosteller and Robert E.K. Rourke. Solutions Manual for Sturdy Statistics: Nonparametrics and Order Statistics. Reading, MA: AddisonWesley, 1973. B35–B38 Frederick Mosteller, William H. Kruskal, Richard F. Link, Richard S. Pieters, and Gerald R. Rising, editors. The Joint Committee on the Curriculum in Statistics and Probability of the American Statistical Association and the National Council of Teachers of Mathematics. Reading, MA: Addison-Wesley, 1973. Parts of these books were translated into Japanese, 1979. B35 Statistics by Example: Exploring Data. (with the assistance of Martha Zelinka) • Set. 2. Frederick Mosteller. “Fractions on closing stock market prices.” pp. 9–13. • Set 7. Frederick Mosteller. “Collegiate football scores.” pp. 61–74. • Set 8. Frederick Mosteller. “Ratings of typewriters.” pp. 75–78. B36 Statistics by Example: Weighing Chances. (with the assistance of Roger Carlson and Martha Zelinka) • Set 7. Frederick Mosteller. “Stock market fractions.” pp. 67–69. • Set 10. Frederick Mosteller. “Ratings of typewriters.” pp. 81–94. • Set 11. Frederick Mosteller. “Collegiate football scores.” pp. 95–111. • Set 12. Frederick Mosteller. “Periodicities and moving averages.” pp. 113–119.

14

Bibliography

B37 Statistics by Example: Detecting Patterns. (with the assistance of Roger Carlson and Martha Zelinka) • Set 9. Frederick Mosteller. “Transformations for linearity.” pp. 99–107. B38 Statistics by Example: Finding Models. (with the assistance of Roger Carlson and Martha Zelinka) • Set 4. Frederick Mosteller. “Escape-avoidance experiment.” pp. 35–39. B39 Panel on Weather and Climate Modification, Committee on Atmospheric Sciences, National Research Council. Weather & Climate Modification: Problems and Progress. Washington, D.C.: National Academy of Sciences, 1973. Members of the panel: Thomas F. Malone, Chairman, Louis J. Battan, Julian H. Bigelow, Peter V. Hobbs, James E. McDonald, Frederick Mosteller, Helmut K. Weickmann, and E.J. Workman. B40 Yvonne M.M. Bishop, Stephen E. Fienberg, and Paul W. Holland, with the collaboration of Richard J. Light and Frederick Mosteller. Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press, 1975. Paperback edition, 1977. B41 John P. Bunker, Benjamin A. Barnes, and Frederick Mosteller, editors. Costs, Risks, and Benefits of Surgery. New York: Oxford University Press, 1977. • Chapter 9. John P. Gilbert, Bucknam McPeek, and Frederick Mosteller. “Progress in surgery and anesthesia: Benefits and risks of innovative therapy.” pp. 124–169. • Chapter 10. Bucknam McPeek, John P. Gilbert, and Frederick Mosteller. “The end result: Quality of life.” pp. 170–175. • Chapter 23. John P. Bunker, Benjamin A. Barnes, Frederick Mosteller, John P. Gilbert, Bucknam McPeek, and Richard Jay Zeckhauser. “Summary, conclusions, and recommendations.” pp. 387–394. B42 William B. Fairley and Frederick Mosteller, editors. Statistics and Public Policy. Reading, MA: Addison-Wesley, 1977. • Frederick Mosteller. “Assessing unknown numbers: Order of magnitude estimation.” pp. 163–184. • John P. Gilbert, Richard J. Light, and Frederick Mosteller. “Assessing social innovations: An empirical base for policy.” pp. 185–241. • William B. Fairley and Frederick Mosteller. “A conversation about Collins.” pp. 369–379. B43 Frederick Mosteller and John W. Tukey. Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison-Wesley, 1977. Also translated into Russian, 2 volumes. Moscow: Statistika Publishers, 1983.

Bibliography

15

B44 Committee for a Planning Study for an Ongoing Study of Costs of Environment-Related Health Effects, Institute of Medicine. Costs of Environment-Related Health Effects: A Plan for Continuing Study. Washington, D.C.: National Academy Press, 1981. Members of the Committee: Kenneth J. Arrow, Chairman, Theodore Cooper, Ralph C. d’Arge, Philip J. Landrigan, Alexander Leaf, Joshua Lederberg, Paul A. Marks, Frederick Mosteller, Evelyn F. Murphy, Robert F. Murray, Don K. Price, Frederick C. Robbins, Anne A. Scitovsky, Irving J. Selikoff, Herman A. Tyroler, Arthur C. Upton, and Richard Zeckhauser. B45 David C. Hoaglin, Richard J. Light, Bucknam McPeek, Frederick Mosteller, and Michael A. Stoto. Data for Decisions: Information Strategies for Policymakers. Cambridge, MA: Abt Books, 1982. Paperback: Lanham, MD: University Press of America, 1984. B46 The National Science Board Commission on Precollege Education in Mathematics, Science, and Technology. Educating Americans for the 21st Century: A Plan of Action for Improving Mathematics, Science and Technology Education for All American Elementary and Secondary Students so that Their Achievement is the Best in the World by 1995. Washington, D.C.: National Science Foundation, 1983. Members of the Commission: William T. Coleman, Jr., Co-Chair, Cecily Cannan Selby, Co-Chair, Lew Allen, Jr., Victoria Bergin, George Burnet, Jr., William H. Cosby, Jr., Daniel J. Evans, Patricia Albjerg Graham, Robert E. Larson, Gerald D. Laubach, Katherine P. Layton, Ruth B. Love, Arturo Madrid II, Frederick Mosteller, M. Joan Parent, Robert W. Parry, Benjamin F. Payton, Joseph E. Rowe, Herbert A. Simon, and John B. Slaughter. • A Report to the American People and the National Science Board. • Source Materials. B47 David C. Hoaglin, Frederick Mosteller, and John W. Tukey, editors. Understanding Robust and Exploratory Data Analysis. New York: Wiley, 1983. Translated into Portuguese: Analise Exploratoria de Dados. Tecnicas Robustas. Lisbon: Novas TecnoLogias, 1992. Also translated into Chinese, (by Zhonglian Chen and Deyuan Guo), Beijing: China Statistics Publishing House, 1998. Wiley Classics Library Edition, 2000. • Chapter 9. David C. Hoaglin, Frederick Mosteller, and John W. Tukey. “Introduction to more refined estimators,” pp. 283–296. B48 Frederick Mosteller, Stephen E. Fienberg, and Robert E.K. Rourke. Beginning Statistics with Data Analysis. Reading, MA: Addison-Wesley, 1983.

16

Bibliography

B49 Joseph A. Ingelfinger, Frederick Mosteller, Lawrence A. Thibodeau, and James H. Ware. Biostatistics in Clinical Medicine. New York: Macmillan, 1983. Italian edition: Biostatistica in Medicina, translated by Ettore Marubini. Milano: Raffacello Cortina Editore, 1986. B50 Second edition of B49, 1987. The second edition has two new chapters—one on life tables, the other on multiple regression. B51 Lincoln E. Moses and Frederick Mosteller, editors. Planning and Analysis of Observational Studies, by William G. Cochran. New York: Wiley, 1983. At the time of his death, William G. Cochran left an almost completed manuscript on observational studies. Lincoln E. Moses and Frederick Mosteller edited and organized the manuscript, and Planning and Analysis of Observational Studies is the result. B52 Oversight Committee on Radioepidemiologic Tables, Board on Radiation Effects Research, Commission on Life Sciences, National Research Council. Washington, D.C.: National Academy Press, 1984. Members of the Committee: Frederick Mosteller, Chairman. Jacob I. Fabrikant, R.J. Michael Fry, Stephen W. Lagakos, Anthony B. Miller, Eugene L. Saenger, David Schottenfeld, Elizabeth L. Scott, John R. Van Ryzin, and Edward W. Webster; Stephen L. Brown, Staff Officer, Norman Grossblatt, Editor. • Assigned Share for Radiation as a Cause of Cancer: Review of Assumptions and Methods for Radioepidemiologic Tables. Interim Report. • Assigned Share for Radiation as a Cause of Cancer: Review of Radioepidemiologic Tables Assigning Probabilities of Causation. Final Report. B53 David C. Hoaglin, Frederick Mosteller, and John W. Tukey, editors. Exploring Data Tables, Trends, and Shapes. New York: Wiley, 1985. • Chapter 5. Frederick Mosteller and Anita Parunak. “Identifying extreme cells in a sizable contingency table: probabilistic and exploratory approaches.” pp. 189–224. • Chapter 6. Frederick Mosteller, Andrew F. Siegel, Edward Trapido, and Cleo Youtz. “Fitting straight lines by eye.” pp. 225–239. B54 Committee for Evaluating Medical Technologies in Clinical Use, Division of Health Promotion and Disease Prevention, Institute of Medicine. Assessing Medical Technologies. Washington, D.C.: National Academy Press, 1985.

Bibliography

17

Members of the Committee: Frederick Mosteller, Chairman, H. David Banta, Stuart Bondurant, Morris F. Collen, Joanne E. Finley, Barbara J. McNeil, Lawrence C. Morris, Jr., Lincoln E. Moses, Seymour Perry, Dorothy P. Rice, Herman A. Tyroler, and Donald A. Young. Study Staff: Enriqueta C. Bond, Barbara Filner, Caren Carney, Clifford Goodman, Linda DePugh, Naomi Hudson, and Wallace K. Waterfall. B55 John C. Bailar III and Frederick Mosteller, editors. Medical Uses of Statistics. Waltham, MA: NEJM Books, 1986. • Introduction. John C. Bailar III and Frederick Mosteller. pp. xxi–xxvi. • Chapter 8. James H. Ware, Frederick Mosteller, and Joseph A. Ingelfinger. “P values.” pp. 149–169. • Chapter 13. Rebecca DerSimonian, L. Joseph Charette, Bucknam McPeek, and Frederick Mosteller. “Reporting on methods in clinical trials.” pp. 272–288. • Chapter 15. Frederick Mosteller. “Writing about numbers.” pp. 305– 321. Italian edition: L’Uso della Statistica in Medicina. Roma: Il Pensiero Scientifico Editore, 1988. Translated by Giovanni Apolone, Antonio Nicolucci, Raldano Fossati, Fabio Parazzini, Walter Torri, Roberto Grilli. B56 John C. Bailar III and Frederick Mosteller, editors. Medical Uses of Statistics, second edition. Boston: NEJM Books, 1992. • John C. Bailar III and Frederick Mosteller. “Introduction.” pp. xxiii– xxvii. • Chapter 10. James H. Ware, Frederick Mosteller, Fernando Delgado, Christl Donnelly, and Joseph A. Ingelfinger. “P values.” pp. 181–200. • Chapter 16. John C. Bailar III and Frederick Mosteller. “Guidelines for statistical reporting in articles for medical journals: Amplifications and explanations.” pp. 313–331. • Chapter 17. Rebecca DerSimonian, L. Joseph Charette, Bucknam McPeek, and Frederick Mosteller. “Reporting on methods in clinical trials.” pp. 333–347. • Chapter 20. Frederick Mosteller. “Writing about numbers.” pp. 375– 389. • Chapter 21. John C. Bailar III and Frederick Mosteller. “Medical technology assessment.” pp. 393–411. • Chapter 22. Katherine Taylor Halvorsen, Elisabeth Burdick, Graham A. Colditz, Howard S. Frazier, and Frederick Mosteller. “Combining results from independent investigations: Meta-analysis in clinical research.” pp. 413–426.

18

Bibliography

B57 Frederick Mosteller and Jennifer Falotico–Taylor, editors. Quality of Life and Technology Assessment, monograph of the Council on Health Care Technology, Institute of Medicine. Washington, D.C.: National Academy Press, 1989. • Jennifer Falotico-Taylor, Mark McClellan, and Frederick Mosteller. “The use of quality-of-life measures in technology assessment” (including “Twelve applications of quality-of-life measures to technology assessment”). pp. 7–44. • Jennifer Falotico-Taylor and Frederick Mosteller. “Applications of quality-of-life measures and areas for cooperative research.” pp. 116– 118. B58 David C. Hoaglin, Frederick Mosteller, and John W. Tukey, editors. Fundamentals of Exploratory Analysis of Variance. New York: Wiley, 1991. • Chapter 1. John W. Tukey, Frederick Mosteller, and David C. Hoaglin. “Concepts and examples in analysis of variance.” pp. 1–23. • Chapter 2. Frederick Mosteller and John W. Tukey. “Purposes of analyzing data that come in a form inviting us to apply tools from the analysis of variance.” pp. 24–39. • Chapter 3. Frederick Mosteller and David C. Hoaglin. “Preliminary examination of data.” pp. 40–49. • Chapter 7. Frederick Mosteller, Anita Parunak, and John W. Tukey. “Mean squares, F tests, and estimates of variance.” pp. 146–164. • Chapter 9. Constance Brown and Frederick Mosteller. “Components of variance.” pp. 193–251. • Chapter 10. Thomas Blackwell, Constance Brown, and Frederick Mosteller. “Which denominator?” pp. 252–294. • Chapter 11. John W. Tukey, Frederick Mosteller, and Cleo Youtz. “Assessing changes.” pp. 295–335. B59 Committee to Review the Adverse Consequences of Pertussis and Rubella Vaccines, Division of Health Promotion and Disease Prevention, Institute of Medicine, Christopher P. Howson, Cynthia J. Howe, and Harvey V. Fineberg, editors. Adverse Effects of Pertussis and Rubella Vaccines. Washington, D.C.: National Academy Press, 1991. B60 Thomas D. Cook, Harris Cooper, David S. Cordray, Heidi Hartmann, Larry V. Hedges, Richard J. Light, Thomas A. Louis, and Frederick Mosteller. Meta-Analysis for Explanation: A Casebook. New York: Russell Sage Foundation, 1992. • Richard J. Light and Frederick Mosteller. “Annotated bibliography of meta-analytic books and journal issues.” pp. xi–xiv. B61 Kenneth S. Warren and Frederick Mosteller, editors. Doing More Good Than Harm: The Evaluation of Health Care Interventions. New York:

Bibliography

19

Annals of the New York Academy of Sciences, 703, New York Academy of Sciences, 1993. • Frederick Mosteller. “Some evaluation needs.” pp. 12–17. B62 Howard S. Frazier and Frederick Mosteller, editors. Medicine Worth Paying For: Assessing Medical Innovations. Cambridge, MA: Harvard University Press, 1995. • Chapter 1. Howard S. Frazier and Frederick Mosteller. “The nature of the inquiry.” pp. 3–8. • Chapter 2. Frederick Mosteller and Howard S. Frazier. “Evaluating medical technologies.” pp. 9–35. • Chapter 11. Georgianna Marks, Frederick Mosteller, Marie-A. McPherson, and Grace Wyshak. “The contributions of lenses to visual health.” pp. 157–172. • Chapter 17. Frederick Mosteller and Howard S. Frazier. “Improving the health care system.” pp. 245–259. • Chapter 18. Frederick Mosteller and Howard S. Frazier. “Innovationspecific improvements.” pp. 260–274. • Chapter 19. Howard S. Frazier and Frederick Mosteller. “Recommendations for change.” pp. 275–280. B63 Committee for Guidance on Setting and Enforcing Speed Limits, Transportation Research Board, National Research Council. Managing Speed: Review of Current Practice for Setting and Enforcing Speed Limits, Special Report 254. Washington, D.C.: National Academy Press, 1998. B64 Committee on Equivalency and Linkage of Educational Tests, Board on Testing and Assessment, Commission on Behavioral and Social Sciences and Education, National Research Council, Michael J. Feuer, Paul W. Holland, Bert F. Green, Meryl W. Bertenthal, and F. Cadelle Hemphill, editors. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, D.C.: National Academy Press, 1999. B65 Frederick Mosteller and Robert Boruch, editors. Evidence Matters: Randomized Trials in Education Research. Washington, D.C.: Brookings Institute Press, 2002.

PAPERS P1 Frederick Mosteller. “Note on an application of runs to quality control charts.” Annals of Mathematical Statistics, 12 (1941), pp. 228–232. P2 Frederick Mosteller and Philip J. McCarthy. “Estimating population proportions.” Public Opinion Quarterly, 6 (1942), pp. 452–458.

20

Bibliography

P3 Louis H. Bean, Frederick Mosteller, and Frederick Williams. “Nationalities and 1944.” Public Opinion Quarterly, 8 (1944), pp. 368–375. P4 M.A. Girshick, Frederick Mosteller, and L.J. Savage. “Unbiased estimates for certain binomial sampling problems with applications.” Annals of Mathematical Statistics, 17 (1946), pp. 13–23. Reprinted in The Writings of Leonard Jimmie Savage—A Memorial Selection, edited by a committee. Washington, D.C.: The American Statistical Association and The Institute of Mathematical Statistics, 1981. pp. 96–106. P5 Frederick Mosteller. “On some useful ‘inefficient’ statistics.” Annals of Mathematical Statistics, 17 (1946), pp. 377–408. P6 Cecil Hastings, Jr., Frederick Mosteller, John W. Tukey, and Charles P. Winsor. “Low moments for small samples: A comparative study of order statistics.” Annals of Mathematical Statistics, 18 (1947), pp. 413–426. P7 Frederick Mosteller. “A k-sample slippage test for an extreme population.” Annals of Mathematical Statistics, 19 (1948), pp. 58–65. P8 Frederick Mosteller. “On pooling data.” Journal of the American Statistical Association, 43 (1948), pp. 231–242. P9 Frederick Mosteller and John W. Tukey. “The uses and usefulness of binomial probability paper.” Journal of the American Statistical Association, 44 (1949), pp. 174–212. P10 Hendrik Bode, Frederick Mosteller, John W. Tukey, and Charles Winsor. “The education of a scientific generalist.” Science, 109 (June 3, 1949), pp. 553–558. P11 Frederick Mosteller and John W. Tukey. “Practical applications of new theory, a review.” • “Part I: Location and scale: tables.” Industrial Quality Control, 6, No. 2 (1949), pp. 5–8. • “Part II: Counted data—graphical methods.” No. 3 (1949), pp. 5–7. • “Part III: Analytical techniques.” No. 4 (1950), pp. 5–8. • “Part IV: Gathering information.” No. 5 (1950), pp. 5–7. P12 Frederick Mosteller. Article on “Statistics” in Collier’s Encyclopedia, circa 1949, pp. 191–195. P13 Frederick Mosteller and John W. Tukey. “Significance levels for a k-sample slippage test.” Annals of Mathematical Statistics, 21 (1950), pp. 120–123.

Bibliography

21

P14 J.S. Bruner, L. Postman, and F. Mosteller. “A note on the measurement of reversals of perspective.” Psychometrika, 15 (1950), pp. 63–72. P15 Arthur S. Keats, Henry K. Beecher, and Frederick Mosteller. “Measurement of pathological pain in distinction to experimental pain.” Journal of Applied Physiology, 3 (1950), pp. 35–44. P16 Frederick Mosteller. “Remarks on the method of paired comparisons: I. The least squares solution assuming equal standard deviations and equal correlations.” Psychometrika, 16 (1951), pp. 3–9. Reprinted in Readings in Mathematical Psychology, I, edited by R. Duncan Luce, Robert R. Bush, and Eugene Galanter. New York: Wiley, 1963. pp. 152–158. P17 Frederick Mosteller. “Remarks on the method of paired comparisons: II. The effect of an aberrant standard deviation when equal standard deviations and equal correlations are assumed.” Psychometrika, 16 (1951), pp. 203–206. P18 Frederick Mosteller. “Remarks on the method of paired comparisons: III. A test of significance for paired comparisons when equal standard deviations and equal correlations are assumed.” Psychometrika, 16 (1951), pp. 207–218. P19 Robert E. Goodnow, Henry K. Beecher, Mary A.B. Brazier, Frederick Mosteller, and Renato Tagiuri. “Physiological performance following a hypnotic dose of a barbiturate.” Journal of Pharmacology and Experimental Therapeutics, 102 (1951), pp. 55–61. P20 Frederick Mosteller. “Theoretical backgrounds of the statistical methods: underlying probability model used in making a statistical inference.” Industrial and Engineering Chemistry, 43 (1951), pp. 1295–1297. P21 Frederick Mosteller. “Mathematical models for behavior theory: a brief report on an interuniversity summer research seminar.” Social Science Research Council Items, 5 (September 1951), pp. 32–33. P22 Frederick Mosteller and Philip Nogee. “An experimental measurement of utility.” The Journal of Political Economy, 59 (1951), pp. 371–404. Reprinted in the Bobbs-Merrill Reprint Series in the Social Sciences. P23 Robert R. Bush and Frederick Mosteller. “A mathematical model for simple learning.” Psychological Review, 58 (1951), pp. 313–323.

22

Bibliography

Reprinted in Readings in Mathematical Psychology, I, edited by R. Duncan Luce, Robert R. Bush, and Eugene Galanter. New York: Wiley, 1963. pp. 278–288. Reprinted in the Bobbs-Merrill Reprint Series in the Social Sciences. P24 Robert R. Bush and Frederick Mosteller. “A model for stimulus generalization and discrimination.” Psychological Review, 58 (1951), pp. 413–423. Reprinted in Readings in Mathematical Psychology, I, edited by R. Duncan Luce, Robert R. Bush, and Eugene Galanter. New York: Wiley, 1963. pp. 289–299. P25 Frederick Mosteller. “Clinical studies of analgesic drugs: II. Some statistical problems in measuring the subjective response to drugs.” Biometrics, 8 (1952), pp. 220–226. P26 Frederick Mosteller. “The World Series competition.” Journal of the American Statistical Association, 47 (1952), pp. 355–380. ´ Translated into French by F.M.-Alfred, E.C. “Utilisation de quelques techniques statistiques au service des parieurs. Pr´esentation et traduction de l’article ‘World Series Competition’ de Fr´ederic Mosteller.” Herm`es, Bulletin de la Facult´e de Commerce de l’Universit´e Laval, 1953, #7, pp. 50–64; #8, pp. 29–37. P27 Frederick Mosteller. “Statistical theory and research design.” Annual Review of Psychology, 4 (1953), pp. 407–434. P28 Robert R. Bush and Frederick Mosteller. “A stochastic model with applications to learning.” Annals of Mathematical Statistics, 24 (1953), pp. 559–585. P29 Henry K. Beecher, Arthur S. Keats, Frederick Mosteller, and Louis Lasagna. “The effectiveness of oral analgesics (morphine, codeine, acetylsalicylic acid) and the problem of placebo ‘reactors’ and ‘non-reactors.’ ” Journal of Pharmacology and Experimental Therapeutics, 109 (1953), pp. 393–400. P30 Frederick Mosteller. “Comments on ‘Models for Learning Theory.’ ” Symposium on Psychology of Learning Basic to Military Training Problems, Panel on Training and Training Devices, Committee on Human Resources Research and Development Board, May 7–8, 1953, pp. 39–42. P31 William G. Cochran, Frederick Mosteller, and John W. Tukey. “Statistical problems of the Kinsey Report.” Journal of the American Statistical Association, 48 (1953), pp. 673–716.

Bibliography

23

P32 William G. Cochran, Frederick Mosteller, and John W. Tukey. “Principles of Sampling.” Journal of the American Statistical Association, 49 (1954), pp. 13–35. Translated into Spanish: “Fundamentos de muestreo.” Estad´ıstica, Journal of the Inter-American Statistical Institute, June 1956, Vol. XIV, No. 51, pp. 235–258. P33 Louis Lasagna, Frederick Mosteller, John M. von Felsinger, and Henry K. Beecher. “A study of the placebo response.”American Journal of Medicine, 16 (1954), pp. 770–779. P34 Frederick Mosteller and Robert R. Bush. “Selected quantitative techniques.” Chapter 8 in Handbook of Social Psychology, edited by Gardner Lindzey. Cambridge, MA: Addison-Wesley, 1954. pp. 289–334. Reprinted in paperback, circa 1969, together with Chapter 9, “Attitude measurement” by Bert F. Green. It was decided to publish these two chapters as a separate book because they were not reprinted in The Handbook of Social Psychology, Second Edition, Vol. I, 1968, Vols. II–V, 1969, edited by Gardner Lindzey and Elliot Aronson. P35 Robert R. Bush, Frederick Mosteller, and Gerald L. Thompson. “A formal structure for multiple-choice situations.” Chapter VIII in Decision Processes, edited by R.M. Thrall, C.H. Coombs, and R.L. Davis. New York: Wiley, 1954. pp. 99–126. P36 Frederick Mosteller. Introduction, IV, “Applications.” In Tables of Cumulative Binomial Probability Distribution, by the Staff of Harvard Computation Laboratory. The Annals of the Computation Laboratory of Harvard University, XXXV. Cambridge, MA: Harvard University Press, 1955. pp. xxxiv–lxi. P37 Frederick Mosteller. “Stochastic learning models.” In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume V: Econometrics, Industrial Research, and Psychometry, edited by Jerzy Neyman. Berkeley: University of California Press, 1956. pp. 151– 167. P38 Frederick Mosteller. “Statistical problems and their solution.” Chapter VI in “The measurement of pain, prototype for the quantitative study of subjective responses,” by Henry K. Beecher. Pharmacological Reviews, 9 (1957), pp. 103–114. P39 Frederick Mosteller. “Stochastic models for the learning process.” Proceedings of the American Philosophical Society, 102 (1958), pp. 53–59.

24

Bibliography

P40 Frederick Mosteller and D.E. Richmond. “Factorial 1/2: A simple graphical treatment.” American Mathematical Monthly, 65 (1958), pp. 735–742. P41 Frederick Mosteller. “The mystery of the missing corpus.” Psychometrika, 23 (1958), pp. 279–289. P42 Frederick Mosteller. “Statistical problems and their solution.” Chapter 4 in Measurement of Subjective Responses, Quantitative Effects of Drugs by Henry K. Beecher. New York: Oxford University Press, 1959. pp. 73–91. P43 In Studies in Mathematical Learning Theory, edited by Robert R. Bush and William K. Estes. Stanford Mathematical Studies in the Social Sciences, III. Stanford: Stanford University Press, 1959. • Chapter 12. Maurice Tatsuoka and Frederick Mosteller. “A commutingoperator model.” pp. 228–247. • Chapter 15. Robert R. Bush and Frederick Mosteller. “A comparison of eight models.” pp. 293–307. P44 Frederick Mosteller and Maurice Tatsuoka. “Ultimate choice between two attractive goals: Predictions from a model.” Psychometrika, 25 (1960), pp. 1–17. Reprinted in Readings in Mathematical Psychology, I, edited by R. Duncan Luce, Robert R. Bush, and Eugene Galanter. New York: Wiley, 1963. pp. 498–514. P45 Richard Cohn, Frederick Mosteller, John W. Pratt, and Maurice Tatsuoka. “Maximizing the probability that adjacent order statistics of samples from several populations form overlapping intervals.” Annals of Mathematical Statistics, 31 (1960), pp. 1095–1104. P46 Frederick Mosteller. “Optimal length of play for a binomial game.” The Mathematics Teacher, 54 (1961), pp. 411–412. P47 Frederick Mosteller and Cleo Youtz. “Tables of the Freeman-Tukey transformations for the binomial and Poisson distributions.” Biometrika, 48 (1961), pp. 433–440. P48 Gordon W. Allport, Paul H. Buck, Frederick Mosteller, and Talcott Parsons, Chairman. “Samuel Andrew Stouffer,” Memorial Minute adopted by the Faculty of Arts and Sciences, Harvard University. Harvard University Gazette, April 29, 1961, pp. 197–198. P49 Frederick Mosteller. “Understanding the birthday problem.” The Mathematics Teacher, 55 (1962), pp. 322–325.

Bibliography

25

P50 Frederick Mosteller and David L. Wallace. “Notes on an authorship problem.” In Proceedings of a Harvard Symposium on Digital Computers and Their Applications, 3–6 April 1961. Cambridge, MA: Harvard University Press, 1962. pp. 163–197. P51 Reports on Continental Classroom’s TV course in probability and statistics. • Frederick Mosteller. “Continental Classroom’s TV course in probability and statistics.” The American Statistician, 16, No. 5 (December 1962), pp. 20–25. • Frederick Mosteller. “The U.S. Continental Classroom’s TV course in probability and statistics.”Quality, 7, No. 2 (1963), pp. 36–39. Abbreviated version. • Frederick Mosteller. “Continental Classroom’s TV course in probability and statistics.” The Mathematics Teacher, 56 (1963), pp. 407–413. • Frederick Mosteller. “Continental Classroom’s television course in probability and statistics.” Review of the International Statistical Institute, 31 (1963), pp. 153–162. P52 Frederick Mosteller and David L. Wallace. “Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers.” Journal of the American Statistical Association, 58 (1963), pp. 275–309. P53 Frederick Mosteller. “Samuel S. Wilks: Statesman of Statistics.” The American Statistician, 18, No. 2 (April 1964), pp. 11–17. P54 Frederick Mosteller. “Contributions of secondary school mathematics to social science.” In Proceedings of the UICSM Conference on the Role of Applications in a Secondary School Mathematics Curriculum, edited by Dorothy Friedman. Urbana, Illinois: University of Illinois Committee on School Mathematics, 1964. pp. 85–111. P55 Frederick Mosteller. “John Davis Williams, 1909–1964, In Memoriam.” The Memorial Service, Santa Monica Civic Auditorium, Santa Monica, CA, December 6, 1964, pp. 3–13. P56 Arthur P. Dempster and Frederick Mosteller. “A model for the weighting of scores.” Appendix C in Staff Leadership in Public Schools: A Sociological Inquiry, edited by Neal Gross and Robert E. Herriott. New York: Wiley, 1965. pp. 202–216. P57 Frederick Mosteller. “His writings in applied statistics,” pp. 944–953 in “Samuel S. Wilks,” by Frederick F. Stephan, John W. Tukey, Frederick Mosteller, Alex M. Mood, Morris H. Hansen, Leslie E. Simon, and W.J.

26

Bibliography

Dixon. Journal of the American Statistical Association, 60 (1965), pp. 938–966. P58 John P. Gilbert and Frederick Mosteller. “Recognizing the maximum of a sequence.” Journal of the American Statistical Association, 61 (1966), pp. 35–73. P59 Gene Smith, Lawrence D. Egbert, Robert A. Markowitz, Frederick Mosteller, and Henry K. Beecher. “An experimental pain method sensitive to morphine in man: The submaximum effort tourniquet technique.” Journal of Pharmacology and Experimental Therapeutics, 154 (1966), pp. 324–332. P60 Subcommittee on the National Halothane Study of the Committee on Anesthesia, National Academy of Sciences—National Research Council. “Summary of the National Halothane Study: Possible association between halothane anesthesia and postoperative hepatic necrosis.” Journal of the American Medical Association, 197 (1966), pp. 775–788. P61 Frederick Mosteller. Contribution to “Tribute to Walter Shewhart.” Industrial Quality Control, 24, No. 2 (1967), p. 117. P62 Conrad Taeuber, Frederick Mosteller, and Paul Webbink. “Social Science Research Council Committee on Statistical Training.” The American Statistician, 21, No. 5 (December 1967), pp. 10–11. Reprinted from Social Science Research Council Items, 21, December 1967, pp. 49–51. P63 Frederick Mosteller. “What has happened to probability in the high school?” The Mathematics Teacher, 60 (1967), pp. 824–831. P64 Frederick Mosteller, Cleo Youtz, and Douglas Zahn. “The distribution of sums of rounded percentages.” Demography, 4 (1967), pp. 850–858. P65 Frederick Mosteller. “Statistical comparisons of anesthetics: The National Halothane Study.” Bulletin of the International Statistical Institute, Proceedings of the 36th Session, Sydney, 1967, XLII, Book 1, pp. 428–438. P66 Lincoln E. Moses and Frederick Mosteller. “Institutional differences in postoperative death rates: Commentary on some of the findings of the National Halothane Study.” Journal of the American Medical Association, 203 (1968), pp. 492–494. P67 Frederick Mosteller. “Association and estimation in contingency tables.” Journal of the American Statistical Association, 63 (1968), pp. 1–28.

Bibliography

27

P68 Frederick Mosteller. “Errors: Nonsampling errors.” In International Encyclopedia of the Social Sciences, Vol. 5, edited by David L. Sills. New York: Macmillan and Free Press, 1968, pp. 113–132. A slightly expanded version appeared in International Encyclopedia of Statistics, edited by William H. Kruskal and Judith M. Tanur. New York: The Free Press, 1978, pp. 208–229. P69 Frederick Mosteller. “S.S. Wilks.” In International Encyclopedia of the Social Sciences, Vol. 16, edited by David L. Sills. New York: Macmillan and Free Press, 1968. pp. 550–553. P70 Frederick Mosteller and John W. Tukey. “Data analysis, including statistics.” Chapter 10 in Vol. 2 of The Handbook of Social Psychology, Second edition, edited by Gardner Lindzey and Elliot Aronson. Reading, MA: Addison-Wesley, 1968. pp. 80–203. P71 Barry H. Margolin and Frederick Mosteller. “The expected coverage to the left of the ith order statistic for arbitrary distributions.” Annals of Mathematical Statistics, 40 (1969), pp. 644–647. P72 Frederick Mosteller. “Progress report of the Joint Committee of the American Statistical Association and the National Council of Teachers of Mathematics.” The Mathematics Teacher, 63 (1970), pp. 199–208. P73 Frederick Mosteller. “Progress report of the Joint Committee of the American Statistical Association and the National Council of Teachers of Mathematics.” The American Statistician 24, No. 3 (June 1970), pp. 8–12. P74 Frederick Mosteller. Discussion of “Statistical aspects of rain stimulation— problems and prospects,” by Jeanne L. Lovasich, Jerzy Neyman, Elizabeth L. Scott, and Jerome A. Smith. Review of the International Statistical Institute, 38 (1970), pp. 169–170. P75 Frederick Mosteller. “Collegiate football scores, U.S.A.” Journal of the American Statistical Association, 65 (1970), pp. 35–48. An abridged version appeared in Optimal Strategies in Sports, edited by Shaul P. Ladany and Robert E. Machol. Amsterdam: North-Holland/American Elsevier, 1977, pp. 97–105. P76 Frederick Mosteller. “The mathematical sciences at work with the social sciences: Learning with irregular rewards.” In “Mathematical sciences and social sciences: Excerpts from the Report of a Panel of the Behavioral and Social Sciences Survey,” selected by William H. Kruskal. Social Science Research Council Items, 24, September 1970, pp. 25–28. Also in The American Statistician, 25, No. 1 (February 1971), pp. 27–30.

28

Bibliography

P77 Frederick Mosteller with the aid of Frank Restle. “The mathematical sciences at work with the social sciences: Learning with irregular rewards.” Chapter 1 in Mathematical Sciences and Social Sciences, edited by William H. Kruskal. Englewood Cliffs, NJ: Prentice-Hall, 1970. pp. 5– 19. P78 Frederick Mosteller with the extensive aid of Margaret Martin and Conrad Taeuber. “The profession of social statistician.” Chapter 3 in Mathematical Sciences and Social Sciences, edited by William H. Kruskal. Englewood Cliffs, NJ: Prentice-Hall, 1970. pp. 35–47. P79 Robert R. Bush and Frederick Mosteller. “Mathematical or stochastic models for learning.” Chapter 14 in Psychology of Learning: Systems, Models, and Theories, by William S. Sahakian. Chicago: Markham Publishing Company, 1970. pp. 280–294. P80 Gudmund R. Iversen, Willard H. Longcor, Frederick Mosteller, John P. Gilbert, and Cleo Youtz. “Bias and runs in dice throwing and recording: a few million throws.” Psychometrika, 36 (1971), pp. 1–19. P81 Frederick Mosteller. “Some considerations on the role of probability and statistics in the school mathematics programs of the 1970’s.” In Report of a Conference on Responsibilities for School Mathematics in the 70’s by School Mathematics Study Group, copyright by The Board of Trustees of the Leland Stanford Junior University, 1971. pp. 87–93. P82 Frederick Mosteller. “The jackknife.” Review of the International Statistical Institute, 39 (1971), pp. 363–368. P83 Francis J. Anscombe, David H. Blackwell, and Frederick Mosteller (Chairman). “Report of the Evaluation Committee on the University of Chicago Department of Statistics.” The American Statistician, 25, No. 3 (June 1971), pp. 17–24. P84 Frederick Mosteller. “The Joint American Statistical Association—National Council of Teachers of Mathematics Committee on the Curriculum in Statistics and Probability.” Review of the International Statistical Institute, 39 (1971), pp. 340–342. P85 Frederick Mosteller. “A data-analytic look at Goldbach counts.” Statistica Neerlandica, 26 (1972), pp. 227–242. P86 Frederick Mosteller. “An empirical study of the distribution of primes and litters of primes.” Paper 15 in Statistical Papers in Honor of George W. Snedecor, edited by T.A. Bancroft with the assistance of Susan Alice

Bibliography

29

Brown, Ames, IA: The Iowa State University Press, 1972. pp. 245–257. P87 William Fairley and Frederick Mosteller. “Trial of an adversary hearing: Public policy in weather modification.” International Journal of Mathematical Education in Science and Technology, 3 (1972), pp. 375–383. P88 Frederick Mosteller. “Chairman’s Introduction.” In Statistics at the School Level, edited by Lennart R˚ ade. Stockholm, Sweden: Almqvist and Wiksell International. New York: Wiley. 1973. pp. 23–37. P89 Frederick Mosteller and David C. Hoaglin. “Statistics.” In the Encyclopaedia Britannica, Fifteenth edition, 17 (1974), pp. 615–624. P90 Frederick Mosteller. “The role of the Social Science Research Council in the advance of mathematics in the social sciences.” Social Science Research Council Items, 28, June 1974, pp. 17–24. P91 William B. Fairley and Frederick Mosteller. “A conversation about Collins.” The University of Chicago Law Review, 41, No. 2 (Winter 1974), pp. 242– 253. P92 Frederick Mosteller. “Robert R. Bush. Early Career.” Journal of Mathematical Psychology, 11 (1974), pp. 163–178. P93 C.F. Mosteller, E.B. Newman, B.F. Skinner, and R.J. Herrnstein, Chairman. “Stanley Smith Stevens,” Memorial Minute adopted by the Faculty of Arts and Sciences, Harvard University, April 9, 1974. Harvard University Gazette, June 13, 1974. P94 John P. Gilbert, Richard J. Light, and Frederick Mosteller. “Assessing social innovations: An empirical base for policy.” Chapter 2 in Evaluation and Experiment: Some Critical Issues in Assessing Social Programs, edited by Carl A. Bennett and Arthur A. Lumsdaine. New York: Academic Press, 1975. pp. 39–193. P95 John P. Gilbert, Richard J. Light, and Frederick Mosteller. “Assessing social innovations: An empirical base for policy.” Chapter 1 in Benefit– Cost and Policy Analysis 1974, an Aldine Annual on forecasting, decisionmaking, and evaluation. Edited by Richard Zeckhauser, Arnold C. Harberger, Robert H. Haveman, Laurence E. Lynn, Jr., William A. Niskanen, and Alan Williams. Chicago: Aldine Publishing Company, 1975. pp. 3–65. P96 Frederick Mosteller. “Comment by Frederick Mosteller.” On David K. Cohen’s “The Value of Social Experiments.” In Planned Variation in Education: Should We Give Up or Try Harder?, edited by Alice M. Rivlin

30

Bibliography

and P. Michael Timpane. Washington, D.C.: The Brookings Institution, 1975. pp. 169–172. P97 Frederick Mosteller, John P. Gilbert, and Bucknam McPeek. “Measuring the quality of life.” In Surgery in the United States: A Summary Report of the Study on Surgical Services for the United States, sponsored jointly by The American College of Surgeons and The American Surgical Association, Volume III. Chicago: The College, 1976. pp. 2283–2299. P98 John P. Gilbert, Frederick Mosteller, and John W. Tukey. “Steady social progress requires quantitative evaluation to be searching.” In The Evaluation of Social Programs, edited by Clark C. Abt. Beverly Hills, CA: Sage Publications, 1976. pp. 295–312. P99 Persi Diaconis, Frederick Mosteller, and Hironari Onishi. “Second-order terms for the variances and covariances of the number of prime factors— including the square free case.” Journal of Number Theory, 9 (1977), pp. 187–202. P100 Committee on Community Reactions to the Concorde, Assembly of Behavioral and Social Sciences, National Research Council. “Community reactions to the Concorde: An assessment of the trial period at Dulles Airport.” Washington, D.C.: National Academy of Sciences, 1977. Committee members: Angus Campbell, Chairman, William Baumol, Robert F. Boruch, James A. Davis, Elizabeth A. Deakin, Kenneth M. Eldred, Henning E. von Gierke, Amos H. Hawley, C. Frederick Mosteller, and H. Wayne Rudmose. P101 John P. Gilbert, Bucknam McPeek, and Frederick Mosteller. “Statistics and ethics in surgery and anesthesia.” Science, 198 (1977), pp. 684–689. Reprinted in Solutions to Ethical and Legal Problems in Social Research, edited by Robert F. Boruch and Joe S. Cecil. New York: Academic, 1983. pp. 65–82. P102 Frederick Mosteller. “Experimentation and innovations.” Bulletin of the International Statistical Institute, Proceedings of the 41st Session, New Delhi, 1977, XLVII, Book 1, pp. 559–572. P103 Frederick Mosteller (with others). “Calendar Year 1977 Report,” National Advisory Council on Equality of Educational Opportunity, March 31, 1978. (Includes “Report of the Task Force on Evaluation,” pp. 15–24, and “Final Report of the Task Force on Evaluation,” pp. 25–38, by Jacquelyne J. Jackson, Haruko Morita, and Frederick Mosteller.)

Bibliography

31

P104 Oliver Cope, John Hedley-Whyte, Richard J. Kitz, C. Frederick Mosteller, Henning Pontoppidan, William H. Sweet, Leroy D. Vandam, and Myron B. Laver, Chairman. “Henry K. Beecher,” Memorial Minute adopted by the Faculty of Medicine, Harvard University, June 1, 1977. Harvard University Gazette, January 13, 1978. p. 9. P105 Frederick Mosteller. “Dilemmas in the concept of unnecessary surgery.” Journal of Surgical Research, 25 (1978), pp. 185–192. P106 Frederick Mosteller. “A resistant analysis of 1971 and 1972 professional football.” In Sports, Games, and Play: Social and Psychological Viewpoints, edited by Jeffrey H. Goldstein. Hillsdale, NJ: Lawrence Erlbaum Associates, 1979. pp. 371–399. P107 Frederick Mosteller. Comment on “Field experimentation in weather modification,” by Roscoe R. Braham, Jr. Journal of the American Statistical Association, 74 (1979), pp. 88–90. P108 Frederick Mosteller. “Problems of omissions in communications.” Clinical Pharmacology and Therapeutics, 25, No. 5, Part 2 (1979), pp. 761–764. P109 Frederick Mosteller and Gale Mosteller. “New statistical methods in public policy. Part I: Experimentation.” Journal of Contemporary Business, 8, No. 3 (1979), pp. 79–92. P110 Frederick Mosteller and Gale Mosteller. “New statistical methods in public policy. Part II: Exploratory data analysis.” Journal of Contemporary Business, 8, No. 3 (1979), pp. 93–115. P111 William H. Kruskal and Frederick Mosteller. “Representative sampling, I: Non-scientific literature.” International Statistical Review, 47 (1979), pp. 13–24. P112 William H. Kruskal and Frederick Mosteller. “Representative sampling, II: Scientific literature, excluding statistics.” International Statistical Review, 47 (1979), pp. 111–127. P113 William H. Kruskal and Frederick Mosteller. “Representative sampling, III: The current statistical literature.” International Statistical Review, 47 (1979), pp. 245–265. P114 William H. Kruskal and Frederick Mosteller. “Representative sampling, IV: The history of the concept in statistics, 1895–1939.” International Statistical Review, 48 (1980), pp. 169–195.

32

Bibliography

P115 Frederick Mosteller. “Classroom and platform performance.” The American Statistician, 34 (1980), pp. 11–17. Reprinted in Amstat News, September 2002, pp. 15–20. P116 Frederick Mosteller, Bucknam McPeek, and John P. Gilbert. “The clinician’s responsibility for the decision process.” CAHP News Letter (Center for the Analysis of Health Practices, Harvard School of Public Health), Winter 1980, pp. 2–4. P117 Bucknam McPeek, John P. Gilbert, and Frederick Mosteller. “The clinician’s responsibility for helping to improve the treatment of tomorrow’s patients.” New England Journal of Medicine, 302 (1980), pp. 630–631. P118 Clark C. Abt and Frederick Mosteller. “Presentation and acceptance of the Lazarsfeld Prize to Frederick Mosteller.” In Problems in American Social Policy Research, edited by Clark C. Abt. Cambridge, MA: Abt Books, 1980. pp. 273–276. P119 Frederick Mosteller, John P. Gilbert, and Bucknam McPeek. “Reporting standards and research strategies for controlled trials: Agenda for the editor. Controlled Clinical Trials, 1 (1980), pp. 37–58. P120 Frederick Mosteller. “Clinical trials methodology: Hypotheses, designs, and criteria for success or failure.” In Medical Advances through Clinical Trials: A Symposium on Design and Ethics of Human Experimentation, May 31 and June 1, 1979. Edmonton, Alberta, Canada, edited by John B. Dossetor, 1980. pp. 12–26. P121 Stephen Lagakos and Frederick Mosteller. “A case study of statistics in the regulatory process: The FD&C Red No. 40 experiments.” Journal of the National Cancer Institute, 66 (1981), pp. 197–212. P122 Frederick Mosteller. “Innovation and evaluation.” Science, 211 (27 February 1981), pp. 881–886. Presidential address, annual meeting of the American Association for the Advancement of Science in Toronto, Ontario, Canada, 6 January 1981. P123 Frederick Mosteller. “Leonard Jimmie Savage memorial service tribute.” Yale University, March 18, 1972. Published in The Writings of Leonard Jimmie Savage—A Memorial Selection, edited by a committee. Washington, D.C.: The American Statistical Association and The Institute of Mathematical Statistics, 1981, pp. 25–28.

Bibliography

33

P124 Arthur P. Dempster and Frederick Mosteller. “In memoriam: William Gemmell Cochran, 1909–1980.” The American Statistician, 35 (1981), p. 38. P125 Frederick Mosteller. “Evaluation: Requirements for scientific proof.” Chapter 8 in Coping with the Biomedical Literature: A Primer for the Scientist and the Clinician, edited by Kenneth S. Warren. New York: Praeger, 1981. pp. 103–121. P126 Frederick Mosteller, Andrew F. Siegel, Edward Trapido, and Cleo Youtz. “Eye fitting straight lines.” The American Statistician, 35 (1981), pp. 150–152. P127 Frederick Mosteller. “Foreword.” Milbank Memorial Fund Quarterly/Health and Society, 59, No. 3 (1981), pp. 297–307. This introduces and discusses a special issue devoted to medical experimentation and social policy. P128 William Kruskal and Frederick Mosteller. “Ideas of representative sampling.” In New Directions for Methodology of Social and Behavioral Science: Problems with Language Imprecision, edited by Donald W. Fiske. San Francisco: Jossey-Bass, 1981. pp. 3–24. P129 Thomas A. Louis, Frederick Mosteller, and Bucknam McPeek. “Timely topics in statistical methods for clinical trials.” Annual Review of Biophysics and Bioengineering, 11 (1982), pp. 81–104. P130 Bucknam McPeek, Cornelia McPeek, and Frederick Mosteller. “In memoriam: John Parker Gilbert (1926–1980).” The American Statistician, 36 (1982), p. 37. P131 Frederick Mosteller. “Foreword.” In Contributions to Statistics: William G. Cochran, compiled by Betty I.M. Cochran. New York: Wiley, 1982. pp. vii–xiii. P132 Rebecca DerSimonian, L. Joseph Charette, Bucknam McPeek, and Frederick Mosteller. “Reporting on methods in clinical trials.” New England Journal of Medicine, 306 (1982), pp. 1332–1337. P133 Frederick Mosteller and John W. Tukey. “Combination of results of stated precision: I. The optimistic case.” Utilitas Mathematica, 21A (1982), pp. 155–179. P134 Arthur Dempster, Margaret Drolette, Myron Fiering, Nathan Keyfitz, David D. Rutstein, and Frederick Mosteller, Chairman. “William Gemmell

34

Bibliography

Cochran,” Memorial Minute adopted by the Faculty of Arts and Sciences, Harvard University, November 9, 1982. Harvard University Gazette, December 3, 1982, p. 4. P135 Frederick Mosteller. “The role of statistics in medical research.” Chapter 1 in Statistics in Medical Research: Methods and Issues, with Applications in Cancer Research, edited by Valerie Mik´e and Kenneth E. Stanley. New York: Wiley, 1983. pp. 3–20. P136 J.L. Hodges, Jr., Frederick Mosteller, and Cleo Youtz. “Allocating loss of precision in the sample mean to wrong weights and redundancy in sampling with replacement from a finite population.” In A Festschrift for Erich L. Lehmann in Honor of His Sixty-Fifth Birthday, edited by Peter J. Bickel, Kjell A. Doksum, and J.L. Hodges, Jr. Belmont, CA: Wadsworth, 1983. pp. 239–248. P137 Frederick Mosteller. “The changing role of the statistician: Getting into the mainstream with policy decisions.” Section Newsletter, Statistics, American Public Health Association, February 1983. Talk given at luncheon in honor of Joel Kleinman, the Spiegelman Award winner, American Public Health Association Annual Meeting, Montreal, Quebec, Canada, November 16, 1982. P138 Frederick Mosteller. Comment on “Ethical Guidelines for Statistical Practice: Historical Perspective, Report of the ASA Ad Hoc Committee on Professional Ethics, and Discussion.” The American Statistician, 37 (1983), pp. 10–11. P139 Frederick Mosteller, John P. Gilbert, and Bucknam McPeek. “Controversies in design and analysis of clinical trials.” In Clinical Trials: Issues and Approaches, edited by Stanley H. Shapiro and Thomas A. Louis. New York: Marcel Dekker, 1983. pp. 13–64. P140 John D. Emerson, Bucknam McPeek, and Frederick Mosteller. “Reporting clinical trials in general surgical journals.” Surgery, 95 (1984), pp. 572– 579. P141 Frederick Mosteller. “Biography of John W. Tukey.” In The Collected Works of John W. Tukey, Volume I, Time Series: 1949–1964, edited by David R. Brillinger. Belmont, CA: Wadsworth Advanced Books and Software, 1984. pp. xv–xvii. P142 Frederick Mosteller and John W. Tukey. “Combination of results of stated precision: II. A more realistic case.” In W.G. Cochran’s Impact on Statistics, edited by Poduri S.R.S. Rao and Joseph Sedransk. New York: Wiley,

Bibliography

35

1984. pp. 223–252. P143 Frederick Mosteller. “Selection of papers by quality of design, analysis, and reporting.” Chapter 6 in Selectivity in Information Systems: Survival of the Fittest, edited by Kenneth S. Warren. New York: Praeger, 1985. pp. 98–116. P144 Thomas A. Louis, Harvey V. Fineberg, and Frederick Mosteller. “Findings for public health from meta-analysis.” Annual Review of Public Health, 6 (1985), pp. 1–20. P145 Frederick Mosteller and Milton C. Weinstein. “Toward evaluating the costeffectiveness of medical and social experiments.” Chapter 6 in Social Experimentation, edited by Jerry A. Hausman and David A. Wise. Chicago: University of Chicago Press, 1985. pp. 221–249. P146 Frederick Mosteller and David C. Hoaglin. “Description or prediction?” 1984 Proceedings of the Business and Economic Statistics Section. Washington, D.C.: American Statistical Association, 1985. pp. 11–15. P147 Judy O’Young, Bucknam McPeek, and Frederick Mosteller. “The clinician’s role in developing measures for quality of life in cardiovascular disease.” Quality of Life and Cardiovascular Care, 1 (1985), pp. 290–296. P148 B. McPeek, M. Gasko, and F. Mosteller. “Measuring outcome from anesthesia and operation.” Theoretical Surgery, 1 (1986), pp. 2–9. P149 Augustine Kong, G. Octo Barnett, Frederick Mosteller, and Cleo Youtz. “How medical professionals evaluate expressions of probability.” New England Journal of Medicine, 315 (1986), pp. 740–744. Response to comments: New England Journal of Medicine, 316 (1987), p. 551. [Comments, pp. 549–551.] P150 Stephen W. Lagakos and Frederick Mosteller. “Assigned shares in compensation for radiation-related cancers.” Risk Analysis, 6 (1986), pp. 345–357. Response to comments, pp. 377–380. [Comments, pp. 363–375.] P151 M.F. McKneally, D.S. Mulder, A. Nachemson, F. Mosteller, and B. McPeek. “Facilitating scholarship: Creating the atmosphere, setting, and teamwork for research.” Chapter 5 in Principles and Practice of Research: Strategies for Surgical Investigators, edited by Hans Troidl, Walter O. Spitzer, Bucknam McPeek, David S. Mulder, and Martin F. McKneally. New York: Springer-Verlag, 1986. pp. 36–42.

36

Bibliography

P152 Frederick Mosteller. “A statistical study of the writing styles of the authors of The Federalist papers.” Proceedings of the American Philosophical Society, 131 (1987), pp. 132–140. P153 Morris Hansen and Frederick Mosteller. “William Gemmell Cochran, July 15, 1909—March 29, 1980.” In Biographical Memoirs, 56. Washington, D.C.: National Academy Press, 1987. pp. 60–89. P154 Kathryn Lasch, Alesia Maltz, Frederick Mosteller, and Tor Tosteson. “A protocol approach to assessing medical technologies.” International Journal of Technology Assessment in Health Care, 3 (1987), pp. 103–122. P155 Frederick Mosteller. “Implications of measures of quality of life for policy development.” Journal of Chronic Diseases, 40 (1987), pp. 645–650. P156 Frederick Mosteller. “Assessing quality of institutional care.” American Journal of Public Health, 77 (1987), pp. 1155–1156. P157 Frederick Mosteller. “Compensating for radiation-related cancers by probability of causation or assigned shares.” Bulletin of the International Statistical Institute, Proceedings of the 46th Session, Tokyo, 1987, LII, Book 4, pp. 571–577. P158 W. H. Kruskal and F. Mosteller. “Representative sampling.” In Encyclopedia of Statistical Sciences, 8, edited by Samuel Kotz and Norman L. Johnson. New York: Wiley, 1988. pp. 77–81. P159 John C. Bailar III and Frederick Mosteller. “Guidelines for statistical reporting in articles for medical journals: Amplifications and explanations,” Annals of Internal Medicine, 108 (1988), pp. 266–273. P160 Frederick Mosteller. “Broadening the scope of statistics and statistical education.” The American Statistician, 42 (1988), pp. 93–99. P161 “Frederick Mosteller and John W. Tukey: A conversation.” Moderated by Francis J. Anscombe. Statistical Science, 3 (1988), pp. 136–144. P162 Graham A. Colditz, James N. Miller, and Frederick Mosteller. “The effect of study design on gain in evaluations of new treatments in medicine and surgery.” Drug Information Journal, 22 (1988), pp. 343–352. P163 Graham A. Colditz, James N. Miller, and Frederick Mosteller. “Measuring gain in the evaluation of medical technology.” International Journal of Technology Assessment in Health Care, 4 (1988), pp. 637–642.

Bibliography

37

P164 Nan Laird and Frederick Mosteller. Discussion of “Publication bias: A problem in interpreting medical data,” by Colin B. Begg and Jesse A. Berlin. Journal of the Royal Statistical Society, Series A, 151 (1988), p. 456. P165 Frederick Mosteller. “Growth and advances in statistics.” Response to “Should mathematicians teach statistics?” by David S. Moore. College Mathematics Journal, 19 (1988), pp. 15–16. P166 Frederick Mosteller. “‘The muddiest point in the lecture’ as a feedback device.” On Teaching and Learning: The Journal of the Harvard-Danforth Center, 3 (April 1989), pp. 10–21. Also in Medianen 1992 (newsletter of the Education Committee of the Swedish Statistical Association), pp. 1–12. P167 Frederick Mosteller and Elisabeth Burdick. “Current issues in health care technology assessment.” International Journal of Technology Assessment in Health Care, 5 (1989), pp. 123–136. P168 The LORAN Commission. “The LORAN Commission: A summary report.” In Harvard Community Health Plan, 1988 Annual Report, pp. 3–6, 9–14, 17–22, 25–30. (Published in 1989.) Members of the Commission: David Banta, Robert Cushman, Douglas Fraser, Robert Freeman, Betty Friedan, Benjamin Kaplan, Frederick Mosteller, David Nathan, Albert Rees, Hays Rockwell, Robert Sproull, Marshall Wolf, and John Paris. P169 Frederick Mosteller, John E. Ware, Jr., and Sol Levine. “Finale panel: Comments on the conference on advances in health status assessment.” Medical Care, 27, No. 3 Supplement (1989), pp. S282–S294. P170 Graham A. Colditz, James N. Miller, and Frederick Mosteller. “How study design affects outcomes in comparisons of therapy. I: Medical.” Statistics in Medicine, 8 (1989), pp. 441–454. P171 James N. Miller, Graham A. Colditz, and Frederick Mosteller. “How study design affects outcomes in comparisons of therapy. II: Surgical.” Statistics in Medicine, 8 (1989), pp. 455–466. P172 Robert Timothy Reagan, Frederick Mosteller, and Cleo Youtz. “Quantitative meanings of verbal probability expressions.” Journal of Applied Psychology, 74 (1989), pp. 433–442. P173 Bucknam McPeek, Frederick Mosteller, and Martin McKneally. “Randomized clinical trials in surgery.” International Journal of Technology Assess-

38

Bibliography

ment in Health Care, 5 (1989), pp. 317–332. P174 Persi Diaconis and Frederick Mosteller. “Methods for studying coincidences.” Journal of the American Statistical Association, 84 (1989), pp. 853–861. 1987 R.A. Fisher Memorial Lecture. P175 Frederick Mosteller and Cleo Youtz. “Quantifying probabilistic expressions” and “Rejoinder” to Comments. Statistical Science, 5 (1990), pp. 2–12, 32–34. P176 Frederick Mosteller. “Improving research methodology: An overview.” Summary in Lee Secrest, Edward Perrin, and John Bunker, editors, Research Methodology: Strengthening Causal Interpretations of Nonexperimental Data, Agency for Health Care Policy and Research Conference Proceedings, U.S. Department of Health and Human Services, May 1990, pp. 221–230. P177 Nan M. Laird and Frederick Mosteller. “Some statistical methods for combining experimental results.” International Journal of Technology Assessment in Health Care, 6 (1990), pp. 5–30. P178 Frederick Mosteller. “Summing up.” Chapter 16 in The Future of Meta– Analysis, Kenneth W. Wachter and Miron L. Straf, editors. Proceedings of a workshop convened by the Committee on National Statistics, National Research Council, October 1986, Hedgesville, West Virginia. New York: Russell Sage Foundation, 1990. pp. 185–190. P179 J.D. Emerson, E. Burdick, D.C. Hoaglin, F. Mosteller, and T.C. Chalmers. “An empirical study of the possible relation of treatment differences to quality scores in controlled randomized clinical trials.” Controlled Clinical Trials, 11 (1990), pp. 339–352. P180 Frederick Mosteller and fellow members of National Cancer Institute Extramural Committee to Assess Measures of Progress Against Cancer. “Special report: Measurement of progress against cancer.” Journal of the National Cancer Institute, 82, No. 10 (1990), pp. 825–835. P181 H. Troidl et al., editors. Principles and Practice of Research: Strategies for Surgical Investigators, second edition. New York: Springer-Verlag, 1991. • Chapter I5. A.S. Wechsler, M.F. McKneally, C.M. Balch, F. Mosteller, B. McPeek, and D.S. Mulder. “Strengthening the research environment.” pp. 31–40. • Chapter II11. R.E. Pollack, C.M. Balch, J. Roth, B. McPeek, and F. Mosteller. “Formulating an initial research plan.” pp. 88–90.

Bibliography

39

• Chapter II14. B. McPeek, F. Mosteller, M.F. McKneally, and E.A.M. Neugebauer. “Experimental methods: Clinical trials.” pp. 114–125. P182 In Elisabeth Burdick, Marie A. McPherson, and Frederick Mosteller, guest editors. “Special section: The contribution of medical registries to technology assessment.” International Journal of Technology Assessment in Health Care, 7, No. 2 (Spring 1991). • Alexia Antczak-Bouckoms, Elisabeth Burdick, Sidney Klawansky, and Frederick Mosteller. “Introduction: Using medical registries and data sets for technology assessment.” pp. 123–128. • Grace Wyshak, Elisabeth Burdick, and Frederick Mosteller. “Technology assessment in the Connecticut Tumor Registry.” pp. 129–133. • Sidney Klawansky, Alexia Antczak-Bouckoms, Judith Barr, Elisabeth Burdick, Mark S. Roberts, Grace Wyshak, and Frederick Mosteller. “Using medical registries and data sets for technology assessment: An overview of seven case studies.” pp. 194–199. P183 Frederick Mosteller. “The contributions of firms: A fresh movement in medicine.” Medical Care, 29, No. 7, Supplement (July 1991), pp. JS3–JS4. P184 Frederick Mosteller. “Comment” on Jessica Utts, “Replication and metaanalysis in parapsychology.” Statistical Science, 6, No. 4 (1991), pp. 395– 396. P185 Frederick Mosteller. “From the President of the ISI.” ISI Newsletter, 16, No. 1 (January 1992), p. 2. P186 J.D. Bentkover, R.H. Sheshinski, J. Hedley-Whyte, C.A. Warfield, and F. Mosteller. “Lower back pain: Laminectomies, spinal fusions, demographics, and socioeconomics.” International Journal of Technology Assessment in Health Care, 8, No. 2 (1992), pp. 309–317. P187 Elliott M. Antman, Joseph Lau, Bruce Kupelnick, Frederick Mosteller, and Thomas C. Chalmers. “A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts: treatments for myocardial infarction.” Journal of the American Medical Association, 268, No. 2 (July 8, 1992), pp. 240–248. P188 Robert D. Morris, Anne-Marie Audet, Italo F. Angelillo, Thomas C. Chalmers, and Frederick Mosteller. “Chlorination, chlorination by-products, and cancer: A meta-analysis.” American Journal of Public Health, 82, No. 7 (July 1992), pp. 955–963. P189 Joseph Lau, Elliott M. Antman, Jeanette Jimenez-Silva, Bruce Kupelnick, Frederick Mosteller, and Thomas C. Chalmers. “Cumulative meta-analysis

40

Bibliography

of therapeutic trials for myocardial infarction.” New England Journal of Medicine, 327, No. 4 (July 23, 1992), pp. 248–254. P190 Frederick Mosteller and Thomas C. Chalmers. “Some progress and problems in meta-analysis of clinical trials.” Statistical Science, 7, No. 2 (1992), pp. 227–236. P191 Frederick Mosteller. “Message from the President.” ISI Newsletter, 16, No. 3 (October 1992), p. 2. P192 Harris Cooper, Kristina M. DeNeve, and Frederick Mosteller. “Predicting professional sports game outcomes from intermediate game scores.” Chance, 5, No. 3–4 (1992), pp. 18–22. P193 Harris Cooper and Frederick Mosteller. “The fourth quarter of the Super Bowl.” National Academy of Sciences, Office of News and Public Information, January 22, 1993. P194 Frederick Mosteller. “Message from the President.” ISI Newsletter, 17, No. 1 (January 1993), p. 2. P195 Wafaie W. Fawzi, Thomas C. Chalmers, M. Guillermo Herrera, and Frederick Mosteller. “Vitamin A supplementation and child mortality: A meta-analysis.” Journal of the American Medical Association, 269, No. 7 (February 17, 1993), pp. 898–903. P196 Miriam E. Adams, Alexia Antczak-Bouckoms, Howard S. Frazier, Joseph Lau, Thomas C. Chalmers, and Frederick Mosteller. “Assessing the effectiveness of ambulatory cardiac monitoring for specific clinical indications, introduction.” International Journal of Technology Assessment in Health Care, 9, No. 1 (1993), pp. 97–101. P197 Frederick Mosteller and Cleo Youtz. “Professional golf scores are Poisson on the final tournament days.” 1992 Proceedings of the Section on Statistics in Sports. Alexandria, VA: American Statistical Association (1992), pp. 39–51. P198 Debra R. Milamed, Carol A. Warfield, John Hedley-Whyte, and Frederick Mosteller. “Laminectomy and the treatment of lower-back pain in Massachussetts.” International Journal of Technology Assessment in Health Care, 9, No. 3 (1993), pp. 426–439. P199 Frederick Mosteller and Cleo Youtz. “Where eagles fly.” Chance, 6, No. 2 (1993), pp. 37–42.

Bibliography

41

P200 Frederick Mosteller. “Message from the President.” ISI Newsletter, 17, No. 2 (May 1993), p. 2. P201 Jane C. Ballantyne, Daniel B. Carr, Thomas C. Chalmers, Keith B. G. Dear, Italo F. Angelillo, and Frederick Mosteller. “Postoperative patientcontrolled analgesia: Meta-analyses of initial randomized control trials.” Journal of Clinical Anesthesiology, 5 (May/June 1993), pp. 182–193. P202 Dennis L. Kasper, Ramzi S. Cotran, Frederick Mosteller, Peter V. Tishler, Stephen Zinner, Sir Kenneth Stuart, Frank E. Speizer, Michelene Matthews–Roth, James O. Taylor, Thomas H. Weller, Jeffrey Parsonnet, and Richard Platt. Edward H. Kass memorial minute adopted by the Faculty of Medicine, Harvard University, May 26, 1993. Harvard Gazette (July 9, 1993), pp. 11–12. P203 Frederick Mosteller and Howard S. Frazier. “Improving the contributions of technology assessment to the health care system of the U.S.A.” Journal of the Italian Statistical Society, 1, No. 3 (1992), pp. 297–310. P204 Frederick Mosteller. “The prospect of data-based medicine in the light of ECPC.” The Milbank Quarterly, 71, No. 3 (1993), pp. 523–532. P205 Judith A. Hall, Linda Tickle-Degnen, Robert Rosenthal, and Frederick Mosteller. “Hypotheses and problems in research synthesis.” Chapter 2 in The Handbook of Research Synthesis, Harris Cooper and Larry V. Hedges, editors. New York: Russell Sage Foundation, 1994. pp. 17–28. P206 Sidney Klawansky, Catherine Berkey, Nirav Shah, Frederick Mosteller, and Thomas C. Chalmers. “Survival from localized breast cancer: Variability across trials and registries.” International Journal of Technology Assessment in Health Care, 9, No. 4 (1993), pp. 539–553. P207 Graham A. Colditz, Timothy F. Brewer, Catherine S. Berkey, Mary E. Wilson, Elisabeth Burdick, Harvey V. Fineberg, and Frederick Mosteller. “Efficacy of BCG vaccine in the prevention of tuberculosis: meta-analysis of the published literature.” Journal of the American Medical Association, 271, No. 9 (March 2, 1994), pp. 698–702. P208 Les Irwig, Anna N. A. Tosteson, Constantine Gatsonis, Joseph Lau, Graham Colditz, Thomas C. Chalmers, and Frederick Mosteller. “Guidelines for meta-analyses evaluating diagnostic tests.” Annals of Internal Medicine, 120, No. 8 (April 15, 1994), pp. 667–676. P209 John P. Bunker, Howard S. Frazier, and Frederick Mosteller. “Improving health: Measuring effects of medical care.” The Milbank Quarterly, 72,

42

Bibliography

No. 2 (1994), pp. 225–258. P210 Frederick Mosteller. “Introduction.” In “The Moriguti report on the role of statisticians: A discussion.” International Statistical Institute Occasional Paper Series #4, Voorburg, Netherlands, 1994. p. 1. P211 John D. Emerson, David C. Hoaglin, and Frederick Mosteller. “A modified random-effect procedure for combining risk difference in sets of 2 × 2 tables from clinical trials.” Journal of the Italian Statistical Society, 2, No. 3 (1993), pp. 269–290. P212 C.S. Berkey, D.C. Hoaglin, F. Mosteller, and G.A. Colditz. “A randomeffects regression model for meta-analysis.” Statistics in Medicine, 14 (1995), pp. 395–411. P213 Vic Hasselblad, Frederick Mosteller, Benjamin Littenberg, Thomas C. Chalmers, Maria G.M. Hunink, Judith A. Turner, Sally C. Morton, Paula Diehr, John B. Wong, and Neil R. Powe. “A survey of current problems in meta-analysis: Discussion from the Agency for Health Care Policy and Research Inter-PORT Work Group on Literature Review/Meta-Analysis.” Medical Care, 33, No. 2 (1995), pp. 202–220. P214 C.S. Berkey, A. Antczak-Bouckoms, D.C. Hoaglin, F. Mosteller, and B.L. Pihlstrom. “Multiple-outcomes meta-analysis of treatments for periodontal disease.” Journal of Dental Research, 74, No. 4 (April 1995), pp. 1030– 1039. P215 Graham A. Colditz, Catherine S. Berkey, Frederick Mosteller, Timothy F. Brewer, Mary E. Wilson, Elisabeth Burdick, and Harvey V. Fineberg. “The efficacy of Bacillus Calmette-Gu´erin vaccination of newborns and infants in the prevention of tuberculosis: Meta-analyses of the published literature.” Pediatrics, 96, No. 1 (July 1995), pp. 29–35. P216 Graham A. Colditz, Elisabeth Burdick, and Frederick Mosteller. “Heterogeneity in meta-analysis of data from epidemiologic studies: A commentary.” American Journal of Epidemiology, 142, No. 4 (August 15, 1995), pp. 371–382. P217 Elon Eisenberg, Catherine S. Berkey, Daniel B. Carr, Frederick Mosteller, and Thomas C. Chalmers. “Efficacy and safety of nonsteroidal antiinflammatory drugs for cancer pain: A meta-analysis.” Journal of Clinical Oncology, 12, No. 12 (December 1994), pp. 2756–2765. P218 Frederick Mosteller. “The Tennessee study of class size in the early school grades.” The Future of Children, 5, No. 2 (Summer/Fall 1995), pp. 113–

Bibliography

43

127. P219 John P. Bunker, Howard S. Frazier, and Frederick Mosteller. “The role of medical care in determining health: Creating an inventory of benefits.” Chapter 10 in Society and Health, Benjamin C. Amick III, Sol Levine, Alvin R. Tarlov, and Diana Chapman Walsh, editors. New York: Oxford University Press, 1995. pp. 305–341. P220 Frederick Mosteller. “Expansion of notes of the vaccine group.” Clinical Research and Regulatory Affairs, Special Issue on the First National Meeting on Research Synthesis: Applications to Drug Regulatory Policy and Health Care Policy, 13, No. 1 (1996), pp. 51–56. P221 Frederick Mosteller. “Editorial: The promise of risk-based allocation trials in assessing new treatments.” American Journal of Public Health, 86, No. 5 (May 1996), pp. 622–623. P222 Frederick Mosteller and Graham A. Colditz. “Understanding research synthesis (meta-analysis).” Annual Review of Public Health, 17 (1996), pp. 1–23. P223 Frederick Mosteller and fellow members of the National Institutes of Health Technology Assessment Panel on Integration of Behavioral and Relaxation Approaches into the Treatment of Chronic Pain and Insomnia. “Integration of behavioral and relaxation approaches into the treatment of chronic pain and insomnia.” Journal of the American Medical Association, 276, No. 4 (July 24/31, 1996), pp. 313–318. P224 John D. Emerson, David C. Hoaglin, and Frederick Mosteller. “Simple robust procedures for combining risk differences in sets of 2 × 2 tables.” Statistics in Medicine, 15, No. 14 (1996), pp. 1465–1488. P225 Frederick Mosteller. “Discussant comments on ‘So what? The implications of new analytic methods for designing NCES surveys,’ by Robert F. Boruch and George Terhanian.” From Data to Information. New Directions for the National Center for Education Statistics, National Center for Education Statistics Conference Proceedings, U.S. Department of Education, NCES 96-901, August 1996, pp. 4-116–4-118. P226 Catherine S. Berkey, Frederick Mosteller, Joseph Lau, and Elliott M. Antman. “Uncertainty of the time of first significance in random effects cumulative meta-analysis.” Controlled Clinical Trials, 17, No. 5 (1996), pp. 357–371.

44

Bibliography

P227 Frederick Mosteller, Richard J. Light, and Jason A. Sachs. “Sustained inquiry in education. Lessons from skill grouping and class size.” Harvard Educational Review, 66, No. 4 (Winter 1996), pp. 797–842. P228 Bernard Rosner, Frederick Mosteller, and Cleo Youtz. “Modeling pitcher performance and the distribution of runs per inning in Major League Baseball.” The American Statistician, 50, No. 4 (1996), pp. 352–360. P229 Jane C. Ballantyne, Daniel B. Carr, Catherine S. Berkey, Thomas C. Chalmers, and Frederick Mosteller. “Comparative efficacy of epidural, subarachnoid, and intracerebroventricular opioids in patients with pain due to cancer.” Regional Anesthesia, 21, No. 6 (1996), pp. 542–556. P230 Edward J. Miech, Bill Nave, and Frederick Mosteller. “On CALL: A review of computer-assisted language learning in U.S. colleges and universities.” Educational Media and Technology Yearbook 1997, Robert Maribe Branch and Barbara B. Minor, editors. Englewood, CO: Libraries Unlimited, Inc., 22 (1997), pp. 61–84. The complete report of this study with five appendices is available in ERIC as ED 394 525. P231 Lincoln E. Moses and Frederick Mosteller. “Experimentation: Just do it!” Chapter 12 in Statistics and Public Policy (Festschrift in honor of I. Richard Savage), Bruce D. Spencer, editor. New York: Oxford University Press, 1997. pp. 212–232. P232 Frederick Mosteller. “Project report: The Tennessee study of class size in the early school grades.” Bulletin of the American Academy of Arts and Sciences, L, No. 7 (1997), pp. 14–25. P233 Frederick Mosteller. “Lessons from sports statistics.” 1996 Proceedings of the Section on Statistics in Sports. Alexandria, VA: American Statistical Association, 1996. pp. 1–8. P234 Frederick Mosteller. “Smaller classes do make a difference in the early grades.” The Harvard Education Letter (July/August 1997), pp. 5–7. P235 Frederick Mosteller. “Lessons from sports statistics.” The American Statistician, 51, No. 4 (1997), pp. 305–310. Reprinted in Anthology of Statistics in Sports (ASA-SAIM Series on Statistics and Applied Probability, Vol. 16), edited by Jim Albert, Jay Bennett, and James J. Cochran. Philadelphia: Society for Industrial and Applied Mathematics, 2005. pp. 245–250.

Bibliography

45

P236 In H. Troidl, M.F. McKneally, D.S. Mulder, A.S. Wechsler, B. McPeek, and W.O. Spitzer, editors. Surgical Research: Basic Principles and Clinical Practice, third edition. New York: Springer-Verlag, 1998. • Chapter 26. M.F. McKneally, B. McPeek, F. Mosteller, and E.A.M. Neugebauer. “Clinical trials.” pp. 197–209. • Chapter 36. F. Mosteller and B. Rosner. “Introductory biostatistics texts: Selected annotated bibliography.” pp. 321–327. • Chapter 40. R.E. Pollock, C.M. Balch, J. Roth, B. McPeek, and F. Mosteller. “Formulating an initial research plan.” pp. 363–365. P237 John D. Emerson and Frederick Mosteller. “Interactive multimedia in college teaching. Part I: A ten-year review of reviews.” In Educational Media and Technology Yearbook 1998, Robert Maribe Branch and Mary Ann Fitzgerald, editors. Englewood, CO: Libraries Unlimited, Inc. 23 (1998), pp. 43–58. P238 John D. Emerson and Frederick Mosteller. “Interactive multimedia in college teaching. Part II: Lessons from research in the sciences.” In Educational Media and Technology Yearbook 1998, Robert Maribe Branch and Mary Ann Fitzgerald, editors. Englewood, CO: Libraries Unlimited, Inc. 23 (1998), pp. 59–75. P239 Lincoln E. Moses and Frederick Mosteller. “Frontiers of biostatistics.” Encyclopedia of Biostatistics, Peter Armitage and Theodore Colton, editors. Chichester, England: Wiley, 1998. pp. 1590–1597. P240 Frederick Mosteller. “The Tennessee study of class size in the early school grades.” In The Practice of Data Analysis: Essays in Honor of John W. Tukey, D.R. Brillinger, L.T. Fernholz, and S. Morgenthaler, editors. Princeton, NJ: Princeton University Press, 1997. pp. 261–277. P241 J.C. Ballantyne, D.B. Carr, S. DeFerranti, T. Suarez, J. Lau, T.C. Chalmers, I.F. Angelillo, and F. Mosteller. “The comparative effects of postoperative analgesic therapies on pulmonary outcome: cumulative meta-analysis of randomized controlled trials.” Anesthesia & Analgesia, 86, No. 3 (1998), pp. 598–612. P242 C.S. Berkey, D.C. Hoaglin, A. Antczak-Bouckoms, F. Mosteller, and G.A. Colditz. “Meta-analysis of multiple outcomes by regression with random effects.” Statistics in Medicine, 17 (1998), pp. 2537–2550. P243 Frederick Mosteller, Richard J. Light, and Jason A. Sachs. “Sustained inquiry in education: Lessons from skill grouping and class size.” In Cool Thinking on Hot Topics: A Research Guide for Educators. Cambridge,

46

Bibliography

MA: Harvard Educational Review, 1998. pp. 67–113. P244 Graham A. Colditz, Catherine S. Berkey, and Frederick Mosteller. “Metaanalysis of randomized trials.” Chapter 6 in Charles H. Hennekens, Julie E. Buring, JoAnn E. Manson, and Paul M. Ridker, editors. Clinical Trials in Cardiovascular Disease: A Companion to Braunwald’s Heart Disease. Philadelphia: W.B. Saunders Company, 1999. pp. 43–51. P245 Frederick Mosteller. “The case for smaller classes and for evaluating what works in the schoolroom.” Harvard Magazine, 101, No. 5 (1999), pp. 34– 35. P246 Frederick Mosteller. “How does class size relate to achievement in schools?” Chapter 6 in Earning and Learning, Susan E. Mayer and Paul E. Peterson, editors. Washington, D.C.: Brookings Institute Press and Russell Sage Foundation, 1999. pp. 117–129. P247 Jennifer Taylor and Frederick Mosteller. “Runaways: A review of the literature.” Boston, MA: American Academy of Arts and Sciences, 1999. P248 John D. Emerson and Frederick Mosteller. “Development programs for college faculty: Preparing for the twenty-first century.” In R.M. Branch and M.A. Fitzgerald, editors. Educational Media and Technology Yearbook, Englewood, CO: Libraries Unlimited, Inc., 25, 2000. pp. 26–42. P249 John D. Emerson, Frederick Mosteller, and Cleo Youtz. “Students can help improve college teaching: A review for an agenda for the statistics profession.” In C.R. Rao and G. Szekely, editors. Statistics for the 21st Century. New York: Marcel Dekker, 2000. pp. 145–172. P250 Bill Nave, Edward J. Miech, and Frederick Mosteller. “A lapse in standards: Linking standards-based reform with student achievement.” Phi Delta Kappan, 82, No. 2 (2000), pp. 128–133. P251 Bill Nave, Edward J. Miech, and Frederick Mosteller. “The role of field trials in evaluating school practices: A rare design.” In Daniel L. Stufflebeam, George F. Madaus, Thomas Kellaghan, editors. Evaluation Models: Viewpoints on Educational and Human Services Evaluation, second edition. Boston, MA: Kluwer Academic Publishers, 2000. pp. 145–161. P252 Edward J. Miech, Bill Nave, and Frederick Mosteller. “Large-scale professional development for schoolteachers: Cases from Pittsburgh, New York City, and the National School Reform Faculty.” Chapter 7 in Richard J. Light, editor. Evaluation Findings That Surprise, New Directions in Eval-

Bibliography

47

uation, No. 90. San Francisco: Jossey-Bass, 2001. pp. 83–99. P253 Lincoln E. Moses, Frederick Mosteller, and John H. Buehler. “Comparing results of large clinical trials to those of meta-analyses.” Statistics in Medicine, 21, No. 6 (2002), pp. 793–800. P254 John D. Emerson, Lisa Boes, and Frederick Mosteller. “Critical thinking in college students: Critical issues in empirical research.” In Educational Media and Technology Yearbook, M.A. Fitzgerald, M. Orey, and R.M. Branch, editors. Englewood, CO: Libraries Unlimited, Inc., 27, 2002. pp. 52–71. P255 Mosteller, F., Nave, B., and Miech, E.J. “Why we need a structured abstract in education research.” Educational Researcher, 33, no. 1 (2004), 29–34. P256 John D. Emerson and Frederick Mosteller. “Cooperative learning in schools and colleges: I. Teamwork in college mathematics.” Educational Media and Technology Yearbook, M. Orey, M.A. Fitzgerald, and R.M. Branch, editors. Westport, CT: Libraries Unlimited, Inc., 29, 2004. pp. 132–147. P257 John D. Emerson and Frederick Mosteller. “Cooperative learning in schools and colleges: II. A review of reviews.” Educational Media and Technology Yearbook, M. Orey, M.A. Fitzgerald, and R.M. Branch, editors. Westport, CT: Libraries Unlimited, Inc., 29, 2004. pp. 148–162. P258 Edward J. Miech, Bill Nave, and Frederick Mosteller. “The 20,000 article problem: How a structured abstract could help practitioners sort out educational research.” Phi Delta Kappan, 86, No. 5 (2005), pp. 396–400. P259 Frederick Mosteller. “Biographical Memoirs: John W. Tukey.” Proceedings of the American Philosophical Society, 149, No. 4 (2005), pp. 626–630.

MISCELLANEOUS M1–M6 Articles on magic. M1 Frederick Mosteller. “Encore.” Part I. In My Best, edited by J.G. Thompson, Jr. Philadelphia: Charles H. Hopkins & Co., 1945. pp. 103–104. Related to M3. M2–M6 Appeared in The Phoenix, a two–sheet biweekly publication. Issues 1 through 73 were edited by Walter Gibson and Bruce Elliott; later

48

Bibliography

issues were edited by Bruce Elliott. All issues were published by Louis Tannen. Issues have now been bound into sets of 50 (1–50, 50–100, etc.) and distributed by Louis Tannen Inc., New York, NY. M2 “Bravo.” Issue 49, December 3, 1943, pp. 200–201. M3 “Encore.” Issue 58, April 14, 1944, pp. 236–237. M4 “The Back Room.” Issue 87, p. 355. M5 “Ambiguous.” Issue 117, January 10, 1947, p. 470. M6 “Thesis.” Issue 118, January 24, 1947, p. 475. M7 and M8 are in “Letters from Readers.” In The Bridge World, edited by Ely Culbertson et al. M7 Frederick Mosteller. “Anti Zankl.” 17, No. 8, May 1946, pp. 2–3. M8 Frederick Mosteller. “Eight points.” 17, No. 12, September 1946, p. 2. M9 Frederick Mosteller. Contribution to Standard Sampling Procedures. Material Inspection Service, U.S.N. Administration Manual, 1945. M10 Frederick Mosteller and John W. Tukey. Binomial probability graph paper (No. 32,298). Norwood, MA: Codex Book Company, 1946. M11 Frederick Mosteller. “School finances questioned.” Letter in Carnegie Alumnus, 33, No. 1, September 1947, pp. 1–2. M12 Frederick Mosteller. Editor of “Questions and Answers.” In The American Statistician, October 1947–December 1951. Questions 1–30. M13 Frederick Mosteller. “Can you be a successful gambler?” TV Guide, May 13–19, 1961, pp. 6–7. M14 Frederick Mosteller. “Textbook supplements.” In Gottfried E. Noether, Guide to Probability and Statistics, especially prepared for Continental Classroom. Reading, MA: Addison-Wesley, 1961. pp. 43–51. M15 Frederick Mosteller. “Foreword.” In Probability and Statistics—An Introduction Through Experiments, by Edmund C. Berkeley. New York: Science

Bibliography

49

Materials Center, 1961. pp. v–vii. M16 Frederick Mosteller, contributor. Goals for School Mathematics. The Report of the Cambridge Conference on School Mathematics. Educational Services Incorporated. Boston: Houghton Mifflin, 1963. M17 Frederick Mosteller. “Foreword.” In Math and After Math, by Robert Hooke and Douglas Shaffer. New York: Walker and Company, 1965. pp. ix–xii. M18 Frederick Mosteller. “Age of achievement.” Letter in The Wall Street Journal, July 6, 1967. M19 Frederick Mosteller. “The President Reports: Three Major ASA Actions.” The American Statistician, 21 (4), 1967, pp. 2–4. M20 Coeditor of Reports of the National Assessment of Educational Progress. A Project of the Education Commission of the States. Report 2. Citizenship: National Results. November 1970. Washington, D.C.: Superintendent of Documents, U.S. Government Printing Office. Report 3. 1969–1970 Writing: National Results. November 1970. Washington, D.C.: Superintendent of Documents, U.S. Government Printing Office. Report 2–1. Citizenship: National Results. November 1970. Education Commission of the States, Denver, Colorado and Ann Arbor, Michigan. Report 4. 1969–1970 Science: Group Results for Sex, Region, and Size of Community. April 1971. Washington, D.C.: Superintendent of Documents, U.S. Government Printing Office. Report 5. 1969–1970 Writing: Group Results for Sex, Region, and Size of Community (Preliminary Report). April 1971. Education Commission of the States, Denver, Colorado and Ann Arbor, Michigan. Report 7. 1969–1970 Science: Group and Balanced Group Results for Color, Parental Education, Size and Type of Community and Balanced Group Results for Region of the Country, Sex. December 1971. Education Commission of the States, Denver, Colorado.

50

Bibliography

Report 8. National Results—Writing Mechanics. February 1972. Washington, D.C.: Superintendent of Documents, U.S. Government Printing Office. Report 9. Citizenship: 1969–1970 Assessment: Group Results for Parental Education, Color, Size and Type of Community. May 1972. Education Commission of the States, Denver, Colorado. Report 02-GIY. Reading and Literature: General Information Yearbook. May 1972. Education Commission of the States, Denver, Colorado. M21 Frederick Mosteller. “Introduction.” In Structural and Statistical Problems for a Class of Stochastic Processes, The First Samuel Stanley Wilks Lecture at Princeton University, March 17, 1970, by Harald Cram´er. Princeton, NJ: Princeton University Press, 1971. pp. 1–2. M22 Frederick Mosteller. A page on subject matter at secondary level. In Developments in Mathematical Education, Proceedings of the Second International Congress on Mathematical Education, edited by A.G. Howson. Cambridge: The University Press, 1973. p. 27. M23 Frederick Mosteller. “Foreword.” In Social Statistics in Use, by Philip M. Hauser. New York: Russell Sage Foundation, 1975. pp. vii–viii. M24 Frederick Mosteller. “Report of the President.” In “Officers’ Reports, 1975.” The Institute of Mathematical Statistics Bulletin, 4 (1975), pp. 207–208. M25 Frederick Mosteller. Research Resources Evaluation Panel, coordinated by Bolt, Beranek, and Newman. Assuring the Resources for Biomedical Research: an Evaluation of the Scientific Mission of the Division of Research Resources, National Institutes of Health. October 1976. M26 Frederick Mosteller. “Swine Flu: Quantifying the ‘Possibility.’ ” Letter in Science, 192 (25 June 1976), pp. 1286 and 1288. M27 Frederick Mosteller. “Who Said It?” Letter in Royal Statistical Society News & Notes, 5, No. 2, October 1978, p. 5. M28 Frederick Mosteller. Testimony by Frederick Mosteller. In “Kenneth Prewitt, Frederick Mosteller, and Herbert A. Simon testify at National Science Foundation Hearings.” Social Science Research Council Items, 34 (March 1980), pp. 4–5. All testimonies pp. 1–7.

Bibliography

51

M29 Frederick Mosteller. “Regulation of social research.” Editorial in Science, 208 (13 June 1980), p. 1219. M30 Frederick Mosteller. “The next 100 years of Science.” In “Science Centennial 3 July 1980 to 4 July 1980.” Centennial Issue of Science, edited by Philip H. Abelson and Ruth Kulstad. 209 (4 July 1980), pp. 21–23. M31 Frederick Mosteller. “Social programs.” Transaction/Social Science and Modern Society, 17, No. 6 (September/October 1980), pp. 10–12. M32 Frederick Mosteller. “Taking science out of social science.” Editorial in Science, 212 (17 April 1981), p. 291. M33 Frederick Mosteller. “Improving the precision of clinical trials.” Editorial in American Journal of Public Health, 72 (May 1982), p. 430. M34 Frederick Mosteller. “The imperfect science of victim compensation.” Washington, D.C.: The Washington Times, June 4, 1985. p. 12A. M35 Frederick Mosteller. Abstract of talk on “Assigned Shares: Probability of radiation as a cause of cancer.” Final Report, ASA Conference on Radiation and Health, Coolfont V. Washington, D.C.: American Statistical Association, 1985. pp. 21–22. M36 In Data: A Collection of Problems from Many Fields for the Student and Research Worker, edited by D.F. Andrews and A.M. Herzberg. New York: Springer-Verlag, 1985. • S.W. Lagakos and F. Mosteller. “Time to death and type of death in mice receiving various doses of Red Dye No. 40.” pp. 239–243. • F. Mosteller and D.L. Wallace. “Disputed authorship: The Federalist Papers.” pp. 423–425. M37 Frederick Mosteller. “Foreword.” In New Developments in Statistics for Psychology and the Social Sciences, edited by A.D. Lovie, London and New York: The British Psychological Society and Methuen, 1986. pp. vii– ix. M38 Barbara J. Culliton and Frederick Mosteller. “How big is ‘big,’ how rare is ‘rare?’ Inquiring minds want to know.” The Newsletter of the National Association of Science Writers, 34, No. 3 (September 1986), p. 13. M39 Frederick Mosteller. “Foreword.” In News & Numbers: A Guide to Reporting Statistical Claims and Controversies in Health and Related Fields, by Victor Cohn. Ames, IA: Iowa State University Press, 1989. pp. ix–x.

52

Bibliography

M40 Frederick Mosteller. “A word of welcome.” Ad Hoc, 49th Session of the ISI, August 25, 1993, pp. 1–2. M41 Frederick Mosteller, Contributor to The Milbank Memorial Fund at 90. New York: Milbank Memorial Fund, 1995. pp. 73–74

REVIEWS R1 Guide for Quality Control and Control Chart Method of Analyzing Data. American War Standards, Z1.1–1941 and Z1.2–1941. New York: American Standards Association, 1941, and Control Chart Method of Controlling Quality During Production. American War Standards, Z1.3-1942. New York: American Standards Association, 1942. Frederick Mosteller. Journal of the American Statistical Association, 40 (1945), pp. 379–380. R2 Ledyard R. Tucker. “Maximum validity of a test with equivalent items.” Psychometrika, 11 (1946), pp. 1–13. F. Mosteller. Mathematical Reviews, 7 (1946), pp. 463–464. R3 Frederick E. Croxton and Dudley J. Cowden. “Tables to facilitate computation of sampling limits of s, and fiducial limits of sigma.” Industrial Quality Control, 3 (July 1946), pp. 18–21. Frederick Mosteller. Mathematical Tables and Other Aids to Computation, II, No. 18, April 1947. National Research Council. p. 258. R4 Hans Zeisel (Introduction by Paul F. Lazarsfeld). Say It with Figures. New York: Harper and Brothers, 1947. Frederick Mosteller. Public Opinion Quarterly, 11 (Fall 1947), pp. 468– 469. R5 Quinn McNemar. “Opinion-attitude methodology.” Psychological Bulletin, 43, No. 4 (July 1946), pp. 289–374. Washington, D.C.: American Psychological Association. Frederick Mosteller. Journal of the American Statistical Association, 42 (1947), pp. 192–195. R6 Paul G. Hoel. Introduction to Mathematical Statistics. New York: Wiley, 1947.

Bibliography

53

Frederick Mosteller. The Journal of Business of the University of Chicago, 20 (1947), pp. 176–177. R7 George W. Snedecor. Statistical Methods. Ames, IA: The Iowa State College Press, 1946. Frederick Mosteller. Annals of Mathematical Statistics, 19 (1948), pp. 124–126. R8 Abraham Wald. Sequential Analysis. New York: Wiley, 1947. Frederick Mosteller. Journal of Applied Mechanics, 15 (1948), pp. 89–90. R9 Palmer O. Johnson. Statistical Methods in Research. New York: Prentice– Hall, 1949. Frederick Mosteller. Journal of the American Statistical Association, 44 (1949), pp. 570–572. R10 Norbert Wiener. Cybernetics, or Control and Communication in the Animal and the Machine. New York: Wiley, 1948. Frederick Mosteller. Journal of Abnormal and Social Psychology, 44 (1949), pp. 558–560. R11 N. Rashevsky. Mathematical Theory of Human Relations: An Approach to a Mathematical Biology of Social Phenomena. Mathematical Biophysics Monograph Series No. 2. Bloomington, IN: Principia Press, 1948. Frederick Mosteller. Journal of the American Statistical Association, 44 (1949), pp. 150–155. R12 S.S. Wilks. Elementary Statistical Analysis. Princeton, NJ: Princeton University Press, 1948. Frederick Mosteller. Psychometrika, 15 (1950), pp. 73–76. R13 Frank Yates. Sampling Methods for Censuses and Surveys. London: Charles Griffin and Company, 1949. Frederick Mosteller. The Review of Economics and Statistics, 32 (1950), pp. 267–268. R14 W.E. Deming. Some Theory of Sampling. New York: Wiley, 1950.

54

Bibliography

Frederick Mosteller. Psychological Bulletin, 48 (1951), pp. 454–455. R15 Gamma Globulin in the Prophylaxis of Poliomyelitis: An evaluation of the efficacy of gamma globulin in the prophylaxis of paralytic poliomyelitis as used in the United States 1953. Public Health Monograph No. 20. Report of the National Advisory Committee for the Evaluation of Gamma Globulin in the Prophylaxis of Poliomyelitis, Public Heatlh Publication No. 358, U.S. Department of Health, Education, and Welfare. Washington, D.C.: Superintendent of Documents, U.S. Government Printing Office, 1954. Frederick Mosteller. Journal of the American Statistical Association, 49 (1954), pp. 926–927. R16 Paul F. Lazarsfeld, editor. Mathematical Thinking in the Social Sciences. Glencoe, IL: The Free Press, 1954. Frederick Mosteller. American Anthropologist, 58 (1956), pp. 736–739. R17 John Cohen and Mark Hansel. Risk and Gambling: The Study of Subjective Probability. London: Longmans, Green, and Co., 1956. Frederick Mosteller. Econometrica, 27 (1959), pp. 505–506. R18 Herbert A. Simon. Models of Man: Social and Rational. Mathematical Essays on Rational Human Behavior in a Social Setting. New York: Wiley, 1957. Frederick Mosteller. American Sociological Review, 24 (1959), pp. 409– 413. Contains some original data illustrating the changing values of a parameter of a stochastic model—the probability of a new word— as the number of words in the text so far increases. R19 Raoul Naroll. Data Quality Control—A New Research Technique. Prolegomena to a Cross–Cultural Study of Culture Stress. New York: The Free Press, 1962. Frederick Mosteller and E.A. Hammel. Journal of the American Statistical Association, 58 (1963), pp. 835–836. R20 Herbert Solomon, editor. Studies in Item Analysis and Prediction. Stanford Mathematical Studies in the Social Sciences, VI. Stanford, CA: Stanford University Press, 1961.

Bibliography

55

Frederick Mosteller. Journal of the American Statistical Association, 58 (1963), pp. 1180–1181. R21 L. R˚ ade, editor. The Teaching of Probability and Statistics. Stockholm: Almqvist & Wiksell, 1970. Proceedings of the first CSMP International Conference co-sponsored by Southern Illinois University and Central Midwestern Regional Educational Laboratory. F. Mosteller. Review of the International Statistical Institute, 39, No. 3 (1971), pp. 407–408. R22 Frederick Mosteller, Gale Mosteller, and Keith A. Soper. “Knowledge beyond achievement. A seventies perspective on school effects—a review symposium on The Enduring Effects of Education, by Herbert Hyman, Charles Wright, and John Reed.” School Review, 84 (1976), pp. 265–283. R23 Anthony C. Atkinson and Stephen E. Fienberg, editors. A Celebration of Statistics (The ISI Centenary Volume). New York: Springer-Verlag, 1985. Frederick Mosteller. Journal of the American Statistical Association, 81 (1986), pp. 1118–1119. R24 Hans Zeisel. Say it with Figures, Sixth Edition. New York: Harper & Row, 1985. Frederick Mosteller. Public Opinion Quarterly, 52 (1988), pp. 274–275. R25 W.S. Peters. Counting for Something: Statistical Principles and Personalities. New York: Springer-Verlag, 1987. Frederick Mosteller. Metrika, 36 (1989), pp. 61–62. R26 A. Hawkins, editor. Training Teachers to Teach Statistics, Proceedings of the International Statistical Institute Round Table Conference, Budapest, Hungary, 23–27 July 1988. Voorburg, The Netherlands: International Statistical Institute, 1990. Frederick Mosteller. Short Book Reviews, International Statistical Institute, 11, No. 1 (April 1991), pp. 2–3.

Reprinted from The Annals of Mathematical Statistics (1946), 17, pp. 13–23

1. Unbiased Estimates for Certain Binomial Sampling Problems with Applications M.A. Girshick, Frederick Mosteller, and L.J. Savage U.S. Department of Agriculture; Statistical Research Group, Princeton University; and Statistical Research Group, Columbia University

1. Introduction. The purpose of this paper is to present some theorems with applications concerning unbiased estimation of the parameter p (fraction defective) for samples drawn from a binomial distribution. The estimate constructed is applicable to samples whose items are drawn and classified one at a time until the number of defectives i, and the number of nondefectives j, simultaneously agree with one of a set of preassigned number pairs. When this agreement takes place, the sampling operation ceases and an unbiased estimate of the proportion p of defectives in the population may be made. Some examples of this kind of sampling are ordinary single sampling in which n items are observed and classified as defective or nondefective; curtailed single sampling where it is desired to cease sampling as soon as the decision regarding the lot being inspected can be made, that is as soon as the number of defectives or nondefectives attain one of a fixed pair of preassigned values; double, multiple, and sequential sampling. In the cases of double and multiple sampling the subsamples may be curtailed when a decision is reached, while for sequential sampling the process may be truncated, i.e. an upper bound may be set on the amount of sampling to be done. In section 3 expressions are given for the unique unbiased estimates of p for single, curtailed single, curtailed double, and sequential sampling. One or two of the illustrative examples of section 3 may be of interest because their rather bizarre results suggest that some estimate other than an unbiased estimate may be preferable; but the discussion of estimates other than unbiased ones is outside the scope of this paper. 

This paper was originally written by Mosteller and Savage. A communication from M.A. Girshick revealed that he had independently discovered for the sequential probability ratio test the estimate pˆ(α) given here and demonstrated its uniqueness. For purposes of publication it seemed appropriate to present the results in a single paper.

58

M.A. Girshick, Frederick Mosteller, and L.J. Savage

2. The estimate of pˆ. For the purposes of the present paper the word point will refer only to points in the xy-plane with nonnegative integral coordinates. We shall need the following nomenclature. A region R is a set of points containing (0, 0). The point (x2 , y2 ) is immediately beyond (x1 , y1 ) if either x2 = x1 + 1, y2 = y1 or x2 = x1 , y2 = y1 + 1. A path in R from the point α0 to the point αn is a finite sequence of points α0 , α1 , · · · , αn such that αi (i > 0) is immediately beyond αi−1 and αj R with the possible exception of αn . A boundary point, that is, an element of the boundary B of R, is a point not in R which is the last point αn of a path from the origin. Accessible points are the points in R which can be reached by paths from the origin, while inaccessible points are the points which cannot be reached by any path from the origin. All points are thus divided into three mutually exclusive categories: accessible, inaccessible, and boundary points. The index of a point is the sum of its coordinates, and the index of a region is the least upper bound of the indices of its accessible points. A finite region is a region for which the indices of the accessible points are less than some number n. In particular a region containing only a finite number of points is finite. Paths may be thought of as arising by a random process such that a path reaching αi = (x, y), αi R, will be extended to αi+1 = (x, y + 1) with  = (x+1, y) with probability q = 1−p. We exclude p = probability p or to αi+1 0, 1 unless these values are specifically mentioned. When a path is extended to a boundary point of R the process ceases. It is clear from the definitions that for a finite region R, paths from the origin cannot include more points than n + 2 where n is the index of the region. This means that a path from the origin cannot escape from a finite region and that the probability that it strikes some boundary point is unity. It is clear that each path from the origin to a boundary point or an accessible point has probability py q x , if the point has coordinates (x, y). We will need the following statements which are immediate consequences of the discussion above: A. The probability of a boundary point or an accessible point being included in a path from the origin is P (α) = k(α)py q x , where k(α) is the number of paths from the origin to the point. We shall call P (α) the probability of the point.  B. For a finite region P (α) = 1, i.e. the sum of the probabilities of the αB

boundary points is unity.  Any region for which P (α) = 1 will be called a closed region. αB

Of course, all finite regions are closed; but it is convenient to have a condition such as that supplied by the following theorem guaranteeing the closure of some infinite regions as well.

1 Unbiased Estimates

59

1 Theorem 1. A √ sufficient condition that a region R be closed is that lim inf n→∞ A(n)/ n = 0, where A(n) is the number of accessible points of index n. Proof. We consider the ascending sequence of finite regions Rn , each consisting of the points of R whose indices are lessthan n. The boundaryBn of Rn can be written as the set theoretic union Kn An , where Kn is Bn B, and An are the accessible points of R of index n. If αBn and Pn (α) is the probability of α with respect to Rn , it is easily seen that for αKn , Pn (α) = P (α). Since every point of B is ultimately contained in the ascending sequence Kn ,    P (α) = lim P (α) = lim Pn (α) ≤ 1, αB

n→∞

αKn

n→∞

αKn

 the inequality being a consequence of statement B. But Pn (α) is monoαAn  tonically decreasing because Pn (α) is monotonically increasing with n αKn  while Pn (α) = 1, from statement B. αBn  If we can show lim Pn (α) = 0 under the condition of the theorem, n→∞ αA

n

the proof is complete. For any point αAn , Pn (α) =√kn (α)py q n−y which for √ fixed p is O(1/ n). The sum over An is O(A(n)/ √ n) and therefore since the hypothesis of the theorem implies that A(n)/ n attains arbitrarily small values for arbitrarily large values of n, the sum in question decreases monotonically to zero. Corollary. If the number of accessible points of R of index n is bounded, the region is closed. That the condition given in Theorem 1 is not a necessary condition may be seen by examining the region R consisting of all points except points of the form (2x + 1, 2y + 1) and (3, 0) and (0, 3). Theorem 2. If R is closed and R contains S, S is closed. Proof. The proof is essentially similar to that of Theorem 1. Any reasonable estimate of p will be a function defined on the boundary points, because the boundary points constitute, so to speak, a sufficient statistic for p. That is, the probability of any path from (0, 0) given the boundary point α at which it terminates is independent of p, and is in fact 1/k(α). We shall construct an unbiased  estimate of p for closed regions R, that is a function pˆ(α), αB, such that αB pˆ(α)P (α) = p (absolutely convergent).2 Construction. Let k ∗ (α) be the number of paths in R from the point (0, 1) to the boundary point α, and let pˆ(α) = k ∗ (α)/k(α). We remark that the definitions imply k ∗ ((0, 1)) = 1, when (0,1) is a boundary point. 1

If it is desired to admit p = 0, 1, the existence of boundary points (x, 0) or (0, y) respectively must be postulated. 2 Even if such a sum were p for a region which was not closed, we would not call the estimate an unbiased estimate.

60

M.A. Girshick, Frederick Mosteller, and L.J. Savage

Theorem 3. For any closed region R pˆ(α) is an unbiased estimate of p. Proof: 

 k ∗ (α) k(α)py q x k(α) αB  = k ∗ (α)py q x .

pˆ(α)P (α) =

αB

αB

If (0, 1) is a boundary point, then k ∗ ((0, 1)) = 1 and k ∗ (α) = 0, α = (0, 1), in which case the sum in question consists of the single term p. If (0, 1) is not a boundary point, consider the region R obtained by deleting (0, 1) from R, and k  (α), the number of paths in R from the origin to the boundary point α of R. k ∗ (α) = k(α) − k  (α)  αB

k ∗ (α)py q x =

 αB

= 1−

k(α)py q x − 



k  (α)py q x

αB 

y x

k (α)p q .

αB

Now R is closed (Theorem 2); except for (0, 1) every boundary point of R is easily seen to be a boundary point of R; and k  (α) vanishes except for the boundary points of R . Therefore  k  (α)py q x = 1, p+ αB

and the proof is complete. It is clear from the construction that 0 ≤ pˆ(α) ≤ 1; this is rather satisfying, since an estimate of p outside of these bounds would be received with some misgivings. Theorem 3 may be generalized to yield unbiased estimates of linear combinations of functions of the form pt q u provided the points (u, t) are not inaccessible points. We need only let the point (u, t) play the role of (0, 1). Even though the point (u, t) is inaccessible it may be possible to represent pt q u as a polynomial, none of whose terms correspond to inaccessible points. It is clear from Theorem 1 that pˆ(α) is an unbiased estimate of p for the usual sequential binomial tests, but the computation may be quite heavy. It should be noted that the coordinate system used here differs slightly from the coordinate system customarily used in sequential analysis. The custom is to let the x coordinate represent the number of items inspected, whereas we use it to represent the number of nondefectives; this is the only difference between the coordinates. We understand that in applications the customary procedure seems preferable, but we find the present coordinates more convenient for the purposes of this article.

1 Unbiased Estimates

61

In general pˆ is not the only unbiased estimate of p. A necessary condition for uniqueness is that the region be simple, that is that all the points between any two accessible points on the line x + y = n be accessible points. In other words no accessible points of index n shall be separated on the line x + y = n by inaccessible points or boundary points. Theorem 4. A necessary condition that the estimate pˆ be the unique unbiased estimate for the closed region R is that R be simple. Proof. For a region that is not simple we shall construct a function m(α) not identically zero, such that  m(α)P (α) = 0. (1) αB

But pˆ(α) + m(α) will be an unbiased estimate of p different from pˆ. Suppose we have a closed region R which is not simple. We consider the lowest index n where the accessible points are separated. There will be at least one uninterrupted sequence of points between some pair of accessible points that are not accessible points. It is easy to see that all the points of this uninterrupted sequence are boundary points of R. Let this sequence be the points αi = (x0 − i, y0 + i), i = 0, 1, · · · , t, x0 + y0 = n. To begin the construction of m(α) let m(αj ) = (−1)j /k(α), 0 ≤ j ≤ t. The coordinates of the point α above the top point of the sequence are (x0 − t, y0 + t + 1), and the number of paths from α to any point on the boundary is l (α), where if α is a boundary point the number of paths l (α ) = 1; similarly α = (x0 + 1, y0 ) and l (α) is the number of paths from α to the boundary point α with the same convention if α is a boundary point. To complete the construction of m(α), let m(α) = −[l (α) + (−1)t l (α)]/k(α) for boundary points not members of the sequence under consideration. Before proceeding to check equation (1), we show that   l (α)py q x = py0 q x0 +1 ; l (α)py q x = py0 +t+1 q x0 −t . (2) αB

αB

Because of symmetry we need only carry out the demonstration for the first sum. If α is a boundary point l (α ) = 1, and for all other points α l (α) = 0, and the sum is the single term py0 q x0 +1 . If α is not a boundary point consider the region obtained by deleting α from R and the corresponding k  (α), the number of paths from (0,0) to the boundary points of the new closed region R . Every boundary of R except α is a boundary point of R. Let us extend the definition of k  (α) to the whole boundary of R by defining k  (α) = 0 for α not in the boundary B  of R . Then it is easy to see that k(α) = k  (α )l (α) + k  (α). Now

62

M.A. Girshick, Frederick Mosteller, and L.J. Savage

1=



k(α)py q x

αB

= k  (α )



l (α)py q x +

αB 



= k (α )





k  (α)py q x

αB 

l (α)p q + 1 − k  (α )py0 q x0 +1 y x

αB

establishing equation (2). We now check that m(α) satisfies equation (1): 

m(α)k(α)py q x =

t 

(−1)j py0 +j q x0 −j −

j=0

αB

=



l (α)py q x −

αB

t 



(−1)t l (α)py q x

αB

(−1)j py0 +j q x0 −j − py0 q x0 +1 − (−1)t py0 +t+1 q x0 −t

j=0



= py0 q x0 −t ⎝

t 

⎞ (−1)j pj q t−j − q t+1 − (−1)t pt+1 ⎠

j=0

= 0. Theorem 5. A necessary condition that pˆ(α) be a unique unbiased estimate of p for the closed region R is that there be no closed region R whose boundary is a proper subset of the boundary of R. Proof. Again supposing that the condition is not satisfied we shall construct a function m(α) not identically zero such that equation (1) is satisfied. Let k  (α) be the number of paths in R to α in B of R, understanding, of course, that k  (α) = 0 if α is not in B  of R . Consider m(α) = 1 − k  (α)/k(α), m(α) is not identically zero because k  (α) vanishes for at least one α, but k(α) does not. From the closure of R and R it is obvious that m(α) satisfies equation (1). Two simple examples will suffice to show that neither simplicity nor the condition of Theorem 5 is alone sufficient to insure the uniqueness of pˆ. The region consisting of the points whose coordinates are given in the following configuration and whose boundary points are x (0,3) (0,2) (0,1) (0,0)

x x (1,1) (1,0)

x (2,0)

x (3,0)

x

indicated by the x’s satisfies the condition of Theorem 5 but is not simple. On the other hand the region consisting of all points for which y < 3, except for the two points (1, 0), (1, 1) is simple but does not satisfy the conditions

1 Unbiased Estimates

63

of Theorem 5, because the region consisting of all points except (1, 0) with y < 3 can play the role of R . The authors are unable to decide whether the two conditions together guarantee the uniqueness of pˆ as an unbiased estimate of p, and supply the following sufficient condition which is adequate for many practical purposes. Theorem 6. A sufficient condition that a closed region have pˆ(α) a unique unbiased estimate of p is that the region be simple and that there exist g, h (0 < g, h ≤ 1) such that for all boundary points |gx − hy| < M . Proof. If there were an unbiased estimate of p different from pˆ, subtracting it from p would yield an equation of the form (sum absolutely convergent):  m(α)py q x = 0, (3) αB

where m(α) is not identically zero. But this will be shown to be impossible. If m(α) were not identically zero, there would be an α0 , such that m(α0 ) = 0 and 1) m(α) = 0 for all boundary points of index less than that of α0 , and 2) one of the coordinates of α0 is less than the corresponding coordinate of any other boundary point for which m(α) = 0. This follows easily from the simplicity requirement which implies that the boundary points of index n are broken into two sets a) those whose y coordinates are less than the y coordinates of the accessible points of index n, and b) those whose x coordinates are less than the x coordinates of the accessible points of index n.3 Since the situations a) and b) are symmetrical we suppose without loss of generality that α0 is a boundary point whose y coordinate is less than that of any other boundary point with m(α) = 0. Equation (3) may be written  m(α0 )py0 q x0 + py0 +1 m(α)py−y0 −1 q x = 0, (4) αB α = α0

where the exponents appearing in the sum are nonnegative. But it will be shown that for sufficiently small p  q x0 |m(α0 )| > p| m(α)py−y0 −1 q x |, (5) αB α = α0

which contradicts equation (4). Now 3

It will be seen as the proof proceeds that if there are no boundary points to which alternative a) applies, the restriction g > 0 may be removed and replaced by g ≥ 0, similarly if there are no boundary points to which b) applies the condition h > 0 may be replaced by h ≥ 0.

64

M.A. Girshick, Frederick Mosteller, and L.J. Savage

|Σm(α)py−y0 −1 q x | ≤ Σ|m(α)|py−y0 −1 q x

(6)

≤ Σ|m(α)|py−y0 −1 q x−(hy0 +h+M +gx−hy)/g | = q −M/g Σ|m(α)|(pq h/g )y−y0 −1 ≤ q −[h(y0 +1)+2M ]/g Σ|m(α)|py−y0 −1 q x , where all the summations range over the values indicated in (5). The summation indicated in (5) is thus seen to be dominated by a convergent power series in pq h/g . Thus Theorem 6 shows that pˆ is a unique estimate for the sequential binomial tests. Theorem 7. A necessary and sufficient condition that pˆ be the unique unbiased estimate of p for a closed finite region R is that R be simple. Proof. The proof follows immediately from Theorems 4 and 6. 3. Applications and illustrative examples A. Single sampling. In single sampling a random sample of n items is drawn from a lot containing items each of which is either defective or nondefective. It is customary to estimate p, the proportion defective by the unbiased estimate i/n, where i is the number of defectives observed. The boundary of the region defined by a single sampling plan consists

of all points of index n. Now k((n − i, i)) = ni and k ∗ ((n − i, i − 1)) = n−1 i−1 . Consequently the unique unbiased estimate of p is n−1 n pˆ((n − i, i)) = = i/n i−1 i the result above. It may be of interest to note an unbiased that   estimate of the variance n−2 n i(n − i) , (n > 1); this pq/n of the proportion pˆ, is n = 2 n (n − 1) i−1 i estimate is obtained by the method suggested immediately following Theorem 3. B. Curtailed single sampling. In single sampling schemes, there is usually given a rejection number c as well as the sample size n. If c or more defectives are found in the sample the lot is rejected, but if less than c defectives are found in the sample the lot is accepted. It is customary to inspect all the items in the sample even if the final decision to accept or reject the lot is known before the completion of the inspection of the sample. One reason sometimes mentioned for this procedure is that an unbiased estimate for p is not known when the inspection is halted as soon as a decision is reached. We provide the unbiased estimate in the following paragraph. In curtailed single sampling the boundary points when rejecting are (x, c), c + x ≤ n, when accepting (n − c + 1, y), y ≤ c − 1. The region is a rectangular array and obviously simple. The unique unbiased estimate along the horizontal line corresponding to rejection with c > 1 therefore is

1 Unbiased Estimates

pˆ((x, c)) =

c−2+x c−2



c+x−1 c−1

=

65

c−1 , c+x−1

or in words, one less than the number of defectives observed divided by one less than the number of observations. The unique unbiased estimate along the vertical line corresponding to acceptance for c > 1 is pˆ((n − c + 1, y)) =

n−c+i−1 n−c



n−c+i n−c

=

i n−c+i

that is, the number of defectives observed divided by one less than the number of observations. We reserved the case c = 1 because it is rather illuminating. The construction of Theorem 3 works as usual, and we note that pˆ((0, 1)) = 1, pˆ((n, 0)) = 0 as we might expect, but pˆ((x, 1)) = 0, 0 < x < n. It is somewhat startling to find that the only unbiased estimate of p for curtailed single sampling with c = 1 provides zero estimates unless a defective is observed on the first item. We remark that the variance of this estimate is pq. In other words, curtailed single sampling with c = 1 is no better for estimation purposes than a sample of size one when the unbiased estimate pˆ is used. A limiting case of curtailed sampling when n is unbounded has been considered by Haldane4 as a useful technique in connection with estimates of the frequency of occurrence of rare events. The region would not be closed unless p = 0 were excluded. In our nomenclature there is a “rejection number” c (c > 1), and we continue sampling and inspecting until c defectives have been observed. The unbiased estimate5 is (c − 1)/(j − 1), where j is the total number of observations, and of course this is the estimate given by Haldane. C. A general curtailed double sampling plan. The following example will illustrate the sort of calculations involved in computing p for multiple and sequential plans. A sample of size n1 is drawn and items are inspected until 1) r1 (1 < r1 ≤ n1 ) defectives are found, or 2) n1 − a + 1 (a ≥ 0) nondefectives are found, or 3) the sample is exhausted with neither of these events occurring. If case 3) arises, a second sample of size n2 is drawn and inspection proceeds until a grand total of r2 (r1 ≤ r2 ≤ n1 +n2 ) defectives is found or n1 +n2 −r2 +1 nondefectives are found. In this scheme we call r1 and r2 rejection numbers and a an acceptance number. The unique unbiased estimate pˆ is as follows: pˆ((j, r1 )) =

r1 − 1 , r1 + j − 1

pˆ((n1 − a + 1, i)) = 4 5

j = 0, 1, · · · , n1 − r1 ;

i , n1 − a + i

J.B.S. Haldane, Nature, Vol. 155 (1945), No. 3924. For the uniqueness, see footnote 4.

i = 0, 1, · · · , a;

(a) (b)

66

M.A. Girshick, Frederick Mosteller, and L.J. Savage

x − x0 + r2 − y0 − 1 x0 + y0 − 1 x0 r2 − y 0 − 1  x0 + y0 x − x0 + r2 − y0 − 1 r2 − y 0 − 1 x0

 pˆ((x, r2 )) =



(c)

n1 − r1 < x ≤ n 1 + n 2 ; pˆ((n1 + n2 − r2 + 1, y)) =  x0 + y0 − 1 n1 + n2 − r2 + y − y0 − x0 y − y0 x0  x0 + y0 n1 + n2 − r2 + y − y0 − x0 x0 y − y0

(d)

a < y ≤ n1 + n2 ; where the summations extend from y0 = a+1 to y0 = r1 −1, and x0 +y0 = n1 . In the above equations (a) and (b) are the estimates corresponding to rejection and acceptance on the basis of the first sample, while (c) and (d) correspond to rejection and acceptance when a second sample has been drawn. Rather than use the sums indicated in (c) and (d), some may find it preferable to make the estimation entirely on the basis of the first sample. If there is no curtailing, the procedure of estimation is equivalent to single sampling, and the estimate is again i/n1 as mentioned in paragraph A above. If the first sample is curtailed and the estimate is made on the basis of the results of the first sample only, the unique unbiased estimate is given by formula (a) when rejecting, by formula (b) when accepting, and by i/n1 when a second sample is to be drawn. It will be noted that (a) and (b) are identical with the expressions derived in paragraph B over the range of values for which they are valid. D. The sequential probability ratio test. Using the nomenclature of sequential analysis,6 the criterion for a decision is given by two parallel straight lines in the dn-plane d1 = h1 + sn (lower line) d2 = h2 + sn (upper line),

(7)

where d is the number of defectives and n is the number of observations. The acceptance and rejection numbers for any n are given by an and rn , respectively, where an is the largest positive integer less than or equal to d1 , and rn is the smallest integer greater than or equal to d2 . We let ka (n) be the number of paths from the origin which end in a decision to accept on the nth observation; kr (n) is similarly defined when rejection occurs on the nth observation. We also require an auxiliary sequential test with acceptance  and rejection numbers an−1 = an − 1, rn−1 = rn − 1 (which is equivalent to 6

See, for example, Sequential Analysis of Statistical Data: Applications, Section 2, Columbia University Press, 1945.

1 Unbiased Estimates

67

replacing h1 and h2 by h1 + 1 − s and h2 − 1 + s in the equations (7)), and with ka (n) and kr (n) the number of paths from the origin which lead to acceptance or rejection on the nth observation for the new test. A graphical comparison of the two plans shows that: The unique unbiased estimate of p is pˆ(n) = ka (n − 1)/ka (n) when the original test leads to a decision to accept, and pˆ(n) = kr (n − 1)/ka (n) when the original test leads to a decision to reject on the nth observation. E. Regions with narrow throats. Let us consider the case of a closed region which has only one accessible point of index n, n > 0 (n being the lowest index not zero at which this phenomenon occurs). The number of paths from the origin to this accessible point α we will denote m, while the number of paths from α to α, boundary points of index greater than n, will be denoted l(α). Then the total number of paths to α from the origin is ml(α). We use the construction preceding Theorem 3 to get pˆ(α). The number of paths from (0,1) to α is similarly m∗ l(α), so for such points pˆ(α) = m∗ /m. In other words, if a closed region has a narrow throat such as that described, pˆ(α) for α of index higher than that of the accessible point α are independent of the shape of the region beyond the line x + y = n, and in fact they are all identical. The curtailed single sample with c = 1 is a particular case of a region with a narrow throat. 4. Estimation based on data from several experiments. In the previous discussion we have been concerned with estimation based on the result of a single experiment. Various kinds of acceptance sampling plans have been suggested as examples of the possible experiments. Acceptance sampling is one of many activities where data toward the estimation of p are often accumulated in a series of experiments. It has been pointed out by John Tukey that when information is available from several experiments the estimate pˆ will no longer be the unique unbiased estimate of p. Little has been done on this problem of combining information from several experiments, but to illustrate the point, we will discuss a very simple example in terms of acceptance sampling. Let us suppose that two large lots of the same size are inspected according to the following curtailed single sampling plan: if a defective occurs at the first or second observation, sampling is stopped and the lot is rejected; if the first two items inspected are nondefective, we accept the lot. The total number of defective and of nondefective items in the two samples form a sufficient statistic for p. In a single application of the sampling plan the boundary points with their probabilities are (0, 1), p; (1, 1), pq; (2, 0), q 2 . From this information we can generate the possible totals of defectives and of nondefectives which may arise when samples are drawn from two lots, with their probabilities by expanding

68

M.A. Girshick, Frederick Mosteller, and L.J. Savage

(p + pq + q 2 )2 = p2 + p2 q 2 + q 4 + 2p2 q + 2pq 2 + 2pq 3 ,

(8)

where a term on the right of the form mpy q x is the probability that in two samples there will be x nondefectives and y defectives altogether. On the basis of the observed number pair (x, y), which may be regarded as a possible terminal point α for the two experiments performed successively, we wish to form an unbiased estimate e((x, y)) = e(α). For the estimate e to be unbiased the condition Σe(α)P (α) = p must be satisfied, where in the present example the P (α) are the six terms on the right of equation (8), and the e(α) are the estimates with which the six probabilities are associated. In the example under consideration the condition for unbiasedness will be satisfied if and only if e((0, 2)) = 1, e((4, 0)) = 0, e((1, 2)) = 12 , e((2, 1)) = [1 − e((2, 2))]/2, e((3, 1)) = e((2, 2))/2. Consequently a one parameter family of unbiased estimates is available. Unfortunately the popular condition that the variance be a minimum depends on the true value of p; in fact the variance is minimized just when e((2, 2)) = 1/(2 + p). So an unbiased estimate of uniformly minimum variance does not exist. In practical applications to acceptance sampling one might meet this difficulty by choosing a value of p near zero for such a minimization scheme. However it is clear that the last word has yet to be said about how best to estimate p when one is faced with the results of several experiments. 5. Conclusion. We would like to call attention to a few problems raised by but not solved in this paper: 1) find a necessary and sufficient condition that pˆ be the unique unbiased estimate for p; 2) suggest criteria for selecting one unbiased estimate when more than one is possible; 3) evaluate the variance of pˆ. In this connection, in a forthcoming paper by M.A. Girshick, it will be shown for certain regions, for example for those of the sequential probability ratio test, that the variance of pˆ(α), σp2ˆ ≥ pq/E(x + y), where E(x + y) is the expected number of observations required to reach a boundary point.

Reprinted from The Annals of Mathematical Statistics (1946), 17, pp. 377–408

2. On Some Useful “Inefficient” Statistics Frederick Mosteller Princeton University

Summary. Several statistical techniques are proposed for economically analyzing large masses of data by means of punched-card equipment; most of these techniques require only a counting sorter. The methods proposed are designed especially for situations where data are inexpensive compared to the cost of analysis by means of statistically “efficient” or “most powerful” procedures. The principal technique is the use of functions of order statistics, which we call systematic statistics. It is demonstrated that certain order statistics are asymptotically jointly distributed according to the normal multivariate law. For large samples drawn from normally distributed variables we describe and give the efficiencies of rapid methods: i) for estimating the mean by using 1, 2, · · · , 10 suitably chosen order statistics; (cf. p. 86) ii) for estimating the standard deviation by using 2, 4, or 8 suitably chosen order statistics; (cf. p. 89) iii) for estimating the correlation coefficient whether other parameters of the normal bivariate distribution are known or not (three sorting and three counting operations are involved) (cf. p. 94). The efficiencies of procedures ii) and iii) are compared with the efficiencies of other estimates which do not involve sums of squares or products. 1. Introduction. The purpose of this paper is to contribute some results concerning the use of order statistics in the statistical analysis of large masses of data. The present results deal particularly with estimation when normally distributed variables are present. Solutions to all problems considered have been especially designed for use with punched-card equipment although for most of the results a counting sorter is adequate. Until recently mathematical statisticians have spent a great deal of effort developing “efficient statistics” and “most powerful tests.” This concentration of effort has often led to neglect of questions of economy. Indeed some may

70

Frederick Mosteller

have confused the meaning of technical statistical terms “efficient” and “efficiency” with the layman’s concept of their meaning. No matter how much energetic activity is put into analysis and computation, it seems reasonable to inquire whether the output of information is comparable in value to the input measured in dollars, man-hours, or otherwise. Alternatively we may inquire whether comparable results could have been obtained by smaller expenditures. In some fields where statistics is widely used, the collection of large masses of data is inexpensive compared to the cost of analysis. Often the value of the statistical information gleaned from the sample decreases rapidly as the time between collection of data and action on their interpretation increases. Under these conditions, it is important to have quick, inexpensive methods for analyzing data, because economy demands militate against the use of lengthy, costly (even if more precise) statistical methods. A good example of a practical alternative is given by the control chart method in the field of industrial quality control. The sample range rather than the sample standard deviation is used almost invariably in spite of its larger variance. One reason is that, after brief training, persons with slight arithmetical knowledge can compute the range quickly and accurately, while the more complicated formula for the sample standard deviation would create a permanent stumbling block. Largely as a result of simplifying and routinizing statistical methods, industry now handles large masses of data on production adequately and profitably. Although the sample standard deviation can give a statistically more efficient estimate of the population standard deviation, if collection of data is inexpensive compared to cost of analysis and users can compute a dozen ranges to one standard deviation, it is easy to see that economy lies with the less efficient statistic. It should not be thought that inefficient statistics are being recommended for all situations. There are many cases where observations are very expensive, and obtaining a few more would entail great delay. Examples of this situation arise in agricultural experiments, where it often takes a season to get a set of observations, and where each observation is very expensive. In such cases the experimenters want to squeeze every drop of information out of their data. In these situations inefficient statistics would be uneconomical, and are not recommended. A situation that often arises is that data are acquired in the natural course of administration of an organization. These data are filed away until the accumulation becomes mountainous. From time to time questions arise which can be answered by reference to the accumulated information. How much of these data will be used in the construction of say, estimates of parameters, depends on the precision desired for the answer. It will however often be less expensive to get the desired precision by increasing the sample size by dipping deeper into the stock of data in the files, and using crude techniques of analysis, than to attain the required precision by restricting the sample size to the minimum necessary for use with “efficient” statistics.

2 “Inefficient” Statistics

71

It will often happen in other fields such as educational testing that it is less expensive to gather enough data to make the analysis by crude methods sufficiently precise, than to use the minimum sample sizes required by more refined methods. In some cases, as a result of the type of operation being carried out sample sizes are more than adequate for the purposes of estimation and testing significance. The experimenters have little interest in milking the last drop of information out of their data. Under these circumstances statistical workers would be glad to forsake the usual methods of analysis for rapid, inexpensive techniques that would offer adequate information, but for many problems such techniques are not available. In the present paper several such techniques will be developed. For the most part we shall consider statistical methods which are applicable to estimating parameters. In a later paper we intend to consider some useful “inefficient” tests of significance. 2. Order statistics. If a sample On = x1 , x2 , · · · , xn of size n is drawn from a continuous probability density function f (x), we may rearrange and renumber the observations within the sample so that x1 < x2 < · · · < xn

(1)

(the occurrence of equalities is not considered because continuity implies zero probability for such events). The xi ’s are sometimes called order statistics. On occasion we write x(i) rather than xi . Throughout this paper the use of primes on subscripted x’s indicates that the observations are taken without regard to order, while unprimed subscripted x’s indicate that the observations are order statistics satisfying (1). Similarly x(ni ) will represent the ni th order statistic, while x (ni ) would represent the ni th observation, if the observations were numbered in some random order. The notation here is essentially the opposite of usual usage, in which attention is called to the order statistics by the device of primes or the introduction of a new letter. The present reversal of usage seems justified by the viewpoint of the article—that in the problems under consideration the use of order statistics is the natural procedure. An example of a useful order statistic is the median; when n = 2m + 1 (m = 0, 1, · · · ), xm+1 is called the median and may be useful to estimate the population median, i.e. u defined by  u f (t)dt = 12 . −∞

In the case of symmetric distributions, the population mean coincides with u and xm+1 will be an unbiased estimate of it as well. When n = 2m (m = 1, 2, · · · ), the median is often defined as 12 (xm + xm+1 ). The median so defined is an unbiased estimate of the population median in the case of symmetric distributions; however for most asymmetric distributions 12 (xm + xm+1 ) will only be unbiased asymptotically, that is in the limit as n increases without

72

Frederick Mosteller

bound. For another definition of the sample median see Jackson [8, 1921]. When x is distributed according to the normal distribution N (x, a, σ 2 ) = √

2 2 1 e−(1/2σ )(x−a) , 2πσ

the variance of the median is well known to tend to πσ 2 /2n as n increases. It is doubtful whether we can accurately credit anyone with the introduction of the median. However for some of the results in the theory of order statistics it is easier to give credit. In this section we will restrict the discussion to the order statistics themselves, as opposed to the general class of statistics, such as the range (xn − x1 ), which are derived from order statistics. We shall call the general class of statistics which are derived from order statistics, and use the value ordering (1) in their construction, systematic statistics. The large sample distribution of extreme values (examples xr , xn−s+1 for r, s fixed and n → ∞) has been considered by Tippett [17, 1925] in connection with the range of samples drawn from normal populations; by Fisher and Tippett [3, 1928] in an attempt to close the gap between the limiting form of the distribution and results tabled by Tippett [17], by Gumbel [5, 1934] (and in many other papers, a large bibliography is available in [6, Gumbel 1939]), who dealt with the more general case r ≥ 1, while the others mentioned considered the special case of r = 1; and by Smirnoff who considers the general case of xr , in [15, 1935] and also [16] the limiting form of the joint distribution of xr , xs , for r and s fixed as n → ∞. In the present paper we shall not usually be concerned with the distribution of extreme values, but shall rather be considering the limiting form of the joint distribution of x(n1 ), x(n2 ), · · · , x(nk ), satisfying ni = λi ; Condition 1. lim i = 1, 2, · · · , k; n→∞ n λ1 < λ2 < · · · < λk . In other words the proportion of observations less than or equal to x(ni ) tends to a fixed proportion which is bounded away from 0 and 1 as n increases. K. Pearson [13, 1920] supplies the information necessary to obtain the limiting distribution of x(n1 ), and limiting joint distribution of x(n1 ), x(n2 ). Smirnoff gives more rigorous derivations of the limiting form of the marginal distribution of the x(ni ) [15, 1935] and the limiting form of the joint distribution of x(ni ) and x(nj ) [16] under rather general conditions. Kendall [10, 1943, pp. 211-14] gives a demonstration leading to the limiting form of the joint distribution. Since we will be concerned with statements about the asymptotic properties of the distributions of certain statistics, it may be useful to include a short discussion of their implications both practical and theoretical. If we ˆ n ) based on a sample On : x , x , · · · , x drawn from a have a statistic θ(O n 1 2 population with cumulative distribution function F (x) it often happens that the function (θˆ − θ)/σn = yn , where σn is a function of n is such that

2 “Inefficient” Statistics

(A)

lim P (yn < t) =

n→∞

√1 2π

t −∞

73

e− 2 x dx. 1

2

When this condition (A) is satisfied we often say: θˆ is asymptotically normally distributed with mean θ and variance σn2 . We will not be in error if we use the statement in italics provided we interpret it as synonymous with (A). However there are some pitfalls which must be avoided. In the first place ˆ has condition (A) may be true even if the distribution function of yn , or of θ, no moments even of fractional orders for any n. Consequently we do not imply ˆ n )] = θ, nor that lim {E(θˆ2 ) − by the italicized statement that lim E[θ(O n→∞ n→∞ ˆ 2 } = σ 2 , for, as mentioned, these expressions need not exist for (A) [E(θ)] n to be true. Indeed we shall demonstrate that Condition (A) is satisfied for certain statistics even if their distribution functions are as momentless as the startling distributions constructed by Brown and Tukey [1, 1946]. Of course it may be the case that all moments of the distribution of θˆ exist and converge as n → ∞ to the moments of a normal distribution with mean θ and variance σn2 . Since this implies (A), but not conversely, this is a stronger convergence condition than (A). (See for example J.H. Curtiss [2, 1942].) However the important implication of (A) is that for sufficiently large n each percentage point of the distribution of θˆ will be as close as we please to the value which we would compute from a normal distribution with mean θ and variance σn2 , independent of whether the distribution of θˆ has these moments or not. Similarly if we have several statistics θˆ1 , θˆ2 , · · · , θˆk , each depending upon the sample On : x1 , x2 , · · · , xn , we shall say that the θˆi are asymptotically jointly normally distributed with means θi , variances σi2 (n), and covariances ρij σi σj , when lim P (y1 < t1 , y2 < t2 , · · · , yk < tk )  t1  t2  tk − 1 Q2 (B) = K −∞ · · · −∞ e 2 dx1 dx2 · · · dxk , −∞ n→∞

where yi = (θˆi − θi )/σi , and Q2 is the quadratic form associated with a set of k jointly normally distributed variables with variances unity and covariances ρij , and K is a normalizing constant. Once again the statistics θˆi may not have moments or product moments, the point that interests us is that the probability that the point with coordinates (θˆ1 , θˆ2 , · · · , θˆk ) falls in a certain region in a k-dimensional space can be given as accurately as we please for sufficiently large samples by the right side of (B). Since the practicing statistician is very often really interested in the probablity that a point will fall in a particular region, rather than in the variance or standard deviation of the distribution itself, the concepts of asymptotic normality given in (A) and (B) will usually not have unfortunate consequences. For example, the practicing statistician will usually be grateful that the sample size can be made sufficiently large that the probability of a statistic falling into a certain small interval can be made as near unity as he pleases, and will not usually be concerned with the fact that, say, the variance of the statistic may be unbounded.

74

Frederick Mosteller

Of course, a very real question may arise: how large must n be so that the probablity of a statistic falling within a particular interval can be sufficiently closely approximated by the asymptotic formulas? If in any particular case the sample size must be ridiculously large, asymptotic theory loses much of its practical value. However for statistics of the type we shall usually discuss, computation has indicated that in many cases the asymptotic theory holds very well for quite small samples. For the demonstration of the joint asymptotic normality of several order statistics we shall use the following two lemmas. ˆ n ) is asymptotically normally disLemma 1. If a random variable θ(O tributed converging stochastically to θ, and has asymptotic variance −→ σ 2 (n) n→∞ 0, where n is the size of the sample On : x1 , x2 , · · · , xn , drawn from ˆ is a single-valued function with the probability density function h(x), and g(θ)  ˆ a nonvanishing continuous derivative g (θ) in the neighborhood of θˆ = θ, then ˆ is asymptotically normally distributed converging stochastically to g(θ) g(θ) with asymptotic variance σn2 [g  (θ)]2 . Proof. By the conditions of the lemma    t 1 2 θˆ − θ 1 −0.98) will occur in 80% of the cases, (iii) a value further from the center (that is, |t| > 0.98) will occur in 40% of the cases. The two-sided statement is apparently simple to make in the case of t, since the distribution concerned is symmetrical and continuous. In asymmetrical cases, we will follow the path of least resistance and calculate the two-sided significance level as twice the (smaller) one-sided significance level, just as 40%=2(20%). (Some statisticians might disagree.) The choice of one among the three significance levels in a practical situation will depend on the alternatives considered for the situation against which the evidence is being assessed. The discrete case In the binomial situation, and in the many other cases where the result obtained proceeds in definite steps, the situation is a little more complex. Consider the case of a sample of 4 from a 2-1 split. The probabilities of the various possible outcomes are given below to 3 decimal places. Outcome:

(0,4)

(1,3)

(2,2)

(3,1)

(4,0)

Frequency:

.012

.099

.296

.395

.198

Cumulated from left to right:

.012

.111

.407

.802

1.000

Cumulated from right to left:

1.000

.988

.889

.593

.198

What significance level shall we assign to (1, 3) in this case? The conventional answers are that it lies at: (i) the lower 11.1% point, (ii) the upper 98.8% point, (iii) the two-sided 22.2% point. These statements mean that, if the situation were binomial with a 2-1 split, (i) one of the outcomes (0, 4) or (1, 3) occurs in 11.1% of all cases, (ii) one of the outcomes (1, 3), (2, 2), (3, 1), (4, 0) occurs in 98.8% of all cases, (iii) it is reasonable to act as though an outcome deviating from expectation as much as or more than (1, 3) in one direction or the other occurs in 22.2% of all cases. An alternative approach, and one which supplies more information, is to attach to such a result as (1, 3) not a single significance level, but a significance zone. Thus the lower significance zone for (1, 3) is “11.1% to 1.2%” and the two-sided significance zone is 22.2% to 2.4% (where we have adopted the convention of doubling in passing from one-sided to two-sided significance zones). The statement “(1, 3) is at the lower (11.1%, 1.2%) zone” means that, if the simple binomial situation with split 2-1 holds, then (0, 4), which is the

Frederick Mosteller and John W. Tukey 118

figure 2 showing observed and theoretical proportions (ex. 1), confidence limits for a proportion (ex. 3), the F test (Ex. 6), designing a binomial experiment (ex. 8), designing a single sampling plan (Ex. 9) and tolerance limits for a second sample (ex. 12) (ex. 12).

4 Binomial Probability Paper

119

only outcome further from expectation than (1, 3) in the same direction, will occur in 1.2% of all cases, while (0, 4) and (1, 3) together will occur in 11.1% of all cases. Interpretation The working statistician, we believe, would almost always react differently to the statement—the outcome is at the lower (12%, 0.5%) zone—than to the statement—the outcome is at the lower (12%, 11.5%) zone. The (12%, 0.5%) zone indicates that we are not sure that the strength of the evidence has reached the conventional 5% point, but it is possible that this is the case. The (12%, 11.5%) zone indicates that the strength of the evidence has certainly not reached the conventional 5% point. This distinction is absent from the customary procedure of assessing just a significance level. In some cases, auxiliary experimental data could be used to interpolate between the ends of the significance zone. As we shall see, binomial probability paper can be used to obtain both the customary significance level and the significance zone. If we are to determine the two percentages needed to fix a significance zone, we shall need to make two measurements on the paper, so it is not surprising that we will want to plot more than a single point to represent a paired count.

PART II—PLOTTING ONE OBSERVED QUANTITY a single binomial example Given a sample, sorted into two categories, the numbers in the categories form a paired count, which is plotted as a triangle. If a probability or a population proportion, p, is assigned for the second category, then this is plotted as the q–p split, where q = 1 − p. The approximate significance level corresponds to the short distance from the split to the triangle, and may be obtained from Table 2 (in Part VI). If it is desired to use the significance zone, then both short and long distances should be measured and the probabilities obtained from Table 2. These general principles will suffice to attack the problems posed by the first four examples. Comparing an observed proportion with a theoretical proportion Example 1 (See Figure 2). Fisher and Mather [11] have described a genetical experiment in which the individuals in 32 litters of 8 mice were observed for the characteristic of straight versus wavy hair. Under the conditions of the experiment, the Mendelian theory predicted that half the mice would have straight hair. It was observed that n1 = 139 had straight hair, n − n1 = 117 had wavy hair. Could such a discrepancy from the 128-128 split have arisen by chance? The observed paired count is plotted as the triangle (139, 117), (139, 118), (140, 117) and is just distinguishable from a point. The theoretical proportion is plotted as the 128-128 split (the 50% line). The short distance is about 6.7

120

Frederick Mosteller and John W. Tukey

mm. = 1.3σ. Since deviations from simple Mendelian genetics might reasonably occur in either direction, the two-sided significance level is appropriate, and from Table 2 this is found to be about 20%. Thus this test would not lead us to reject simple Mendelian genetics. Since the two-sided 5% distance is 10 millimeters, which is worth remembering, Table 2 would not have been needed in routine testing, since the result “not significant at the 5% point” would have been enough. We may observe the experimental percentage of straight-haired mice by drawing a vertical line to the horizontal axis from the intersection of the degrees quarter-circle with the line from the observed point to the origin. This gives about 54.5 per cent as compared with an actual value of 54.3 per cent. If the significance zone were desired, then both the short distance of 6.7 mm. and the long distance of 7.5 mm. would have been measured. The corresponding two-sided significance interval is (20%, 14%). Critique of Example 1. The experimental conditions seem to have been exactly suited for a binomial test. The accuracy of the graphical method is clearly adequate. The sign test The classical case of the sign test is the comparison of two materials or treatments, in pairs, where the observations in each pair are comparable except for the materials or treatments being tested. The sign test is a special case of the comparison of theoretical and observed proportions, where the theoretical proportion is always 50% and has been thoroughly discussed in this Journal by Dixon and Mood [4, 1946]. Example 2 (See Figure 3). Dixon and Mood cite the yields of two lines of hybrid corn where 6, 8, 2, 4, 3, 3, and 2 pairs of plots were available from 7 experiments. In 7 out of the 28 pairs line A yielded higher. If a significance zone is wanted, then both the short distance of 12.9 mm. and the long distance of 14.7 mm. are measured. The resulting two-sided significance zone is (1.2%, 0.3%). Comment on Example 2. The tables of Dixon and Mood correctly state that (7, 21) is not at the 1% level. Some statisticians, however, in part because of the general and unprecise considerations which lead to the choice of 1%, may use the extra information in the significance zone and decide that they would rather work at the 1.3% level than the 0.4% level (the precise values are 1.254% and 0.372%). They would then treat (7, 21) as significant at a level of approximately 1%. Extension of the sign test As presented above, the sign test applies to the hypothesis of equality in paired experiments. It can be easily extended to cover (i) the hypothesis of a constant additive difference, (ii) the hypothesis of a constant percentage difference, or (iii) the hypothesis of a certain population median by construct-

4 Binomial Probability Paper

121

ing dummy observations. Suppose that five experiments have produced the following numbers: Set

1

2

3

4

5

Condition A Condition B

57 26

53 31

49 24

56 28

51 31

There is clearly no need to test for equality. A test of the hypothesis that condition A runs 30 units above condition B may be made by adding 30 to each observation under condition B, which yields 2 positive and 3 negative differences. A test of the hypothesis that condition A gives numbers double those of condition B may be made by doubling the condition B numbers, which yields 2 positive, 1 zero, and 2 negative differences (the corresponding paired count is (2, 2)!). A test of the hypothesis that the median number in condition A is 57, may be made by replacing the results in condition B by an imaginary experiment always giving 57, which yields 1 positive, 1 zero, and 3 negative differences (the corresponding paired count is (1, 3)!). Confidence limits for population proportions If we have a sample divided into two categories, we may wish to set confidence limits on the percentage of the population in each category. This is accomplished by plotting the paired count which was observed and then constructing two splits whose short distances from this triangle correspond to the two-sided level (of confidence) required. Thus if 95% confidence is required, short distances of 10 mm. will be used. The coordinates of the intersections of the two splits with the quarter-circle give the confidence limits for percentages. Example 3 (See Figure 2). A random sample of 500 farms from a certain region yields the information that 143 of the farmers own outright the farms they work. Set 95 per cent confidence limits on the per cent of farms in this region wholly owned by the farmers. We plot the triangle (357, 143), (358, 143), (357, 144), which can, in this case, be approximated by a point. We draw arcs of circles of radius 10 mm. about its extreme vertices, and draw the tangent splits. We get as our estimate of the percentage of farmer-whole owners 28.5 (compared to 28.6 computed) and as our 95% confidence limits, 25 per cent and 33 per cent. Critique of Example 3. If the sample is truly random the method is sound. If stratified sampling is correctly done, these confidence limits will be unnecessarily large by an amount which depends on the effectiveness of the strata chosen, for the particular question at hand. The increase in efficiency by stratification is often so small that these limits are a wise choice. Confidence limits for the population median By combining the ideas of the sign test and the last example, we may obtain confidence limits for the population median. If x0 is the population

122

Frederick Mosteller and John W. Tukey

median, and is used to divide samples into paired counts, then certain extreme paired counts will occur with probability at most 1 − α, where α is the desired confidence. By determining these unlikely splits, and referring to the observations, it is possible to set confidence limits for the population median. While strict confidence requires taking blocks of paired counts with at most a certain probability, interpolation can usually be carried out with safety. Example 4 (See Figure 3). The differences between reaction-times of an individual and a control group were 6, 5, 3, 2, 1, −1, −2 −3, −5, −12, −13, −13, −15, −28, when expressed on a logarithmic scale (actually in terms of 100 times the log to the base 10). Within what limits do we have 90% confidence that the median of the population lies? Drawing the 50-50 split, and parallel lines at ±8.4 mm., we find that the triangles for (10, 4) and (4, 10), where 4 + 10 = 10 + 4 = 14 (the number of cases), are cut by these parallel lines about 34 of the way from their vertices nearest the 50-50 split. We interpolate 34 of the way from 1 to 2 (these are the fourth and fifth values from the top) and 34 of the way from −12 to −13 (these are the fourth and fifth values from the bottom) to obtain approximate 90% confidence limits of −12 34 and 1 34 for the median of the population of differences from which the 14 observed differences were a random sample. Critique of Example 4. The interpolation is based on the fact that the use of 1.001 (say) for a cutting point would give the paired count (4, 10), as would any cut up to 1.999, while 2.001 would give (3, 10). If we take account of the grouping and rounding process we should widen these interpolated values by 12 the grouping interval. Thus if the difference were rounded from more decimals to the nearest tenth, and were actually 6.0, 5.0, 3.0, · · · , −28.0 then 1 1 we should use −12 34 − 20 = −12.8 and 1 34 + 20 = 1.8. If, as might more reasonably have been the case, they had been rounded to integers, we should use −12 34 − 12 = −13.25 and 1 34 + 12 = 2.25. comparison and analysis of variance Comparison The comparison of two variances calculated for samples from normal populations, leads alternatively to Fisher’s z or Snedecor’s F . As explained in Part V, the distribution of these quantities is mathematically related to the binomial distribution. Thus we may use binomial probability paper to make a significance test of the equality of the variances of two normal populations— the only approximation being the approximation of binomial probability paper to the binomial distribution. The test is made by drawing the line or split through the point whose coordinates are the observed sums of squares of deviations from means, and plotting the point whose coordinates are half the numbers of degrees of freedom. Example 5 (See Figure 3). In volume 1 of Biometrika, Fawcett [8, 1902, p. 442] gives the sample variance of the lengths of 141 male Egyptian skulls as

4 Binomial Probability Paper

123

34.2740 and the variance of the lengths of 187 female skulls as 30.5756. Are these significantly different? The sums of squares of deviations are 4832.6 and 5717.6. The degrees of freedom are 140 and 186. Drawing the split through (48.3, 57.2) and plotting the point (70=140/2, 93=186/2), we find a distance of 3.7 mm which is near a (two-sided) significance level of 45%. Thus there is no evidence of a difference in variance. Critique of Example 5. The dangers in such a test lie far more in the possible lack of normality of the populations of practical experience than in the approximations of binomial probability paper. Such comparisons of variances from independent samples based on normality should almost always be taken with a grain of salt, particularly when so many degrees of freedom are involved. This procedure is quite convenient when both degrees of freedom are greater than 24 (this is outside the range of short tables of F or z) and will be quite accurate if both degrees of freedom are at least 2. When one of the degrees of freedom is 1, accuracy will be improved by plotting 12 seven√ tenths of the way from zero to one, because .5 = .7. The use of binomial probability paper has the feature, interesting to some, of producing estimates of intermediate levels of significance. If it is desired to check the tabled values of levels of F against the results given by binomial probability paper, one plots the split (n1 F, n2 ) and measures the distance to the point (n1 /2, n2 /2), remembering that the tabled values provide a one-sided test. Analysis The same process can, of course, be applied to the analysis of variance, in principle a special case of the comparison of variances, but in practice a whole realm of its own. The sum of squares column determines the line and the df column the point. Example 6 (See Figure 2). How can the analysis of variance in Example 7 below be tested graphically? Draw the line through (257.15, 73.16) and plot the point (1, 5). The perpendicular distance of 15 mm is very highly significant (beyond the 0.1 per cent point). Critique of Example 6. The difficulties with normality, mentioned in the last critique, are usually reduced enough to be neglected in an analysis of variance situation. Only in case this graphical method yields borderline significance is accuracy an excuse for using an F or z table, though convenience may often be a reason. The angular transformation The analysis of variance of counted data is frequently facilitated [3, Bartlett 1947 and references cited therein] by making the angular transformation. If the data involve small numbers, the accuracy of graphical transformation will suffice. The observation is plotted, a line through the origin produced to the quarter-circle, and the corresponding angle read off.

124

Frederick Mosteller and John W. Tukey

figure 3 showing the sign test (ex. 2), confidence limits for median (ex. 4), the F test (ex. 5), operating characteristic of sign test (ex. 10), and sample size for population tolerance limits (ex. 11).

Example 7. W.E. Kappauf and W.N. Smith (personal communication) tested the performance of six observers in reading three types of dials. Sixty readings were made by each observer on each size. The errors, angles, and analysis of variance are shown below. (The reader will find it easy to check on the computation of the angles on his own piece of graph paper.)

4 Binomial Probability Paper

Errors in 60 trials

125

Corresponding angles

Observer

Size I

Size II

Size III

Size I

Size II

Size III

B0 B K R S T

7 15 10 28 11 25

4 6 7 20 8 17

4 3 5 16 9 15

20.2 30.2 24.1 43.2 25.2 40.0

15.0 18.6 20.2 35.6 21.6 32.2

15.0 13.1 16.9 31.2 23.0 30.2

Source

df

Sum of Squares

Observers Sizes Interaction

5 2 10

991.64 257.15 73.16

Binomial





Mean Square 198 129 7.3 821 = 13.7* 60

* Variance of an angle obtained from binomially distributed data 821 (degrees)2 . n

1 (radians)2 4n

=

The usual test for significance of the effect of size would be F = 129/7.3 = 17.8 on 2 and 10 degrees of freedom, which is highly significant. Critique of Example 7. A possibly more conservative test would be F = 129/13.7 = 9.4 on 2 and ∞ degrees of freedom, which is still very highly significant. Although 7.3 on 10 df is not significantly less than 13.7, there is reason to believe in this case that the error mean square in a large-scale repetition of this experiment might well be less than 13.7. For the analysis above assumes the errors distributed binomially, and the probable differences in difficulty of the various dials attempted might reduce this variance.

PART III. APPLICATIONS TO DESIGN binomial design Since binomial probability paper allows us to approximately judge the significance of a paired count, it must also let us plan binomial experiments to have desired properties. Sample size necessary to resolve two given percentages When designing a test to discriminate between two theoretical percentages, the experimenter often wishes to know that any result will give significant evidence against at least one of the two theories. The procedure is best described by an example. Example 8 (See Figure 2). A geneticist wishes to test whether a certain character appears in one-half or one-quarter of the progeny of a certain mating.

126

Frederick Mosteller and John W. Tukey

He requires significance at the (two-sided) 1 per cent level against at least one of these hypotheses and wishes to know the smallest sample size which will guarantee this. He draws the 50-50 and 25-75 splits and then parallel lines at a distance of 13.1 mm (2.58σ) which corresponds to the two-sided 1% level. These parallel lines intersect at (37, 63). This point separates the triangle (36, 63), (37, 63), (36, 64) from the triangle (37, 62), (38, 62), (37, 63). Thus the paired count (36, 63) is beyond the 1% level from the 50-50 split, while (37, 62) is beyond the 1% level from 25-75. Thus a sample size of 36 + 63 = 37 + 62 = 99 (= 37 + 63 − 1) will be enough. Critique of Example 8. The design of this experiment is very good as far as sample size and significance levels are concerned. Since such genetical ratios are usually well-behaved, the experimenter who uses a sample of at least 99 and protects the progeny against causes of differential mortality should obtain very good results. Our criticism should be directed against his less careful competitor who says that he will use the two-sided 1% level also, but will be satisfied with a sample size for which an observed 75-25 proportion will be at this level. Since the parallel line for 50-50 cuts the 25-75 split at (6, 18), he will use a sample size of only 5 + 18 = 6 + 17 = 23. This design probably does not meet his needs, for if he uses it over and over he will be well protected from falsely stating the ratio is not 50-50 (since he is using a two-sided 1% level) but he will miss one-half of the cases where the ratio is 25-75. Such designs emphasizing one risk are usually ill-chosen, and if such a choice is compelled by limited experimental resources the choice of significance level should be re-examined, looking at both types of risk. Designing single sampling plans An essentially equivalent problem arising in industrial work as well as in certain kinds of experimental work is to design a sampling inspection plan which will distinguish between two kinds of quality, say product which is 100p1 per cent defective and that which is 100p2 per cent defective (p1 < p2 ). The plan desired is often described in terms of the operating characteristic curve, namely that large lots having 100p1 per cent defectives should be accepted 100(1 − α) per cent of the time, while lots having 100p2 per cent defectives should only be accepted 100β per cent of the time (1 − α > β). The process of building such a plan is described by the following: Example 9 (See Figure 2). It is desired to construct a plan which will accept product with 3 per cent defectives, 95 per cent of the time, while product 12 per cent defective is to be accepted only 10 per cent of the time. What sample size should be used, and how many defectives can be tolerated before the lot is rejected? We construct the 3% line (the 97-3 split) and the 12% line (the 88-12 split). If we wish to accept lots which are 3 per cent defective 95 per cent of the time, we must accept lots whose samples, as plotted on the paper, go as high as 1.64σ above the 3 per cent line, consequently we draw a parallel

4 Binomial Probability Paper

127

line 1.64σ = 8.4 mm above the 3% line. Similarly in order to reject material which is 12 per cent defective 90 per cent of the time, we must reject lots whose samples come within 1.28σ of the 12% line, so we draw a parallel line 1.28σ = 6.5 mm below the 12% line. The intersection of the two construction lines is (61, 5) and we find that the sample size is 61 + 5 − 1 = 65, the acceptance number is 4, the rejection number is 5. Critique of Example 9. It is assumed here that the lot is much larger than the sample, say ten times as large. Notice that these significance levels (10% and 95%) were one-sided. Consultation of tables shows that this plan will accept 3%-defective lots 96.77% of the time and will accept 12%-defective lots 9.69% of the time, which is approximately the result requested. The operating characteristic of the sign test We often want to know how well the sign test will discriminate. Example 10 (See Figure 3). In Example 2 we considered 28 pairs of observations and decided to treat (7, 21) as significant. It is natural to inquire what population percentage of favorable pairs is needed to insure significance at this level 95% of the time. We must then find a split so that the triangle representing (7, 21) is 8.4 mm = 1.65σ away. This leads to the 12-88 split, and so the sign test with 28 pairs discriminates very well between 50% and 12% (or 88%). As in Examples 8 and 9, we can determine a sample size so that the sign test will have given discriminating power. tolerance limit design Sample sizes for population tolerance limits In industrial work it may be desirable to take the least sample from an unknown population, such that the range from the smallest value in the sample to the largest value in the sample will cover a given fraction a of the population with given confidence β. This may be shown to be equivalent to a binomial problem, namely: Find the least sample size from a q-p split (p = 1 − q) such that the second count will be at least 2 with confidence β. (This is, of course, a special case of Example 8.) If it is desired to use the rth from the bottom and the mth from the top to establish tolerance limits, replace 2 by r + m. Example 11 (See Figure 3). A manufacturer of ball bearings wishes to have 99.5% confidence that 90% of his ball bearings lie between the limits set by the largest and smallest of a sample of a chosen size. He draws the 10% line (10% = 100% − 90%!), a parallel line 2.58σ = 13.2 mm lower (for 99.5% confidence) and the horizontal line through 2. The intersection of the last two lines gives (71, 2) and the desired sample size is 74 = 72 + 2. Critique of Example 11. This example assumes that the successive ball bearings produced behave like a random sample—no manufacturer of any metal object, even ball bearings has reached so high a state of control. The practical interpretation of such a sample size is that it is a lower bound.

128

Frederick Mosteller and John W. Tukey

Second sample tolerance limits When tolerance limits are desired for a second sample, rather than for the population, the problem is hypergeometric, and may only be approximated by a binomial problem. The chance that the range between the rth from the bottom and the mth from the top from a sample of n will omit N1 or less from a second sample of N , may be shown to be the same as that a sample of N1 + n1 (where n1 = r + m) from a finite population split N to n will contain n1 or more of the second sort. If N1 ≤ N/10, this can be approximated by the chance of a second count of n1 or more in a sample of N1 + n1 from a N -n split. Example 12 (See Figure 2). A manufacturer of precision resistors tests random samples of 1000 of each new type, and establishes the second from each end as working limits. He wishes to know the confidence with which he may expect (a) 99.5 or (b) 99 per cent of a batch of 50,000 similar resistors to fall within these working limits. He draws the 50,000-1,000 split and computes the chances of getting a second count of 4 or more in a sample of (a) 250 + 4, (b) 500+4. These are given by the distances from the split to (250, 5) and (500, 5), which are 0.0 mm ≈ 50.0% and 9.3 mm ≈ 3.4%. He has, therefore, 50.0% confidence that 99.5% will be within working limits and 96.6% confidence that 99% will be within working limits. In the light of this example, the reader can construct answers to the other problems of tolerance limits for a second sample. Critique of Example 12. If the manufacturer’s production line is nearly in control when it starts to produce a new type, the authors will be surprised. The procedure given will answer the manufacturer who “wishes limits to the confidence, PROVIDED the process were in perfect control from the start.” Also note that m and r have to be selected before the sample is examined, if the probabilities are to be accurate. analysis of variance design Operating characteristic of model II—anova An analysis of variance situation is Model II [6, Eisenhart 1947], when the effects are drawn from a normal population with variance σ12 , the errors being drawn from a normal population with variance σ 2 . If the effects are the c column effects in a simple design with r rows and c columns, then the mean squares for columns and for error have the expectations σ 2 + rσ12 and σ 2 and the degrees of freedom (c−1) and (c−1)(r −1). In Model II, the mean squares are still distributed like multiples of chi-square. Thus the power of such an experiment can be easily determined as in the following example. Example 13 (See Figure 5). A random sample of 25 sailors are to have their balancing ability measured quantitatively on each of 7 days. The results will be submitted to the analysis of variance, and a 5% significance level used. How large must σ12 /σ 2 be, before the existence of differences between sailors will be detected 95% of the time?

4 Binomial Probability Paper

129

Here c = 25, r = 7, and the degrees of freedom are 24 and 144, hence we plot the point (12, 72). A one-sided 5% level corresponds to 8.4 mm, so we draw a circle around (12, 72) with this radius. The tangent splits are 27.2-100 and 8.9-100, so that the critical variance ratio is found from σ 2 + 7σ12 27.2 = σ2 8.9 to be σ12 /σ 2 = .29. The basis of this construction is as follows: A ratio of sums of squares of 27.2 to 100 is needed for the chosen significance level. A ratio as small as 8.9(σ 2 + 7σ12 ) to 100σ 2 can be expected by chance 5% of the time if the population ratio is σ 2 + 7σ12 to σ 2 , hence to obtain 95% confidence of finding significance at 5%, we must have 27.2 8.9(σ 2 + 7σ12 ) ≥ . 2 100σ 100

PART IV—SEVERAL PAIRED COUNTS level and homogeneity combined Comparing several sets of data with a theoretical proportion A set of several paired counts may give evidence against a fixed theoretical proportion in two ways. If the observed proportions are too variable, they indicate lack of homogeneity, that the samples came from populations with different proportions. If the average observed proportion deviates too much from the theoretical, it indicates a change (or error) in level, that the theoretical proportion does not apply. It is well known that the correct way to test a homogeneous set of paired counts for agreement with a preassigned population proportion is to add them together, and then test the sum (as in Part II). Tests of homogeneity alone are the subject of the next section. Many delicate testing procedures involve first a test of homogeneity and then a test of level. Combined tests are mainly used to make quick and easy tests. Example 14. (See Figure 4). A production process has been producing an average of 15% defective pieces over a long period. After the introduction of a new batch of raw material, successive shifts produced the following paired counts of nondefectives and defectives: 1. (155, 20), 2. (164, 41), 3. (106, 12), 4. (41, 10). Is it reasonable to think that the production process is producing the same proportion of defectives as before? We plot the points and the 15% line. Since no middle distance from the line is as great as 2σ = 10 mm, and since the

Frederick Mosteller and John W. Tukey 130

figure 4 showing comparison of k proportions with theory (ex. 14, ex. 15), first step toward a stabilized p-chart (ex. 16), homogeneity of k proportions (ex. 17, ex. 18, ex. 19).

4 Binomial Probability Paper

131

points are about equally spread about the line, there seems to be no reason to suppose the new batch of raw material has made a change in the percentage defective. If some points were found outside 2σ the resulting paired count (of points inside 2σ vs. points outside 2σ) would be compared with the 19-1 split by the method of Example 2. An alternative procedure is to combine the middle distances from the theoretical line by crab addition (discussed in Part I) and consider this a chi-square with as many degrees of freedom as there are paired counts. Example 15 (See Figure 4). Taking the same data as in Example 14, the crab sum, which can be obtained in less time than it takes to read the description, is a length, which when doubled may be read on the marginal scale as χ2 =9.0 which is between the upper 10% and upper 5% points for 4 degrees of freedom. Critique of Examples 14 and 15. Merely examining to see whether any points fall outside the 2σ limits does not squeeze the data dry, and is not very precise, but is very convenient in situations where a control chart would seem reasonable and proper. Both tests tend to detect either a change to a new level or excess variability due to changes in level from shift to shift. To check one of these alone, different procedures should be used. A change to a new level should be tested for by combining all the shifts, and thus comparing (466, 83) with 85-15 (which is far from significance). Changes in level from shift to shift should be tested for by the methods of the next examples (Ex. 17–19). The stabilized p-chart Where the lot size is constant in 100% inspected industrial production or the sample size is constant when sampling inspection is used, one of the standard quality control procedures is the p-chart, where the percentage of each lot or sample found to be defective is plotted against lot number or time. When lot size is not constant, and it is not feasible to break down the data into groups of different size, there is a need for a new technique. The use of groups of varying size is not recommended—it would be better to break up the lots into rational sub-lots of nearly uniform size—but where the quality control engineer cannot arrange for the better solution he may wish to use the following device which was suggested to us by Acheson Duncan [5]. (The classical device is to plot observed percentages, with broken horizontal lines for control limits. This makes a hard-to-read, messy diagram.) Example 16 (See Figures 4 and 6). The following data on adjustment irregularities of electrical apparatus appear in the ASTM Manual on Presentation of Data [1, 1940 Supplement B, p. 58]. The first number given is the sample size, the second, the number of defectives. (We shall not correct the sample size by subtracting the number of defectives, because in the one case where this might make a visible difference there are no defectives.) We divide the

Frederick Mosteller and John W. Tukey 132

figure 5 showing operating characteristic of anova ii (ex. 13), homogeneity of k small proportions (ex. 20), the four-fold table (ex. 21a, ex. 21b), a less obvious example (ex. 22).

4 Binomial Probability Paper

133

sample size by 10, and plot the triangles in Figure 4 (to this scale, the triangles reduce to vertical segments). ADJUSTMENT IRREGULARITIES, ELECTRICAL APPARATUS Lot 1 2 3 4 5 6 7 8 9 10

Sample Size Defectives 600 1300 2000 2500 1550 2000 1550 780 260 2000

2 2 1 1 5 2 0 3 0 15

Lot 11 12 13 14 15 16 17 18 19 20

Sample Size Defectives 1550 950 950 950 35 330 200 600 1300 780

7 2 5 2 0 3 0 4 8 4

From Figure 4 we measure the vertical deviations from the p line (which is assumed to be 0.27% based on past experience) and plot them on a regular control chart (Figure 6), being sure to keep the data in the order in which they originally appeared. (The use of tracing paper makes this process very easy.) In practice each new observation would first be plotted on binomial probability paper (perhaps at an enlarged scale) and then transferred. If the data are retained on the original probability paper, the advantage of examining the data for trends and runs would be lost. Critique of Example 16. The control chart in Figure 6 looks very different from that usually given. In the usual chart Lot 19 is shown as beyond the control limits on the high side, and Lots 4 and 7 are not detected as being possibly too defective-free because we find that there can be no lower control limit. This kind of plotting might be useful even when the samples are of constant size. tests of homogeneity The general case Given 5 or 50 or 500 paired counts which we wish to test for homogeneity, to test if it is reasonable that they have arisen from sampling a population with the same percentages in the two categories, the problem is the same, but the practical solution is different. In every case we plot the individual paired counts and draw the best fitting split, either by eye or through the sum of the paired counts. We shall discuss three methods here, namely: (1) graphical chi-square, (2) range, (3) counts in ±1σ and ±2σ strips.

134

Frederick Mosteller and John W. Tukey

figure 6 showing completed, stabilized p-chart at enlarged scale (with segments made vertical).

Each of these has advantages and disadvantages, and, to the best of our knowledge, they can be compared as follows: Method

Number of Samples

Advantages

Disadvantages

Feasible Recommended χ2 range counts

2 to ∞ 2 to 8 or 15 efficiency labor 2 to ? 2 to 20 ease and speed limited efficiency 15 to ∞ 15 or 20 to ∞ 80% efficiency; relative simplicity

The range is only recommended for 20 paired counts or less since its use for larger k involves the delicate details of the normal distribution and since its efficiency is less than the counting method. To apply the χ2 -test, plot the paired counts and the split through their sum, and combine the middle distances by crab addition as explained in Part I. Example 17 (See Figure 4). The following classic data by C. Goring quoted by K. Pearson and by M.G. Kendall compare the number of alcoholics and

4 Binomial Probability Paper

135

non-alcoholics among criminals according to crimes committed. The first number in each pair is the number of alcoholics: Arson Rape Violence

(50, 43) (88, 62) (155, 110)

Stealing Coining Fraud Totals

(379, 300) (18, 14) (63, 144) (753, 673)

The graphical display shows very clearly that (1) the observations are discrepant, (2) the crime of fraud is the only one for which the proportion of alcoholism is really different. Graphical chi-square computation gives 30.2 on 5 degrees of freedom—highly significant. When criminals convicted for fraud are removed from consideration the remaining five groups are each less than one standard deviation away from the new fitted line (690-539 split). Stealing is slightly misplotted. Critique of Example 17. If the definition of “alcoholic” were sufficiently objective, and if the sample of convicted criminals represents a random sample of criminals, then the analysis seems sound. It cannot, of course, throw any appreciable light on the connection between alcoholism and crime in general, bearing only on the question “excluding fraud, do alcoholics tend to be convicted of some types of crime and non-alcoholics of others?” A quicker method of analysis, and one well suited for drawing lines by eye is to compute the range of the sample, that is the sum of the greatest middle distances to the right and left of the line. Then the range measured in millimeters or standard deviations can be compared with Table 4 (at end of paper) to discover whether the samples deviate enough among themselves to provide evidence that the observations did not arise from random sampling from a single proportion in the population. Example 18 (See Figure 4). In testing the effect of the X-chromosome inversion B M 1 on secondary non-disjunction, K.W. Cooper (personal communication) raised 21 cultures of Drosophila melanogaster, crossing v-In(1)B M 1 / y 2 wa v/Y females with wild-type Canton-S males. The presence or absence of secondary non-disjunction can be detected in female progeny. The 21 cultures gave the following results: (9,135), (7,115), (11,118), (13,89), (15,148), (8,91), (6,113), (11,104), (9,122), (10,90), (15,155), (14,138), (5,84), (11,128), (2,34), (4,73), (4,107), (9,107), (10,103), (11,115), (8,104) where the smaller class showed secondary non-disjunction. The split is drawn through the total of (192, 2273). The range is 16.5 mm which is far from significant—thus this test gives no evidence of heterogeneity. Critique of Example 18. Clearly an amount of heterogeneity large enough to affect the estimated standard error of the grand mean would almost surely have been detected. The test seems adequate for its purpose.

136

Frederick Mosteller and John W. Tukey

A more refined, but still simple test is obtained by drawing parallel lines, at ±5 mm and ±10 mm. In sets of 21 or more samples, we expect about 5% outside the outer lines and about 33% outside the inner lines. The weighted sum 12 (no. outside 10 mm) + 3 (no. between 5 mm and 10 mm) −2 (no. inside 5 mm) is distributed with mean nearly zero and variance nearly 11.72k√in case of√k homogeneous samples. Approximate significance levels are 5.65 k and 8 k for the 5% and 1% points. Example 19 (See Figure 4). Returning to the data of Example 18 and drawing the 5 mm and 10 mm lines, we find (classifying borderline cases according to the center of the hypotenuse of the triangles):

Outside Between Inside

Expected

Found

1 6 14

0 4 17

The weighted sum is 12 − 34 = −22 which is far from the significance levels of 25.9 and 36.5. This more delicate test finds no evidence of heterogeneity in the per cent of secondary non-disjunction in Cooper’s cultures. Indeed, if anything the data are a little too homogeneous, though not enough to notice. These methods can also be applied to the case of unsymmetric counts, as in the following example. Example 20 (See Figure 5). In testing the effects of X-chromosome inversions on primary non-disjunction, K.W. Cooper (personal communication) crossed 847 males with females of eight different chromosomal sequences. Exceptional cases can be detected in both males and females. The observed counts for males were (2885, 13), (7172,18), (4672,13), (9162,14), (1389, 2), (2961,4), (2199, 2), (1195, 1) . Does the rate of primary non-disjunction seem to be constant? The total count is (31635, 67), and the corresponding split together with the individual counts are plotted on Figure 5, where all horizontal coordinates have been divided by 50. The lines parallel to the total split are at ±5 mm and ±10 mm vertically. The range, measured vertically (since the horizontal coordinate has been reduced), is 19.3 mm which is not far from the 5% point of 21.8 mm. Critique of Example 20. The method in the critique of Example 18 should not be applied on so few points, but if it were applied in the critique √ the weighted sum would be 12(1) + 3(3) − 2(4) = 13. The value of 5.65 k is 16.0, but direct calculation of the 5% point yields 19. Thus the approximate 5% point is not too accurate for 8 points (16.0 is about the 10% point). Calculation of χ2 by crab addition of the vertical deviations from the line to

4 Binomial Probability Paper

137

the total yields 13.9, which is again not quite at the 5% level for 7 degrees of freedom. The four-fold table The four-fold table, where a sample is classified into two categories in each of two ways has received very much attention by both applied and theoretical statisticians. Different methods of analysis have been given, some of which assume that (1) the sample is a representative of samples in which only the total is fixed, or (2) the sample is a representative of samples in which one set of marginal totals are fixed, or (3) the sample is representative of samples in which all marginal totals are fixed. Many of the “control group versus experimental group” experiments so common in biology, medicine, psychology, and education fall under (2), since the numbers in the control and experimental groups are fixed. Such experiments can be approximately analyzed as a homogeneity test as in the last section. For the case of two paired counts, the chi-square and range methods are equivalent, and the range is simpler. Example 21 (See Figure 5). English et al. [7, 1940] took samples of 208 smokers and 208 non-smokers and investigated the incidence of coronary disease. They found (198, 10) and (206, 2), where coronaries are the second category. The range is 17 mm which is significant at the 5% level. They also took 187 cases with coronary disease, and 302 without, and investigated the incidence of smoking. They found (149, 38) and (187, 115), where smokers are the first category, which yields a range of 30.1 mm (Figure 5) which is horribly significant. Critique of Example 21. These last two samples can be united into a fourfold table, but, in view of the way in which the data were obtained, it would be incorrect to compare (187, 149) with (115, 38) by this method and to assume that two binomials were being compared. However, the range obtained in this way is 29.7 mm and it is possible that such inverted tests on binomial probability paper give approximately correct answers. Coronary disease

Smokers Non-smokers

Yes

No

149 38

187 115

138

Frederick Mosteller and John W. Tukey

A less obvious example The ideas behind the sign test may be extended to give approximate tests in many situations of greater complexity. Such tests may be very useful, when used with the knowledge that they are quick, but often lack the sensitivity of more complex methods. Example 22 (See Figures 7 and 5). A routine bioassay had been in use for two years using a standard curve. Occasional checks on the standard had been made. The situation is shown in Figure 7, which raised two questions: (i) Does the curve agree with the recent points? (ii) If not, has something surely changed in two years, or may the difference be assigned to the combined sampling fluctuations in establishing and checking the standard curve? figure 7 basic bioassay data for example 22 (a less obvious example).

The first question is answered by the split test, for comparing (19, 5) with a 50-50 split yields (Figure 5) a separation of 15 mm which is very highly significant. The second question can be approximately answered as follows: the original least square fit to 18 points was probably more accurate than fitting a median to 18 points and less accurate than fitting a median to 36 points. A roughly fair test should come between a comparison of (19, 5) and (9, 9) and a comparison of (19, 5) with (18, 18). These give (Figure 5) ranges of 14.5 mm and 16.5 mm, which are both beyond the 5% level, indicating that the activity, or the fish, or the technique has probably changed.

4 Binomial Probability Paper

139

Critique of Example 22. While these are not the most thorough tests which can be applied to this situation, anyone familiar with bioassay computation will appreciate their speed, simplicity, and clarity.

PART V—REFERENCE MATERIAL modification of the angular transformation The original angular transformation The angular transformation was introduced by R.A. Fisher in 1922 [9, p. 326] in a genetic situation where a certain proportion was varying by random fluctuation from generation to generation. In 1936, Bartlett [2, p. 74] proposed its use on experimental data as a means of stabilizing the variance when binomial data were subjected to the analysis of variance. Various authors have proposed its use for various purposes, a considerable number of references may be found in [13, 1947]. Bartlett’s modification In his 1936 paper, Bartlett also proposed an empirical modification to make the transformation more effective near p = 0 and p = 1. This was the device of transferring 12 of a unit from the larger count to the smaller count. Thus (3, 29) would become (3.5, 28.5). This proved to be helpful, but had the annoying feature that both (3, 4) and (4, 3) were converted to (3.5, 3.5) which did not seem appropriate. The smooth version The smooth way of obtaining the good effects of ± 12 near the ends and ±0 in the middle is to add 12 to each cell, thus passing from (n − k, k) to (n − k + 12 , k + 12 ). It is clear that for values of p near 0, 21 , and 1 this will stabilize the variance very well. How well requires a numerical study, now in progress. Correction for continuity Most of the applications of binomial probability paper discussed above deal with tests of significance rather than with scoring paired counts. We must try, then, to assign nearly normal deviates, not to single paired counts, but to tails—to all (n − k, k) for which k ≥ r, for example. This is closely connected with the scoring problem, since a natural dividing line between (n−r+1, r−1) and (n − r, r) is (n − r + 12 , r − 12 ), and in accordance with the last paragraph, this is to be scored as if it were (n − r + 1, r). Thus we expect to find that ! √ r −1 −1 √ n + 1 sin p − sin n+1 is nearly the normal deviate associated with the probability that k ≥ r, where k is binomially distributed according to n and p. Flattening

140

Frederick Mosteller and John W. Tukey

Since the angles involved are rather small, it is plausible to replace them by their sines. This is of course what has been done in the examples, where we have always measured distances perpendicular to the splits. A little trigonometry shows that the distance from (n − r + 1, r) to the p-q split is (in standard deviations) " √ 2( p(n − r + 1) − qr). (To obtain distances in millimeters, replace 2 by 10.16 mm.) Accuracy The accuracy of the over-all approximation to P r{k ≥ r|k binomial (n, p)} by

" √ P r{x ≥ 2( p(n − r + 1) − qr)|x unit normal}

has been studied numerically, and a note giving details will be submitted to the Annals of Mathematical Statistics (by Murray F. Freeman and John W. Tukey). The general conclusion is that the approximation is extraordinarily good near the 1% to 5% points, and remarkably good in general. the incomplete beta and f distributions The binomial distribution is, as is well known, given by the expansion of (q + p)n

q =1−p

where n is the number of cases, p the chance of a “success” and the term n n−r r q p r in the expansion is the probability of exactly r successes. The probability of r or more successes is given by S=

n  x=r

n! (1 − p)n−x px . x!(n − x)!

Using the well-known device of differentiating both sides with respect to p and summing, we get dS n! = pr−1 (1 − p)n−r . dp (r − 1)!(n − r)! Replacing p by t and integrating S from 0 to S, and t from 0 to p, we have the usual relation P r(r or more successes) = Ip (r, n − r + 1)  p n! = tr−1 (1 − t)n−r dt (r − 1)!(n − r)! 0

4 Binomial Probability Paper

141

where Ip (m, n) is the incomplete Beta-function. Hence if binomial probability paper successfully represents the binomial distribution it also successfully approximates the incomplete Beta-function. Thus " √ Ip (r, n − r + 1) ∼ P r{x ≥ 2( p(n − r + 1) − qr)|x unit normal} which seems, incidentally, to be a new analytic approximation to the incomplete Beta-function. Simplifying notation, we find that Ip (m1 , m2 ) corresponds to the distance from p-q split to the point (m1 , m2 ). The ratio of two independent mean squares obtained from normal variates of the same variance is Snedecor’s F , which is related to Fisher’s z by F = e2z . The ratio of the numerator sum of squares to the total sum of squares may be written in terms of F as x=

n1 F , n2 + n1 F

and its distribution is given by, P r{x < p} = Ip ( 12 n1 , 12 n2 ). Hence, to the approximation of binomial probability paper, a ratio s1 =p s1 + s2 of sums of squares has a probability of arising from populations of equal variance which is given by the tail area corresponding to the deviation of the count ( 12 n1 , 12 n2 ) from the line p-(1 − p) which is the same as the line s1 -s2 .

PART VI—INDEX, OUTLINE AND TABLES Introduction Table 1 is not intended to replace the worked examples, but rather to serve as a key for the new reader and a reminder for the old. The short tables which follow are of standard distributions based on the normal distributions. Since millimeters are convenient units for use with binomial probability paper, they are given in both millimeters and in standard deviation units. For maximum accuracy, use a sharp pencil! (Regular thickness automatic pencils may serve for some routine work, but finer lead will give better results.) The figures have been drawn for clarity, not accuracy. Remember these methods are all approximations.

142

Frederick Mosteller and John W. Tukey table 1. index and outline

Example

Aim

Plotting Required

Remarks

Part II. Plotting one observed quantity 1 2 3 4

5, 6

7

Observed and 1 paired count theoretical proportions 1 split (theory) Sign test 1 split (50-50) 1 paired count Confidence limits 1 paired count for proportion 2 splits (at distance) Confidence limits 1 split (50-50) for median 2 paired counts (at distance) All F tests 1 split (sums of squares) 1 point ( 12 degrees of freedom) Angular transformation 1 paired count 1 split (through middle)

Use short distance for significance level. Use both short and long for significance zone. Use short distance Use short distance

Part III. Application to design 8

9

10

11

12

13

Designing binomial ex- 2 splits (theory) periment 2 parallel lines (at distance) Designing single 2 splits (theory) sampling plan 2 parallel lines (at distance) Operating characteris- 1 paired count tic of sign test 1 split (at distance) Sample size for 1 split population tolerance 1 parallel line limits 1 horizontal line

Distances correspond to one-sided significance levels at percentages to be controlled. (AQL and RQL=LTPD) Use short distance

Split-to-line distance desired confidence. Sum of counts-in determines horizontal Tolerance limits for 1 split Split through 1st and second sample 2 paired counts (touch- 2nd sample sizes; dising) tance to common vertex=confidence Operating characteris- 1 point ( 12 degrees of tic of anova II freedom) 2 splits (at distance) compute from ratio of split ratios

4 Binomial Probability Paper

143

table 1. (Continued) Example

Aim

Plotting Required

Remarks

Part IV. Several paired counts 14

k proportions and a theoretical proportion

15 16

k proportions and a theoretical proportion Stabilized p-chart

17

Homogeneity of k

18 19

20

21

proportions (k small) Homogeneity of k proportions (k 20) Homogeneity of k proportions (k large)

k paired counts 1 split (theory) 2 parallel lines (±10 mm) k paired counts 1 split (theory) k paired counts split (assumed level) k paired counts 1 k 1 k 1 4

split (sum) paired counts spit (sum) paired counts split (sum) parallels (±5 mm, ±10mm) Homogeneity of k as 17 or 19 with unsymmetrical propor- large count divided tions Four-fold table 2 paired counts 1 split (sum)

Expect 1 in 20 outside by middle distance. Combine middle distance by crab addition, Transfer to tracing paper as control chart Combine middle distances by crab addition, Range of middle distances 12 outside+3 between −2 inside√ √ 5%, 5.65 k; 1%, 8 k. Use middle distances. As 17 or 19 with distances in undivided direction Range from middle distances

table 2. millimeter table for normal deviate Significance Level one-sided two-sided 50% 40% 30% 20% 16.5% 10% 5% 2.5% 1% 0.5% 0.1%

100% 80% 60% 40% 33.2% 20% 10% 5% 2% 1% 0.2%

Conversion relation

Normal Deviate millimeters multiples of σ 0.0 1.3 2.7 4.3 5.0 6.5 8.4 10.0 11.8 13.1 15.7

0.0 .25 .52 .84 .97 1.28 1.65 1.96 2.33 2.58 3.09

5.080

1.000

144

Frederick Mosteller and John W. Tukey

table 3. millimeter table for chi-square Degrees of Freedom 1 2 3 4 5 6 10 15 30

Undoubled Millimeters Multiples of σ 2 At an Upper Significance Level of (50%) 5% 1% (50%) 5% (3.4) 10.0 13.1 (.5) 3.8 (6.0) 12.4 15.4 (1.4) 6.0 (7.8) 14.2 17.1 (2.4) 7.8 (9.3) 15.6 18.5 (3.4) 9.5 (10.6) 16.9 19.7 (4.4) 11.1 (11.8) 18.0 20.8 (5.3) 12.6 (15.5) 21.8 24.4 (9.3) 18.3 (19.3) 25.4 28.1 (14.3) 25.0 (27.5) 33.6 36.2 (29.3) 43.8

1% 6.6 9.2 11.3 13.3 15.1 16.8 21.2 30.6 50.9

table 4. millimeter table for normal ranges Number of Observations 2 3 4 5 6 7 8 9 10 15 30

Millimeters Multiples of σ At an Upper Significance Level of (50%) 5% 1% (50%) 5% (4.8) 14.1 18.5 (0.95) 2.77 (8.1) 16.9 20.8 (1.59) 3.34 (10.0) 18.5 22.2 (1.97) 3.65 (11.5) 19.6 23.2 (2.26) 3.87 (12.6) 20.5 24.1 (2.47) 4.04 (13.5) 21.2 24.8 (2.65) 4.18 (14.2) 21.8 25.3 (2.79) 4.29 (14.8) 22.3 25.8 (2.91) 4.39 (15.4) 22.7 26.2 (3.02) 4.48 (17.4) 24.2 27.6 (3.42) 4.79 (18.8) 25.4 28.6 (3.69) 5.01

1% 3.64 4.10 4.38 4.59 4.74 4.87 4.98 5.07 5.15 5.44 5.64

references [1 ]American Society for the Testing of Materials, “Manual on Presentation of Data,” Philadelphia (1940). [2 ]M.S. Bartlett, “The square root transformation in analysis of variance,” Suppl. J. Roy. Stat. Soc., Vol. 3, 1936, pp. 68–78. [3 ]M.S. Bartlett, “The use of transformations,” Biometrics, Vol. 3 (1947), pp. 39–52. [4 ]W.J. Dixon and A.M. Mood, “The statistical sign test,” Journal of the American Statistical Association, Vol. 41 (1946) pp. 557–566. [5 ]Acheson J. Duncan, “Detection of non-random variation when size of sample varies,” Industrial Quality Control, Vol. 4 (1947–48) No. 4, pp. 9–12.

4 Binomial Probability Paper

145

[6 ]Churchill Eisenhart, “The assumptions underlying the analysis of variance,” Biometrics, Vol. 3 (1947) pp. 1–21. [7 ]J.P. English, F.A. Willius, and J. Berkson, “Tobacco and coronary disease,” J. Am. Med. Assn., Vol. 115, pp. 1327–1328 (1940). [8 ]C.O. Fawcett, et al., “A second study of the variation and correlation of the human skull, with special reference to the Naqada Crania,” Biometrika, Vol. 1 (1902) pp. 408–467. [9 ]R.A. Fisher, “On the dominance ratio,” Proc. Roy. Soc. Edinburgh, Vol. 42 (1922) pp. 321–341. [10 ]R.A. Fisher and K. Mather, “The inheritance of style length in Lythrum salicaria,” Annals of Eugenics, Vol. 12 (1943) pp. 1–23. [11 ]R.A. Fisher and K. Mather, “A linkage test with mice,” Annals of Eugenics, Vol. 7 (1936) pp. 265–280. [12 ]Frederick Mosteller and John W. Tukey (designers), Binomial Probability Paper, Codex Book Company, Norwood, Mass. 1946. [13 ]Statistical Research Group, Columbia University, Selected techniques of statistical analysis, McGraw-Hill, 1947.

Reprinted from Science (1949), 109, pp. 553–558

5. The Education of a Scientific Generalist Hendrik Bode, Frederick Mosteller, John W. Tukey, and Charles Winsor Physical Research Department, Bell Telephone Laboratories; Department of Social Relations, Harvard University; Department of Mathematics, Princeton University; and Department of Biostatistics of the School of Hygiene and Public Health, The Johns Hopkins University

The complexities of modern science and modern society have created a need for scientific generalists, for men trained in many fields of science. To educate such men efficiently would require modified courses and new ones. However, a good beginning can be made now with courses which are available in many colleges and universities. One such program is set forth here. General Considerations The central problem. Scientific and technological advances have made the world we live in complex and hard to understand. We have today large scale division of labor, complex and indirect methods of production and distribution, large communities and large areas held together by common channels of transport and communication, and operation with small margins of safety, requiring close and delicate control. All these complex and delicate activities produce scientific and technological problems of great difficulty. Science itself shows the same growing complexity. We often hear that “one man can no longer cover a broad enough field” and that “there is too much narrow specialization.” And yet these complexities must be met—and resolved. At all levels, decisions must be made which involve consideration of more than a single field. These difficulties are most pressing in the borderline fields like physical chemistry, chemical physics, biophysics, biochemistry, high polymers, and the application of chemistry, physics, and mathematics to medicine. It is here that both the challenge of the problems and the difficulties arising from overspecialization are greatest. We need a simpler, more unified approach to scientific problems, we need men who can practice science—not a particular science—in a word, we need scientific generalists. Research teams, aggregations of men of diverse skills working on the various aspects of single problems, have been widely used and have accomplished much. Their use is certain to continue and expand. But the research team must have a leader to unify the group, whether he be director, coordinator, or

148

Hendrik Bode, Frederick Mosteller, John W. Tukey, and Charles Winsor

advisor. This leader must work as a scientific generalist, and we feel he would function better if trained with this in mind. There is a clear remedy for these complexities, both in education and in science, but its use will involve work and time. We must use the methods of description and model-construction which, in the individual sciences, have made the partial syntheses we call organic chemistry, sensory psychology, and cultural anthropology. We must use these methods on the sciences collectively so that, eventually, one can learn science, and not sciences. We suggest that an attempt to unify science may reasonably start from the following ideas: 1. All science began as part of “natural philosophy” and radiated outward. (Even in this modern day, it should be possible to recapture the universalist spirit of the early natural philosophers.) 2. Scientific method is common to all sciences. (The difficulty that almost all scientists have in defining scientific method does not lessen the importance of this fact.) 3. Almost every science is more easily taught by using some of the equipment of the others. (This is generally admitted to be true for mathematics in physics, less generally for chemistry and physics in biology, to name two examples.) 4. Statistics, as the doctrine of planning experiments and observations and of interpreting data, has a common relation to all sciences. 5. Unification will be more easily attained if the logical framework of the individual sciences can be identified and isolated from their factual content. The isolation of the logical framework of the sciences is a long overdue first step toward a synthesis of science. What is the logical framework of organic chemistry? Or equivalently, what are the characteristic ways in which a good organic chemist thinks and works? There is no place where a student can learn this directly, no place where it is set forth clearly, freed of as many detailed facts as possible. A second step would be the development of courses within individual sciences which really make use of the material and equipment from the other sciences. A synthesis unifying the sciences is, at best, a long and difficult task. It is a problem that will take time to solve and we do not have a facile answer to it. We shall consider ways in which potential scientific generalists might, as undergraduates, be given the kind of background that would enable them to develop into real scientific generalists. Ours is frankly only an interim proposal; it makes no real synthesis. Yet, since the student will go deeply into parts of many sciences, he will learn something of the habits of mind of the chemist, psychologist, and geologist. These habits, and not subject matter, are what distinguish the sciences—for how else can we distinguish the chemical physicist from the physical chemist, the mathematical biologist from the biomathematician!

5 Education of a Scientific Generalist

149

Relation to Other Programs Generalists and scientific generalists. By confining ourselves to scientific generalists, we do not intend to undervalue the need of true generalists, trained in all disciplines, science among them. Science should be able to look to such true generalists for considered judgment about the fields where the greatest speed of development is needed, and then to the scientific generalist for help in attaining that speed. Acting alone, scientists are not competent to plan the training of the true generalists, so that, in spite of the great need of such men, it would be inappropriate for us here to attempt to lay out a program for their training. Nonetheless, some aspects of the scientific generalist’s training bear on theirs. Liberal education. We do not believe that a satisfactory synthesis of the modern world can be achieved without incorporating within it science and the scientific method, which have had a major share in shaping that world, and are still forcing the pace of change. First, however, we must develop some satisfactory synthesis of science by itself, else how can we hope for the greater synthesis of a liberal education, in which science plays so large a part? Moreover, it seems clear that scientific methods of description and modelbuilding are in many ways the most efficient intellectual techniques yet devised for covering a broad field quickly. Since any satisfactory program of liberal education involves covering an immense mass of material in a limited time, the most efficient techniques must be used. We lay stress in the generalist’s education upon these techniques, and in particular upon taking advantage of the efficiency to be gained by using in each science some of the equipment of other sciences. We feel that similar efficiency could be gained in history, philosophy, and the “nonscientific” fields by using these techniques. We are not historians or philosophers. We can only say that we think historians and philosophers who are also scientific generalists seem most likely to begin this task. Finally, we believe that while the scientific generalist’s education is not intended primarily as a general liberal education, it may be a temporary approximation to one. The program we outline has room for enough nonscientific courses to meet the going minimum standards of a liberal education as well as the program for a scientific specialist does today. A true generalist would, of course, require a much broader program. In spite of this, a college graduate with an education based upon the generalist’s program might well be a better lawyer, businessman, or teacher of high school science than one with a classic liberal education or one with specialized training in a single field. The Scientific Generalist What is he? By a generalist—and we shall not bother to specify every time that we mean scientific generalist—we mean a man with training and a working sense in many fields of physical and biological science. His principal interests may be broad or sharply defined, but he is exceptional in his breadth

150

Hendrik Bode, Frederick Mosteller, John W. Tukey, and Charles Winsor

of appreciation. He may be working in pure science, or in the application of science to engineering, business, or industry. He may not be as good a physicist as a student or research man who has been trained principally in physics, or as good a geneticist as the biologist who is trained in genetics, or as good an economist or engineer or psychologist as a specialist in those fields. But he has learned enough of these fields, and of the central tools of mathematics and statistics, to bring to problems of almost any kind the ideas and broad tools of any combination of these many subjects that will speed and improve the work. We are interested here in the education of a generalist during his four undergraduate years. These four years will not complete his education, any more than four years complete the education of a chemist, a psychologist, or an economist. After these undergraduate years, the budding generalist will go on with further study, presumably in some special field. The aim of the four undergraduate years is to give him a broad foundation and to open his mind to a wide range of scientific fields. Eventually, perhaps, there will be graduate training as a generalist, but until that day comes, specialized graduate study must serve. Who needs him? First, any research group needs a generalist, whether it is an institutional group in a university or a foundation, or an industrial group working directly on industrial problems. If the problems are broad enough to require a group instead of a few isolated researchers, then there will be a place for a generalist. Many groups who now have generalists or near generalists do not recognize them as such, but think of them in terms of their specialties. In addition, any first-class administrator or policymaker in fields related to science must be a generalist to a considerable degree, if only to foresee external influences which might rock the boat. For example, the production of plastics has radically affected almost all enterprises based on the manufacture of small intricate parts. A good administrator needs a generalist’s background in appraising the possible effects of going research, estimating its time of fruition, and judging the date and intensity of its ultimate impact on his organization. What would he do? Many illustrations of the contributions a generalist can make to a research group can be drawn from the recent war. The ability to isolate critical elements, to establish the essentials of the logical framework, to reduce the problem to a few critical issues, is essential in handling problems of military engineering and operations analysis; it is also the ability that the scientist uses every day in his own work. The scientists who were able to carry over and apply their methods of thought to other fields were able to assist largely in the solution of military problems. By working as generalists these scientists distinguished themselves from the specialists, who found themselves baffled and uncomfortable when confronted with unfamiliar and ill-defined issues. The problems of social engineering and economic engineering are similar to those of military engineering in their need for immediate action, their con-

5 Education of a Scientific Generalist

151

fusing variety of aspects, the lack of definition of their issues, and, often, the lack of basic knowledge appropriate to their solution. It seems not too much to hope that scientific generalists, amateur and professional, can do much toward developing these fields and solving many pressing problems. In an engineering group the generalist would naturally be concerned with systems problems. These problems arise whenever parts are made into a balanced whole—balanced so as to serve an end efficiently. It may be weights that are balanced, or sizes, or complexities of component pieces of mechanism, or expense, effort, or research time applied to different phases of the problem. In the social sciences, the generalist would provide background in physical science and in scientific inference, and experience in the analysis of data and in use of mathematical methods and techniques, which together seem essential for that rapid development of social science which we all agree is now so urgently needed. The generalist would assist in the construction, interpretation, and modification of mathematical models. He would examine previous findings from a fresh viewpoint. The generalist would be able to assist in the design of experiments — still a fairly weak spot in most of the social sciences. Incidentally, he would assist in the development of statistics by disclosing unsolved problems. In biological and medical science he would provide efficient interpretation of data and design of experiment, and—what is most important—would suggest physical explanations or mathematical models for known or conjectured facts. The need of such useful and stimulating people is probably better recognized in this field than in any other that we know. And finally, a student equipped with the undergraduate training of a generalist would have an ideal start toward becoming a consulting statistician, who must work with others on problems in many fields of science and technology. With this foundation, a graduate training in mathematical and theoretical statistics would produce the best beginning of a consulting statistician that we can plan today. What is his background? Most of the small number of generalists that any institution might train during a year are, we think, going to have the following characteristics: They will be planning, definitely or tentatively, to go on to graduate study in some field. They will have interest and skill in science, fairly general and unspecialized. They will have met and mastered competition in high school (or preparatory school) and will have the selfconfidence to attempt an unusual and challenging course in college. They will want to learn, and will be prepared to work hard. After a broad introduction to science, and after they have learned what the various fields of science are really like, and how the practitioners of these fields think and work, they will be able to make an intelligent choice between 1) some single field of science, 2) some borderline field between two or three sciences, and 3) the profession of being a generalist.

152

Hendrik Bode, Frederick Mosteller, John W. Tukey, and Charles Winsor

Educating the Generalist Now Principles. There are certain principles that must govern any plan for educating a generalist—an interim plan using present-day courses or the efficient plan of the distant future. Some of these principles are general and will offend no one; others are specific and will bother some administrators and some scientists. Let us list three: 1. The pregeneralist must study many fields of science deeply enough to understand their logical framework and the approaches of their practitioners. 2. With the possible exception of a few tool subjects, the interest of the pregeneralist in factual information is definitely secondary. 3. Skipping prerequisites is to be encouraged. The idea of skipping prerequisites may seem strange and shocking, but the generalist as such will always be working on problems for which he has not had the normal training. He needs to learn to work well under these adverse circumstances during his college years if he is to prepare for his job realistically. Such flexibility will make administrative complications, and will produce temporary discomfort for many instructors. Program. We now give an illustrative program, which is specific enough about individual courses to make these principles explicit: Assignment of 40 Semester Courses Biology . . . . . . . . . . . . . . . . . . . . . 4 Statistics . . . . . . . . . . . . . . . . . . . . 4 Chemistry . . . . . . . . . . . . . . 4 or 5 English . . . . . . . . . . . . . . . . . . . . . 2 Geology . . . . . . . . . . . . . . . . . . . . . 1 Industrial processes . . . . . . . . . 1 * Mathematics . . . . . . . . . . . . . . . . 6 Judging . . . . . . . . . . . . . . . . . . . . . 1 * Physics . . . . . . . . . . . . . . . . . . . . . 6 Surveying . . . . . . . . . . . . . . . . . . . 1 Psychology . . . . . . . . . . . . . . . . . . 2 Distribution . . . . . . . . . . . . . . . . . 8 or 7 * If these special courses are not available, the time freed is to be applied to independent work or to distribution. Summer Work 1st: Summer surveying course listed above. 2nd: Completion of the language requirement of a broad reading knowledge in two important scientific languages (if possible by foreign study this summer). 3rd: Work on a research or development project involving real engineering problems. Independent Work Junior Year: Topic involving economic considerations. Senior Year: Topic involving at least two fields and preferably three.

5 Education of a Scientific Generalist

153

This program is a definite overload, since it combines 40 semester courses with independent work. But the pregeneralist must be an unusual student, building on high precollege attainment. Any student who does not understand what he has studied will be useless as a generalist, and will have achieved a “smattering of ignorance.” Since most schools allow students to limp along with low grades, the pregeneralist’s standing must be a full grade higher than that required in other programs. Besides carrying out his formal program, the pregeneralist should meet regularly with sympathetic and active graduate students and faculty members, for lunch, perhaps, or to drink beer. The regular courses. We take first the bulk of the program, reserving the special courses in English, industrial processes, and judging for a separate discussion. The “distribution” courses will have real value for the generalist. History, philosophy, economics, the humanities, and the social sciences come here. Some educators might emphasize the need for a course in social change and cultural lag, others would put the emphasis elsewhere. We cannot plan these distribution courses now; their planning should involve all departments, and should probably not be too rigid. Next come the courses in various fields of science. While specific courses are stated, it is always permissible to replace any course by a more advanced course in the same field. In biology, where we include paleontology, paleobotany, physiological psychology, and the like, there are to be four courses. What shall they be? We do not feel prepared to make really detailed suggestions, and can only point out that it is essential that not all of the courses be at the elementary or near elementary level. It has been suggested by one of our friends (K.W. Cooper) that if four semesters of work in biology were to be given to such students, the most desirable plan might be: (1) cellular biology or general physiology (from the viewpoint of transfer of energy, intermediary metabolism, physiology), (2) comparative anatomy (from the developmental and truly comparative point of view), (3) the problems of genetics (persistence and change of hereditary patterns), and (4) evolution and the evolutionary record. If such a program were available at any institution, we would gladly recommend it as one good set of four courses for the training of a generalist. The four or five courses in chemistry we propose must give the future generalist a bird’s-eye view of a broad field. Hence, we advocate a large amount of skipping around: one or two semester courses in elementary chemistry, one semester in organic chemistry, one semester in physical chemistry, and one semester or none in an advanced elective. To the chemist, the idea of taking one semester of organic chemistry and then jumping to one semester of physical chemistry may seem strange, wild, and unwarranted. But the student generalist wishes to learn how a chemist thinks, how problems are approached, and in what general direction he may learn things once they seem necessary or useful in a particular problem. One semester of organic chemistry (whatever

154

Hendrik Bode, Frederick Mosteller, John W. Tukey, and Charles Winsor

classes of compounds are discussed) and one semester of physical chemistry (whatever branches are included) will do much to orient him and give him a basis on which he can build later as needed. He will know enough physics and mathematics to learn physical chemistry. Only one course is listed in the field of geology, since paleontology, paleobotany, and the like were included under biology. Just which course is to be taken can safely be left to the choice of the student. In mathematics there are standard courses available: four semester courses or less in elementary mathematics through calculus and differential equations, one semester in complex variables, and one or more semesters of advanced electives. These courses seem to need little comment. In physics (where we include astronomy) the situation is also simple, for the courses which should be taken are available at most institutions. Here we suggest: two semester courses in elementary physics, one or two semesters in electricity (circuits and waves), one semester in physical measurements, and one or two semesters of advanced electives. We feel that a knowledge of electrical circuits and the associated knowledge of the uses and behavior of electron tubes would be an essential tool for a generalist in almost all the fields that interested him. The other courses seem to us to require little explanation. Two courses are proposed in psychology: one semester course in experimental psychology, and one in systematic psychology. Experimental psychology will include laboratory work. Systematic psychology will be a study of psychological problems and theories, with some emphasis on methodology and classic experiments. It will provide a theoretical interpretation for the experimental course. Here the essentials seem to be (1) learning how the psychologist approaches his problems, and (2) acquiring an appreciation of the complications and difficulties which the human observer always introduces into any experiment or study. In statistics we propose four semester courses, which can probably be found at a few institutions where the training in statistics has been well organized and developed: one or two semester courses in elementary mathematical statistics, one semester in design and analysis of experiment with practical work, and one or two semesters in advanced mathematical statistics, including multivariate methods and the use of order statistics. Because statistics is one of the main tools for applying the quantitative method to any field of science, we feel that four semester courses are by no means too many. We feel that statistics is a distinct field, rather than a branch of mathematics under a new name. English has been allotted two semester courses in composition. These should be, primarily, courses in exposition. We should prefer to have one a freshman course in “How to Say What You Mean,” orally and in writing. The other should be a junior or senior course in writing technical reports, papers, and expository accounts, with emphasis on getting the results and essential spirit across to nontechnical readers.

5 Education of a Scientific Generalist

155

The proposed inclusion of a course in surveying in this curriculum has led to considerable discussion, and it seems worth while to make very clear what values we hope might be obtained from it, particularly if it were taken just before the beginning of the freshman year. It would provide students with experience in physical measurements for which there are checks (the traverse must “close”), and give them opportunity to observe the customary processes of handling data. It is not essential that the student become a good surveyor, but it is very much to his advantage to learn to recognize a man who is good with tools, and good at measuring. The special courses. There are two courses that we think should be added to present offerings in order to improve the interim training of the generalist. The first could be given tomorrow at many institutions. The other, here somewhat obscurely entitled “judging,” would be very difficult for anyone to teach. The course in industrial and shop processes should summarize how things are actually made. It could be entirely descriptive, since it is to train generalists, not engineers, and therefore one semester would suffice. It should, for example, describe forging and milling, the function of a turret lathe, the kinds of heat treating used and their effects, what an industrial still looks like, and how it operates—in other words, teach him the unit operations of mechanical, electrical, and chemical engineering. The generalist needs some knowledge of how materials are manipulated. Somewhere in the curriculum, probably in the senior year, there should be a course in judging, guessing, and the scientific method. This course is needed not only by the generalist, but by many other scientists. We have no way now to encourage or require a man to bring his education and intelligence to bear on estimation and prediction problems for which he has inadequate information. As a result, scientists often become stuffy and narrow in their views. To meet this need, we propose a course in educated, intelligent guessing. It would be principally a laboratory course, in which essentially impossible problems were put to the student, who would be required to supply answers and estimates of their trustworthiness, on the basis of the inadequate data given plus what he knows about the world. The time limits for these answers would vary. Some problems would be done by individuals; others by groups. There would be discussion periods after solutions were submitted. The main difficulty with this course would be finding an instructor equipped to teach it. None of these four special courses is essential to the training of a generalist, but all of them would be extremely helpful. We believe their value to other students as well as to generalists would make them worth adding to the offerings of most colleges. Another generally useful course which we would place next in importance for the pregeneralist is one in “intellectual techniques.” This should supply the student with tools which would help him to ingest and digest a large amount of material in a limited time. It should include rapid and effective reading, quick ways of using a library, and what is known about efficient methods of

156

Hendrik Bode, Frederick Mosteller, John W. Tukey, and Charles Winsor

study and learning. Such a course should come in the freshman year if possible. Some of us feel that introductory psychology is a good facsimile, others feel that it does not concentrate enough on the tool aspect to meet the need. Installing the program. Before any new program can be introduced into an American college or university without objection it must be shown to fill certain needs. For the next decade colleges will ask: Does this program meet the distribution requirements? Does this program fit into the scheme of departmental concentration or majoring? Distribution requirements vary from institution to institution, and are in the process of changing at many institutions. We have compared the proposed program with the distribution requirements at Harvard and at Princeton, since we are most familiar with these universities. There the distribution requirements would easily be met. It seems reasonable to conclude that this program would be consistent with the distribution requirements of a substantial number of institutions. Next, there must be a home for the pregeneralist. Some department must be prepared to accept his widely distributed work as concentration in that department. Or a new interdepartmental program must be set up! Finding a home may be difficult. There will be difficulty with individual scientific departments wherever such a program is proposed; each such department has a natural desire to have these very good students concentrate in its department. Finally, there will be objections from administrators about any program requiring students to take courses without the usual prerequisites. These difficulties of installation, however, are merely details, and such a program can be started. The point of view we have tried to expound might be summarized thus: Science is complex; yet it must become manageable. It can be managed better with the help of a few scientists with training in many sciences. A few such scientific generalists can be trained tomorrow with the courses at hand. To make science more manageable. we must perform a new and difficult synthesis on a higher level of organization.

Reprinted from Psychometrika (1951), 16, pp. 3–9

6. Remarks on the Method of Paired Comparisons: I. The Least Squares Solution Assuming Equal Standard Deviations and Equal Correlations Frederick Mosteller Harvard University Thurstone’s Case V of the method of paired comparisons assumes equal standard deviations of sensations corresponding to stimuli and zero correlations between pairs of stimuli sensations. It is shown that the assumption of zero correlations can be relaxed to an assumption of equal correlations between pairs with no change in method. Further the usual approach to the method of paired comparisons Case V is shown to lead to a least squares estimate of the stimulus positions on the sensation scale.

1. Introduction. The fundamental notions underlying Thurstone’s method of paired comparisons [4] are these: (1) There is a set of stimuli which can be located on a subjective continuum (a sensation scale, usually not having a measurable physical characteristic). (2) Each stimulus when presented to an individual gives rise to a sensation in the individual. (3) The distribution of sensations from a particular stimulus for a population of individuals is normal. (4) Stimuli are presented in pairs to an individual, thus giving rise to a sensation for each stimulus. The individual compares these sensations and reports which is greater. (5) It is possible for these paired sensations to be correlated. (6) Our task is to space the stimuli (the sensation means), except for a linear transformation. There are numerous variations of the basic materials used in the analysis— for example, we may not have n different individuals, but only one individual who makes all comparisons several times; or several individuals may make all comparisons several times; the individuals need not be people. 

This research was performed in the Laboratory of Social Relations under a grant made available to Harvard University by the RAND Corporation under the Department of the Air Force, Project RAND.

158

Frederick Mosteller

Furthermore, there are “cases” to be discussed—for example, shall we assume all the intercorrelations equal, or shall we assume them zero? Shall we assume the standard deviations of the sensation distributions equal or not? The case which has been discussed most fully is known as Thurstone’s Case V. Thurstone has assumed in this case that the standard deviations of the sensation distributions are equal and that the correlations between pairs of stimulus sensations are zero. We shall discuss a standard method of ordering the stimuli for this Case V. Case V has been employed quite frequently and seems to fit empirical data rather well in the sense of reproducing the original proportions of the paired comparison table. The assumption of equal standard deviations is a reasonable first approximation. We will not stick to the assumption of zero correlations, because this does not seem to be essential for Case V. 2. Ordering Stimuli with Error-Free Data. We assume there are a number of objects or stimuli, O1 , O2 , · · · , On . These stimuli give rise to sensations which lie on a single sensation continuum S. If Xi and Xj are single sensations evoked in an individual I by the ith and jth stimuli, then we assume Xi and Xj to be jointly normally distributed for the population of individuals with mean of Xi = Si variance of Xi = σ 2 (Xi ) = σ 2 correlation of Xi and Xi = ρij = ρ

(i = 1, 2, · · · , n) (i = 1, 2, · · · , n) (i, j = 1, 2, · · · , n) .

(1)

The marginal distributions of the Xi ’s appear as in Figure 1. The figure indicates the possibility that X2 < X1 , even though S1 < S2 . In fact this has to happen part of the time if we are to build anything more than a rank-order scale.

Figure 1 The Marginal Distributions of the Sensations Produced by the Separate Stimuli in Thurstone’s Case V of the Method of Paired Comparisons.

An individual I compares Oi and Oj and reports whether Xi ties are allowed).

>
Xj ) = √

1 2πσ(dij )





e 0

−[dij − (Si − Sj )]2 ddij 2σ 2 (dij )

(2)

where dij = Xi − Xj , and σ 2 (dij ) = 2σ 2 (1 − ρ). There will be no loss in generality in assigning the scale factor so that 2σ 2 (1 − ρ) = 1.

(3)

It is at this point that we depart slightly from Thurstone, who characterized Case V as having equal variances and zero correlations. However, his derivations only assume the correlations are zero explicitly (and artificially), but are carried through implicitly with equal correlations (not necessarily zero). Actually this is a great easing of conditions. We can readily imagine a set of attitudinal items on the same continuum correlated .34, .38, .42, i.e., nearly equal. But it is difficult to imagine them all correlated zero with one another. Past uses of this method have all benefited from the fact that items were not really assumed to be uncorrelated. It was only stated that the model assumed the items were uncorrelated, but the model was unable to take cognizance of the statement. Guttman [2] has noticed this independently. With the scale factor chosen in equation (3), we can rewrite equation (2)  ∞ 1 2 1 pij = √ e− 2 y dy. (4) 2π −(Si −Sj ) From (4), given any pij we can solve for −(Si − Sj ) by use of a normal table of areas. Then if we arbitrarily assign as a location parameter S1 = 0, we can compute all other Si . Thus given the pij matrix we can find the Si . The problem with fallible data is more complicated. 3. Paired Comparison Scaling with Fallible Data. When we have fallible data, we have pij which are estimates of the true pij . Analogous to equation (4) we have  ∞ 1 2 1  pij = √ e− 2 y dy, (5)  2π −Dij  are estimates of Dij = Si − Sj . We merely look up the normal where the Dij  deviate corresponding to pij to get the matrix of Dij . We notice further that  the Dij need not be consistent in the sense that the Dij were; i.e.,

160

Frederick Mosteller

Dij + Djk = Si − Sj + Sj − Sk = Dik  does not hold for the Dij .  We conceive the problem as follows: from the Dij to construct a set of  estimates of the Si ’s called Si , such that    [Dij − (Si − Sj )]2 is to be a minimum. (6) = i,j

It will help to indicate another form of solution for nonfallible data. One can set up the Si − Sj matrix: MATRIX OF Si − Sj 1 2 3 · · · n Totals Means

1 S 1 − S1 S2 − S1 S3 − S1

2 S1 − S2 S2 − S2 S3 − S2

3 S1 − S3 S2 − S3 S3 − S3

S n − S1

Sn − S2

Sn − S3



Si − nS1 S¯ − S1



Si − nS2 S¯ − S2

···



Si − nS3 S¯ − S3

n S1 − Sn S2 − Sn S3 − Sn

Sn − Sn 

Si − nSn S¯ − Sn

Now by setting S1 = 0, we get S2 = (S−S1 )−(S−S2 ), S3 = (S−S1 )−(S−S3 ), and so on. We will use this plan shortly for the Si . If we wish to minimize expression (6) we take the partial derivative with    respect to Si . Since Dij = −Dji and Si − Sj = −(Sj − Si ) and Dii = Si − Si = 0, we need only concern ourselves with the sum of squares from above the  main diagonal in the Dij − (Si − Sj ) matrix, i.e., terms for which i < j. Differentiating with respect to Si we get:    i−1 n    ∂( /2)  =2 (Dji − Sj + Si ) − (Dij − Si + Sj ) ∂Si j=1 j=i+1 (i = 1, 2, · · · , n).

(7)

Setting this partial derivative equal to zero, we have   +S1 + S2 · · · + Si−1 − (n − 1)Si + Si+1 + · · · + Sn i−1 n     = Dji − Dij (i = 1, 2, · · · , n), j=1

but

 Dij

=

 −Dji ,

and

 Dii

j=i+1

= 0; this makes the right side of (8)

(8)

6 Method of Paired Comparisons I i−1 

  Dji + Dii +

j=1

n 

 Dji =

j=i+1

n 

161

 Dji .

j=1

Thus (8) can be written n 

Sj − nSi =

j=1

n 

 Dji

(i = 1, 2, · · · , n).

(9)

j=1

The determinant of the coefficients of the left side of (9) vanishes. This is to be expected because we have only chosen our scale and have not assigned a location parameter. There are various ways to assign this location parameter, for example, by setting S  = 0 or by setting S1 = 0. We choose to set S1 = 0. This means we will measure distances from S1 . Then we try the solution (10) which is suggested by the similarity of the left side of (9) to the total column in the matrix of Si − Sj . Si =

n 

 Dj1 /n −

n 

j=1

 Dji /n.

(10)

j=1

Notice that when i = 1, Si = 0 and that n 

Si =

i=1

n 

 Dj1

i=1

because  i

 Dji = 0,

j

which happens because every term and its negative appear in this double sum. Therefore, substituting (10) in the left side of (9) we have ⎡ ⎤ n n n n         Dj1 − n⎣ Dj1 /n − Dji /n⎦ = Dji , (11) i=1

j=1

j=1

j=1

which is an identity, and the equations are solved. Of course, any linear transformation of the solutions is equally satisfactory. The point of this presentation is to provide a background for the theory of paired comparisons, to indicate that the assumption of zero correlations is unnecessary, and to show that the customary solution to paired comparisons is a least squares solution in the sense of condition (6). That this is a least squares solution seems not to be mentioned in the literature although it may have been known to Horst [3], since he worked closely along these lines. This least squares solution is not entirely satisfactory because the pij tend to zero and unity when extreme stimuli are compared. This introduces un satisfactorily large numbers in the Dij table. This difficulty is usually met by

162

Frederick Mosteller

excluding all numbers beyond, say, 2.0 from the table. After a preliminary arrangement of columns so that the Si will be in approximately proper order, the quantity    (Dij − Di,j+1 )/k is computed, where the summation is over the k values of i for which entries appear in both column j and j + 1. Then differences between such means are taken as the scale separations (see for example Guilford’s discussion [1] of the method of paired comparisons). This method seems to give reasonable results. The computations for methods which take account of the differing variabilities  of the pij and therefore of the Dij seem to be unmercifully extensive. It should also be remarked that this solution is not entirely a reasonable one because we really want to check our results against the original pij . In other words, a more reasonable solution might be one such that once the Si are computed we can estimate the pij by pij , and minimize, say,  (pij − pij )2 or perhaps 

# arc sin

pij − arc sin

#

pij

2 .

Such a thing can no doubt be done, but the results of the author’s attempts do not seem to differ enough from the results of the present method to be worth pursuing. references 1. Guilford, J. P. Psychometric Methods. New York: McGraw-Hill Book Co., 1936, 227–8. 2. Guttman, L. An approach for quantifying paired comparisons and rank order. Annals of math. Stat., 1946, 17, 144–163. 3. Horst, P. A method for determining the absolute affective values of a series of stimulus situations. J. educ. Psychol., 1932, 23, 418–440. 4. Thurstone, L.L. Psychophysical analysis. Amer. J. Psychol., 1927, 38, 368–389.

Reprinted from Psychometrika (1951), 16, pp. 203–206

7. Remarks on the Method of Paired Comparisons: II. The Effect of an Aberrant Standard Deviation When Equal Standard Deviations and Equal Correlations Are Assumed Frederick Mosteller Harvard University If customary methods of solution are used on the method of paired comparisons for Thurstone’s Case V (assuming equal standard deviations of sensations for each stimulus), when in fact one or more of the standard deviations is aberrant, all stimuli will be properly spaced except the one with the aberrant standard deviation. A formula is given to show the amount of error due to the aberrant stimulus.

1. Introduction. In a previous article1 we showed that the ordinary solution to Thurstone’s Case V of the method of paired comparisons was a leastsquares solution. It was also pointed out that for Case V it was not necessary to assume that all correlations between stimulus sensations were zero; it was sufficient to assume the correlations were equal. Thurstone’s Case V assumes that all standard deviations of stimulus sensations are equal. In this article we will investigate the effect of an aberrant standard deviation on the Case V solution. We will deal with error-free data. 2. The Problem of the Aberrant Standard Deviation. As in the previous article, we suppose the objects O1 , O2 , . . . , On to have sensation means S1 , S2 , . . . , Sn . We shall also assume the standard deviation of Xi = σi = σ, (i = 1, 2, . . . , n − 1); the standard deviation of Xn = σn ; and the correlaton ρij = ρ. 

(1)

This research was performed in the Laboratory of Social Relations under a grant made available to Harvard University by the RAND Corporation under the Department of the Air Force, Project RAND. 1 Mosteller, Frederick. Remarks on the Method of Paired Comparisons: I. The Least Squares Solution Assuming Equal Standard Deviations and Equal Correlations. Psychometrika, 1951, 16, 3–9.

164

Frederick Mosteller

In other words, the standard deviations are all equal except the one associated with Sn , and the correlations are all equal. Then we may define the matrix of differences between means in original standard deviation units as Xij = #

Si − Sj σi2

+ σj2 − 2ρσi σj

, (i, j = 1, 2, . . . , n);

(2)

or Xij = " Xin

Si − Sj

, (i, j = n). 2σ 2 (1 − ρ) Si − Sn = " . 2 σ + σn2 − 2ρσσn

(3) (4)

There is no loss of generality and great gain in convenience if we define 2σ 2 (1 − ρ) = 1.

(5)

σ 2 + σn2 − 2 ρ σ σn = σd2 .

(6)

Also Now if we write our Xij matrix we have: Xij MATRIX

1 2 3 . . . n

···

1

2

3

n

S1 − S1 S2 − S1 S3 − S1

S1 − S2 S2 − S2 S3 − S2

S1 − S3 S2 − S3 S3 − S3

(S1 − Sn )/σd (S2 − Sn )/σd (S3 − Sn )/σd

(Sn − S1 )/σd

(Sn − S2 )/σd

(Sn − S3 )/σd

(Sn − Sn )/σd

We will work out the least-squares solution much as described in the earlier article as if σd were unity. That is, we behave as if the standard deviations are equal, as we would if we were experimenters using Case V. This merely involves summing the columns and averaging. From this matrix the total for the ith column is Si∗ =

n−1 

Sj − (n − 1)Si + (Sn − Si )/σd , (i = 1, 2, . . . , n − 1);

j=1

⎛ Sn∗ = ⎝

n  j=1

⎞ Sj − nSn ⎠ σd ,

(i = n).

(7)

7 Method of Paired Comparisons II

165

The Si∗ are essentially estimates of the least-squares solution when the standard deviations are in fact equal. We can perform linear transformations on them without changing the symbol. Because the result would only be good n−1  Sj from all these to a linear transformation, we are allowed to subtract j=1

results, and then we temporarily set Sn = 0. This gives Si∗ = −Si [n − 1 + (1/σd )], Sn∗ =

n−1 

(i = 1, 2, · · · , n − 1);

Sj [(1/σd ) − 1].

(8)

j=1

We may change the scale factor assumed in equation ( 5) by multiplying through by −1 , (n − 1) + 1/σd and this at last gives Si∗ = Si

(i = 1, 2, . . . , n − 1),

Sn∗ = [(1 − 1/σd )/(n − 1 + 1/σd )]

n−1 

Sj ;

(9)

j=1

or, since Sn = 0, Sn∗ = [n(1 − 1/σd )/(n − 1 + 1/σd )]S,

(Sn = 0).

(10)

The gratifying part of this result is that all the Si are properly spaced relative to one another except Sn . In other words, changing one of the standard deviations affects only the position of the object with the aberrant sensation standard deviation. We note, of course, that when σd2 = 1, [i.e., = 2σ 2 (1 − ρ)], Sn∗ = 0 as anticipated. We also note that when the grand mean S is small, that is when Sn is centrally located with respect to the other stimuli means, the effect of an aberrant stimulus is small. Thus if we have reason to believe that some particular object has a much different sensation variability from the rest, the other objects should be so chosen that the aberrant one is near the center of the scale, or else it should be excluded. If we suppose σd > 1, and n of reasonable size, we may approximate Sn∗ by Sn∗ = (1 − 1/σd )S. 3. Examples (a) Suppose the values S1 , S2 , . . . , S6 are −4, −2, 0, 1, 2, 3 and S3 has a standard deviation different from the rest. Application of equation (9) shows that the spacing will be correct.

166

Frederick Mosteller

(b) With the same S values as in Example (a), suppose S1 has σd = 2. Then S2 , . . . , S6 will be properly scaled with values 2, 4, 5, 7, 10 (we must add 4 to all values because we take S1 = 0) and Sn∗ =

6(1 − 12 ) 28 = 2.55 5 + 12 6

instead of zero. (c) With the same S values as in Example (a), suppose S1 has σd = 12 . Then S2 , . . . , S6 will again be properly scaled as in Example (b), but Sn∗ = −4 instead of zero. 4. Generalization to Several Aberrant Standard Deviations. Although it will not be shown here, the generalization to several aberrant standard deviations is immediate. If we have a set of objects O1 , O2 , . . . , On , with variances σ 2 , 2 2 σ 2 , σ 2 , . . . , σn−k , σn−k+1 , . . . , σn2 , then the standard method of solving paired comparisons, Case V, will leave those stimuli with equal variances appropriately spaced. Of course, there need to be at least three stimuli with equal variances for this result to be interesting or useful. It follows that if we have two or more sets of stimuli such that the standard deviations within each set are equal, each set will itself be properly spaced, but the sets will not be spaced or positioned correctly relative to one another. It is conceivable that in a practical situation a different method could be used for some of the measurements, so that we could get an estimate of the relative sizes of the sigmas and that this information could be useful in practice. Thurstone has already noticed that small changes in the sigmas do not affect the solution much.

Manuscript received 8/22/50.

Reprinted from Psychometrika (1951), 16, pp. 207–218

8. Remarks on the Method of Paired Comparisons: III. A Test of Significance for Paired Comparisons when Equal Standard Deviations and Equal Correlations Are Assumed Frederick Mosteller Harvard University A test of goodness of fit is developed for Thurstone’s method of paired comparisons, Case V. The test involves the computation of   χ2 = n (θ − θ )2 /821, where n is the number of observations per pair, and θ and θ are the angles obtained by applying the inverse sine transformation to the fitted and the observed proportions respectively. The number of degrees of freedom is (k − 1)(k − 2)/2.

1. Introduction It would be useful in Thurstone’s method of paired comparisons to have a measure of the goodness of fit of the estimated proportions to the observed proportions. Ideally we might try to find estimates of the stimuli positions Si such that we can reproduce the observed proportions pij as closely as possible in some sense. One kind of test might be based on   (pij − pij )2 2 χ = 2 σij where pij is the estimate of pij derived from the Si . But the true pij are not known and would have to be replaced by the observed pij . If one does replace  the pij by pij and σij by σij , then it is possible to fit the Si by means of a minimum chi-square criterion. However, such a procedure calls for an iterative scheme and involves extremely tedious computations. An alternative method is suggested by the inverse sine transformation. 

This research was performed in the Laboratory of Social Relations under a grant available to Harvard University by the RAND Corporation under the Department of the Air Force, Project RAND.

168

Frederick Mosteller

2. The model It is assumed that we have a set of stimuli which, when presented to a subject, produce sensations. These sensations are assumed to be normally distributed, perhaps with different means. However the standard deviations of each distribution are assumed to be the same, and the correlations between pairs of stimuli sensations are assumed equal. Subjects are presented with pairs of stimuli and asked to state which member of each pair is greater with respect to some property attributed to all the stimuli (the property is the dimension of the scale we are trying to form). Our observations consist of the proportions of times stimulus j is judged “greater than” stimulus i. We call these proportions pij to indicate that they are observations and not the true proportions pij .  From the observed proportions we compute normal deviates Xij and pro ceed in the usual way [5] to estimate the stimulus positions, Si , on the sensation scale. Once the Si are found we can retrace to get the fitted normal  deviates Xij and the fitted proportions pij . Our problem is to provide a method for ascertaining how well the fitted pij agree with the observed pij . In such a test of significance involving goodness of fit, we are interested in knowing what the null hypothesis and the alternative hypothesis are. In the present case the null hypothesis is given by the model assumed above. However, the alternative hypothesis is quite general: merely that the null hypothesis is not correct. In particular, the null hypothesis assumes additivity so that if Dij is the distance from Si to Sj and Djk is the distance from Sj to Sk , we should find Dik = Dij + Djk . If we do not have unidimensionality this additivity property will usually not hold. For example, consider the case of three stimuli with S1 < S2 < S3 . If the standard deviation of each distribution is the same, we might write D12 = S2 − S1 D13 = S3 − S1 D23 = S3 − S2 . Since we can choose S1 = 0 and S2 = D12 , S3 from the second equation must be D13 . Finally D23 = D13 − D12 . Since each of our comparisons of stimuli is done independently it is not necessary that this relation hold either for the observations or for the theoretical values. Indeed the observed value of D23 could have conflicted with the assumption of additivity. Such a failure of additivity makes the fitting of the observed pij less likely, and on the average failure will increase the value of χ2 in our test.

8 Method of Paired Comparisons III

169

It can also happen that the standard deviations of the various stimuli are not equal even though unidimensionality obtains. In this case our attempt to fit the data under the equal standard deviations assumption will sometimes fail, and this failure will be reflected, in general, in a failure of additivity and thus an increase in χ2 . 3. The transformation Like so many other good things in statistics, the inverse sine transformation was developed by R. A. Fisher (4). Further discussion by Barlett (1,2), Eisenhart (8), and Mosteller and Tukey (7) may be of interest to those who wish to examine the literature. The facts essential to the present discussion are these: If we have an observed p arising from a binomial sample of size n from a population with true proportion of successes p, then " θ = arc sin p (1) is approximately normally distributed with variance σθ2 =

821 , n

(2)

nearly independent of the true p, when θ is measured in degrees. A table for making the transformation to angles has been computed by C. I. Bliss (3), and is readily available in G. W. Snedecor’s Statistical Methods (4th Edition), p.450. Then if we define #  θij = arc sin pij #  θij = arc sin pij (3) where pij are the observed proportions and pij are the proportions derived from fitting the Si , we can test goodness of fit by χ2 =

  2  (θij − θij ) . 821/n i S2 > S1 . Furthermore we will assume that these means are sufficiently close to one another that the approximation  ∞ 1 2 1 1 Sj − Si e− 2 x dx ∼ , (5) pij = √ = + √ 2 2π −(Sj −Si ) 2π will be adequate. For this case pij will be nearly 1/2, so we will be able to use the approximation: 1 σ 2 (pij ) = (6) = σ2 . 4n Working with this case will have the further advantage that we will not need to use the inverse sine transformation but can work directly with   (pij − pij )2 χ2 =

i 0), then the probability it wins a three-game series is S(p, 3) = p3 + 3p2 (1 − p) = p2 (3 − 2p) 1 = ( + )2 (2 − 2) 2 1 3 = +  − 22 , 2 2 and the increase in the probability of correctly choosing the better team as we go from the one-game to the three-game play-off is  − 23 . 2 If  = 0.01 (p = 0.51), the gain in the probability S is essentially 0.005 for a three-game as compared with a one-game series. This means that in 200 one-game play-offs, the better team could expect to win 102 play-offs, but in 200 three-game play-offs, the better team could expect to win 103 play-offs, a scarcely noticeable improvement. There are few data for comparing playoff series within the major leagues, because ties are rare. Further, one-game play-offs would not provide any information on the point at hand. We will not therefore be able to pursue this question of differences between play-off teams to any conclusion, and so we proceed directly to consideration of World Series, where data are more plentiful. As a simple means of examining the power of a series for identifying the better team we provide Table 1. In Table S(p, 3) − S(p, 1) =

254

Frederick Mosteller

1 (suggested by K.A. Brownlee) we give the probability S(p, n) that the better team wins an n-game series for n = l, 3, 5, 7, 9, and for various probabilities p. We also give the probability that the better team wins, or ties, in even-sized series n = 2, 4, 6, 8. comparison of major leagues One way to approach the question of equality of teams entering the World Series is to compare the two Leagues. Altogether there have been 44 sevengame Series from 1905 to 1951, and 4 nine-game Series (1903, 1919–1921, no Series in 1904), for a total of 275 games actually played. The American League has won 159 of these games, or 57.82 per cent. In any year i, the American League team is assumed to have had a probability pi of winning single games. Then over the 48  years of World Series games there would be an average probability, say p = pi /48. If we adopt this view, then the average proportion of games won, 0.5782, is an estimate of p. Some might object to this estimate of p because the sampling is truncated (a seven-game Series is stopped when one team wins 4 games), or because we would have a better estimate if we averaged over the number of Series rather than the total number of games. These objections are both reasonable, but it happens to turn out that the use of an estimate suited to truncated sampling gives almost exactly the same numerical result. We introduce this estimate because we plan to base a later argument on it. An unbiased estimate appropriate to truncated sampling can be obtained in the following manner:1 In a year when the American League wins, the estimate of p is taken to be c−1 , c+x−1 where c is the number of games it takes to win a Series and x is the number won by the National League; when the American League loses, the estimate is y c+y−1 where y is the number won by the American League. In our data the value of c is either 4 or 5 depending on whether a seven- or nine-game Series is used. The summary table (Table 2) shows the Series outcomes and the estimates corresponding to these outcomes. In addition, the four estimates of American League pi for the nine-game Series are 4/7, 3/7, 4/6, 3/7. The average of the 48 estimates is 57.80 per cent, providing an estimate of 100 p that is scarcely different from the more naive per cent won. If we think of the team 1

M.A. Girshick, F. Mosteller, and L.J. Savage, “Unbiased estimates of certain binomial sampling problems with applications, Annals of Mathematical Statistics, 17 (1946), 20.

12 World Series Competition

255

table 1 probability S(p, n) that the better team wins an n-game series, when its probability of winning single games is p p

S(p, 1)

S(p, 3)

S(p, 5)

S(p, 7)

S(p, 9)

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

0.500 0.550 0.600 0.650 0.700 0.750 0.800 0.850 0.900 0.950 1.000

0.500 0.575 0.648 0.718 0.784 0.844 0.896 0.939 0.972 0.993 1.000

0.500 0.593 0.683 0.765 0.837 0.896 0.942 0.973 0.991 0.999 1.000

0.500 0.608 0.710 0.800 0.874 0.929 0.967 0.988 0.997 1.000 1.000

0.500 0.621 0.733 0.828 0.901 0.951 0.980 0.994 0.999 1.000 1.000

probability that the better team wins (W ) or ties (T ) a series of an even number of games, when its probability of winning single games is p n=2

p

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

n=4

n=6

n=8

W

T

W

T

W

T

W

T

0.250 0.302 0.360 0.422 0.490 0.562 0.640 0.722 0.810 0.902 1.000

0.500 0.495 0.480 0.455 0.420 0.375 0.320 0.255 0.180 0.095 0.000

0.312 0.391 0.475 0.563 0.652 0.738 0.819 0.890 0.948 0.986 1.000

0.375 0.368 0.346 0.311 0.265 0.211 0.154 0.098 0.049 0.014 0.000

0.344 0.442 0.544 0.647 0.744 0.831 0.901 0.953 0.984 0.998 1.000

0.312 0.303 0.276 0.235 0.185 0.132 0.082 0.041 0.015 0.002 0.000

0.363 0.477 0.594 0.706 0.806 0.886 0.944 0.979 0.995 1.000 1.000

0.273 0.263 0.232 0.188 0.136 0.087 0.046 0.018 0.005 0.000 0.000

representing the American League as having a possibly different pi in each Series, then 57.80 per cent is, in a reasonable sense, an estimate of the average probability of winning single games over the years. Another question is whether the American League has done significantly better than the National League. We could check, merely on the basis of the number of Series won. The American League has won 31 of 48 Series, and

256

Frederick Mosteller table 2 outcomes of the 44 seven-game series∗ Games Won N.L.

A.L.

Estimate of P for A.L.

4 4 4 4 3 2 1 0

0 1 2 3 4 4 4 4

0 1/4 2/5 3/6 3/6 3/5 3/4 1

Frequency 3 4 1 7 4 10 9 6 — Total 44



Data from The World Almanac 1952, New York World Telegram, Harry Hansen (ed.), p.821. These data have also been checked in The Official Encyclopedia of Baseball, Jubilee Edition, Hy Turkin and S.C. Thompson (New York: A.S. Barnes Co., 1951).

under the null hypothesis of p = 12 , the probability2 of 31 or more successes is 0.0297, or a two-sided probability of about 0.06. Although it has little to do with the main discussion, another frequentlyasked question is whether the American League has been improving through the years. To answer this question, at least partially, we break the 48 Series into four sets of 12, chronologically in Table 3. There is a slight but not statistically significant trend in the data. Of course there is one notable trend by the New York Yankees (A.L.) suggested by Table 4. Just what sort of significance test should be applied to a team chosen on the basis of its notable record, is an issue not at present settled by statistical theory, so we leave Table 4 without further analysis. table 3 number of series won by the American League in 12 year intervals

Years Series Won by A.L.

1903 15∗ 7



1916– 27† 7

1928– 39 9

1940– 51 8

Total 31 of 48

∗ No series in 1904. † Include National League victory in 1919, year of “Blacksox Scandal.”

2

Tables of the Binomial Probability Distribution, National Bureau of Standards, Applied Mathematics Series 6 (U.S. Government Printing Office, 1950), p. 375.

12 World Series Competition

257

table 4 table of world series won and lost by yankees by years Years

Won

Lost

Totals

1903–27 1928–51 Totals

2 11 13

3 1 4

5 12 17

estimating the probability that the better team wins single games Since the American League has done rather well through the years, we will reject the idea that the league champions are equally matched (p = 12 ) when they appear for the Series. How closely are they matched? Suppose each year we knew the single game probability pi for the better team, then the average of such p’s could be a measure of how well the teams were matched. We expect the average p to be greater than 0.5. Indeed in the present case our estimate for the average p for better teams should be higher than the 57.8 per cent we obtained for the American League, because we suspect that the American League team was in some years not the “better team.” There should be no confusion here between the “winning team” and the “better team.” The “winning team” is the team that wins the Series. The “better team” is the team with the higher probability of winning single games, whether or not it actually wins the Series. We anticipate that the better team sometimes loses a Series, just as the league champion loses single games to the last-place team within the league during the season. Model A: The better team has the same p each year. The author has not discovered any good way of estimating the average p for better teams without making further unrealistic assumptions (the lack of reality may not be important, because the results may not be sensitive to the assumptions). What has been assumed in this section is that every year the better team has the same probability p > 12 of winning single games. Of course, we cannot identify the better team in any particular Series, but we may by arithmetic manipulation derive an estimate from our half-century of data. For this purpose, we will neglect the four nine-game Series, because they cause considerable arithmetic trouble. Our data and Model A can be summarized in Table 5. The algebraic expressions in the right-hand column are not the usual terms in the expansion of the binomial (p + q)7 because we are working with truncated single sampling. On each line the first algebraic term represents the probability that the better team wins the Series in the number of games appropriate to the line, while the second term similarly represents the probability that the poorer team wins the Series. The sum of these two terms represents the total prob-

258

Frederick Mosteller

ability that the Series is won in the pattern of games given in the first two columns. Thus in the third line 10p4 q 2 is the probability that the better team wins the Series in exactly six games. table 5 games won (seven-game series only)

Winner

Loser

4 4 4 4

0 1 2 3

Frequency 9 13 11 11 — Total 44

Theoretical Proportion p4 + q 4 4p4 q + 4pq 4 10p4 q 2 + 10p2 q 4 20p4 q 3 + 20p3 q 4 ———————– 1

If we represent a win by the better team as a B and a win by the poorer team as a W , the following 10 ways for the better team to win in exactly 6 games exhaust the possibilities: BBBWWB BBWBWB BWBBWB WBBBWB BBWWBB

BWBWBB WBBWBB BWWBBB WBWBBB WWBBBB

These 10 ways correspond to the coefficient 10 in 10p4 q 2 . The factor p4 q 2 arises from the fact that we must have exactly 4 wins by the better team, whose single game probability is p, and exactly 2 wins by the poorer team whose single game probability is q = 1 − p. Similar computations account for the coefficients and powers of p and q corresponding to the other Series outcomes. To estimate p for better teams we should not take merely the total number of games won by Series-winning teams and divide by the grand total of games. This will clearly overestimate the p-value, as an example with p = 12 will show. In Table 6 based on hypothetical data for equally matched teams, it turns out that the Series-winning teams won 256 of a total of 372 games, or 68.8 per cent of the games—but an estimate of p = 0.688 would be rather far from the actual p = 0.500. On the other hand, in the actual Series results, Table 5, the per cent of games won by the Series-winning team is only 72.1 (176 of a total of 244) which seems rather close to 68.8, so perhaps the previous assumption that the teams are unevenly matched is not in line with the facts. To investigate the facts, we need an estimation process. Three estimates seem reasonable:

12 World Series Competition table 6 expected results for 64 series, p =

Games

Won

4 4 4 4

0 1 2 3

259

1 2

Theoretical Frequency 8 16 20 20 – Total 64

l) Use the theoretical distribution to obtain a formula for the expected number of games won by the losing team in a 7-game series. This average will be a function of p. Equate this theoretical average to the observed average and solve for p. This is the method of moments applied to the sample mean.3 2) Obtain the maximum likelihood estimate of p. 3) Obtain the minimum chi-square estimate of p. Method 1) is by far the easiest computationally. Method 1. We wish to obtain the average number of games won by the Series-losing team in terms of p. We multiply the theoretical proportions from Table 5 by the number of games won by the loser and add. This operation gives A = Average number of games won per Series by the Series-loser = 1(4p4 q + 4pq 4 ) + 2(10p4 q 2 + 10p2 q 4 ) + 3(20p4 q 3 + 20p3 q 4 ) 3

3

2

2

(1)

2 2

= 4pq[(p + q ) + 5pq(p + q ) + 15p q ]. We note that p3 + q 3 = p3 + 3p2 q + 3pq 2 + q 3 − 3p2 q − 3pq 2 = 1 − 3pq 3

It was suggested by William Kruskal (personal communication) that other estimates based on the method of moments seem equally plausible. He suggests as an example the statistic Number of games won by loser . Number of games in Series Calculations similar to those in the text give the average value of this statistic as   7 10 2 2 1 + pq + p q . B = 4pq 5 30 21 When this is equated to its observed average (about 0.25), the estimate of p turns out slightly higher than 0.65.

260

Frederick Mosteller

and p2 + q 2 = p2 + 2pq + q 2 − 2pq = 1 − 2pq. Substituting these relations in (1) gives A = 4pq[1 − 3pq + 5pq(1 − 2pq) + 15p2 q 2 ]

(2)

2 2

A = 4pq[1 + 2pq + 5p q ]. The value of A attains its maximum when p = 12 as we would anticipate. In the 44 seven-game Series, the average number of games per Series won by the Series-loser was 1.5455. If we set this equal to A in equation 2 we can solve directly a cubic in pq and then a quadratic in p, or we might go directly to the 6th degree equation in p. Moderation and discretion suggest that we just substitute a few values of p in the expression, and see what values of p lead to outcomes close to the average wins of the Series-loser. We get p

A = Average Wins Expected by Loser 0.5 1.8125 0.6 1.6973 0.6500 1.5596 0.6667 1.5034 0.7 1.3780 Linear interpolation gives the estimate of p as 0.6542. Of course, the uncertainty of the estimate makes the use of this many decimal places quite misleading. On the basis of the evidence thus far available, then, the teams entering the World Series seem to be matched at about 65–35 for single games. Method 2. If P (0), P (1), P (2), P (3) are the probabilities that the Serieslosing team wins 0, 1, 2, or 3 games respectively in a Series, then the maximum likelihood approach involves finding the value of p that maximizes [P (0)]9 [P (1)]13 [P (2)]11 [P (3)]11 . The numbers 9, 13, 11 and 11 are the frequencies tabulated in Table 5 and the P (x) are given in algebraic form in the Theoretical Proportions column in Table 5. Although tedious, this maximization was done, and the estimate obtained was 0.6551, encouragingly close to that obtained from Method 1. Method 3. Finally the chi-square to be minimized was χ2 =

[9 − 44P (0)]2 [13 − 44P (1)]2 [11 − 44P (2)]2 + + 44P (0) 44P (1) 44P (2) [11 − 44P (3)]2 , + 44P (3)

where P (x), x = 0, 1, 2, 3 has the same definition as before. Here the terms 44P (x) are the “expected numbers” for the usual chi-square formula. The p-value minimizing this chi-square turned out to be 0.6551.

12 World Series Competition

261

The following table summarizes the results for the three methods. Method Estimate Average Wins by Series Loser 0.6542 Maximum Likelihood 0.6551 Minimum Chi-square 0.6551 Presumably the close agreement between the minimum chi-square method and the maximum likelihood method is partly an accident of the particular empirical data, and partly owing to 44 being a fairly large number. For contingency problems equivalent to the present one, Cram´er4 points out that “the modified chi-square minimum method is . . . identical with the maximum likelihood method.” However the “modified chi-square minimum method” neglects certain terms of the form: [nx − nP (x)]2 , n = total number, nx = observed number of x’s 2n[P (x)]2 when the partial derivatives appropriate to minimizing chi-square are set equal to zero. We would not in general expect such a term to vanish, but the closeness of agreement suggests to the author that in the future he will usually prefer the easier maximum likelihood to the more tedious minimum chi-square when totals even as small as 44 are involved. Using the estimate obtained by the minimum chi-square method, the observed value of χ2 turns out to be 0.222, which for two degrees of freedom is fairly small, the probability of a larger value of chi-square being about 0.89, so the fit is rather good. how often does the better team win the series? Based on past experience then, a reasonable estimate of the average probability of winning single games for the better team is about 0.65 according to Model A. Using this value, we can compute the probability of the better team winning a seven-game World Series as about S(0.65, 7) = 0.80 (see Table 1), so the better team would win about four out of five Series. If we push our assumptions to an extreme we might even estimate that the American League has had the better team about 75 per cent of the time. We can obtain this number by assuming that the American League had the better team a fraction of the time x, and recall that the American League won 31 of 48 Series. Then equating expected and observed proportions of Series won we have: 31 = 0.646 48 0.60x = 0.446 x = 0.743.

0.80x + 0.20(1 − x) =

4

Harald Cram´er, Mathematical Methods of Statistics (Princeton, N.J.: Princeton University Press, 1946), p. 426.

262

Frederick Mosteller

Alternatively, we could use games won rather than Series won. Using our estimate of 0.65 as the single game probability for the better team, and recalling that the American League has won 0.578 of the games we have: 0.65x + 0.35(1 − x) = 0.578 0.30x = 0.228 x = 0.76. These two methods both give estimates of about 75 per cent as the percentage of years in which the American League has had the better team. If the American League has had the better team 75 per cent of the time, or in about 0.75(48) = 36 Series, these 36 better teams could expect to win about 0.80 × 36 = 29 Series and lose about 36 − 29 = 7 Series. Similar computations suggest that the American League has had the poorer team 12 times, and that these poorer teams could expect to win two Series from their better National League opponents. The discussion just given shows why we do not use the per cent of Series won (65) as an estimate of the per cent of times the American League has had the better team. The side more often having the better team will suffer most in actual play due to lack of discrimination of the 7-game Series. If one League always had the better team, it would still lose a good many Series unless its single game p were quite high. It might be supposed that the estimate of 80 per cent for the probability that the better team would win the Series would depend sensitively on the Model A assumption of a constant value of p for the better team in every year. The reader might be willing to accept the idea that our estimate of 0.65 as an average for better teams is reasonable, but feel that since there is surely a distribution of p values for better teams from year to year, the average of the S(pi , 7) may not be close to S(p, 7). (It will be recalled that S(p, 7) is the probability that a team with a single-game probability of p will win a 7-game Series.) For example, let p1 = 0.50, p2 = 0.90, and then the average p = (0.50 + 0.90)/2 = 0.70, while S(0.50, 7) = 0.50, S(0.90, 7) = 1.00, but S(0.70, 7) = 0.87 instead of 0.75 = 12 (0.50 + 1.00). This example shows that k 

S(pi , n)

i=1

k

= S(p, n), k 

p=

i=1

k

pi .

How important this objection is depends on the linearity of S(p, 7) over the dense part of the distribution of p’s from year to year. A graph of S(p, 7) reveals that S is approximately linear in p over the range from p = 0.50 to

12 World Series Competition

263

p = 0.75. We would suppose that in World Series competition it would be relatively rare that single game p’s exceeded 0.75, and therefore we feel that the lack-of-linearity argument is not a strong one against this estimate of 0.80 as the average probability that the better team wins the Series. an alternative method of estimation Model B. Fixed p’s within Series, but normally distributed p’s from year to year. There is another way of estimating the number of times the American League has had the better team. We can take the view that each year the American League team has a true but unknown probability of winning single games pi . For each of these yearly pi we have an unbiased estimate pˆi ; the distribution of these estimates was given in Table 2. If we let the observed mean of the estimates be p¯, we can compute the sum of squares of deviations of the estimates pˆi from p¯. This sum of squares can be partitioned into two parts. One part has to do with σ 2 (ˆ pi ), the variation of pˆi around its true value pi , and the other with σ 2 (pi ), the variation of the true pi around their true mean p. Such a partition is standard practice in analysis of variance. We need to define σ 2 (ˆ pi ) = E(ˆ pi − pi )2 , n  (pi − p)2 i=1 2 , σ (pi ) = n

i = 1, 2, · · · , n. (3)

where E is the expected value operator. We need the expected value of p¯)2 , it is  E

n 





(ˆ pi − p¯)2 = E

i=1

 =E

n 

 pˆ2i − n¯ p2

i=1 n 

 (ˆ pi −

pˆ2i

i=1

  ( pˆi )2 . − n

(4)

Recalling that E(ˆ p2i ) = σ 2 (ˆ pi ) + p2i , i = 1, 2, · · · , n E(ˆ pi pˆj ) = pi pj , i = j

(5)

and using these results in (4) we have the well-known result expressed in words and symbols: Total Sum of Squares

=

Within Years Sum of Squares

+

Between Years Sum of Squares

264

Frederick Mosteller

E

k 

(ˆ pi − p¯)2 =

i=1

n−1 2 σ (ˆ pi ) + nσ 2 (pi ). n

(6)

 The value of (ˆ pi − p¯)2 can be computed from Table 2. We would like to 2 know σ (pi ) as an aid in estimating the average  p2 for better teams. To get an estimate of σ 2 (pi ) we will have to estimate σ (ˆ pi ). By a procedure like that used in deriving the average number of games won by the Series-loser (equation (2)) we can show that pi q i qi = 1 − pi . (7) [5 − 3pi qi − 4p2i qi2 ], 20 The derivation of (7) is lengthy, but is shown in the Appendix. Naturally this variance depends on the true pi , but its value does not change rapidly in the neighborhood of pi = 12 . Therefore we propose to estimate this error variance by evaluating σ 2 (ˆ pi ) at the average value of the pi ’s. For the 44 seven-game Series p¯ is 0.583. Substituting this p-value in the formula for the variance (7) gives σ 2 (ˆ pi ) =

σ 2 (ˆ pi ) = 0.0490 (error variance). We use this same value of σ 2 (ˆ pi ) for all years. The total sum of squares  p¯)2 for the 44 Series is 2.8674. The estimated between years sum of (ˆ pi −  squares (pi − p)2 is 2.8674 − 43(0.0490) = 0.7604. Dividing this by 44 gives an estimate of the variance of the true pi ’s from year to year as 0.0173, or an estimated standard deviation of p-values σ(pi ) = 0.1315. The departure of the observed average, 0.583, from 0.500 in standard deviation units is 0.63. If we assume that the distribution of true p’s is normal, we estimate the percentage of times the American League had the better team to be 74 per cent. This result is close to our previous estimate of 76 per cent. It has been suggested by Howard L.  Jones (personal communication) that we might obtain an improved estimate σ 2 (ˆ pi )/n by averaging the formula of equation (7) over the normal distribution with mean 0.583 and standard deviation 0.1315. When this was done the value 0.0455 was obtained instead of 0.0490. The new residual variance would be 2.8674 − 43(0.0455) = 0.9109. Dividing 0.9109 by 44 gives a corrected estimate of the variance of the true pi ’s as 0.0207, or an estimated standard deviation σ ˆ (pi ) = 0.1439. This adjusted standard deviation can be used as before for estimating the proportion of pi ’s higher than 0.500. We have the departure from the mean in standard deviation units as (0.583−0.500)/0.1439 or 0.58 units, which corresponds to a proportion of 72 per cent on a normal distribution. Thus a better approximation for the proportion of times the American League has had the better team using Model B is 72 per cent. Again the result is not far from our Model A estimate of 76 per cent.

12 World Series Competition

265

If we make further use of this normality assumption, we can also estimate the average single-game probability for the better team. We break the assumed normal distribution of true pi ’s for the American League into two parts. One part is the truncated normal for which pi > 12 (American League better), the other is the part for which pi < 12 (National League better). Then we obtain the average pi for the American League when it is better, and for the National League when it is better, and weight each by the relative frequency it represents. The final result is an estimate of the average single-game p for the better team. The integration is shown in the Appendix. When σ(pi ) is taken as 0.1315, the estimate is 0.626, but the improved estimate of σ(pi ) as 0.1439 gives the final estimate as 0.634, which can be compared with our Model A estimate of 0.655. tests of the binomial assumptions We have emphasized the binomial aspects of the model. The twin assumptions needed by a binomial model are that throughout a World Series a given team has a fixed chance to win each game, and that the chance is not influenced by the outcome of other games. It seems worthwhile to examine these assumptions a little more carefully, because any fan can readily think of good reasons why they might be invalid. Of course, strictly speaking, all such mathematical assumptions are invalid when we deal with data from the real world. The question of interest is the degree of invalidity and its consequences. Obvious ways that the assumptions might be invalid are: 1) A team might be expected to do better “at home” than it does “away,” and this would negate a constant probability because even the shortest Series may be played in two places. This possibility is strongly suggested both by intuition and by an examination of the results of regular season games in the major leagues. That it would hold for World Series games is not a foregone conclusion. 2) Winning a game might influence the chance of winning the next game, i.e., there may be serial correlation from game to game. To examine the first of these issues, we collected the detailed results of four games in each Series. We chose four because that represents the least number played. Games were chosen as early as possible in each Series to provide two games played by each team in an “at-home” capacity. In a sevengame Series we ideally find the first two games played at the National League (American League) park, the next three at the American League (National League) park, and the last two in the National League (American League) park. When we observed this pattern, we used the first four games played. This ideal pattern was not actually used as often as one might suppose. Sometimes teams alternated parks after each game, when extensive travelling was not required. Sometimes both teams used the same park as did the New York Giants and the New York Yankees in the early days, or the St. Louis Cardinals and the St. Louis Browns; in such cases, we took the view that the home team was the second team to come to bat. In some Series there were ties that had to

266

Frederick Mosteller

be thrown out. And sometimes the first four games could not be used because three would be played at one park followed by some number at the other park. Our final rule was to collect for each Series the first two games played with the National League team as the “home team,” and the first two games played with the American League team as the home team.5 One Series (1922, N.Y. Giants vs. N.Y. Yankees) had to be omitted because in the four non-tied games, all won by the Giants, one team was “at home” three times. Thus we were left with 47 sets of four games. The plan of analysis is to compare the same team for two games “away” and two “at home.” We arbitrarily chose the first “away” team for the comparison. We counted the number won by that team in its first two (non-tied) away games and subtracted this from the number it won in its first two at-home games. This difference is taken as a measure of the improvement of a World Series team playing at home over playing away. If this difference is strongly positive or negative on the average, we would have to reject the notion that the chance of winning a single game is constant throughout the Series. For example, in 1949 Brooklyn (NL) was the first away team in the Series with New York (AL). It won one of its away games, and none of its at-home games, for an improvement score of minus one. The average improvement score for the 47 sets of four games was 2/47 = 0.042, and the standard error of this mean is approximately 0.14. So the improvement score is only about a third of a standard error from the null hypothesis score of zero improvement. Thus far, then, we have no good evidence for rejecting the constant probability assumption and it has been shown that the probability of winning a game is not influenced very much by the “at-home” or “away” status of teams. A possible rationale for explaining this at-home-away similarity observed in Series games and not observed in season games suggests itself. It may be that travelling fatigues the away team and thus tends to cut down proficiency. During the regular season, at-home teams remain stationary for long periods, and various opponent teams travel in to play them. In the Series, one team has to travel initially, but then both teams do equal amounts of travelling until the Series ends. If travelling is an important variable influencing outcomes of games, the Series tends to equalize this influence much more than regular season games. Another possibility is that many teams are tailored to the home park because half the games are played at home: for example, the Boston Red Sox (A.L.) have a short left-field fence, and therefore hire a good many strong left-field hitters. But it may well be that League champions represent teams that are not much affected by change of park. Another way to say that trials are not independent is to say that they are correlated serially, and obviously this means p changes from game to game depending on outcomes of previous games. To test for serial correlation, we 5

Hy Turkin and S.C. Thompson, The Official Encyclopedia of Baseball, Jubilee Edition (New York: A.S. Barnes Co., 1951).

12 World Series Competition

267

examined the results of the first four games regardless of where played. Each of the 48 sets of four games was broken into two sets of two games, the first set consisting of game 1 and game 2, and the second set consisting of game 3 and game 4. If there is serial correlation, we might find that winning a game improved the chance of winning the next game. To test this we scored the American League team in each set of two games and thus constructed the 2 × 2 table shown in Table 7. table 7 performance of american league team in 96 sets of two games Second game

First game

Win Lose Total

Win

Lose

Total

32 24 56

24 16 40

56 40 96

It will be noted that the rows of this table are almost proportional to one another. There is a slight question about the interpretation of this result. To work this out, we must explore the situation when we have independence between the games. Suppose that the American League team in any particular series of 2 games has a probability pi of winning each game. Then the expected values for the particular 2-game series would be shown by the following table: Second game Win First game

Win Lose Total

p2i pi (1 − pi ) pi

Lose pi (1 − pi ) (1 − pi )2 (1 − pi )

Total pi (1 − pi ) 1

The expected value for the total table for all the 2-game series would be represented by the following table in which we merely sum each entry of the previous table over the subscript i:

268

Frederick Mosteller

Second game

First game

Win Lose Total

Win  2  pi  pi (1−pi ) pi

Lose   pi (1−pi2) (1 − pi ) (1 − pi )

Total   pi (1 − pi ) n

In the ordinary test for independence we estimate the Win-Win cell by multiplying the two Win margins together and dividing by the total number. In the present case we would have 

2 pi

/n

 2 as this estimate. Clearly this is not identical to pi . But we will show that in our present problem it is very close. If each pi is represented as the sum of a grand mean plus a departure from that mean in the form pi = p + ei

 2  where e is the departure, then p2i = p2 + 2pei + e2i and pi = np2 + e2i . Dividing both sides by n gives  2  2 pi ei = p2 + n n and the second term on the right-hand side is approximately the variance of the true pi ’s. From our earlier work we have estimated the standard deviation of the pi ’s to be about 0.13 or 0.14, so the variance of the pi ’s would be estimated to be about 0.017  or 0.020. This would yield an expected discrepancy between p2i and ( pi )2 /n = np2 of about 96 × 0.017, which is roughly 1.6 or 2.0 units. So in our Table 7 the expected cell values should not be and are not very many cases away from the result predicted by the products of the margins. Therefore, approximate independence in the table of pairs of games is consonant with the notion of game-to-game independence. To sum up: we have made tests of the reasonableness of the assumption of the constancy of p throughout the Series and of the independence of p from game to game, and we have found no reason to reject the hypothesis of binomiality, in spite of the fact that it disagrees with our intuition or with our knowledge of the facts of games within the regular season. We have not, of course, completely tested binomiality. We have only checked two of the most obvious sources of disturbance that might be present. One could check on various additional conjectures for which data are available. But the final word on the assumptions would come from an analysis of replications of the games, and, of course, there are no replications of World Series games. The issue here is that though the average p does not change as

12 World Series Competition

269

we go from at-home to away or from first to second game, it is still possible that p itself changes from game to game, though in no systematic way. Furthermore, the fact that the principal assumptions are reasonable when using a model in connection with World Series data may not help much if one wishes to use the model with other kinds of data. The methods used to investigate the agreement of the assumptions with the facts may have value in other cases, though, especially when detailed information is available about orders-of-test and other pertinent facts. odds quoted for series Something a good many fans are concerned about is the before-Series chances of the contenders. At the suggestion of Harry V. Roberts we have gathered the odds quoted in advance of the Series for 36 years—these odds are published by betting commissioners or sometimes are the odds being used generally by the public for small bets. One way to look at these odds is to consider them a group-judgment of the subjective probabilities associated with the two teams as they enter the Series. Information on betting odds was found in articles in the New York Times, and for all years represents information quoted on the morning of the opening day of the World Series. Naturally, information reported varies somewhat from year to year, and occasionally arbitrary procedures had to be adopted to get quantitative probabilities to associate with each team and to make the total probability unity. The betting fraternity may have been active in the early years, but reports for 1913, 1914, and 1915 were vague and seemed very hard to quantify. Therefore, these data start in 1916. We list the information for 1913, 1914, and 1915: New York Times, Oct.7, 1913 New York Times, Oct. 9, 1914

New York Times, Oct. 8, 1915

“No odds in the betting” “Few backing Mac’s team” . . . “American League are favorites but adherents . . . not loudly proclaiming powers” “There is a little conservative betting . . . they are about evenly matched.”

Starting in 1916, however, we observed remarks giving actual odds like: “Boston favored 10 to 8” (N.Y. Times, Oct. 7, 1916); “Giant partisans . . . will wager all the cash Chicago fans want at even money but they will not offer odds” (N.Y. Times, Oct. 6, 1917); “ . . . oppressive silence among the fans who usually make wagers . . . no choice . . . a few Boston enthusiasts willing to wager 6 to 5 on the Red Sox but these are small bets” (N.Y. Times, Sept. 5, 1918). In some of the later years more detailed information is available giving the odds in both directions. In the computation, we have averaged the odds in cases where more than one set was given (like the 1918 quote above), so that if the big money was being wagered at even money and smaller bettors were willing to wager at 6 to 5, we have arbitrarily decided that the over-all odds prevailing were 5.5

270

Frederick Mosteller

to 5. When the odds are given in both directions, the probabilities associated with the two teams do not add up to unity, because a percentage has been deducted so that the betting commissioners profit no matter who wins. Since we are not immediately interested in the breadwinning activities of the betting commissioners, we have divided the remaining probability (1 – sum of probabilities for the two teams) in half and added equal amounts of probability to the estimate for each team. A sample computation will clarify this procedure. In 1931, the October 1 N.Y. Times gives one to two against the Athletics (A.L.) and 8 to 5 against the Cardinals (N.L.). To the fan this says: one dollar will get you two dollars if you bet against the Athletics and the Athletics lose the Series; if you choose to bet against the Cardinals, you must wager $8 to win $5 if the Cardinals lose. These figures lead us to the following preliminary assessment of the total probability: 1 8 13 24 37 + = + = . 3 13 39 39 39 37/39 is 2/39 less than unity, so we arbitrarily add 1/39 to the fraction expressing the odds the bookmakers are giving for each team. Adding 1/39 to 24/39 gives a total probability for the Athletics’ winning of 25/39, or 0.64. Calculations like this were carried through for each year for the Series winner starting in 1916. Since the total probability is always unity, the probability for the loser is the complement of the winner’s probability. (1924 material may not be worth including because of a scandal on the eve of the Series. The best information obtainable was “betting odds were not decisive for either contender.” We arbitrarily decided to include this information and assessed it as a 50-50 situation.) It turns out that the average subjective probability associated with the winner is 55.33 per cent. This leads us to the opinion that the betting fraternity has some ability to pick the winner. Furthermore, by looking at the average probability for 1916-33 and comparing it with the average probability for 1934-51, we see that bettors are getting better at picking the winner. The average probability for the winner from 1916-33 is 0.5028, compared to 0.6039 for 1934-51. Of course, many will argue that such a powerful team as the Yankees have had on many occasions since 1934 makes prediction easy. Much more pertinent than average probabilities—at least from the bettor’s point of view—is the number of times a probability greater than 1/2 was associated with the actual winner of the Series. These data are shown below, omitting years when the probability was 50-50:

Number of Winners

Probability greater than 0.5 published before Series 24

Probability less than 0.5 published before Series 8

If this 24-8 split is tested against a 16-16 split (null hypothesis), the result is highly significant. Thus the favorite has won 75 per cent of the time. Recalling that we estimate that the better team wins only 80 per cent of the time, this

12 World Series Competition

271

represents rather good choosing. If we let x represent the fraction of times the better team is picked as the favorite, then the fraction of times the favorite would win on the average (using our previous 0.8 as the probability that better teams win) is 0.8x + 0.2(1 − x). If we equate this to the observed fraction of the time that the favorite has won we find 0.8x + 0.2(1 − x) = 0.75 0.55 x= = 0.917. 0.60 In other words, we estimate that the better team has been made the favorite 92 per cent of the time. We have also computed the odds given before the Series for the American League teams, since in much of this paper attention has been directed to the American League. The average before-Series subjective probability associated with the American League team is 0.5802; for the same period (1916-51) the American League team has won 24 of 36 Series, or 66.67 per cent. We note a trend toward increasing favoritism toward the American League team when the available data are split in half. For 1916-33 the average probability given for the American League team is 0.5372 while for 1934-51 the corresponding figure is 0.6228. The reader may wonder what the boundary for the probability is at the upper end—we can see that the favored team does not get odds better than 2/3 very often, and the lower limit is 1/2. For the 36-year period, the average value for the larger subjective probability is 0.6044. This means that the favorite gets odds of about 3 to 2 on the average. These data on betting odds are presented for their own interest, rather than for any contribution they make to the problem of estimating the probability associated with the better team. The bettors seem to have ability to pick the winning team, and as time goes on the bettors are getting more confident of their judgment. The strategy (since 1935) seems to be to pick the American League team unless that team is the Browns, and if the Yankees are the American League team good odds are found as low as 1 to 2. Finally, we have traced the financial successes and set-backs of dyed-in-thewool American League bettors and dyed-in-the-wool National League bettors. It is assumed that each year $100 was wagered on, say, the American League team, at the odds prevailing, and then depending on the outcome of the Series, either a profit was made or a loss was taken. At the end of 36 years (19161951), a gambler betting only on the American League team would have been ahead $556. At the end of the same 36 years, another gambler betting only on the National League would have been behind $808. summary We have used two methods to estimate 1) the percentage of times the American League has had the better team, and 2) the average p-value for

272

Frederick Mosteller

better teams. The first method (Model A) made the unrealistic assumption that the better team had the same chance of winning every year, and the second method (Model B) made the unrealistic assumption that the p-values for American League teams were normally distributed from year to year. Both methods have the assumption that the probability of winning single games is constant within a Series, and that the outcomes of games within a Series are independent. Two checks on the binomial assumption showed no reason to reject it—better, they showed good agreement with it. For the percentage of times the American League has had the better team the two methods of estimating led to results of 76 and 72, or in round numbers 75 per cent. For the average single-game p-value for better teams, the estimates are 0.655 and 0.634, or in round numbers about 0.65. The two methods yield fairly close agreement. appendix Derivation of σ 2 (ˆ p) To derive the variance of pˆ we write below all the outcomes (x, y), the estimates corresponding to these outcomes, the probabilities of these estimates, and then we compute the second raw moment of pˆ. Outcomes p P (ˆ p) 

(4,0) 1 p4

(4,1) 3/4 4p4 q

pˆ2 P (ˆ p) = p4 + 94 p4 q +

(4,2) (4,3) (3,4) (2,4) (1,4) 3/5 3/6 3/6 2/5 1/4 10p4 q 2 20p4 q 3 20p3 q 4 10p2 q 4 4pq 4

18 4 2 p q 5

(0,4) 0 q4

+ 5p4 q 3 + 5p3 q 4 + 85 p2 q 4 + 14 pq 4 = E(ˆ p2 )

To get the variance σ 2 (ˆ p) we subtract p2 , but in the form p2 = p4 + 2p4 q + 3p4 q 2 + 4p4 q 3 + 4p3 q 4 + p2 q 4 . This gives  σ 2 (ˆ p) = pq

 1 3 3 3 3 1 p + p q + p3 q 2 + p2 q 3 + pq 3 + q 3 . 4 5 5 4

Now the probabilities of the outcomes for a 5-game series have to add up to unity, so we write 14 in the form 1 1 = (p3 + 3p3 q + 6p3 q 2 + 6p2 q 3 + 3pq 3 + q 3 ). 4 4 p) to This relation simplifies σ 2 (ˆ σ 2 (ˆ p) =

pq [5 − pq(3p2 + 10p2 q + 10pq 2 + 3q 2 )]. 20

12 World Series Competition

273

Similarly a 3-game series has outcomes whose probabilities add to unity, and we write 3 in the form 3 = 3(p2 + 2p2 q + 2pq 2 + q 2 ). This relation simplifies σ 2 (ˆ p) to pq [5 − pq(3 + 4p2 q + 4pq 2 )] 20 pq [5 − 3pq − 4p2 q 2 ]. σ 2 (ˆ p) = 20

σ 2 (ˆ p) =

p) = 1/20. A non-truncated binomial variance with p = 12 When p = 12 , σ 2 (ˆ would need n = 5 to yield pq/n = 1/20, so we might say that the effective number of observations for estimating p in a 7-game series is approximately 5. The effective number for estimating p is less than the average number of 1 games, which is 5 13 16 when p = 2 . Means of Truncated Normal Distribution If f (x) is a probability density function on the interval (−∞, ∞) and we wish to evaluate the mean value of x given that x > a, we can use the expression ∞ xf (x)dx E(x|x > a) = a ∞ . f (x)dx a In our problem we are interested in the average p for the better team. We have assumed that p is normally distributed for American League teams. The American League team is better when p > 12 , the National League team is better when p < 12 . We need to compute the mean p for the American League when it is better, and the mean p for the National League when it is better. These two means can then be weighted by their estimated frequency of occurence to give a final weighted mean estimate of the average probability that the better team wins single games. To obtain the truncated normal estimate for the average p for better teams, we first set up the normal distribution with mean p¯ and standard deviation σ ˆ. For printing convenience, we drop the bar and circumflex. When the American League has the better team, the mean p-value is √1 σ 2π

∞ 0.5

1

2

P (A.L. better)

σ √12π e− 2 [(0.5−p)/σ] + pP (A.L. better) 1

xe 2 [(x−p)/σ] dx =

2

P (A.L. better)

When the National League has the better team the mean p is

.

(a)

274

Frederick Mosteller

√1 σ 2π

∞ 0.5

xe− 2 [x−(1−p)] 1

2

/σ 2

dx

P (N.L. better) σ √12π e− 2 [(p−0.5)/σ] + (1 − p)P (N.L. better) 1

=

2

P (N.L. better)

. (b)

Weighting the contributions (a) and (b) by their probabilities of occurence gives √ 2 1 Final estimate = 2σ(1/ 2π)e− 2 [(p−0.5)/σ] + pP (A.L. better) +(1 − p)P (N.L. better). With σ = 0.1315, p = 0.583, (p − 0.5)/σ = 0.63, P (A.L. better)= 0.74, we get Estimate = 2(0.1315)(0.3271) + (0.583)(0.74) + 0.417(0.26) = 0.626, as reported in the text. Using σ = 0.1439 from our second approximation, we have p = 0.583, (p − 0.5)/σ = 0.576, P (A.L. better)= 0.718, and the improved approximation is 0.634, as the average single-game probability for the better team.

Reprinted from Journal of the American Statistical Association (1954), 49, pp. 13–35

13. Principles of Sampling William G. Cochran, Frederick Mosteller, and John W. Tukey Johns Hopkins University; Harvard University; and Princeton University

i. samples and their analyses 1. Introduction Whether by biologists, sociologists, engineers, or chemists, sampling is all too often taken far too lightly. In the early years of the present century it was not uncommon to measure the claws and carapaces of 1000 crabs, or to count the number of veins in each of 1000 leaves, and then to attach to the results the “probable error” which would have been appropriate had the 1000 crabs or the 1000 leaves been drawn at random from the population of interest. Such actions were unwarranted shotgun marriages between the quantitatively unsophisticated idea of sample as “what you get by grabbing a handful” and the mathematical precise notion of a “simple random sample.” In the years between we have learned caution by bitter experience. We insist on some semblance of mechanical (dice, coins, random number tables, etc.) randomization before we treat a sample from an existent population as if it were random. We realize that if someone just “grabs a handful,” the individuals in the handful almost always resemble one another (on the average) more than do the members of a simple random sample. Even if the “grabs” are randomly spread around so that every individual has an equal chance of entering the sample, there are difficulties. Since the individuals of grab samples resemble one another more than do individuals of random samples, it follows (by a simple mathematical argument) that the means of grab samples resemble one another less than the means of random samples of the same size. From a grab sample, therefore, we tend to underestimate the variability in the population, although we should have to overestimate it in order to obtain valid estimates of variability of grab sample means by substituting such an estimate 

This paper will constitute Appendix G of Cochran, Mosteller, and Tukey, Statistical Problems of the Kinsey Report, to be published by the American Statistical Association later this year as a monograph. The main body of this monograph was published in the Journal last December (Vol. 48 (1953), pp. 673–716).

276

William G. Cochran, Frederick Mosteller, and John W. Tukey

into the formula for the variability of means of simple random samples. Thus using simple random sample formulas for grab sample means introduces a double bias, both parts of which lead to an unwarranted appearance of higher stability. Returning to the crabs, we may suppose that the crabs in which we are interested are all the individuals of a wide-ranging species, spread along a few hundred miles of coast. It is obviously impractical to seek to take a simple random sample from the species—no one knows how to give each crab in the species an equal chance of being drawn into the sample (to say nothing of trying to make these chances independent). But this does not bar us from honestly assessing the likely range of fluctuation of the result. Much effort has been applied in recent years, particularly in sampling human populations, to the development of sampling plans which simultaneously, (i) are economically feasible (ii) give reasonably precise results, and (iii) show within themselves an honest measure of fluctuation of their results. Any excuse for the dangerous practice of treating non-random samples as random ones is now entirely tenuous. Wider knowledge of the principles involved is needed if scientific investigations involving samples (and what such investigation does not?) are to be solidly based. Additional knowledge of techniques is not so vitally important, though it can lead to substantial economic gains. A botanist who gathered 10 oak leaves from each of 100 oak trees might feel that he had a fine sample of 1000, and that, if 500 were infected with a certain species of parasites, he had shown that the percentage infection was close to 50%. If he had studied the binomial distribution he might calculate a " standard error according to the usual formula for random samples, p± pq/n, which in this case yields 50 ± 1.6% (since p = q = .5 and n = 1000). In this doing he would neglect three things: (i) Probable selectivity in selecting trees (favoring large trees, perhaps?), (ii) Probable selectivity in choosing leaves from a selected tree (favoring wellcolored or, alternatively, visibly infected leaves perhaps), and (iii) the necessary allowance, in the formula used to compute the standard error, for the fact that he has not selected his leaves individually at random, as the mathematical model for a simple random sample prescribes. Most scientists are keenly aware of the analogs of (i) and (ii) in their own fields of work, at least as soon as they are pointed out to them. Far fewer seem to realize that, even if the trees were selected at random from the forest and the leaves were chosen at random from each selected tree, (iii) must still be considered. But if, as might indeed be the case, each tree were either wholly infected or wholly free of infection, then the 1000 leaves tell us no more than 100 leaves, one from each tree. (Each group of 10 leaves will be all infected or all free of infection.) In this case we should take n = 100 and find an infection rate of 50 ± 5%.

13 Principles of Sampling

277

Such an extreme case of increased fluctuation due to sampling in clusters would be detected by almost all scientists, and is not a serious danger. But less extreme cases easily escape detection and may therefore be very dangerous. This is one example of the reasons why the principles of sampling need wider understanding. We have just described an example of cluster sampling, where the individuals or sampling units are not drawn into the sample independently, but are drawn in clusters, and have tried to make it clear that “individually at random” formulas do not apply. It was not our intention to oppose, by this example, the use of cluster sampling, which is often desirable, but only to speak for proper analysis of its results. 2. Self-weighting probability samples There are many ways to draw samples such that each individual or sampling unit in the population has an equal chance of appearing in the sample. Given such a sample, and desiring to estimate the population average of some characteristic, the appropriate procedure is to calculate the (unweighted) mean of all the individual values of that characteristic in the sample. Because weights are equal and require no obvious action, such a sample is selfweighting. Because the relative chances of different individuals entering the sample are known and compensated for (are, in this case, equal), it is a probability sample. (In fact, it would be enough if we knew somewhat less, as is explained in Section 5.) Such a sample need not be a simple random sample, such as one would obtain by numbering all the individuals in the population, and then using a table of random numbers to select the sample on the basis: one random number, one individual. We illustrate this by giving various examples, some practical and others impractical. Consider the sample of oak leaves; it might in principle be drawn in the following way. First we list all trees in the forest of interest, recording for each tree its location and the number of leaves it bears. Then we draw a sample of 100 trees, arranging that the probability of a tree’s being selected is proportional to the number of leaves which it bears. Then on each selected tree we choose 10 leaves at random. It is easy to verify that each leaf in the forest has an equal chance of being selected. (This is a kind of two-stage sampling with probability proportional to size at the first stage.) We must emphasize that such terms as “select at random,” “choose at random,” and the like, always mean that some mechanical device, such as coins, cards, dice, or tables of random numbers, is used. A more practical way to sample the oak leaves might be to list only the locations of the trees (in some parts of the country this could be done from a single aerial photograph), and then to draw 100 trees in such a way that each tree has an equal chance of being selected. The number of leaves on each tree is now counted and the sample of 1000 is prorated over the 100 trees in proportion to their numbers of leaves. It is again easy to verify that each leaf

278

William G. Cochran, Frederick Mosteller, and John W. Tukey

has an equal chance of appearing in the sample. (This is a kind of two-stage sampling with probability proportional to size at the second stage.) If the forest is large, and each tree has many leaves, either of these procedures would probably be impractical. A more practical method might involve a four-stage process in which: (a) (b) (c) (d)

the forest is divided into small tracts, each tract is divided into trees, each tree is divided into recognizable parts, perhaps limbs, and each part is divided into leaves.

In drawing a sample, we would begin by drawing a number of tracts, then a number of trees in each tract, then a part or number of parts from each tree, then a number of leaves from each part . This can be done in many ways so that each leaf has an equal chance of appearing in the sample. A different sort of self-weighting probability sample arises when we draw a sample of names from the Manhattan telephone directory, taking, say, every 17,387th name in alphabetic order starting with one of the first 17,387 names selected at random with equal probability. It is again easy to verify that every name in the book has an equal chance of appearing in the sample (this is a systematic sample with a random start, sometimes referred to as a systematic random sample). As a final example of this sort, we may consider a national sample of 480 people divided among the 48 states. We cannot divide the 480 cases among the individual states in proportion to population very well, since Nevada would then receive about one-half of a case. If we group the small states into blocks, however, we can arrange for each state or block of states to be large enough so that on a pro rata basis it will have at least 10 cases. Then we can draw samples within each state or block of states in various ways. It is easy to verify that the chances of any two persons entering such a sample (assuming adequate randomness within each state or block of states) are approximately the same, where the approximation arises solely because a whole number of cases has to be assigned to each state or block of states. (This is a rudimentary sort of stratified sample.) All of these examples were (at least approximately) self-weighting probability samples, and all yield honest estimates of population characteristics. Each one requires a different formula for assessing the stability of its results! Even if the population characteristic studied is a fraction, almost never will ! pq p± n be a proper expression for “estimate ± standard error.” In every case, a proper formula will require more information from the sample than merely the overall percentage. (Thus, for instance, in the first oak leaf example, the variability from tree to tree of the number infested out of 10 would be needed.)

13 Principles of Sampling

279

3. Representativeness Another principle which ought not to need recalling is this: By sampling we can learn only about collective properties of populations, not about properties of individuals. We can study the average height, the percentage who wear hats, or the variability in weight of college juniors, or of University of Indiana juniors, or of the juniors belonging to a certain fraternity or club at a certain institution. The population we study may be small or large, but there must be a population—and what we are studying must be a population characteristic. By sampling, we cannot study individuals as particular entities with unique idiosyncrasies; we can study regularities (including typical variabilities as well as typical levels) in a population as exemplified by the individuals in the sample. Let us return to the self-weighted national sample of 480. Notice that about half of the times that such a sample is drawn, there will be no one in it from Nevada, while almost never will there be anyone from Esmeralda County in that state. Local pride might argue that “this proves that the sample was unrepresentative,” but the correct position seems to be this: (i) the particular persons in the sample are there by accident, and this is appropriate, so far as population characteristics are concerned, (ii) the sampling plan is representative since each person in the U.S. had an equal chance of entering the sample, whether he came from Esmeralda County or Manhattan. That which can be and should be representative is the sampling plan, which includes the manner in which the sample was drawn (essentially a specification of what other samples might have been drawn and what the relative chances of selection were for any two possible samples) and how it is to be analyzed. However great their local pride, the citizens of Esmeralda County, Nevada, are entitled to representation in a national sampling plan only as individual members of the U.S. population. They are not entitled to representation as a group, or as particular individuals—only as individual members of the U.S. population. The same is true of the citizens of Nevada, who are represented in only half of the actual samples. The citizens of Nevada, as a group, are no more and no less entitled to representation than any other group of equal size in the U.S. whether geographical, racial, marital, criminal, selected at random, or selected from those not in a particular national sample. It is clear that many such groups fail to be represented in any particular sample, yet this is not a criticism of that sample. Representation is not, and should not be, by groups. It is, and should be, by individuals as members of the sampled population. Representation is not, and should not be, in any particular sample. It is, and should be, in the sampling plan. 4. One method of assessing stability Because representativeness is inherent in the sampling plan and not in the particular sample at hand, we can never make adequate use of sample results

280

William G. Cochran, Frederick Mosteller, and John W. Tukey

without some measure of how well the results of this particular sample are likely to agree with the results of other samples which the same sampling plan might have provided. The ability to assess stability fairly is as important as the ability to represent the population fairly. Modern sampling plans concentrate on both. Such assessment must basically be in terms of sample results, since these are usually our most reliable source of information about the population. There is no reason, however, why assessment should depend only on the sample size and the overall (weighted) sample mean for the characteristic considered. These two suffice when measuring percentages with a simple random sample, but in almost all other cases the situation is more complex. It would be too bad if, every time such samples were used, the user had to consult a complicated table of alternative formulas, one for each plan, before calculating his standard errors. (These formulas do need to be considered whenever we are trying to do a really good job of maximum stability for minimum cost—considered very carefully in selecting one complex design in preference to another.) Fortunately, however, this complication can often be circumvented. One of the simplest ways is to build up the sample from a number of independent subsamples, each of which is self-sufficient, though small, and to tabulate the results of interest separately for each subsample. Then variation among separate results gives a simple and honest yardstick for the variability of the result or results obtained by throwing all the samples together. Such a sampling plan involves interpenetrating replicate subsamples. All of us can visualize interpenetrating replicate subsamples when the individuals or sampling units are drawn individually at random. Some examples in more complex cases may be helpful. In the first oak leaf example, we might select randomly, not one sample of 100 trees, but 10 subsamples of 10 trees each. If we then pick 10 leaves at random from each tree, placing them in 10 bags, one for each subsample, and tabulate the results separately, bag by bag, we will have 10 interpenetrating replicate subsamples. Similarly, if we were to pick 10 subsamples out of the Manhattan phone book, with each subsample consisting of every 173,870th name (in alphabetic order) and with the 10 lead names of the 10 subsamples selected at random from the first 173,870 names we would again have 10 interpenetrating replicate subsamples. We can always analyze 10 results from 10 independent interpenetrating replicate subsamples just as if they were 10 random selected individual measurements and proceed similarly with other numbers of replicate subsamples. 5. General probability samples The types of sample described in the last section are not the only kinds from which we can confidently make inferences from the sample to the population of interest. Besides the trivial cases where the sample amounts to 90% or even 95% of the population, there is a broad class of cases, including those

13 Principles of Sampling

281

of the last section as special cases. This is the class of probability samples, where: (1) There is a population, the sampled population, from which the sample is drawn, and each element of which has some chance of entering the sample. (2) For each pair of individuals or sampling units which are in the actual sample, the relative chances of their entering the sample are known. (This implies that the sample was selected by a process involving one or more steps of mechanical randomization.) (3) In the analysis of the actual sample, these relative chances have been compensated for by using relative weights such that (relative chance) times (relative weight) equals a constant. (4) For any two possible samples, the sum of the reciprocals of the relative weights of all the individuals in the sample is the same. (Conditions (3) and (4) can be generalized still further.) In practice of course, we ask only that these four conditions shall hold with a sufficiently high degree of approximation. We have made the sampling plan representative, not by giving each individual an equal chance to enter the sample and then weighting them equally, but by a more noticeable process of compensation, where those individuals very likely to enter the sample are weighted less, while those unlikely to enter are weighted more when they do appear. The net result is to give each individual an equal chance of affecting the (weighted) sample mean. Such general probability samples are just as honest and legitimate as the self-weighting probability samples. They often offer substantial advantages in terms of higher stability for lower cost. We can alter our previous examples, so as to make them examples of general, and not of self-weighting, probability samples. Take first the oak leaf example. We might proceed as follows: (1) locate all the trees in the forest of interest, (2) select a sample of trees at random, (3) for each sampled tree, choose 10 leaves at random and count (or estimate) the total number of leaves, (4) form the weighted mean by summing the products (fraction of 10 leaves infested) times (number of leaves on the tree) and then divide by the total number of leaves on the 100 trees in the sample. When we selected trees at random, each tree had an equal probability of selection. When we chose 10 leaves from a tree at random, the chance of getting a particular leaf was

282

William G. Cochran, Frederick Mosteller, and John W. Tukey

10 . (number of leaves on the tree) Thus the chance of selecting any one leaf was a constant multiple of this and was proportional to the reciprocal of the number of leaves of the tree. Hence the correct relative weight is proportional to the number of leaves on the tree, and it is simplest to take it as 1/10 of that number. After all, summing the products (fraction of 10 infected) times (leaves on tree) or (1/10) times (number out of 10 infected) times (leaves on tree) over all trees in the sample gives the same answer. One-tenth of this answer is given by summing (1/10) times (number out of 1 infected) times (leaves on tree) or (leaves on tree) 10 which shows that the weighted mean prescribed above is just what would have been obtained with relative weights of (number of leaves on tree)/10. If in sampling the names in the Manhattan telephone directory, we desired to sample initial letters from P through Z more heavily, we might proceed as follows: (number out of 1 infected)

(1) Select one of the first 17,387 names at random with equal probability as the lead name. (2) Take the lead name, and every 17,387th name in alphabetic order following it, into the sample. (3) Take every name which begins with P , Q, R, S, . . . , Z and is the 103rd or 207th name after a name selected in step 2 of the sample. Each name beginning with A, B, · · · , N, O has a chance of 1/17,387 of entering the sample. Each name beginning with P, Q, · · · , Y, Z has a chance of 3/17,387 of entering the sample (it enters if any one of three names among the first 17,387 is selected as the lead name). Thus the relative weight in the sample of a name beginning with A, B, · · · , N, O is 3 times that of a name beginning with P, Q, · · · , Y, Z. The weighted mean is found simply as: 3(sum for A, B, · · · , N, O s) + (sum for P, Q, · · · , Y, Z  s) 3(A, B, · · · , N, O s in sample) + (P, Q, · · · , Y, Z  s in sample) Finally we may wish to distribute our national sample of 480 with 10 in each state. The analysis exactly parallels the oak leaf case, and we have to form the sum of

13 Principles of Sampling

283

(mean for state sample) times (population of state) and then to divide by the population of the U.S. 6. Nature and properties of general probability samples We can carry over the use of independent interpenetrating replicates to the general case without difficulty. We need only remember that the replicates must be independent. In the oak leaf example, the replicates must come from groups of independently selected trees. In the Manhattan telephone book example, the replicates must be based on independently chosen lead names; in the national sample, the replicates must have members in every state. In every case they must interpenetrate, and do this independently. It is clear from discussion and examples that general probability samples are inferior to self-weighting probability samples in two ways, for both simplicity of exposition and ease of analysis are decreased! If it were not for compensating advantages, general probability samples would not be used. The main advantages are: (1) better quality for less cost due to reduction in administrative costs or prelisting cost, (2) better quality for less cost because of better allocation of effort over strata, (3) greater possibility of making estimates for individual strata. All three of these advantages can be illustrated on our examples. In the general oak leaf example, in contrast to the first oak leaf example in Section 2, there is no need to determine the size (number of leaves) of all trees. This is a clear cost reduction, whether in money or time. Suppose that, in the Manhattan telephone book sample, one aim was an opinion study restricted to those of Polish descent. Such persons’ names tend to be concentrated in the second part of the alphabet, so that the general sample will bring out more persons of Polish descent and the interviewing effort will be better allocated. In the case of the national sample of 480, the general sample, although probably giving a less stable national result, does permit (rather poor) state-by-state estimates where the self-weighting sample would skip Nevada about half the time. It is perhaps worth mentioning at this point that, if cost is proportional to the total number of individuals without regard to number of strata or the distribution of interviews among strata, the optimum allocation of interviews is proportional to the product (size of stratum) times (standard deviation within stratum). In particular, optimum allocation calls for sample strata not in proportion to population strata. If we weight appropriately, disproportionate samples will be better than proportionate ones—if we choose the disproportions wisely. In specifying the characteristics of a probability sampling at the beginning of this paper, we required that there be a sampled population, a population

284

William G. Cochran, Frederick Mosteller, and John W. Tukey

from which the sample comes and each member of which has a chance of entering the sample. We have not said whether or not this is exactly the same population as the population in which we are interested, the target population. In practice they are rarely the same, though the difference is frequently small. In human sampling, for example, some persons cannot be found and others refuse to answer. The issues involved in this difference between sampled population and target population are discussed at some length in Part II, and in chapter III-D of Appendix D in our complete report. 7. Stratification and adjustment In many cases general probability samples can be thought of in terms of (1) a subdivision of the population into strata, (2) a self-weighting probability sample in each stratum, and (3) combination of the stratum sample means weighted by the size of the stratum. The general Manhattan telephone book sample can be so regarded. There are two strata, one made up of names beginning in A, B, · · · , N, O, and the other made up of names beginning in P, Q, · · · , Y, Z. Similarly the general national sample may be thought of as made up of 48 strata, one for each state. This manner of looking at general probability samples is neat, often helpful, and makes the entire legitimacy of unequal weighting clear in many cases. But it is not general. For in the general oak leaf example, if there were any strata they would be whole trees or parts of trees. And not all trees were sampled. (Still every leaf was fairly represented by its equal chance of affecting the weighted sample mean.) We cannot treat this case as one of simple stratification. The stratified picture is helpful, but not basic. It must fail as soon as there are more potential strata than sample elements, or as soon as the number of elements entering the sample from a certain stratum is not a constant of the sampling plan. It usually fails sooner. There is no substitute for the relative chances that different individuals or sampling units have of entering the sample. This is the basic thing to consider. There is another relation of stratification to probability sampling. When sizes of strata are known, there is a possibility of adjustment. Consider taking a simple random sample of 100 adults in a tribe where exactly 50% of the adults were known to be males and 50% females. Suppose the sample had 60 males and 40 females. If we followed the pure probability sampling philosophy so far expounded, we should take the equally weighted sample mean as our estimate of the population average. Yet if 59 of the 60 men had herded sheep at some time in their lives, and none of the 40 women, we should be unwise in estimating that 59% of the tribe had herded sheep at some time in their lives. The adjusted mean

13 Principles of Sampling

.50

59 60



+ .50

0 40

285

= 49+ %

is a far better indicator of what we have learned. How can adjustment fail? Under some conditions the variability of the adjusted mean is enough greater than that of the unadjusted mean to offset the decrease in bias. It may be a hard choice between adjustment and nonadjustment. The last example was extreme, and the unwise choice would be made by few. But, again, less extreme cases exist, and the unwise choice, whether it be to adjust or not to adjust, may be made rather easily (and probably has been made many times). A quantitative rule is needed. One is given in chapter V-C of the complete report. In the preceding example the relative sizes of the strata were known exactly. It turns out that inexact knowledge can be included in the computation without great increase in complexity. An example in Kinsey’s area is cited by one critic of the Kinsey report: These weighted estimates do not, of course, reflect any population changes since 1940, which introduces some error into the statistics for the present total population. Moreover, on some of the very factors that Kinsey demonstrates to be correlated with sexual behavior, there are no Census data available. For example, religious membership is shown to be a factor affecting sexual behavior, but Census data are lacking and no weights are assigned. While the investigators interviewed members of various religious groups, there is no assurance that each group is proportionately represented, because of the lack of systematic sampling controls. Thus, the proportion of Jews in Kinsey’s sample would seem to be at least 13 per cent whereas their true proportion in the population is of the order of 4 per cent.1 Do we know the percentage of Jews well enough to make an adjustment for it? If we can assess the stability of the “4%” figure, the procedure of Chapter V-C will answer this question. Failing this technique, we could translate the question into more direct terms as follows: “In considering Kinsey’s results, do we want to have 13 per cent Jews or 4 per cent Jews in the sampled population?” and try to answer with the aid of general knowledge and intuition. We have discussed the adjustment of a simple random sample. The same considerations apply to the possibility of adjusting any self-weighting or general probability sample. No new complications arise when adjustment is superposed on weighting. The presence of a complication might be suspected in the case where not all segments appear in the sample, and we attempt to use these segments as strata. Careful analysis shows the absence of the complication, as may be illustrated by carrying our example further. 1

Hyman, H.H. and Sheatsley, P.B. “The Kinsey report and survey methodology,” International Journal of Opinion and Attitude Research, Vol. 2 (1948), 184–185.

286

William G. Cochran, Frederick Mosteller, and John W. Tukey

Suppose that the sheep-herding tribe in question contains a known, very small percentage of adults of indeterminate sex, and that none have appeared in our sample. To be sure, their existence affected, albeit slightly, the chances of males and females entering the sample, but it does not affect the thinking which urged us to take the adjusted mean. We still want to adjust, and have only the question “Adjust for what?” to answer. If the fraction of indeterminate sex is 0.000002, and the remainder are half males and half females, and if our anthropological expert feels that about 1 in 7 of the indeterminate ones has herded sheep, we have a choice between 59 0 1 .499999 + .499999 + .000002 60 40 7 which represents adjustment for three strata, one measured subjectively, and 0 59 + .500000 .500000 60 40 which represents adjustment for the two observed strata. Clearly, in this extreme example, the choice is immaterial. Clearly, also, the estimated accuracy of the anthropologist’s judgment must enter. We can again use the methods of Chapter V-C. 8. Upper semiprobability sampling Let us be a little more realistic about our botanist and his sample of oak leaves. He might have an aerial photograph, and be willing to select 100 trees at random. But any ladder he takes into the field is likely to be short, and he may not be willing to trust himself in the very top of the tree with lineman’s climbing irons. So the sample of 10 leaves that he chooses from each selected tree will not be chosen at random. The lower leaves on the tree are more likely to be chosen than the highest ones. In the two-stage process of sampling, the first stage has been a probability sample, but the second has not (and may even be entirely unplanned!). These are the characteristic features of an upper semiprobability sample. As a consequence, the sampled population agrees with the target population in certain large-scale characteristics, but not in small-scale ones and, usually, not in other large-scale characteristics. Thus, if in the oak leaf example we use the weights appropriate to different sizes of tree, as we should, the sampled population of leaves will (1) have the correct relative number of leaves for each tree, but (2) will have far too many lower leaves and far too few upper leaves. The large-scale characteristic of being on a particular tree is a matter of agreement between sampled and target populations. The large-scale characteristic of height in the tree (and many small-scale characteristics that the reader can

13 Principles of Sampling

287

easily set up for himself), is a matter of serious disagreement between sampled and target populations. The sampled population differs from the target population within each segment, here a tree, although sampled population segments and target population segments are in exact proportion. If infestation varies between the bottoms and the tops of the trees, this type of sampling will be biased, and, while the inferences from sample to sampled population will be correct, they may be useless or misleading because of the great difference between sampled population and target population. Such dangers always exist with any kind of nonprobability sampling. Upper semiprobability sampling is no exception. By selecting the trees at random we have stultified biases due to probable selectivity between trees, and this is good. But we have done nothing about almost certain selectivity between leaves on a particular tree—this may be all right, or very bad. It would be nice to always have probability samples, and avoid these difficulties. But this may be impractical. (The conditions under which a nonprobability sample may reasonably be taken are discussed in Part II.) There is one point which needs to be stressed. The change from probability sampling within segments (in the example, within trees) to some other type of sampling, perhaps even unplanned sampling, shifts a large and sometimes difficult part of the inference from sample to target population—shifts it by moving the sampled population away from the target population toward the sample—shifts it from the shoulders of the statistician to the shoulders of the subject matter “expert.” Those who use upper semiprobability samples, or other nonprobability samples, take a heavier load on themselves thereby. Upper semiprobability samples may be either self-weighting or general. The “quota samples” of the opinion pollers, where interviewers are supposed to meet certain quotas by age, sex, and socioeconomic status, are rather crude forms of upper semiprobability samples, and are often self-weighting. Bias within segments arises, some contribution being due, for example, to the different availability of different 42 year old women of the middle class. The sampled population may contain sexes, ages, and socioeconomic classes in the right ratios, but retiring persons are under-represented (and hermits are almost entirely absent) in comparison with the target population. Election samples of opinion, although following the same quota pattern, will ordinarily only be self-weighting within states (if we ignore the “who will vote” problem). Predictions are desired for individual states. If Nevada had a mere 100 cases in a self-weighting sample, the total size of a national sample would have to be about 100,000. When national percentages are to be compiled, it would be foolish not to weight each state mean in accordance with the size of the state. No one would favor, we believe, weighting each state equally just because there may be (and probably are) biases within each state. Disproportionate samples and unequal weights are just as natural and wise a part of upper semiprobability sampling as they are of probability sampling.

288

William G. Cochran, Frederick Mosteller, and John W. Tukey

The difficulties of upper semiprobability sampling do not lie here; instead they lie in the secret and insidious biases due to selectivity within segments. Our sampling of names from the Manhattan telephone directory might conceivably be drawn by listing the numbers called by subscribers on a certain exchange during a certain time, and then taking into the sample names from each exchange in proportion to the names listed for the exchange. The result would be an upper semiprobability sample with substantial selectivity within the segments, which here are exchanges. The nature of this selectivity would depend on the time of day at which the listing was made. Whether all segments are represented in an upper semiprobability sample or not, the segments may be used as strata for adjustment. The situation is exactly similar to that for probability sampling. The only difficulty worthy of note is the difficulty of assessing the stability of the various segment means. Independent interpenetrating replicate subsamples can be used to estimate stabilities of over-all or segment means in upper semiprobability samples without difficulty, if we can obtain a reasonable facsimile of independence in taking the different subsamples. They provide, if really independent, respectable bases for inference from sample to sampled population. We still have a nonprobability sample, however, and there is no reason for the sampled population to agree with the target population. The problem is just reduced to “What was the sampled population?” What finally is the situation with regard to bias in an upper semiprobability sample? We shall have a weighted mean or an adjusted one. In either case, any bias originally contributed by selectivity between segments will have been substantially removed. But, in either case, the contribution to bias due to selectivity within segments will remain unchanged. This is an unknown and hence additionally dangerous, sort of bias. The great danger in weighting or adjusting such samples is not so much that that weighting or adjusting may make the results worse (as it will from time to time) but rather that its use may cause the user to feel that his values are excellent because they are “weighted” or “adjusted” and hence to neglect possible or likely biases within segments. Like all other nonprobability sample results, weighted means from upper semiprobability samples should be presented and interpreted with caution. 9. Salvage of unplanned samples What can we do for such samples? We can either try to improve the results of their analysis, or try to inquire how good they are anyway. We may try to improve either actual quality, or our belief in that quality. The first has to be by way of manner of weighting or adjustment, the second must involve checking sample characteristics against population characteristics. Weighting is impossible, since we cannot construct a sampling plan and hence cannot estimate chances of entering the sample in any other manner than by observing the sample itself. So all that we can do under this head is

13 Principles of Sampling

289

to adjust. We recall the salient points about adjustment, which are the same in a complete salvage operation as they are in any other situation: (1) The population is divided into segments. (2) Each individual in the sample can be uniquely assigned to a segment. (3) The population fraction is either known with inappreciable error or estimated with known stability. (4) The procedures of Chapter V-C of Appendix C of the complete report are applied to determine whether, or how much, to adjust. After adjustment, what is the situation as to bias? Even worse than with upper semiprobability sampling, because if we do not adjust, we cannot escape bias by turning to weighting. In summary (1) whether adjusted or not, the result contains all the effects of all the selectivity exercised within segments, while (2) if adjustment is refused by the methods of Chapter V-C, we face additional biases resulting from selectivity between segments of a magnitude comparable with the difference between unadjusted and adjusted mean. This is, to put it mildly, not a good situation. Clearly even more caution is needed in presenting and interpreting the results of a salvage operation on an unplanned sample than for any of the other types of sample discussed previously. (If it were not for the psychological danger that adjustment might be regarded as cure, the caution required for results based on the original, unadjusted, unplanned sample would, however, be considerably greater.) Having adjusted or not as seems best, what else can we do? Only something to make ourselves feel better about the sample. Some other characteristic than that under study can sometimes be compared in the adjusted sample and in the population. A large difference is evidence of substantial bias within segments. Good agreement is comforting, and strengthens the believability of the adjusted mean for the characteristic of interest. The amount of this strengthening depends very much on the a priori relation between the two characteristics. Some would say that an unplanned sample does not deserve adjustment, but the discussion in Part II indicates that if any sort of a summary is to be made, it might as well, in principle, be an adjusted mean. ii. systematic errors In order to understand how systematic errors in sampling should be treated, it seems both necessary and desirable to fall back on the analogy with the treatment of systematic errors in measurement. No clear account of the situation for sampling seems to be available in the literature, although understanding of the issues is a prerequisite to the critical assessment of nonprobability samples. On the other hand, one of physical science’s greatest and more recurrent problems is the treatment of systematic errors.

290

William G. Cochran, Frederick Mosteller, and John W. Tukey

10. The Presence of Systematic Errors Almost any sort of inquiry that is general and not particular involves both sampling and measurement, whether its aim is to measure the heat conductivity of copper, the uranium content of a hill, the visual acuity of high school boys, the social significance of television or the sexual behavior of the (white) human (U.S.) male. Further, both the measurement and the sampling will be imperfect in almost every case. We can define away either imperfection in certain cases. But the resulting appearance of perfection is usually only an illusion. We can define the thermal conductivity of a metal as the average value of the measurements made with a particular sort of apparatus, calibrated and operated in a specified way. If the average is properly specified, then there is no “systematic” error of measurement. Yet even the most operational of physicists would give up this definition when presented with a new type of apparatus, which standard physical theory demonstrated to be less susceptible to error. We can relate the result of a sampling operation to “the result that would have been obtained if the same persons had applied the same methods to the whole population.” But we want to know about the population and not about what we would find by certain methods. In almost all cases, applying the method to the “whole” population would miss certain persons and units. Recognizing the inevitability of (systematic) error in both measurement and sampling, what are we to do? Clearly, attempt to hold the combined effect of the systematic errors down to a reasonable value. What is reasonable? This must depend on the cost of further reduction and the value of accurate results. How do we know that our systematic errors have been reduced sufficiently? We don’t! (And neither does the physicist!) We use all the subject-matter knowledge, information and semi-information that we have—we combine it with whatever internal evidence of consistency it seems worthwhile to arrange for the observations to provide. The result is not foolproof. We may learn new things and do better later, but who expects the last words on any subject? In 1905, a physicist measuring the thermal conductivity of copper would have faced, unknowingly, a very small systematic error due to the heating of his equipment and sample by the absorption of cosmic rays, then unknown to physics. In early 1946, an opinion poller, studying Japanese opinion as to who won the war, would have faced a very small systematic error due to the neglect of the 17 Japanese holdouts, who were discovered later north of Saipan. These cases are entirely parallel. Social, biological and physical scientists all need to remember that they have the same problems, the main difference being the decimal place in which they appear. If we admit the presence of systematic errors in essentially every case, what then distinguishes good inquiry from bad? Some reasonable criteria would seem to be:

13 Principles of Sampling

291

(1) Reduction of exposure to systematic errors from either measurement or sampling to a level of unimportance, if possible and economically feasible, otherwise (1+) Balancing the assignment of available resources to reduction in systematic or variable errors in either measurement or sampling reasonably well, in order to obtain a reasonable amount of information for the “money.” (2) Careful consideration of possible sources of error and careful examination of the numerical results. (3) Presentation of results and inferences in a manner which adequately points out both observed variability and conjectured exposure to systematic error. In many situations it is easy, and relatively inexpensive, to reduce the systematic errors in sampling to practical unimportance. This is done by using a probability sampling plan, where the chance that any individual or other primary unit shall enter the sample is known, and allowed for, and where adequate randomness is ensured by some scheme of (mechanical) randomization. The systematic errors of such a sample are minimal, and frequently consist of such items as: (a) failure of individuals or primary units to appear on the “list” from which selection has been made, (b) persons perennially “not at home” or samples “lost,” (c) refusals to answer or breakdowns in the measuring device. These are the hard core of causes of systematic error in sampling. Fortunately, in many situations their effect is small—there a probability sample will remove almost all the systematic error due to sampling. 11. Should a probability sample be taken? But this does not mean that it is always good policy to take probability samples. The inquirer may not be able to “afford” the cost in time or money for a probability sample. The opinion pollers do not usually afford a probability sample (instead of designating individuals to be interviewed by a random, mechanical process, they allow their interviewers to select respondents to fill “quotas”) and many have criticized them for this. Yet the behavior of the few probability samples in the 1948 election (see pp. 110–112 of The Preelection Polls of 1948, Social Science Research Council Report No. 60) does not make it clear that the opinion pollers should spend their limited resources on probability samples for best results. (Shifts toward a probability sample have been promised, and seem likely to be wise.) The statement “he didn’t use a probability sample” is thus not a criticism which should end further discussion and doom the inquiry to the cellar. It is always necessary to ask two questions: (a) Could the inquirer afford a probability sample?

292

William G. Cochran, Frederick Mosteller, and John W. Tukey

(b) Is the exposure to systematic error from a non-probability sample small enough to be borne? If the answer is “no” to both, then the inquiry should not be, or have been, made—just as would be the case with a physical inquiry if the systematic errors of all the forms of measurement which the physicist could afford were unbearably large. If the answer is “yes” to the first question and “no” to the second, then the failure to use a probability sample is very serious, indeed. If the answer is “yes” to both, then careful consideration of the economic balance is required—however it should be incumbent on the inquirer using a nonprobability sample to show why it gave more information per dollar or per year. (As statisticians, we feel that the onus is on the user of the nonprobability sample. Offhand we know of no expert group who would wish to lift it from his shoulders.) If the answer is “no” to the first question, and “yes” to the second, then the appropriate reaction would seem to be “lucky man.” Having admitted that the sampling, as well as the measurement, will have some systematic errors, how then do we do our best to make good inferences about the subject of inquiry? Sampling and measurement being on the same footing, we have only to copy, for the sampling area, the procedure which is well established and relatively well understood for measurement. This procedure runs about as follows: We admit the existence of systematic error—of a difference between the quantity measured (the measured quantity) and the quantity of interest (the target quantity). We ask the observations about the measured quantity. We ask our subject matter knowledge, intuition, and general information about the relation between the measured quantity and the target quantity. We can repeat this nearly verbatim for sampling: We admit the existence of systematic error—of a difference between the population sampled (the sampled population) and the population of interest (the target population). We ask the observations about the sampled population. We ask our subject matter knowledge, intuition, and general information about the relation between sampled population and target population. Notice that the measured quantity is not the raw readings, which usually define a different measured quantity, but rather the adjusted values resulting from all the standard corrections appropriate to the method of measurement. (Not the actual gas volume, but the gas volume at standard conditions!) Similarly, the result for the sampled population is not the raw mean of the observations, which usually defines a different sampled population, but rather the adjusted or weighted mean, all corrections, weightings and the like appropriate to the method of sampling having been applied. Weighting a sample appropriately is no more fudging the data than is correcting a gas volume for barometric pressure.

13 Principles of Sampling

293

The third great virtue of probability sampling is the relative definiteness of the sampled population. It is usually possible to point the finger at most of the groups in the target population who have no chance to enter the sample, who therefore were not in the sampled population; and to point the finger at many of the groups whose chance of entering the sample was less than or more than the chance allotted to them in the computation, who therefore were fractionally or multiply represented in the sampled population. When a nonprobability sample is adjusted and weighted to the best of an expert’s ability, on the other hand, it may still be very difficult to say what the sampled population really is. (Selectivity within segments cannot be allowed for by weights or adjustments, but it arises to some extent in every nonprobability sample and alters the sampled population.) 12. The value and conditions of adjustment Some would say that correcting, adjusting and weighting most nonprobability samples is a waste of time, since you do not know, when this process has been completed, to what sampled population the adjusted result refers. This is entirely equivalent to saying that it does not pay to adjust the result of a physical measurement for a known systematic error because there are, undoubtedly, other systematic errors and some of them are likely to be in the other direction. Let us inquire into good practice in the measurement situation, and see what guidance it gives us for the sampling situation. When will the physicist adjust the principle for the known systematic error? When (i) he has the necessary information and (ii) the adjustment is likely to help. The necessary information includes a theory or empirical formula, and the necessary observations. Empirical formulas and observations are subject to fluctuations, so that adjustment will usually change the magnitude of fluctuations as well as altering the systematic error. The adjustment is likely to help unless the supposed reduction of systematic error coincides with a substantial increase in fluctuations. If the known systematic error is so small as not to (1) affect the result by a meaningful amount, or (2) affect the result by an amount likely to be as large as, or a substantial fraction of, the unknown systematic errors, then the physicist will report either the adjusted or the unadjusted value. If he reports the unadjusted value, he should state that the adjustment has been examined, and is less than such-and-so. To do this, either he must have calculated the adjustment or he must have had generally applicable and strong evidence that it is small. In any event, his main care, which he will not always take, must be to warn the reader about the dangers of further systematic errors, perhaps, in some cases, even by saying bluntly that “the adjusted value isn’t much better than the raw value,” and then provide raw values for those who wish to adjust their own.

294

William G. Cochran, Frederick Mosteller, and John W. Tukey

If the physicist is aware of systematic errors of serious magnitude and has no basis for adjustment, his practice is to name the measured quantity something, like Brinnell hardness, Charpy impact strength, or if he is a chemist—iodine value, heavy metals as Pb, etc. By analogy, those who feel that the combination of recall and interview technique make Kinsey’s results subject to great systematic error might well define “KPM sexual behavior” as a standard term,2 and work with this. By analogy then, when should a nonprobability sample be adjusted in principle? (Most probability samples are made to be weighted anyway—this is part of the design and must be carried out.) When (i) we have the necessary information and (ii) when the adjustment is likely to help. The necessary information will usually consist of facts or estimates of the true fractions in the population of the various segments. When is the adjustment likely to help? This problem has usually been a ticklish point requiring technical knowledge and intuition. A quantitative solution is now given in Chapter V-C of Appendix C in the complete report. With this as a guide, it should be possible to make reasonable decisions about the helpfulness of adjustment. If the decision is to adjust, we should accept the sampled population corresponding to the adjusted mean, and calculate the adjustment. We then report the adjusted value, unless the adjustment is small, when we may report the unadjusted value with the statement that the adjustment alters it by less than such-and-so. Our main care, which we may not always take, must be to warn the reader about the dangers of further lack of representativeness, perhaps, in some cases, even by saying bluntly that “the adjusted mean isn’t much better than the raw mean, even if we took 20 pages to tell you how we did it and six months to do it,” and to provide raw means for those who wish to adjust their own. If we were prepared to report an unadjusted mean, we were clearly inviting inference to some sampled population. Adjustment will give us a sampled population that is usually nearer to the target population. Hence we should adjust. If we cannot adjust, and must present raw data which we feel badly needs adjustment, we may say that this is what we found in these cases—take ’em or leave ’em. Except from the point of view of protecting the reader from overbelief in the results, this would seem to be a counsel of despair. By analogy with the physicist, it seems better to introduce “KPM sexual behavior” and its analogs in such situations.

2

The letters KPM stand for Kinsey, Pomeroy and Martin, the authors of Sexual Behavior in the Human Male.

Reprinted from Proceedings of the American Philosophical Society (1958), 102, pp. 53–59

14. Stochastic Models for the Learning Process Frederick Mosteller Professor of Mathematical Statistics, Harvard University (Read April 26, 1957)

introduction Since 1949 psychologists have increasingly used mathematics in the study of the learning process. Probability theory has been the keynote in these applications, and thus there is considerable unity in the published work. The applications described here are typical of those made by W. K. Estes, C. J. Burke, and their students [2, 3, 11, 12, 13, 14, 15, 16, 17, 18, 19], by G. A. Miller, F. C. Frick, and W. J. McGill [25, 26], by F. Restle [27, 28], and by R. R. Bush and F. Mosteller [1, 4, 5, 6, 7, 8, 9, 10, 23]. The purpose of this paper is to give the flavor of these new developments through a few examples, rather than to survey the literature. Three experiments illustrate the use of the new mathematics to provide: (1) a summary description of the course of learning in an experiment; (2) a qualitative distinction between two theoretical positions; (3) a test of the correspondence between elements in one theory and those in the physical world. We also mention a new mathematical problem, interesting in its own right, that has arisen from these applications. probability and learning Let us suppose that to learn a list of words you read through the list and then recite those words that you recall. With successive readings and recitations, the number recalled correctly increases, and, if the list is short, you ultimately learn all the words. Early writers on mathematical methods in the psychology of learning described the course of such learning by finding a mathematical curve whose shape is appropriate to this general improvement in recall. Such curves were invented by many authors. More generally, for a variety of experiments, mathematical functions were sought to describe how some measure of performance of a task increased with some measure of practice. Early favorites among the functions were the hyperbola, the exponential, and the arc cotangent. The main features of these curves are that they rise

296

Frederick Mosteller

monotonically with increased practice and that they tend to an asymptote or ceiling corresponding to the best possible performance. Let us focus attention on the monotonically increasing character of these curves. The course of learning, like that of true love, does run not quite so smoothly. You will find in learning a long list of words that on some trials you do not remember words that you recalled on earlier trials and, worse yet, that on some late trials you do not recall as many words as you did on an earlier trial. This erratic character of learning in turn suggests that a better description of the process might be acheived by the use of probabilistic, statistical, or random processes rather than by deterministic curves. In present day mathematics, such descriptions are often called stochastic models. The word “stochastic” is now almost synonomous with “probabilistic” or “random,” but usually implies that a time variable is present. In the learning of lists, the successive readings and recitations are thought of as a sequence in time. The word “model” in applied mathematics has come to mean a mathematical description of certain aspects of a physical process rather than a small physical replica of the real thing. Thus one might speak of the Mendelian model for inheritance, or the binomial model for coin tosses. Thurstone [31] seems to have developed the first of these stochastic models for learning in 1930, but he used it only as a vehicle to get a deterministic curve. Gulliksen and Wolfle [21] developed modified versions of such learning curves. These curves were designed to describe average performance as it depended upon the number of practice units. The stochastic models developed since 1949 are designed to describe the responses made by subjects in simple repetitive experiments. The subject of the experiment receives a stimulus, he makes one of a number of responses, and some outcome of this response occurs, perhaps reward or shock. It is assumed that at the start of a trial each possible response has its own probability of occurring. It is assumed further that the event that occurs during the trial changes the probabilities of these responses for the next trial. The mathematical counterpart of the event is a mathematical operator which adjusts the probabilities in a predetermined manner. Usually an event consists of the response of the subject together with the outcome of the trial, such as reward or punishment. Thus from the point of view of the model, the learning process consists of the changing probabilities of the responses and the rules that change them. These changes are reflected in the data by the changing frequencies of the responses through time. Such probability process generates mathematically an erratic sequence of “responses” like those that occur in the successive recalls of lists of words. These models have been used to describe experiments in reward training including partial reinforcement, rote learning, discrimination, spontaneous recovery, avoidance training, time and rate problems, and experimental extinction. We turn now to some specific experiments.

14 Stochastic Learning Models

297

an escape–avoidance experiment One use of stochastic models is to provide a summary description of the learning process. For example, Solomon and Wynne [30], in an escapeavoidance experiment with dogs as subjects, placed the dog in one side of a symmetrical box divided by a moveable barrier. At the start of a trial, the barrier was raised leaving the dog free to jump to the other side across a shoulder-high fence. If the dog did not jump within 10 seconds, he received a shock and thereupon usually jumped to the other side. Such a trial is called an “escape.” If the dog jumped before the 10 seconds were up, he is said to have “avoided.” Thus the two responses are “escape” and “avoidance.” All normal dogs tested learned to avoid almost perfectly. The question arises, how does the probability of escape change as the trials continue. In this experiment the event changing the probability is assumed to be in perfect correspondence with the response of the animal. Bush and Mosteller [7] assume that the probabilities of the responses, escape and avoidance, are p and 1 − p, respectively, at some given time in the course of the experiment. They assume further that on the next trial either response will reduce the probability of escape. The reduction takes the form of multiplication by a factor α1 for escape, α2 for avoidance. Both α1 and α2 are assumed to have values between 0 and 1. These α’s measure the slowness of learning, and in this experiment their values were estimated as α1 = 0.92, α2 = 0.80. Thus if the present probability of escape is p = 0.4, an escape on the next trial would reduce this probability of escape to α1 p = (0.92) (0.4) = 0.368, whereas an avoidance would reduce it to α2 p = (0.80)(0.4) = 0.32. In other words, an escape reduces the probability of escape by 8 per cent (100 − 92 = 8), and an avoidance reduces the probability of escape by 20 per cent. The values 8 and 20 per cent can be thought of as the speed of learning corresponding to the two events. The particular form these changes in probability are assumed to take in this experiment are a specialization of the more general operators used by Bush and Mosteller. Initially, the dogs almost never avoided (one avoidance in 300 pretraining trials), therefore the initial probability of escape could be taken as very close to unity. Using p = 1 as the initial probability, the model states that after a avoidances and b escapes the probability of escape is given by α1b α2a . Thus as soon as α1 and α2 are measured, one can estimate, for dogs with a given history of escapes and avoidances, the fraction that will escape on the next trial. And, of course, many other predictions can be made. In this experiment the values of the parameters themselves give special information. Of course, we note that the learning rate is faster for avoidance than for escape. But, more important, we can find out how many escape trials are required to change the probability by the same amount as one avoidance trial. Since (0.92)2.7 is approximately 0.80, the answer is that 2.7 escape trials are roughly equivalent to one avoidance trial. It should, of course, be under-

298

Frederick Mosteller

stood that this and other calculations are made within the framework of this model with its special assumptions. The validity of the interpretation depends on how well the whole model describes the process.

Fig. 1. Branching diagram for the first few trials of Solomon-Wynne escapeavoidance experiment. The first number at each intersection is observed, the second is the corresponding computed number. The letter E stands for escape, A for avoidance.

Figure 1 is a branching diagram showing the results for 30 dogs for the first few trials of the Solomon-Wynne experiment. The number before the semicolon at each point of the branching diagram gives the number of dogs that have arrived at that position, the number after the semicolon is computed on the basis of the values of the parameters already given. Thus 30 dogs started the experiment and all escaped (E) on the first trial. Assuming that all dogs had the same α1 = 0.92, approximately, their new probability of E is 0.92. (The experimenters adjusted the shock and the height of the barrier for each dog separately in an effort to equate the situation from dog to dog.) The mathematical expectation of the number of dogs escaping on the next trial is (0.92)(30) = 27.6, and 27 dogs actually escaped, the other 3 avoided. After the third trial there are three kinds of dogs—those that escaped thrice, twice, or once. The only new kind of calculation is that for those escaping twice. These 5 dogs are composed of the 3 who avoided for the first time on the third trial and the 2 who escaped after a previous avoidance. The theoretical calculation is (27.6)(1 − (0.92)2 ) + 2.4(0.92)(0.80), or about 6.0. (Thus the theoretical calculation was done entirely on the basis of the original parameter values and does not use the actual outcomes for the dogs on intermediate trials.) It will readily be noted that the observed and computed frequencies in figure 1 are quite close—closer, if anything, than ordinary sampling variation would lead one to expect. Once the three numbers are given—the initial probability and the two learning parameters—we can compute statistics other than those given in figure 1 to see how well they fit the data. While some of these new statistics involve routine probability calculations, others are quite complicated, and it

14 Stochastic Learning Models

299

is convenient to run a number of stat-dogs using random number tables to evaluate them. Given the values of the parameters and the method of changing the probabilities, we can carry out a mock experiment with the aid of a random number table. Thus our first stat-dog has probability 1 of escaping on the first trial. This escape reduces his probability of escaping to 0.92. We draw a two-digit random number from a table in which the 100 numbers 00, 01, . . . , 99 are equally likely. If the number is less than 92 (i.e., 00, 01, . . . , 91), we say the stat-dog escaped on the second trial, otherwise (92, 93, . . . , 99) that he avoided. Now we adjust his probability, appropriately, and continue the process. Thus the statdog is carried through the “experiment,” and he produces a succession of E’s and A’s just as the real dogs do. By carrying many such stat-dogs through the process, we can generate artificial data that indicate the properties of the mathematical model. These artificial data can be compared with the real data. In this manner questions that are too complicated for theoretical calculation can still be answered. The technique is an application of the Monte Carlo method which is widely used by physicists and statisticians. Many such statistics are compared in table 1. Not all of these measures are of psychological interest; rather the table illustrates the point that the three basic parameters supply satisfactory answers to a wealth of questions about the sequences that occur in such an experiment. The standard deviations provided in table 1 could be used to test the difference of the means obtained in the experiment and the pseudoexperiment, but they are given for a different reason. The mathematical model is supposed to predict not only the mean value of the statistics listed in table 1 but also the distribution of values for many dogs. Rather than provide the whole distribution, we supply the standard deviations so that the variability of the stat-dogs can be compared with that of the real dogs. This supplementary comparison shows that the stat-dogs are slightly less variable, generally, than the real dogs. Table 1 illustrates an important difference between stochastic models and the earlier deterministic ones. After fitting the parameters, the deterministic models yield a “learning curve,” which in this experiment would estimate the fraction of escapes on each trial. The stochastic model can do the same. But the deterministic model had no answer to such questions as, what is the average length of the longest run of escapes, or what fraction of those who have avoided once and escaped twice will avoid next time, whereas the stochastic model does. Most acquisition curves have much the same shape, and it is a shape that a good many mathematical curves in common use can fit quite well. Consequently, a good fit to the learning curve cannot give much support to the theory leading to it. If, however, one can use the fitted parameters to forecast accurately a variety of other statistics which have not been fitted directly but are consequences of the mathematical process, the results are

300

Frederick Mosteller

rather more satisfying and informative. The statistics given in table 1 are of this type. TABLE 1 Comparisons of the Stat-dog “Data” and the Solomon-Wynne Data for 30 Dogs, Each Run Through 25 Trials Stat-dogs

Trials before first avoidance Trials before second avoidance Total shocks Trial of last shock Alternations Longest run of shocks Trials before first run of four avoidances ∗

Real dogs

Mean

S.D.∗

Mean

S.D.∗

4.13 6.20 7.60 12.53 5.87 4.33

2.08 2.06 2.27 4.78 2.11 1.89

4.50 6.47 7.80 11.33 5.47 4.73

2.25 2.62 2.52 4.36 2.72 2.03

9.47

3.48

9.70

4.14

To obtain standard deviation of the mean, divide by



30.

This example illustrates some of the kinds of calculations that are involved in stochastic models. It shows how a particular model fits in one experiment, and especially it displays the use of the model to describe the fine-grained structure of the data over and above the mean performance curves provided by earlier models. The close agreement between data and theory indicates that the model describes the data quite well. Thus, the model plus the parameter values summarizes the data rather completely. And, incidentally, the example illustrates the Monte Carlo method. an experiment with paradise fish In an experiment with paradise fish, by Bush and Wilson [10], the fish had two choices—to swim to the right-hand side or to the left-hand side of the far end of a tank after the starting gate was raised. One of these sides— the favorable side—gave the fish a reward, caviar, 75 per cent of the times he chose it, the other side gave the reward only 25 per cent of the times he chose it. Thus, from the point of view of the model there are four possible events: right-reward, right-nonreward, left-reward, left-nonreward. It would be generally thought that being rewarded on a given side would improve the probability that that side was chosen on the next trial. But about nonrewarded trials, the reasoning is not so clear. An extinction or information theory would suggest a reduction in the probability of going to an unrewarded side on the next trial, but a theory based on habit formation or secondary reinforcement would suggest that merely going to a side would make that side more likely to be chosen on the next trial.

14 Stochastic Learning Models

301

The mathematical operators for the four events in this experiment would differ according to which of these formulations one adopted.1 Let us assume that the probability of choosing the right-hand side of the tank is p at some stage of the experiment. Then if the fish chooses the right-hand side and is rewarded, his probability of choosing the right-hand side on the following trial is increased. Bush and Mosteller assume that the new probability of choosing the right-hand side has the form α1 p + 1 − α1 . (See table 2.) As before, α1 is the learning parameter appropriate to this particular outcome, and α1 is between 0 and 1. If p = 0.4 and α1 = 0.8, the new probability is (0.8)(0.4) + 1 − 0.8 = 0.52. It is assumed that if the left side is chosen and rewarded, the new probability of turning right is smaller. Furthermore, from the symmetry of the experiment it is assumed that the rate of learning on the left is the same as that on the right. It turns out that α1 p makes the proper reduction in the probability of turning right. (The algebraic asymmetry between α1 p and α1 p+1−α1 comes from the fact that we discuss the problem from the point of view of the probability of turning right, rather than from the point of view of the effect on the probability of the side just chosen.) When we consider non-reinforcement on the right-hand side, the model for a theory of extinction suggests a reduction in the probability of choosing the right-hand side on the next trial (α2 p). A theory of habit formation or secondary reinforcement suggests that an increase, no doubt smaller than that for reward, in the probability will occur (α2 p + 1 − α2 ). These possibilities are listed in table 2. These two models make quite different forecasts about the long-run behavior of the animals. The reinforcement-extinction model implies that the animals never stabilize on a side. The reason is that if a very high probability of choosing the right-hand side is achieved, nonrewards there will reduce the probability, and ultimately the animal will switch to the left. Nonrewards are sure to occur because the fish is rewarded on only 75 per cent of the choices of the favorable side. A similar argument shows that he cannot stabilize on the unfavorable side according to this model. On the other hand, the habit-formation model implies that the animal will stabilize on one side or the other, but, surprisingly enough, states that some animals will stabilize on the favorable side and some on the unfavorable side. Again the idea is simple. Whether rewarded or not, going to a side increases its probability. Once a high probability is achieved for a side, the animal is very likely to go there, and going there makes it the more probable that he will go there again. (A somewhat technical argument is required to prove that all organisms are ultimately absorbed by one side or the other.) 1

They would depend, too, on whose mathematical analysis one adopted. In this paper we present some of the spirit of the applications of probability theory to experiments in learning. A much more technical discussion would be required to compare different mathematical theories for the same experiment. Here we choose one form for the operators and use this same form to describe two different psychological positions. It is not implied that this is the only mathematical form that could be used.

302

Frederick Mosteller TABLE 2 Operators for Choice Experiments Operators for reinforcement-extinction model (p = Prob (right)):

Reinforcement Non-reinforcement

Left

Right

α1 p α2 p + 1 − α2

α1 p + 1 − α1 α2 p

Operators for habit-formation model:

Reinforcement Non-reinforcement

Left

Right

α1 p α2 p

α1 p + 1 − α1 α2 p + 1 − α2

In the paradise fish experiment, there were two conditions: (1) with opaque divider between the two goal boxes, (2) with transparent divider. The opaque divider prevented the fish from seeing the goal box he did not choose, but the transparent one permitted the fish to see the food placed in the other goal box if that one was to be rewarded in the given trial. There was behavioral evidence that the fish was usually aware of the food in the other goal box when he could see but not obtain it. A tabulation was made of the number of trials on which each fish turned to the favorable side in the last 49 of its total of 140 trials. This tabulation is shown in table 3. Clearly, most of the fish go nearly all the time to one side. Ten have almost all their late trials on the favorable side, 4 have almost all on the unfavorable side. And generally, the clustering is at the extremes. This result is in better agreement with the habit-formation model than with the reinforcement-extinction model, because the latter predicts that the fish will cluster around the value of 37 trials to the favorable side, contrary to the data which cluster at one end or the other of the distribution. The stat-fish figures of table 3 are the results of Monte Carlo runs using the habit-formation model. The parameters (α1 , α2 ) have been fitted to those of the fish run with transparent divider. No stat-fish were run to compare with the opaque divider group. The treatment of this experiment illustrates the use of stochastic models to describe different theoretical positions and to make qualitative distinctions between them.

14 Stochastic Learning Models

303

TABLE 3 Paradise Fish Experiment Cell entry is number of fish with trials to the favorable side indicated in first column (last 49 of 140 trials)

Trials to favorable side

Real fish: transparent divider

Stat-fish

Real fish: opaque divider

0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–39 40–44 45–49 Total fish

4 1 2 0 0 0 1 2 2 10 22

4 2 0 0 0 1 0 2 3 10 22

1 0 0 1 2 0 2 3 7 11 27

a compound stimulus experiment In deriving a form for the mathematical operators used to change the probabilities, Estes [11] assumes that the environment is composed of elementary elements. Each of these elements is assumed to be conditioned to or associated with one and only one of the responses. The subject is assumed to take a sample of the elements. His probability of giving a particular response is assumed to be identical with the fraction of elements in the sample that are conditioned to that response. Such a theory can be used to derive operators like those used by Bush and Mosteller [7]. Among the experiments performed to provide a test of Estes’ theory of conditioned elements is that of Schoeffler [29]. A group of 24 lights was randomly divided into three groups of 8 each. Subjects were taught to move a lever to the left when one set of 8 lights was flashed, to move it to the right when another set of 8 flashed (hereafter we call these the left and right sets, respectively). The third set of 8 lights was used only for testing purposes (we call these the neutral set). After subjects learned to respond perfectly to these two sets of lights, a composite set was flashed composed of some lights from two or all three sets—the left, right, and neutral sets. After each test trial, the subject was re-trained to discriminate between the original left and right sets of lights. The theory would suggest that the fraction of times the subject would respond “left” to the compound stimulus would be

304

Frederick Mosteller

l + 12 n , r+l+n where l = number of lights in the test stimulus previously conditioned to left, r = number of lights in the test stimulus previously conditioned to right, n = number of lights in the test stimulus not previously presented in a training series. In constructing this formula, it is a convenience to suppose that a light in the compound stimulus is an elementary element. However, a light might correspond to many elements, so we suppose that the lights each correspond to the same number of elements—that is, each light has the same weight in the formula. There is a tacit assumption that, for lights not presented in the training trials, half the elements are conditioned to left, half to right. There is a further tacit assumption that the lights in the stimulus set contributed all the elements in the environment (in some later experiments, Estes and Burke [18] have questioned this assumption). The results are shown in table 4. The predicted and observed fractions agree quite well except possibly for the pattern shown in the second line. TABLE 4 Schoeffler’s Compund Stimulus Experiment



Test trial

l=number of left lights

r=number of right lights

n=number of neutral lights

Theoretical fraction “left” responses

Fraction∗ “left” responses

1 2 3 4 5 6 7 8 9

8 8 8 4 8 8 4 8 8

8 4 2 2 4 2 2 0 8

0 0 0 0 8 8 8 8 8

.50 .67 .80 .67 .60 .67 .57 .75 .50

.54 .79 .81 .63 .62 .67 .54 .73 .54

Each based on 180 responses, one by each of 180 subjects.

It is not intended here to claim that other theories might not account for these same results, but rather to display an experiment suggested by the theory and to indicate the degree to which the results were in agreement with it. a mathematical problem If the habit-formation model is appropriate for the paradise fish experiment, what fraction of the fish will stabilize on the favorable side? This ques-

14 Stochastic Learning Models

305

tion is typical of a number of mathematical problems that have arisen in the mathematical study of learning. For a mathematical description of this problem we take a slightly simpler question. Suppose that there are two responses A1 and A2 , and that the outcome of both is pleasant though not necessarily equally so. Suppose that the initial probabilities of the responses A1 and A2 are p0 and 1 − p0 , respectively. If p is the probability of A1 on some trial and A1 is performed, the new probability of A1 is α1 p + 1 − α1 , but if A2 is performed, the new probability of A1 is α2 p. It can be shown that an organism obeying this description will in the long run stop giving one of the responses and respond only with the other (with probability one). Now, given p0 , α1 , α2 , what is the probability that the organism stops giving A2 ’s—that is, is absorbed by A1 ? Let us call this probability f (p0 , αl , α2 ). After one trial the organism has a new probability α1 p0 + 1 − α1 (if A1 occurs) or α2 p0 (if A2 occurs) with probabilities p0 and 1 − p0 , respectively. Thus, if his first trial is A1 , his new probability of absorption by A1 , is f (α2 p0 + 1 − α1 , α1 , α2 ), but if the first trial is A2 , the new probability of absorption by A1 is f (α1 p0 , α1 , α2 ). Weighting these by their respective probabilities, we find the functional equation f (p0 , α1 , α2 ) = p0 f (α1 p0 + 1 − α1 , α1 , α2 ) + (1 − p0 )f (α2 p0 , α1 , α2 ). This functional equation and related ones have been studied by Bellman and Shapiro [22] and by Karlin [24]. Bellman showed that a limiting continuous solution exists, that it is analytic and unique, and described other properties. Shapiro studied methods of solving the functional equation and gave the rate of convergence of iterative solutions. Karlin developed a different approach to the study of such random walk problems, and studied the case of two reflecting barriers as well as the present one of two absorbing barriers, and other extensions of this problem. summary In summary, the erratic nature of the learning process suggests that a mathematical description might better be probabilistic rather than deterministic. The manner in which stochastic models describe the fine-grained structure of the learning process has been illustrated by the data of the escapeavoidance experiment. The use of these models to describe alternative psychological theories, and thus suggest tests of them, was illustrated by the paradise fish experiment. Similarly, the Schoeffler experiment was a test of Estes’ theory of conditioned stimuli. It was also noted that work in learning had created new mathematical problems of interest in their own right. In closing, I should mention that this is a quite young field and that we have a long way to go before we can explain even a good fraction of the many well-established principles already known to the psychologist. Progress can best be made if we are willing to consider and then destroy a great many of our mathematical efforts.

306

Frederick Mosteller

references 1. Brush, F. R., R. R. Bush, W. O. Jenkins, et al. 1952. Stimulus generalization after extinction and punishment: an experimental study of displacement. Jour. Abnor. and Soc. Psych. 47: 633–640. 2. Burke, C. J., and W. K. Estes. 1957. A component model for stimulus variables in discrimination learning. Psychometrika 22: 133–145. 3. Burke, C. J., W. K. Estes, and S. Hellyer. 1954. Rate of verbal conditioning in relation to stimulus variability. Jour. Exp. Psych. 48: 153–161. 4. Bush, R. R., and F. Mosteller. 1951. A mathematical model for simple learning. Psych. Review 58: 313–323. 5. Bush, R. R., and F. Mosteller. 1951. A model for stimulus generalization and discrimination. Psych. Review 58: 413–423. 6. Bush, R. R., and F. Mosteller. 1953. A stochastic model with applications to learning. Annals of Math. Stat. 24: 559–585. 7. Bush, R. R., and F. Mosteller. 1955. Stochastic models for learning. New York, John Wiley & Sons. 8. Bush, R. R., F. Mosteller, and G. L. Thompson. 1954. A formal structure for multiple-choice situations. In Decision Processes (edited by R.M. Thrall, C.H. Coombs, and R.L. Davis), 99–126. New York, John Wiley & Sons. 9. Bush, R. R., and J. W. M. Whiting. 1953. On the theory of psychoanalytic displacement. Jour. Abnor. and Soc. Psych. 48: 261–272. 10. Bush, R. R., and T. R. Wilson. 1956. Two-choice behavior of paradise fish. Jour. Exp. Psych. 51: 315–322. 11. Estes, W. K. 1950. Toward a statistical theory of learning. Psych. Review 57: 94–107. 12. —. 1950. Effects of competing reactions on the conditioning curve for bar pressing. Jour. Exp. Psych. 40: 200–205. 13. —. 1954. Individual behavior in uncertain situations: an interpretation in terms of statistical association theory. In Decision Processes (edited by R.M. Thrall, C.H. Coombs, and R.L. Davis), 127–137. New York, John Wiley & Sons. 14. —. 1955. Statistical theory of spontaneous recovery and regression. Psych. Review 62: 145–154. 15. —. 1955. Statistical theory of distributional phenomena in learning. Psych. Review 62: 369–377. 16. —. 1957. Theory of learning with constant, variable, or contingent probabilities of reinforcement. Psychometrika 22: 113–132. 17. Estes, W. K., and C. J. Burke. 1953. A theory of stimulus variability in learning. Psych. Review 60: 276–286. 18. Estes, W. K., and C. J. Burke. 1955. Application of a statistical model to simple discrimination learning in human subjects. Jour. Exp. Psych. 50: 81–88.

14 Stochastic Learning Models

307

19. Estes, W. K., and J. H. Straughan. 1954. Analysis of a verbal conditioning situation in terms of statistical learning theory. Jour. Exp. Psych. 47: 225–234. 20. Gulliksen, H. 1934. A rational equation of the learning curve based on Thorndike’s law of effect. Jour. General Psych. 11: 395–434. 21. Gulliksen, H., and D. L. Wolfle. 1938. A theory of learning and transfer: I, II. Psychometrika 3: 127–149 and 225–251. 22. Harris, T. E., R. Bellman, and H. N. Shapiro. 1953. Studies in functional equations occurring in decision processes. The RAND Corporation, P-382. 23. Hays, D. G., and R. R. Bush. 1954. A study of group action. Amer. Sociological Rev. 19: 693–701. 24. Karlin, S. 1953. Some random walks arising in learning models I. Pacific Jour. Math. 3: 725–756. 25. Miller, G. A., and F. C. Frick. 1949. Statistical behavioristics and sequences of responses. Psych. Review 56: 311–324. 26. Miller, G. A., and W. J. McGill. 1952. A statistical description of verbal learning. Psychometrika 17: 369–396. 27. Restle, F. 1955. A theory of discrimination learning. Psych. Review 62: 11–19. 28. —. 1957. Theory of selective learning with probable reinforcements. Psych. Review 64: 182–191. 29. Schoeffler, M. S. 1954. Probability of response to compounds of discriminated stimuli. Jour. Exp. Psych. 48: 323–329. 30. Solomon, R. L., and L. C. Wynne. 1953. Traumatic avoidance learning: acquisition in normal dogs. Psych. Monographs 67 (354). 31. Thurstone, L. L. 1930. The learning function. Jour. General Psych. 3: 469–494.

Reprinted from American Mathematical Monthly (1958), 65, pp. 735–742

15. Factorial 12 : A Simple Graphical Treatment Frederick Mosteller and D.E. Richmond Harvard University. and Williams College

The great secret of successful interpolation or extrapolation is the choice of a good scale for plotting the function or the independent variable. Since the straight line is the only curve that provides easy and reliable interpolation, a good scale is one that tends to produce a straight line. We use this principle twice to obtain an approximate numerical definition for 12 !. Elementary students, after having heard about n! for integers, often wish to know its value when n is not an integer. Our discussion is keyed to the definition and evaluation of 12 !, and more generally of (n + 12 )!, where n is an integer. While the discussion could be extended beyond the half-integers, its merit lies in its intuitive appeal—to pursue this extension would be excessive. We begin by plotting f (n) = n! against n for a few values of n (see Figure 1). The request to define (n + 12 )! is equivalent to asking how to interpolate certain values on a curve connecting the points plotted in Figure 1 for integral values of n. One feels that a complete set of interpolated points should create a smooth curve, but words like “smooth” are vague, except when points are collinear. Then the natural step is to put in the extra points by linear interpolation. Our trouble is that the function f (n) rises so fast that we cannot get a good “purchase” on it. If a function increases too fast for comfort, it is natural to take its logarithm, so we plot log2 n! against n (see Figure 2). The curve passed by the eye through these new points seems to be getting straighter as n increases. We check this by looking at the slopes of the chords between n and n + 1 and between n + 1 and n + 2; namely: log(n + 2)! − log(n + 1)! = log(n + 2), log(n + 1)! − log n! = log(n + 1). We observe that as n gets large, the difference in the slopes of adjacent chords 

This research was facilitated by a grant from the Ford Foundation and by the Laboratory of Social Relations of Harvard University

310

Frederick Mosteller and D.E. Richmond

Fig. 1. n! plotted against n(= 0, 1, 2, 3, 4, 5, 6).

log(n + 2) − log(n + 1) = log 1 +

1 n+1



becomes very small. Therefore, as n gets larger, the polygonal curve formed by the chords is getting straighter and straighter. This straightness means that we can evaluate log(n + 12 )! by going far enough out on the curve, so that linear interpolation will do the trick. So for large integral n we would have approximately 1 1 ! = [log n! + log(n + 1)!] log n + 2 2 or, the equivalent approximation " √ 1 ! = n!(n + 1)! = n! n + 1. n+ 2

(1)

15 Factorial

1 : 2

A Simple Graphical Treatment

311

Fig. 2. Plot of log2 n! against n(=0, 1, · · · , 9).

Since the fundamental relation (n + 1)! = (n + 1)n!

(2)

serves to define n! for integral values of n ≥ 0 (given 1! = 1), it is natural to try to retain it for nonintegral values of n. We therefore write

312

Frederick Mosteller and D.E. Richmond

3 1 3 ! = · !, 2 2 2 5 5 3 5 3 1 ! = · ! = · · !, 2 2 2 2 2 2 ...................... , (2n + 1)(2n − 1) · · · 3 1 1 != · !. n+ 2 2n 2

(3)

By inserting the factors 2n, 2n − 2, · · · , 2 in the numerator and denominator of the right-hand side of (3) we can rewrite it as (2n + 1)! 1 1 != · !. n+ 2 22n n! 2 Then we have 22n n! 1 != 2 (2n + 1)!

n+

1 !. 2

(4)

We obtain a sequence of approximations v0 , v1 , v2 , · · · to the desired 12 ! by using (1) as an approximation for (n + 12 )! in (4). Then √ 22n (n!)2 n + 1 vn = . (2n + 1)! It is pleasant to work entirely with integers. Accordingly, we shall compute vn2 instead of vn : vn2 =

(n + 1)(n!)4 24n . [(2n + 1)!]2

(5)

Each member of this sequence is positive. Since vn2 2 vn−1

=1−

1 , (2n + 1)2

n ≥ 1,

(6)

the sequence decreases monotonically. Hence vn2 approaches a limit which will be defined as ( 12 !)2 . Equation (6) is equivalent to 1 2 vn2 , vn−1 = 1+ (7) 4n(n + 1) which will be useful later. We wish to evaluate this limit graphically. We need to adopt a scale on the horizontal axis so that as n goes from 1 to ∞, the graph stays on the paper. The classical device is to plot against 1/n or more generally 1/(n + k) (k a positive constant), because then as n goes from 1 to ∞, the function is plotted for points on the interval from 1/(1+k) to 0. If by good luck we can

15 Factorial

1 : 2

A Simple Graphical Treatment

313

choose k so that the resulting graph is approximated by a straight line, we can find the required limit of vn2 as n → ∞, by finding where this straight line crosses the vertical axis. The first three points (1/(1 + k), v12 ), (1/(2 + k), v22 ), (1/(3 + k), v32 ) can be lined up exactly by choosing k = 23/25 = 0.92. This value of k is near enough to unity to suggest plotting vn2 against 1/(n+1) (see Figure 3). (Had n=0, 1, 2 been used for the first three points rather than n=1, 2, and 3, k would have been 16/17 or approximately 0.94, even closer to unity. By not plotting the point (1,1) corresponding to n=0 in Figure 3, we have been able to enlarge the scale on both axes.) The result is a satisfactory resemblance to a straight line, especially on the left. Extrapolating to 1/(n+1)=0 by eye gives the approximation 0.785 for [( 12 )!]2 . The straightness is sufficiently encouraging to justify a more careful extrapolation to 0.

Fig. 3. Plot of vn2 against 1/(n+1).

To facilitate the work, we tabulate vn2 against 1/(n+1) for certain convenient values of n (see Table 1). The last column contains extrapolations to 2 v∞ which will now be explained. The slope of the line joining the point

314

Frederick Mosteller and D.E. Richmond Table 1 2 Estimates of ( 12 !)2 = v∞ First approximation 1 n+1

n

1 2 3 4 9 19 24 49 99 199 399

Second approximation  bn = vn2 1 −

vn2

.5 .33 · · · .25 .20 .10 .05 .04 .02 .01 .0050 .0025

.88888 .85333 .83591 .82559 .80527 .79527 .79329 .78933 .78736 .78638 .78588

88889 33333 83673 83875 22492 62214 10176 49223 41070 05239 91906

.77777 .78222 .78367 .78431 .78514 .78533 .78535 .78538 .78539 .78539 .78539 π 4

Pn =

1 , v2 n+1 n



to Pn−1 =

1 4(n + 1)



77778 22222 34694 84681 04430 52686 81074 82477 56968 75483 80098

= .78539 81634

1 2 ,v n n−1



is mn =

2 − vn2 vn−1 . 1 1 n − n+1

Using equation (7), vn2 , 4 a remarkably simple result. The line through Pn with slope mn intersects the vn2 axis at   1 2 . bn = vn 1 − 4(n + 1) 2 mn = n(n + 1)[vn−1 − vn2 ] =

2 The results for bn have been listed in Table 1 as estimates of v∞ = ( 12 !)2 . As we go down the table we note that the value is stabilizing √ near 0.785398, which is π/4 to six decimal places. The correct result ( 12 )! = π/2 is therefore suggested. This result is no accident, since there is, in fact, a close connection between our work and Wallis’ formula

π 2 2 4 4 6 6 = · · · · · ··· 2 1 3 3 5 5 7

15 Factorial

1 : 2

A Simple Graphical Treatment

315

which may be written [2] π 24n (n!)4 = lim n→∞ (2n!)2 (2n + 1) 2 2n + 1 (vn2 ), = lim n→∞ n+1 so that limn→∞ vn2 = π/4. The interest in this note lies in the elementary methods which have been used to evaluate ( 12 )!, rather than in the result itself. The remainder of this note is concerned with improvements of the results in Table 1 which may be obtained by plotting against 1/(n + k) where k = 1. If we parallel the previous discussion, we obtain for the slope of the line 2 from Pn = (1/(n + k), vn2 ) to Pn−1 = (1/(n + k − 1), vn−1 ), the value mn =

vn2 (n + k − 1)(n + k) , 4 n(n + 1)

and for the intercept on the vn2 axis   1−k 1 + . bn = vn2 1 − 4(n + 1) 4(n + 1)n

(8)

(9)

What is the best k to use for a given n? As long as the graph is concave upward to the left of Pn , bn will be too small. Since bn increases as k decreases, we shall obtain the best value of bn among graphs of this type by choosing the smallest k compatible with the requirement of upward concavity. The graph is concave upward to the immediate left of Pn if the ratio R = mn /mn+1 ≥ 1, that is, using (8), if

2 n + 32 (n + k − 1) ≥ 1, R= n(n + 1)(n + k + 1)

which implies that k ≥ (7n+9)/(8n+9). Note that if this condition is satisfied for a given n, it is satisfied for all larger values of n and hence for the whole 2 graph to the left of Pn . To obtain the best value of v∞ for a fixed n among graphs with upward concavity, we should therefore use kn = (7n+9)/(8n+9). Results for some values of n are given in Table 2. 2 As we know, even this most favorable k gives too small a result for v∞ . It is easy to obtain a result which is known to be too large by choosing k ≤ 7/8, since for all such values of k the graph is concave downward to the left of Pn . Among these values, k = 7/8 gives the smallest bn and hence the best upper 2 among graphs of this type. Values of bn for k=7/8 have been bound for v∞ added to Table 2. Unfortunately, a choice of k between 7/8 and kn corresponds to a graph which changes its concavity to the left of Pn so that no further improvement

316

Frederick Mosteller and D.E. Richmond Table 2 1 2 Obtained from the Best Values of k Bounds for ( !)2 = v∞ 2   1 1−k + bn = vn2 1 − 4(n + 1) 4(n + 1)n Lower bounds

n 1 2 3 4 9 19 24 49 99 199 399

k=

7n + 9 8n + 9

.78431 37255 .78506 66667 .78525 66481 .78532 52954 .78538 89838 .78539 70137 .78539 75746 .78539 80898 .78539 81542 .78539 81622 .78539 816325

Upper bounds

k= .79166 .78666 .78585 .78560 .78542 .78540 .78539 .78539 .78539 .78539 .78539

7 8 66667 66667 03401 84656 00514 06696 94246 83157 81821 81657 81637

π = .78539 816340 4

can be made by these elementary methods since we have no way to determine whether the result is too large or too small. Nevertheless, the bounds obtained are rather good. In both tables the entries are believed to be accurate to the number of places indicated. Tables of logarithms and logarithms of factorials used in these calculations were [1,3]. The authors wish to express their appreciation to Mrs. Cleo Youtz for carrying out the calculations. References 1. F.-J. Duarte, Nouvelles tables de log n! a 33 decimales. Geneva and Paris, 1927. 2. P. Franklin, Methods of Advanced Calculus. New York, 1944, p. 262. 3. A. J. Thompson, Logarithmetica Britannica, vols. I and II. Cambridge University Press, (published in nine parts at irregular intervals, 19241952).

Reprinted from Studies in Mathematical Learning Theory (1959), R.R. Bush and W.K. Estes, eds. Stanford: Stanford University Press, pp. 293–307

16. A Comparison of Eight Models Robert R. Bush and Frederick Mosteller University of Pennsylvania, Harvard University

Introduction In the testing of a scientific model or theory, one rarely has a general measure of goodness-of-fit, a universal yardstick by which one accepts or rejects the model. Indeed, science does not and should not work this way; a theory is kept until a better one is found. One way that science does work is by comparing two or more theories to determine their relative merits in handling relevant data. In this paper we present a comparison of eight models for learning by using each to analyze the data from the same experiment.1 A primary goal of any learning model is to predict correctly the learning curve—proportions of correct responses versus trials. Almost any sensible model with two or three free parameters, however, can closely fit the curve, and so other criteria must be invoked when one is comparing several models. A criterion that has been used in recent years is the extent to which a model can reproduce the fine-grain structure of the response sequences. Many properties can be and have been invented for this purpose. Fourteen such properties are used in this paper. A summary index of how well one model fits the fine-grain detail of data compared with another model is the likelihood ratio. There are three objections to this measure, however. First, for many models it is very difficult to compute. Second, its use obscures the particular strengths and weaknesses of a model and so fails to suggest why the model is inadequate. Third, it may be especially sensitive to uninteresting differences between the model and the experiment. Therefore we do not use likelihood ratios in this paper. A satisfactory prediction of the sequential properties of learning data from a single experiment is by no means a final test of a model. Numerous other criteria—and some more demanding—can be specified. For example, a model 

Support for this research was received from the Ford Foundation, the National Science Foundation (grant NSF-G2258), and the Laboratory of Social Relations, Harvard University. 1 The same data are used in testing still another model in Chapter 18.

318

Robert R. Bush and Frederick Mosteller

with specific numerical parameter values should be invariant to changes in independent variables that explicitly enter in the model. Such requirements we do not investigate in this paper. Our analyses are restricted to the problem of predicting sequential details. We believe that this is a sensible second step once the learning curve has been handled. The particular data used for comparing the eight models were obtained by Solomon and Wynne from an experiment on the avoidance training of dogs [8]. On each of 25 trials, a dog could avoid an intense electric shock by jumping over a barrier within ten seconds after the occurrence of a conditioned stimulus. The basic data are sequence of shocks (S) and avoidances (A) for 30 dogs. In an earlier work we analyzed these data with a two-operator linear model. This is one of the eight models being compared in this paper, and so the results are summarized below. For most learning models of the type to be discussed, one can derive formulas for the expected values and variances of several sequential statistics. Such formulas are of great value in the application of models to data. It often turns out, however, that for a particular model certain explicit formulas are very difficult if not impossible to obtain. Therefore, for the purpose of this paper, we resort to Monte Carlo computations with each model. This allows a direct comparison of the models on each of the fourteen statistics chosen. One exception to this procedure occurs in our treatment of the Markov model. For some properties of some models, we include explicit formulas even though they are not used in the final comparisons. To standardize notations, we use pn to represent the probability of avoidance (A) on trial n and qn = 1 − pn to denote the probability of shock (S). Trials are numbered n = 1, 2, · · · . The Two-Operator Linear Model The model previously applied [1] to the Solomon-Wynne data asserts that  α2 qn if S occurs on trial n qn+1 = , α1 qn if A occurs on trail n where 0 ≤ α1 , α2 ≤ 1, and q1 = 1.00. Several procedures for estimating α1 and α2 from the data were used, but the final estimates were α ˆ 1 = 0.80 and α ˆ 2 = 0.92. With these parameter values we ran 30 stat-dogs and calculated the values of the fourteen statistics listed in Table 1.

16 A Comparison of Eight Models

319

A Hullian Model2 Clark Hull does not explicitly consider the problem of avoidance training in his Principles of Behavior, but he does give a general theory of acquisition [2]. He asserts that the increment in habit strength occurring on each trial is a constant proportion of the “potential habit strength yet unformed.” He further says (in one place at least) that an index of habit strength is “per cent of correct reaction evocation.” This suggests that if pn is the probability of avoidance on trial n, then pn+1 = pn + (1 − α)(1 − pn ), where (1−α) is the constant of proportionality. In terms of qn , the probability of shock, this transition law is qn+1 = αqn . Note that this rule is equivalent to the basic assumption of the two-operator linear model when α1 = α2 = α. In this Hullian model, the expectation of T , the total number of shocks in 25 trials, is 25 25   1 − α25 qn = αn−1 q1 = q1 E(T ) = . 1−α n=1 n=1 As before, we take q1 = 1. From the data we get the average number of shocks per dog to be T¯ = 7.80. By equating T¯ and E(T ), we get α ˆ = 0.88. With this value of α, 30 stat-dogs were run and statistics of the sequences were computed. The results are shown in Table 1. A Hullian Model with Individual Differences Inspection of Table 1 shows that the Hullian stat-dogs are less variable than the real dogs and so one might suspect that the Hullian model would apply to individuals, but that different dogs would have different values of the parameter α. To construct a model that allows for such individual differences we need to make some assumptions about the distribution of α, and we need to estimate parameters of this distribution from the data. Values of α are restricted to the unit interval. A reasonable, convenient, and well-known probability density function on the unit interval is the betadistribution [5], (r + s + 1)! r f (α) = α (1 − α)s , r!s! where the parameters r and s must both be greater than −1. The mean and variance of this distribution are 2

To each of the remaining models we attach the name of the man judged by us to be most closely associated with it. This does not imply that any of these men has explicitly proposed a model for the Solomon-Wynne experiment, nor that our interpretations would be acceptable to them. At best we are simplifying if not wrenching their notions, but it may help the reader if we use these names as labels.

320

Robert R. Bush and Frederick Mosteller

E(α) =

r+1 , r+s+2

var(α) =

(r + 1)(s + 1) . (r + s + 3)(r + s + 2)2

We shall assume that the values of α have this distribution and estimate r and s from the Solomon-Wynne data. In order to estimate r and s, we compute the mean and variance of the total number of shocks in terms of r and s, set these expected values equal to the observed values, and thereby solve for the desired estimates. A particular value of α determines the parameter of a binomial distribution for a single trial, and so we need expectations over these binomials as well as expectations over the assumed beta-distribution. Subscripts b and β will be used on the expectation operator E to indicate whether the binomial or the betadistribution, respectively, is involved. The total number of shocks received by a dog in an infinite number of trials is represented by a random variable T . We then see that for fixed α, Eb (T ) =

∞ 

qn =

n=1

∞ 

αn−1 =

n=1

1 , 1−α

and the variance of T is var(T ) =

∞ 

qn (1 − qn ) =

n=1

∞ 

αn−1 (1 − αn−1 ) =

n=1

Therefore Eb (T 2 ) =

1 1 − , 1 − α 1 − α2

1 1 1 − + . 1 − α 1 − α2 (1 − α)2

We now need expectations over the beta-distributions (we assume they exist): 1 Eβ Eb (T ) = Eβ 1−α  1  1 1 (r + s + 1)! r = f (α)dα = α (1 − α)s−1 dα 1 − α r!s! 0 0  r + s + 1 1 (r + s)! r = α (1 − α)s−1 dα. s 0 r!(s − 1)! The last integral is unity because the integrand is a beta density function with parameters r and s − 1. Thus, dropping the subscripts, we have E(T ) =

r+s+1 . s

By a similar process we can easily show that 1 (r + s)(r + s + 1) . = Eβ (1 − α)2 (s − 1)s

16 A Comparison of Eight Models

321

To compute Eβ Eb (T 2 ), we need also Eβ

1 1 − α2





1

= 0

(r + s + 1)! r (1 − α)s−1 α dα. r!s! 1+α

This integral causes some minor difficulty but we can expand 1/(1 + α) into a series as follows: 1 1 1 1 = = 1+α 2 − (1 − α) 2 1 − (1 − α)/2   1 − α (1 − α)2 (1 − α)3 1 1+ + + + ··· . = 2 2 4 8 The integration can be carried out term by term to give s+1 (s + 1)(s + 2) r+s+1 1 1 Eβ + + + +· · · . = 2 1−α 2s 4 8(r + s + 2) 16(r + s + 2)(r + s + 3) This result enables us to write E(T 2 ) = where

(r + s)(r + s + 1) r + 1 1 + + − δ, (s − 1)s 2s 4

  s+2 s+1 1+ + ··· . δ= 8(r + s + 2) 2(r + s + 3)

The variance of T is then var (T ) =

(r + 1)(r + s + 1) r + 1 1 + + − δ. s2 (s − 1) 2s 4

This equation and our previous expression for E(T ) are the desired estimation equations. To simplify computations, we let µ = E(T ) =

r+s+1 , s

σ 2 = var(T ).

We then can solve for s to get s= with δ=

σ2

µ(µ − 1) − − 1) − 1 2 (µ

1 4



,

(1)

  s+2 s+1 1+ + ··· . 2(sµ + 2) 8(sµ + 1)

Having estimated s from Equation 1, we can estimate r from the equation r = s(µ − 1) − 1.

322

Robert R. Bush and Frederick Mosteller

The Solomon-Wynne data give µ ˆ = 7.80, σ ˆ 2 = 6.58, which in turn yield the estimates sˆ = 19, rˆ = 128. In terms of these values, we get E(α) = 129/149 = 0.866, (129)(20) var (α) = = 0.000775, (150)(149)2 σ(α) = 0.0278. In order to use available tables, we choose s = 19, r = 130, which give E(α) = 0.868, σ(α) = 0.0275. From tables [6], the cumulative of the beta-distribution F (α) with r = 130, s = 19, and the first differences were determined as shown below. α .76 .78 .80 .82 .84 .86

F (α) .000 .003 .013 .051 .158 .372

F (α)

α .88 .90 .92 .94 .96

.003 .010 .038 .107 .214

F (α) .657 .887 .983 .999 1.000

F (α) .285 .230 .096 .016 .001

To approximate this distribution with 30 animals, the following frequency table was constructed. α .81 .83 .85 .87

Number 1 3 6 9

α .89 .91 .93

Number 7 3 1

The mean is 0.871 and the standard deviation is 0.0271. These values were considered sufficiently close to the desired values of 0.866 and 0.0278. With the above set of α’s, 30 stat-dogs were run and they yielded the statistics given for them in Table 1. An Early Thurstone Model In 1917, Thurstone proposed a hyperbolic law of learning [9], y=

a(n + c) , n+c+b

where y is the number of successes per unit time, n is the number of trials, and a, b, and c are constants. If y is to be the probability of avoidance, which is zero when n = 1 and approaches unity as n → ∞, Thurstone’s equation becomes n−1 pn = , n−1+b and the probability of shock is

16 A Comparison of Eight Models

qn =

323

b . n−1+b

The expected number of shocks during the first N trials is E(T ) =

N  n=1

qn =

N 

b . n−1+b n=1

The sum can be approximated by the integral 

N +(1/2)

1/2

bdn b + N − (1/2) = b log . n−1+b b − (1/2)

For N = 25 and T¯ = 7.80, we get ˆb = 3.5. Furthermore, a direct evaluation of the sum with b = 3.5 gives the correct value of 7.80. With this value of b, the values of qn were computed and 30 stat-dogs run. The results are given in Table 1. A Late Thurstone Model One of the first stochastic learning models with differential effects for success and failure was Thurstone’s urn scheme [10]. His idea was that an urn containing black and white balls represented the probability of the responses in a two-choice situation. Thus if a randomly drawn ball is white, response 1 occurs; if black, response 2 occurs. The contents of the urn can be altered by the effects of events. For example, if event i occurred, ai white and bi black balls could be added to the urn. While the a’s and b’s can take negative values, some arrangement must be made to assure that the number of balls of any color is never negative, and that the urn always has at least one ball in it. In the model that follows, we shall add only white balls. 3 Suppose that immediately preceding some trial n, the contents of the urn are (an , bn ) in white and black balls. Then the probability of shock on trial n is bn . qn = an + bn If a white ball is drawn, an avoidance occurs. The white ball is replaced and c1 whites added to the urn. If a black ball occurs, then shock occurs, and the black ball and c2 whites are added to the urn. Thus both shock and avoidance improve the probability of an avoidance if we assume the c’s are positive. In the dog data, the initial probability of an avoidance has been taken to be zero. Thus initially we can take q1 = 1, i.e., b1 = 1, a1 = 0. We shall not actually work with the concept of balls but with continuous parameters. This 3

If the number of white balls added were proportional to the number already in the urn, Thurstone’s model would be essentially equivalent to the beta model discussed in Chapter 18 of this volume.

324

Robert R. Bush and Frederick Mosteller

does no violence to Thurstone’s original idea. He obviously used the balls for intuitive appeal. Then in general on trial n, the probability of shock is qij =

1 1 + ic2 + jc1

(i + j = n − 1),

(2)

where i is the number of previous shocks and j is the number of previous avoidances. Several estimation procedures have been considered for obtaining values of c1 and c2 . Some of these are rather tedious. We settled upon a rather inexpensive method. We note first that qˆij = xij /nij , where nij is the number of dogs with i previous shocks and j previous avoidances, and where xij is the number of those dogs shocked on trial n = i + j + 1. Then we could rewrite by analogy with Equation 2 xij (1 + ic2 + jc1 )= % ∗ nij . where = % means “estimates” or “estimated by.” There are two parameters to estimate, c1 and c2 , and so we wish to obtain a pair of simultaneous equations. These can be obtained by summing over two different sets of i, j pairs. Suppose these sets are called A and B; then we get the equations    c2 ixij + c1 jxij = (nij − xij ), A

c2

 B

A

ixij + c1

 B

A

jxij =



(nij − xij ).

B

The coefficients of the c’s suggest how we might pick the sets A and B. Generally speaking, in one of the equations we wish the coefficient of c1 to be large and that of c2 to be small, and in the other equation we want this relation reversed. We might therefore pick for our set A those i, j pairs such that i > j, for B those i, j pairs for which j ≥ i. While we are not thereby guaranteed that the result is staisfactory, because the xij have their contribution to make, we can hope that the general effect is about right. The values we got from such an approach gave us a preliminary estimate. (We actually used i > j + 1 for A, and i ≤ j + 1 for B.) It seemed wise to try to improve this estimate by forming weights for the various cells. We decided to weight the equations of the form xij (1 + ic2 + jc1 ) = nij reciprocally by their variances. The variance of the left side is 1 ic2 + jc1 2 (1 + ic2 + jc1 ) nij = nij (ic2 + jc1 ). 1 + ic2 + jc1 1 + ic2 + jc1 Dividing both sides by this result and then multiplying through by nij again gives

16 A Comparison of Eight Models

xij

1 + ic2 + jc1 ic2 + jc1

= %

325

nij . ic2 + jc1

The initial estimates of c1 and c2 gave c1 as about three times c2 , so the weights finally used were 1/(i + 3j). The same summations were performed, and estimates cˆ1 = 0.446, cˆ2 = 0.111 were obtained. Using these values for the c’s, we then ran 30 stat-dogs and computed the several statistics. A Markov Model A simple model for a learning situation is a two-state Markov chain. Such models were discussed by G.A. Miller [4]. Recall that S stands for shock and A for avoidance. The two conditional probabilities, PR{S|S} and Pr{A|A}, are assumed constant. Pr{X|X} means the probability of X occurring on a trial given that X occurred on the previous trial. We need the initial probability of shock q1 (S), but from the data we see that the appropriate value is 1.00. Furthermore, Solomon and Wynne report that for many trials beyond the 25th, no dogs were shocked. Thus we take Pr{A|A} = 1.00. This leaves us with one parameter, a = Pr{S|S}, to be estimated from the data. It is easily shown that the expected total number of shocks is E(T ) =

1 1−a

Using T¯ = 7.8, we get a ˆ = 0.872. Further, it can be shown that var (T ) =

a . (1 − a)2

For a = 0.872, we have var(T ) = 53.1, σ(T ) = 7.30. Because Pr{A|A} = 1, as soon as an avoidance occurs, the model predicts that the dogs will continue avoiding forever. Thus, all of the statistics listed in Table 1 can be computed from theory without running stat-dogs. These theoretical results are shown in Table 1. A Restle Model F. Restle has described a model for discrimination experiments [7], and a modification of his model might be appropriate for avoidance training. Assume with Restle that the stimulus situation contains r relevant cues and i irrelevant cues. A relevant cue may or may not be conditioned to a response (avoiding shock, in our problem) on trial n; the probability that any single relevant cue is conditioned on trial n, according to Restle, is cn = 1 − (1 − θ)n−1 . Irrelevant cues become “adapted,” i.e., cease to exist for the subject. The probability that on trial n any irrelevant cue is adapted is an = 1 − (1 − θ)n−1 . Another of Restle’s main assumptions is that θ = r/(r + 1). In writing down an expression for the response probability on trial n, we depart from Restle’s model. We assume that only conditioned cues contribute toward avoidance. Thus, we take

326

Robert R. Bush and Frederick Mosteller

pn =

rcn . r + i(1 − an )

From the previous equations, one gets pn = 1 −

(1 − θ)n−1 . θ + (1 − θ)n

The probability of shock is qn =

(1 − θ)n−1 . θ + (1 − θ)n

The expected total number of shocks is E(T ) =

∞  n=1

qn =

∞  (1 − θ)n−1 . θ + (1 − θ)n n=1

By replacing the sum with an integral from 1/2 to ∞, we get the approximation √ log θ − log(θ + 1 − θ) . E(T ) = (1 − θ) log(1 − θ) The right-hand side of this equation is 7.82 when θ = 0.23. The sum is 7.78 when θ = 0.23. Thus, we take θˆ = 0.23. With this value of θ, the values of qn were computed and 30 stat-dogs run. The results are shown in Table 1. A Krechevsky Model For ten years or more, beginning about 1930, there was a controversy between the “continuity” and “noncontinuity” theorists [3]. The latter school (Lashley, Krechevsky) argued that an initial presolution period was followed by sudden learning; the continuity theorists (Hull, Spence) contended that learning occurred from the beginning. One possible formalization of the noncontinuity position is described in this section. Define a random variable  1 if A on trial n by ith animal xin = , 0 if S on trial n by ith animal Then pin = Pr{xin = 1}, and qin = Pr{xin = 0}. Assume that the ith animal is in some “state” S0 at the start of the experiment and at the start of some trial Ni changes to another state S1 and then remains in S1 for the remainder of the experiment. We shall speak of the change of state on trial Ni as the occurrence of an event E. We then postulate that the probability of avoidance obeys the following law:

16 A Comparison of Eight Models

 pin =

327

p if ith animal is in state S0 on trial n 1 if ith animal is in state S1 on trial n.

It is next assumed that event E (change from S0 to S1 ) occurs with some fixed probability on every trial that the animal is in state S0 : Pr{S1 on n|S0 on n − 1} = β. It then follows that the random variable Ni (the trial number of insight or complete learning) has a negative binomial distribution given by Pr{Ni = j} = β(1 − β)j−1

(j = 1, 2, · · · ).

These axioms specify a stochastic process which has some simple properties. A group learning curve is obtained by plotting the proportion of animals that have a success on trial n versus n. The ordinate is 1 xin , I i=1 I

x ¯n =

where I is the number of animals in the group. The theoretical curve is obtained from the expected value of xin which we now compute. From the axioms of the model Pr{xin = 1|n < Ni } = p, and

Pr{xin = 1|n ≥ Ni } = 1,

n Pr{Ni ≤ n} = j=1 β(1 − β)j−1 = 1 − (1 − β)n , Pr{N1 > n} = (1 − β)n .

It then follows that E(xin ) = 1 − (1 − p)(1 − β)n . Thus, a theoretical group learning curve can be computed in terms of the two model parameters, p and β. It is interesting to note that the learning function just derived can also be obtained from a “Hullian-type” model. Of course, this model makes very different predictions about other aspects of the data. Denote the total number of shocks made by the ith animal by Ti . On trials before event E occurs, Ti has a binomial distribution j q k pj−k (k ≤ j). Pr{Ti = k|Ni = j + 1} = k We have already seen that Pr{Ni = j + 1} = β(1 − β)j , and so

328

Robert R. Bush and Frederick Mosteller

Pr{Ti = k} =

∞  j j=k

k

q k pj−k β(1 − β)j .

The summation can be carried out to yield Pr{Ti = k} = γ(1 − γ)k where γ=

(k = 0, 1, · · · ),

β . 1 − p(1 − β)

Thus Ti has a negative binomial distribution with parameter γ. It is well known that the expected value and variance of Ti are 1−γ 1−β =q , γ β 1−γ 1−β = q 2 [1 − (1 − q)(1 − β)]. var(Ti ) = 2 γ β E(Ti ) =

(3)

Denote by Li the trial number of the last shock by the ith animal. We must have Li < Ni and so we can define a new random variable, Yi = Ni − Li − 1, which corresponds to the number of trials after trial Li and before trial Ni . If one thinks of the trials occurring in reverse order, it is readily seen that Yi has a negative binomial distribution Pr{Yi = h} = qph

(h = 0, 1, 2, · · · ).

We already know that Pr{Ni = j + 1 + h} = β(1 − β)j+h , and so the joint distribution of Yi and Ni is Pr{Ni = j + 1 + h, Yi = h} = β(1 − β)j+h qph . From this we can get the distribution of Li because Pr{Li = j} = =

∞  h=0 ∞ 

Pr{Ni = j + 1 + h, Yi = h} β(1 − β)j+h qph .

h=0

Performing the summation gives Pr{Li = j} =

qβ(1 − β)j . 1 − p(1 − β)

This function is not a normalized density function; in fact

16 A Comparison of Eight Models ∞ 

Pr{Li = j} = 1 −

j=1

329

β . 1 − p(1 − β)

This is so because Li may not have a value—it may happen that no shocks occur. We already know that Pr{Ti = 0} = γ =

β . 1 − p(1 − β)

Therefore, it is more convenient to deal with the conditional probabilities Pr{Li = j|Ti = 0} =

Pr{Li = j} . Pr{Ti = 0}

We obtain Pr{Li = j|Ti = 0} = β(1 − β)j−1 . which is precisely the distribution of Ni . The mean and variance are E(Li |Ti = 0) =

1−β , β

var(Li |Ti = 0) =

1−β . β2

(4)

These theoretical values can be compared with empirical values obtained from the values of Li for animals that obtain at least one shock. Equation 3 and 4 were used to estimate the two parameters, p and β. Equating expectations and observed means for the Solomon-Wynne data, we have q(1 − β)/β =7.80, % (1 − β)/β =11.33. % Solving for the estimates, we have βˆ = 0.081, pˆ = 0.312. These values were used in 30 Monte Carlo computations and the obtained values of the fourteen statistics are given in Table 1. The process defined by the model is binomial on trials before event E occurs. Thus the distribution of alternations is essentially that for a binomial. In j binomial trials with probabilities p and q of success and failure, respectively, the expected number of runs is well known to be 2(j − 1)pq + 1

(j = 1, 2, · · · ).

The number of alternations is one less than the number of runs. Avoidance necessarily occurs on trial Ni , and if trial Ni − 1 is a shock trial, one more alternation occurs. Thus, if Ai is the number of alternations by the ith animal, E(Ai |Ni = j = 1) = 2(j − 2)pq + q. We know that E(Ni |Ni = 1) = (1/β) + 1. Thus E(Ai ) = 2

1−β pq + q. β

330

Robert R. Bush and Frederick Mosteller

Discussion A study of Table 1 reveals a great deal about the relative weaknesses of the eight models. It appears that the Markov model is the least satisfactory; the first and second avoidances occur too late, the last shock occurs too soon, and there are far too few alternations. This is no surprise in view of the design of the model—all learning occurs on a single random trial. For a similar reason, the Krechevsky model is quite unsatisfactory. With this model, the first and second avoidances occur too soon, the last shock occurs too soon, and there are too few alternations. Therefore, we can conclude that neither of these “discontinuity” models is adequate.

Krechevsky

Restle

Markov

Late Thurstone

Early Thurstone

Hullian with individual differences

Hullian

Two-operator linear

Statistic Trials before first avoidance Trials before second avoidance Total number of shocks Trials before last shock Number of alternations Length of longest run of shocks Trials before first run of 4 avoidances

Dog data

TABLE 1 Comparisons of the Eight Models with the Dog Data

Mn 4.50 4.13 3.17 3.57 3.13 4.10 7.80 4.40 1.77 SD 2.25 2.08 1.79 1.81 1.17 2.91 7.30 1.73 2.11 Mn 6.47 6.20 5.03 5.33 4.87 6.53 8.80 6.37 4.03 SD 2.62 2.06 1.82 1.71 1.28 2.74 7.30 1.83 2.52 Mn 7.80 7.60 7.57 7.50 8.50 8.67 7.80 7.73 6.07 SD 2.52 2.27 1.73 1.62 2.23 2.80 7.30 1.76 5.37 Mn 11.33 12.53 17.57 15.80 19.97 18.97 6.80 12.97 7.73 SD 4.36 4.78 4.09 5.36 4.70 5.11 7.30 4.60 7.46 Mn 5.47 5.87 7.40 7.10 9.17 7.37 1.00 5.87 3.87 SD 2.72 2.11 2.04 2.94 3.28 2.62 0.00 2.56 3.34 Mn 4.73 4.33 3.53 3.70 3.40 4.44 7.80 4.47 3.43 SD 2.03 1.89 1.57 1.69 1.04 2.73 7.30 1.59 2.45 Mn 9.70 9.47 7.83 10.03 10.13 9.57 7.80 10.03 7.50 SD 4.14 3.48 3.00 4.20 5.78 3.59 7.30 3.34 6.96

The Hullian model has several weaknesses: the first and second avoidances occur too soon, the last shock occurs too late, and there are too many alternations. All of the SD’s are too small. Similar weaknesses are found in the Hullian model with individual differences, although it is a decided improvement over the simpler Hullian model. Like the Hullian models, the early Thurstone model predicts that avoidances occur too soon, that the last shock occurs too late, and that there are too many alternations. It is evident that a satisfactory model must have some mechanism for delaying the early avoidances and at the same time speeding

16 A Comparison of Eight Models

331

up the occurrence of the last shock. The late Thurstone model accomplishes the former but fails at the latter; in this model learning does not occur rapidly enough near the end. The two models that seem most satisfactory are the two-operator linear model and the Restle model. Both models predict about the correct means on all seven properties. However, both predict SD’s that are somewhat small on all properties except the number of trials before the last shock. The twooperator model and the Restle model are both sufficiently close that the data of Table 1 do not allow one to choose between them. Thus, we ask if there exists a statistic which is more sensitive to the differences between those models. One major difference between them is that the Restle model assumes that all organisms have a single value of qn on trial n, whereas the two-operator model generates a distribution of qn for n > 1. We now investigate one consequence of this difference. The estimates of α1 and α2 in the two-operator model show that shock has a smaller effect than avoidance. Thus, those dogs that receive five shocks on the first five trials, for example, have a higher probability of shock on the sixth trial than any dog that receives fewer than five shocks on the first five trials. Therefore, the total number of shocks received after the fifth trial should be greater for those dogs that are shocked on each of the first five trials. The Solomon-Wynne data show that 13 dogs obtained shock on all of the first five trials and that these dogs obtained a mean of 4.77 shocks thereafter; the other 17 dogs obtained a mean of 2.70 shocks on trials beyond the fifth. This difference is significant, as a Mann-Whitney test shows (P < .01). The Restle model, like all models that have a single qn for all organisms on trial n, predicts that no such difference will occur. Conclusion From the discussion in the previous section it is clear that among the models presented the two-operator model described the data best. A critic might point out that the game played in this paper is easily rigged by the authors to favor their own model. Though one reply to such a critic would be that the authors have not suppressed any models, that reply would be entirely out of the spirit of this paper. We freely admit that several of the models can be considerably improved, though at some expense in mathematical and statistical research. (Parameter estimation is the main difficulty.) What we are trying to do here is to present a sequence of models each of which has some relation to past psychological thinking about learning, and to discover just where our versions of these models are weak in describing one experiment. A person with a special bent toward one of the psychological theories represented here can take our formulation as a first approximation and see wherein it was weak. Such a step may suggest to him directions in which the model needs improvement and may indicate those aspects of the more general psychological theory that have been especially neglected in our formulation.

332

Robert R. Bush and Frederick Mosteller

The work in this paper might be regarded as mathematical experimentation, as distinguished from laboratory experimentation. The former is needed to get some feeling for the variety of models there can be, and to prevent premature acceptance of any one model just because it has been worked on and others have not. The authors feel that in the past too little attention has been given to the ability of various models to reproduce the fine structure of a set of data, and they present this paper as an illustration of one kind of study that is needed. references 1. Bush, R.R., and Mosteller, F. Stochastic models for learning. New York: Wiley, 1955. 2. Hull, C.L. Principles of behavior. New York: Appleton-Century-Crofts, 1943. Chap. 8. 3. Krechevsky, I. A study of the continuity of the problem-solving process. Psychol. Rev., 1938, 45, 107-33. 4. Miller, G.A. Finite Markov processes in psychology. Psychometrika, 1952, 17, 149-67. 5. Mood, A.M. Introduction to the theory of statistics. New York: McGraw-Hill, 1950. p. 115. 6. Pearson, K. Tables of the incomplete beta function. London: Cambridge University Press, 1932. 7. Restle, F. A theory of discrimination learning. Psychol. Rev., 1955, 62, 11-19. 8. Solomon, R.L., and Wynne, L.C. Traumatic avoidance learning: acquisition in normal dogs. Psychol. Monogr., 1953, 67, No. 4 (whole No. 354). 9. Thurstone, L.L. The learning curve equation. Psychol. Monogr., 1919, 26, No. 3 (whole No. 114). 10. Thurstone, L.L. The learning function. J. gen. Psychol., 1930, 3, 469–91.

Reprinted from The Mathematics Teacher (1961), 54, pp. 411-412

17. Optimal Length of Play for a Binomial Game Frederick Mosteller Harvard University, Cambridge, Massachusetts

In a game where A plays against B, how many trials should A choose to play to maximize his chance of success? A game between players A and B consists of N (= 2n) independent trials. One point is awarded on each trial—to player A with probability p < 12 , or to player B with probability q(= 1 − p). If player A wins more than half the points, he wins a prize. How shall he choose N to maximize his chance of success?1 Player A knows the value of p. At first blush, most people notice that the game is unfair, and therefore as N increases, the expected value of the difference (A’s points − B’s points) grows more and more negative. They conclude that A should play as little as he can and still win—that is, two trials. Had an odd number of trials been allowed, this reasoning based on expected values would have been correct, and A should play only one trial (and this note would not have been written). But with an even number of trials there are two effects at work, (1) the bias in favor of B, and, opposing that at first (2) the redistribution of the probability in the middle term of the binomial distribution (the probability of a tie) as the number of trials increases. Consider, for a moment, a fair game (p = 12 ). Then the larger N , the larger A’s chance to win because as 2n increases, the probability of a tie tends to zero, and in the limit A’s chance to win is 12 . For N = 2, 4, 6, his probabilities 5 1 , 22 are 14 , 16 64 . Continuity suggests that for p slightly less than 2 , A should play a large but finite number of games. But if p is small, N = 2 should be optimal for A. It turns out that for p < 13 , N = 2 is optimal.

1

P.G. Fox originally alluded to a result which gives rise to this game in “A Primer for Chumps,” which appeared in the Saturday Evening Post, November 21, 1959, and discussed the idea further in private correspondence arising from that article in a note entitled “A Curiosity in the Binomial Expansion—and a Lesson in Logic.” I am indebted to Clayton Rawson and John Scarne for alerting me to Fox’s paper.

334

Frederick Mosteller

An approximation Let us use the usual normal approximation to estimate the optimal value of N . Let the random variable X be A’s number of points in N = 2n trials. We wish to maximize P (X ≥ n + 1). Let z=

(n + 1) − 2np − √ 2npq

1 2

where the 12 is the usual continuity correction. Then if Z is a standard normal random variable (zero mean and unit variance), P (Z > z) approximates P (X > n). The smaller z, the larger P (Z > z), so we wish to find the value of n that minimizes z for a fixed p < 12 . Standard calculus methods yield for the value for N that minimizes z N = 2n =

1 . 1 − 2p

For example, if p = 0.49, N = 50, and, wonder of wonders, this value of N is not just approximate, but is exactly the one that maximizes P (X ≥ n + 1). Can such good fortune continue? Consider p = 13 , then the estimated value of N is 3. It turns out that the surrounding even values of N , 2 and 4, give identical probabilities of a win for A, namely 19 ; and this 19 is the optimal probability that A can achieve! These results, and others like them suggest that the nearest even integer to 1/(1 − 2p) is the optimal value of N , unless 1/(1 − 2p) is an odd integer and that then both neighboring integers are optimal. To assist in the proof of this conjecture, we let P2n be the probability that X ≥ n + 1 in a game of 2n trials: 2n  2n x 2n−x P2n = p q . x x=n+1 In a game of 2n + 2 trials P (X ≥ n + 2) is P2n+2 =

2n+2  x=n+2



2n + 2 x 2n+2−x p q . x

A game composed of 2n + 2 trials can be regarded as having been created by adding two trials to a game of 2n trials. Unless player A has won either n or n + 1 times in the 2n game, his status as a winner or loser cannot differ in the 2n + 2 game from that in the 2n game. Except for these two possibilities, P2n+2 would be identical with P2n . These exceptions are (1) that having n + 1 successes in the first 2n trials, A loses the next two, thus reducing his probability of winning in the 2n + 2 game by 2n 2 pn+1 q n−1 ; q n+1

17 Optimal Length of Play for a Binomial Game

335

or (2) that having won n trials in the 2n game he wins the next two, increasing his probability by 2n n n p2 p q . n If N = 2n is the optimal value, then both PN −2 ≤ PN and PN ≥ PN +2 must hold. The results of the previous paragraph imply that these inequalities are equivalent to

n n−2

n−1 n−1 p q ≤ p2 2n−2 q ; q 2 2n−2 n n−1 p (1) 2n n+1 n−1 n n q 2 n+1 p p q ≥ p2 2n q n or, after some simplifications (we exclude the trivial case p = 0), (n − 1)q ≤ np; nq ≥ (n + 1)p.

(2)

These inequalities yield, after a little algebra, the condition 1 1 − 1 ≤ 2n ≤ +1 1 − 2p 1 − 2p

(3)

Thus unless 1/(1 − 2p) is an odd integer, N is uniquely determined as the nearest even integer to 1/(1 − 2p). When 1/(1 − 2p) is an odd integer, both adjacent even integers give the same optimal probability. And we can incidentally prove that when 1/(1 − 2p) = 2n + 1, P2n = P2n+2 . Generalization I.R. Savage suggested to the author that the game be generalized. Again with the probability p(< 12 ) of winning a single point, suppose that to win the game of N trials, one must have r more points than one’s opponent. Again we want to find the value of N (unrestricted) that maximizes the probability of winning. The normal approximation suggests that the optimum value of N ≈ (r − 1)/(1 − 2p). Suitably interpreted, this approximation is satisfactory as an exact solution. The methods used above in the exact solution for r = 2 and N an even integer can be extended. It turns out after some algebra that for odd values of r, the optimum value of N is the positive odd integer nearest (r − 1)/(1 − 2p), while for r an even integer, the optimum value of N is the positive even integer nearest (r − 1)/(1 − 2p). The distinction between odd and even values of r can matter substantially. For example, let r = 3, p = 0.4, then direct application of the approximate formula gives N = 10, with probability 0.05476 of winning by at least 3 points, which for an even value of N means by at least 4 points. On the other hand for N = 9 or 11, the probability of winning by at least 3 points (a win by exactly 3 points is now achievable) is 0.09935. Furthermore, the value N = 10 is not the best even value of N to choose, because then winning by 3 implies winning by 4, and the optimum N for r = 4 is N = 14 (or 16), with probability of winning 0.05832.

Reprinted from Biometrika (1961), 48, pp. 433–440

18. Tables of the Freeman-Tukey Transformations for the Binomial and Poisson Distributions Frederick Mosteller and Cleo Youtz Harvard University

Summary We present a table of the Freeman-Tukey variance stabilizing arc-sine transformation for the binomial distribution together with properties of the transformation. Entries in the table are  $ √ √ x+1 1 x θ= arcsin + arcsin , 2 n+1 n+1 where n is the sample size and x is the number of successes observed in a binomial experiment. Values of θ are given in degrees, to two decimal places, for n = 1[1]50 and x = 0[1]n. In addition, for completeness, we give a table of the corresponding squareroot transformation to two decimal places for use with Poisson counts. The observed count is x (x = 0[1]50) and the transformed values are g=



x+



(x + 1);

the squares of the transformed values are also given for use in analysis of variance computations. 1. Introduction Transformations are often used in the analysis of data to improve linearity of regression, to improve normality of distribution, and to stabilize approximately the variance, when it might otherwise depend strongly upon a parameter. Freeman & Tukey (1950) introduced the transformations tabled here to stabilize the variance of binomial and of Poisson counts. (A good account of variance-stabilizing transformations is given by Eisenhart (1947).) The table may also be helpful in using methods developed by Gupta & Sobel (1958) to select a subset of populations better than a standard. 

This work was facilitated by a grant from The Ford Foundation and by the Laboratory of Social Relations, Harvard University.

338

Frederick Mosteller and Cleo Youtz

In using the Freeman-Tukey transformations, we found need for tables to speed the work, and they are presented here. The Freeman-Tukey transformation for the binomial number of successes x observed in n independent trials is the averaged angular transformation  $ √ √ x+1 1 x θ= arcsin + arcsin (1) 2 n+1 n+1 Table 5 gives values of θ in degrees to two decimals for n = 1[1]50 and x = 0[1]n. When θ is measured in degrees, it has variance σθ2 tending to the asymptotic variance 2 σθ∞ =

821 (squared degrees), n + 12

(2)

for a substantial range of p if n is not too small. We tabled the function in degrees rather than radians because we wanted to be able to follow from our table into the arc-sine table by Bliss (1946) for n’s larger than 50. We did not consider making the correspondence with the arc-sine table by Stevens (1953) because we had already computed our table before we were aware of his. For Poisson counts, the corresponding transformation is g=



x+



(x + 1),

(3)

with variance approximately 1, provided the Poisson parameter exceeds 1. Table 6 gives values of g and of g 2 for x = 0[1]50. We found that the table of g 2 saved a little time in desk calculations. 2. Some properties of the transformations The arc–sine transformation For selected numbers of trials, n, Table 1 shows, approximately, the value of the probability of success on a single trial p (and of 1 − p) at which the maximum of σθ2 occurs, the values of σθ2 at 2 2 , and the ratio σθ2 /σθ∞ . When σθ2 is plotted against the maximum and of σθ∞ p, as n increases, the positions of the maximum variance move toward the extremes, p = 0 and p = 1. The value of p that maximizes the variance is approximately a linear function of 1/n for large n, see Fig. 1. This is explained below in the discussion of the square-root transformation for the Poisson. 2 Fig. 2 shows σθ2 /σθ∞ plotted against p for n = 1, 3, 5, 10. At n = 5, the curve is beginning to assume its characteristic shape, though the ears (the shapes near the left and right maxima) are quite flat as yet. For n = 5, the nearly flat portion in the interval 0.3 ≤ p ≤ 0.7 appears to have a relative minimum at p = 1/2. One might suppose from the plots of variances of other arc-sine transformations, such as given by Eisenhart (1947) and from those of Fig. 2, that a relative minimum for the variance would occur at p = 1/2 for large values

18 Tables of the Freeman-Tukey transformations

339

Table 1. Value of p that maximizes σθ2 for the arc-sine transformation, maximum 2 2 2 σθ2 , σθ∞ , and maximum σtheta /σθ∞ Maximizing values of 

 821 n + 12 547.33 328.40 234.57 149.27

Max σθ2 2 σθ∞ 0.925 1.140 1.139 1.055

Sample size n

p

1−p

Max σθ2

1 2 3 5

0.5 0.5 0.5 0.344

— — — 0.656

506.25 374.50 267.19 157.42

10 20 30 50

0.181 0.097 0.066 0.040

0.819 0.903 0.934 0.960

82.08 42.20 28.40 17.17

78.19 40.05 26.92 16.26

1.050 1.054 1.055 1.056





1.061



Poisson

2 σθ∞ =

Fig. 1. Values of p (< 12 ) for maximum σθ2 plotted against 1/n, n ≥ 5.

of n. Our impression from calculations is that the Freeman-Tukey transformation has a relative maximum at p = 1/2 for n = 20 and n = 50, where this question was investigated, though the curve of variances is quite flat for a long interval about p = 1/2. Table 2 gives values of σθ2 at p = 1/2 for various values of n. 2 Table 3 shows both σθ2 and σθ∞ for n = 50 and for various values of p. The values of p are chosen to illustrate the behaviour to the left and right of the high maximum near p = 0.040. There appears to be a relative minimum

340

Frederick Mosteller and Cleo Youtz

2 Fig. 2. Plot of σθ2 /σθ∞ against p for n=1, 3, 5, 10.

2 for p = 1/2 Table 2. Values of σθ2 and of σθ2 /σθ∞

Sample size n

σθ2

1 2 3 4 5

506.25 374.50 267.19 198.98 156.02

σθ2 2 σθ∞ 0.925 1.140 1.139 1.091 1.045

10 20 30 50

76.05 39.20 26.51 16.10

0.973 0.979 0.985 0.990

between p = 0.15 and p = 0.20. In the interval 0.07 ≤ p ≤ 0.93, σθ2 is within 2 2 2% of σθ∞ . In the interval 0.02 ≤ p ≤ 0.98, σθ2 is within 6% of σθ∞ . An alternative arc-sine transformation for the binomial distribution is

18 Tables of the Freeman-Tukey transformations

341

2 for n = 50 for various values of p Table 3. σθ2 and σθ2 /σθ∞

p 0.01 .02 .03

σθ2 10.94 15.34 16.85

2 σθ2 /σθ∞ 0.673 0.943 1.037

0.039 .040 .041

17.17 17.17 17.17

1.056 1.056 1.056

0.05 .06 .07 .08 .09

17.05 16.81 16.57 16.38 16.24

1.049 1.034 1.019 1.008 0.999

p 0.10 .11 .12 .13 .14 .15 .16

σθ2 16.15 16.09 16.06 16.04 16.03 16.02 16.02

2 σθ2 /σθ∞ 0.994 .990 .988 .986 .986 .985 .985

0.20 .30 .40 .50

16.04 16.07 16.09 16.10

0.986 .989 .990 .990

√x θ = arcsin (1 ≤ x ≤ n − 1), n ⎧ ⎪ ⎨ arcsin √ 1 (x = 0),  4n θ = √ 1 ⎪ ⎩ 90◦ − arcsin (x = n) 4n 2 to p and σθ2 /σθ2 ∞ to p (where σθ2 ∞ = For n = 50, Fig. 3 relates σθ2 /σθ∞ 821/n). Only the left half of the curve is plotted (0 ≤ p ≤ 0.5). The variance of the Freeman-Tukey transformation is flatter over a longer region, as of course it is intended to be. Table 4 gives some numerical values of the ratio σθ2 /σθ2 ∞ for n = 50.

Table 4. σθ2 and σθ2 /σθ2 ∞ for n = 50 for various values of p p 0.1 .2 .3 .4 .5

σθ2 17.91 17.11 16.87 16.78 16.76

σθ2 /σθ2 ∞ 1.091 1.042 1.027 1.022 1.020

In Fig. 3, the reader will observe that over the long flat region the variance ratio curve for θ falls below 1. If one wished a better match between σθ2 and 2 2 σθ∞ over this interval, he might use σθ∞ = 821/(n+1) rather than 821/(n+ 12 ) for 10 ≤ n ≤ 50. If he does, then a larger percentage error is committed for p’s in the neighbourhood of the ears. The square-root transformation. Table 6 gives the square-root transformation for the Poisson distribution. If λ is the parameter (mean or expectation) of the Poisson, then for λ = 0, σg2 = 0, and as λ increases, σg2 increases to a maximum of about 1.061 in the neighbourhood of λ = 2.1. Thereafter σg2

342

Frederick Mosteller and Cleo Youtz

2 Fig. 3. Graph relating σθ2 /σθ∞ to p and relating σθ2 /σθ2 ∞ to p for n = 50.

decreases, apparently never falling below 1. A graph of σg2 is given by Freeman and Tukey (1950). For λ ≥ 1, σg2 is within 6% of its asymptotic value 1. When a binomial p is small and n is large, the binomial distribution is approximated by the Poisson with parameter λ = np. Furthermore, √ √ when θ is measured in radians, for small values of p, we have arc sin p ≈ p. Thus the square-root transformation for the Poisson and the arc-sine transformation are essentially equivalent in distribution provided that p and (x + 1)/(n + 1) are small. We can expect therefore, even for large values of n, that the ears of plots of the variance against p for the arc-sine transformation do not vanish, but that 2 the maximum value of σθ2 /σθ∞ is about 6%, and that this maximum occurs near the value of p where np = 2.1. For n = 50, we found the maximizing value of p to be 0.040 (Table 1), as compared with the Poisson estimate 0.042. This explains the approximately linear relation shown in Fig. 1 between the maximizing value of p and 1/n. The behaviour of the ears (their persistence) with increasing values of n is reminiscent of Gibbs’s phenomenon when a Fourier series is fitted to a discontinuous function. Perhaps the analogy is not far-fetched since the transformation is designed to fit a continuous function with ordinates y(p) = 2 σθ2 /σθ∞ and end-points (0,0) and (1,0) to the function f (p) = 1 (0 < p < 1).

18 Tables of the Freeman-Tukey transformations

343

Table 5. Table of Freeman-Tukey arc-sine transformation for binomial proportions*

 √ √



x x+1 θ = 12 arc sin + arc sin , n+1 n+1 number of successes observed, n = sample size 4 5 6 7 8 9 10 13.28 12.05 11.10 10.35 9.74 9.22 8.77 32.90 29.68 27.26 25.35 23.80 22.50 21.39 45.00 40.13 36.60 33.88 31.69 29.89 28.36 57.10 49.87 45.00 41.38 38.54 36.22 34.28 76.72 60.32 53.40 48.62 45.00 42.12 39.74 — 77.95 62.74 56.12 51.46 47.88 45.00

The table entries where x/n 1 2 0 22.50 17.63 1 67.50 45.00 2 — 72.37 3 — — 4 — — 5 — —

are x= 3 15.00 37.50 52.50 75.00 — —

x/n 0 1 2 3 4

11 8.39 20.44 27.05 32.63 37.73

12 8.05 19.60 25.90 31.20 36.01

13 7.75 18.85 24.89 29.94 34.51

14 7.48 18.19 23.99 28.83 33.18

15 7.24 17.59 23.18 27.83 31.99

16 7.02 17.05 22.45 26.93 30.93

17 6.82 16.55 21.78 26.11 29.97

18 6.63 16.10 21.17 25.36 29.09

19 6.46 15.68 20.61 24.68 28.28

20 6.30 15.29 20.09 24.04 27.54

5 6 7 8 9

42.60 47.40 52.27 57.37 62.95

40.56 45.00 49.44 53.99 58.80

38.80 42.95 47.05 51.20 55.49

37.25 41.16 45.00 48.84 52.75

35.87 39.59 43.20 46.80 50.41

34.64 38.18 41.62 45.00 48.38

33.54 36.92 40.20 43.41 46.59

32.53 35.78 38.91 41.97 45.00

31.61 34.74 37.75 40.68 43.57

30.76 33.79 36.69 39.50 42.26

10

69.56

64.10

60.06

56.82

54.13

51.82

49.80

48.03

46.43

45.00

x/n

21

22

23

24

25

26

27

28

29

30

0 1 2 3 4

6.15 14.93 19.61 23.46 26.86

6.02 14.59 19.16 22.91 26.22

5.89 14.28 18.74 22.40 25.63

5.77 13.98 18.35 21.92 25.07

5.65 13.71 17.98 21.48 24.55

5.55 13.44 17.63 21.05 24.06

5.45 13.20 17.30 20.66 23.60

5.35 12.96 16.99 20.28 23.17

5.26 12.74 16.70 19.93 22.76

5.17 12.53 16.42 19.59 22.37

5 6 7 8 9

29.98 32.91 35.71 38.42 41.08

29.25 32.10 34.81 37.43 39.99

28.58 31.34 33.98 36.51 38.98

27.95 30.64 33.20 35.66 38.05

27.36 29.98 32.47 34.86 37.18

26.81 29.37 31.79 34.12 36.38

26.29 28.79 31.16 33.42 35.62

25.79 28.24 30.55 32.77 34.91

25.33 27.72 29.99 32.15 34.24

24.89 27.24 29.45 31.57 33.61

10 11 12 13 14

43.70 46.30 48.92 51.58 54.29

42.50 45.00 47.50 50.01 52.57

41.41 43.80 46.20 48.59 51.02

40.39 42.70 45.00 47.30 49.61

39.45 41.68 43.90 46.10 48.32

38.58 40.74 42.87 45.00 47.13

37.76 39.85 41.92 43.98 46.02

36.99 39.03 41.03 43.02 45.00

36.27 38.25 40.20 42.13 44.04

35.58 37.52 39.42 41.29 43.15

15

57.09

55.19

53.49

51.95

50.55

49.26

48.08

46.98

45.96

45.00

For an entry x not in the table (n ≤ 50), take the complement with respect to 90◦ of the entry for n − x, e.g. for n = 16, x = 11, take 90 − 34.64 = 55.36. The entries printed in italics are in fact, complements of entries printed higher in that column.

344

Frederick Mosteller and Cleo Youtz Table 5 (cont.)

x/n

31

32

33

34

35

36

37

38

39

40

0 1 2 3 4

5.09 12.33 16.15 19.27 21.99

5.01 12.14 15.90 18.96 21.64

4.94 11.96 15.66 18.67 21.30

4.87 11.78 15.43 18.39 20.98

4.80 11.61 15.21 18.12 20.68

4.73 11.45 14.99 17.87 20.38

4.67 11.30 14.79 17.63 20.10

4.61 11.15 14.60 17.39 19.83

4.55 11.01 14.41 17.16 19.57

4.49 10.87 14.23 16.95 19.32

5 6 7 8 9

24.47 26.77 28.94 31.01 33.01

24.07 26.33 28.46 30.49 32.44

23.69 25.91 28.00 29.99 31.90

23.33 25.51 27.56 29.52 31.39

22.99 25.13 27.15 29.06 30.90

22.66 24.76 26.75 28.63 30.44

22.34 24.41 26.36 28.22 29.99

22.04 24.08 26.00 27.82 29.57

21.75 23.76 25.65 27.44 29.16

21.47 23.45 25.31 27.08 28.77

10 11 12 13 14

34.94 36.83 38.68 40.50 42.31

34.33 36.18 37.98 39.76 41.52

33.75 35.56 37.32 39.06 40.77

33.20 34.97 36.70 38.39 40.06

32.68 34.41 36.10 37.76 39.39

32.18 33.88 35.53 37.16 38.75

31.71 33.37 34.99 36.58 38.15

31.25 32.88 34.48 36.04 37.57

30.81 32.42 33.98 35.51 37.02

30.40 31.97 33.51 35.01 36.49

15 16 17 18 19

44.10 45.90 47.69 49.50 51.32

43.26 45.00 46.74 48.48 50.24

42.47 44.16 45.84 47.53 49.23

41.72 43.36 45.00 46.64 48.28

41.01 42.61 44.20 45.80 47.39

40.33 41.90 43.45 45.00 46.55

39.69 41.22 42.74 44.25 45.75

39.08 40.57 42.06 43.53 45.00

38.50 39.96 41.41 42.85 44.28

37.94 39.37 40.79 42.20 43.60

20

53.17

52.02

50.94

49.94

48.99

48.10

47.26

46.47

45.72

45.00

18 Tables of the Freeman-Tukey transformations

345

Table 5 (cont.)

x/n

41

42

43

44

45

46

47

48

49

50

0 1 2 3 4

4.44 10.74 14.05 16.74 19.08

4.39 10.61 13.89 16.54 18.85

4.34 10.49 13.72 16.34 18.62

4.29 10.37 13.57 16.15 18.41

4.24 10.26 13.42 15.97 18.20

4.19 10.15 13.27 15.80 18.00

4.15 10.04 13.13 15.63 17.80

4.11 9.93 12.99 15.46 17.62

4.07 9.83 12.86 15.30 17.43

4.02 9.74 12.73 15.15 17.26

5 6 7 8 9

21.20 23.15 24.99 26.73 28.39

20.94 22.86 24.67 26.39 28.03

20.69 22.59 24.37 26.06 27.68

20.44 22.32 24.08 25.75 27.35

20.21 22.07 23.80 25.45 27.02

19.99 21.82 23.53 25.16 26.71

19.77 21.58 23.27 24.88 26.41

19.56 21.35 23.02 24.60 26.12

19.35 21.12 22.78 24.34 25.83

19.15 20.90 22.54 24.09 25.56

10 11 12 13 14

29.99 31.55 33.06 34.53 35.98

29.61 31.14 32.62 34.07 35.50

29.24 30.74 32.20 33.63 35.03

28.88 30.36 31.80 33.21 34.58

28.53 29.99 31.41 32.80 34.15

28.20 29.64 31.04 32.40 33.74

27.88 29.30 30.68 32.02 33.34

27.57 28.97 30.33 31.66 32.95

27.27 28.65 30.00 31.30 32.58

26.98 28.34 29.67 30.96 32.22

15 16 17 18 19

37.41 38.81 4.20 41.58 42.95

36.90 38.27 39.64 40.99 42.33

36.41 37.76 39.10 40.42 41.74

35.93 37.26 38.58 39.88 41.17

35.48 36.79 38.08 39.36 40.62

35.05 36.33 37.60 38.86 40.10

34.63 35.89 37.14 38.37 39.60

34.22 35.47 36.70 37.91 39.11

33.83 35.06 36.27 37.46 38.64

33.45 34.66 35.86 37.03 38.19

20 21 22 23 24

44.32 45.68 47.05 48.42 49.80

43.67 45.00 46.33 47.67 49.01

43.04 44.35 45.65 46.96 48.26

42.45 43.73 45.00 46.27 47.55

41.88 43.13 44.38 45.62 46.87

41.33 42.56 43.78 45.00 46.22

40.81 42.01 43.21 44.40 45.60

40.30 41.48 42.66 43.83 45.00

39.81 40.98 42.13 43.28 44.43

39.34 40.49 41.62 42.75 43.88

25

51.19

50.36

49.58

48.83

48.12

47.44

46.79

46.17

45.57

45.00

346

Frederick Mosteller and Cleo Youtz Table 6. Table of values for the Freeman-Tukey square-root transformation∗

The observed count is x and the transformed values are x ˇ + ˇ(x + 1). The squares of the transformed values are also given for use in analysis of variance computations. √ √ √ √ √ √ √ √ x x + (x + 1) ( x + (x + 1))2 x x + (x + 1) ( x + (x + 1))2 0 1.00 1.0000 26 10.30 106.0900 1 2.41 5.8081 27 10.49 110.0404 2 3.15 9.9225 28 10.68 114.0624 3 3.73 13.9129 29 10.86 117.9396 4 4.24 17.9776 30 11.04 121.8816 5 4.69 21.9961 31 11.22 125.8884 6 5.10 26.0100 32 11.40 129.9600 7 5.47 29.9209 33 11.58 134.0964 8 5.83 33.9889 34 11.75 138.0625 9 6.16 37.9456 35 11.92 142.0864 10 6.48 41.9904 36 12.08 145.9264 11 6.78 45.9684 37 12.25 150.0625 12 7.07 49.9849 38 12.41 154.0081 13 7.35 54.0225 39 12.57 158.0049 14 7.61 57.9121 40 12.73 162.0529 15 7.87 61.9369 41 12.88 165.8944 16 8.12 65.9344 42 13.04 170.0416 17 8.37 70.0569 43 13.19 173.9761 18 8.60 73.9600 44 13.34 177.9556 19 8.83 77.9689 45 13.49 181.9801 20 9.05 81.9025 46 13.64 186.0496 21 9.27 85.9329 47 13.78 189.8884 22 9.49 90.0601 48 13.93 194.0449 49 14.07 197.9649 23 9.69 93.8961 24 9.90 98.0100 50 14.21 201.9241 25 10.10 102.0100 ∗

This table was originally published in Gardner Lindzey (editor), Handbook of Social Psychology. vol. 1, Addison-Wesley, 1954 p. 327, in a chapter by Frederick Mosteller and Robert R. Bush entitled ‘Selected Quantitative Techniques’ and is reproduced here with the permission of the Addison-Wesley Publishing Company, Inc.

References and tables used in calculations Bliss, C. I. (1946). Angles corresponding to percentages, Angle = Arc sin √ (percentage), from Statistical Methods, 4th ed., pp. 449–450 by G.W. Snedecor. Ames, Iowa: The Iowa State College Press. Eisenhart, C. (1947). Inverse sine transformation of proportions, from Selected Techniques of Statistical Analysis, by Statistical Research Group,

18 Tables of the Freeman-Tukey transformations

347

Columbia University, with C. Eisenhart, M.W. Hastay, and W.A. Wallis (eds.), pp. 395–416. New York: McGraw-Hill Book Co., Inc. Freeman, M. F. & Tukey, J. W. (1950). Transformations related to the angular and the square root. Ann. Math. Statist. 21, 607–611. Gupta, S.S. & Sobel, M. (1958). On selecting a subset which contains all populations better than a standard. Ann. Math. Statist. 29, 235–244. Molina, E.C. (1942). Poisson’s Exponential Binomial Limit. Princeton, New Jersey: D. van Nostrand Company, Inc. Mosteller, F. & Bush, R.R. (1954). Selected quantitative techniques, from Handbook of Social Psychology 1, 289–334, edited by G. Lindzey. Reading, Mass: Addison-Wesley Publishing Co., Inc. National Bureau of Standards (1945). Table of Arc Sin x. New York: Columbia University Press. National Bureau of Standards (1949). Tables of the Binomial Probability Distribution. Applied Mathematics Series 6, Washington, D.C.: U.S. Government Printing Office. Stevens, W.L. (1953). Tables of the angular transformation. Biometrika, 40, 70-73. U.S. Department of Commerce (1942). Natural Sines and Cosines to Eight Decimal Places. Washington, D.C.: U.S. Government Printing Office. U.S. Department of Commerce (1952). Tables of the Cumulative Binomial Probabilities. ORDP 20-1.

Reprinted from The Mathematics Teacher (1962), 55, pp. 322–325

19. Understanding the Birthday Problem Frederick Mosteller Harvard University, Cambridge, Massachusetts

The classical birthday problem: What is the least number of people required to assure that the probability that two or more of them have the same birthday exceeds 12 ? The usual simplifications are that February 29 is ignored as a possible birthday and that the other 365 days are regarded as equally likely birth dates. The problem and its answer, 23 people, are well known.1 Many are surprised at this answer, because 23 is a small fraction of 365. Instead, they feel that a number such as 183 (≈ 12 (365)) would be more reasonable. Their surprise stems in part from a misunderstanding of the statement of the problem. The problem they have in mind is the birthmate problem: What is the least number of people I should question to make the probability that I find one or more birthmates exceed 12 ? This problem is known, but not well known,1 and its answer is 253. The answer exceeds 183 because the sampling of birth dates is done with replacement, rather than without. If we sampled from 365 people with different birth dates, 183 is the correct answer to the birthmate problem. In comparing the birthday and birthmate problems, one observes that for r people in the birthday problem, there are r(r−1)/2 pairs or opportunities for like birthdays; whereas, if n people are questioned in the birthmate problem, there are only n opportunities for me to find one or more birthmates. It seems reasonable that, approximately, if one is to have about the same probability of success (finding matching birthdays) in the two problems, then n ≈ r(r−1)/2, because then one has the same number of opportunities in both problems. For example, if the probability of success is to be at least 12 , then if r=23, n=23(22)/2=253, which happens to be exactly the correct answer. On the other hand, the probabilities of success in the two problems are not identical. For the birthday problem with r=23, P (success) ≈ 0.5073, for the birthmate problem n=253, P (success) ≈ 0.5005. 1

W. Feller, Probability Theory and Its Applications (New York: John Wiley & Sons, Inc., 1950 [first edition]), I, pp. 31-32, 112.

350

Frederick Mosteller

Let us generalize the two problems. Let N be the number of equally likely days (or categories), r the number of individuals in the birthday problem, and n the number examined in the birthmate problem. Then for the birthday problem it is easiest to compute first the probability of no like birthdays, and to obtain the probability of at least one like pair by taking the complement. There are N ways for the first person to have a birthday, N −1 for the next so that he does not match the first, N − 2 for the third so that he matches neither of the first two, and so on down to N − r + 1 for the rth person. Then applying the multiplication principle, the number of ways for no matching birthdays is N (N − 1) · · · (N − r + 1),

r factors.

(1)

To get the probability of no matching birthdays we also need the number of ways r people can have birthdays without restriction. There are N ways for each person. Then the multiplication principle says that the total number of different ways the birthdays can be assigned to r people is N r.

(2)

The number in expression (1) divided by that in expression (2) is the probability of no like birthdays, because we assume that all birthdays and therefore all ways of assigning birthdays are equally likely. The complement of this ratio is the probability, PR , of at least one pair of like birthdays in the classical birthday problem. Thus PR = P (at least 1 matching pair) = 1 − N (N − 1) · · · (N − r + 1)/N r .

(3)

For the birthmate problem the probability that a randomly chosen person does not have my birthday is (N − 1)/N = 1 − 1/N . If n people are independently drawn, then by the multiplication principle for independent events, the probability that none has my birthday is (1 − 1/N )n . The probability, PS , that at least one has my birthday is n 1 . (4) PS = 1 − 1 − N We wish to equate PS to PR approximately, and thus approximate n as a function of r. If PR = PS , then n 1 N (N − 1)(N − 2) · · · (N − r + 1) = 1 − . (5) Nr N Dividing each factor on the left by an N from N r , we get

19 Understanding the Birthday Problem

n 2 r−1 1 1 1− ··· 1 − = 1− 1 1− . N N N N

351

(6)

If we multiply out the left-hand side of Equation (6) to terms of order 1/N , and expand the right-hand side to two terms we get 1 2 r−1 n 1− + + ··· + + ··· = 1 − + ··· . (7) N N N N Eliminating the 1’s shows that, approximately, we want n = 1 + 2 + · · · + (r − 1) = (r − 1)r/2.

(8)

This result justifies in part our earlier statement about the need for equivalent opportunities. The approach just given is satisfactory if n is small compared to N , but the following logarithmic approach leads to a more refined relation between n and r if PR ≈ PS . Recall that if |x| < 1, then x3 x2 + + ··· . (9) loge (1 − x) = − x + 2 3 Taking the logarithm of both sides of Equation (6) gives us

1 2 log 1 − + log 1 − + ··· N N r−1 1 + log 1 − = n log 1 − . N N

(10)

We use Equation (9) to expand each logarithm in Equation (10) to get (after multiplying both sides of the equation by −1) Equation (11). Each line on the left side of Equation (11) corresponds to one logarithmic expansion in the left of Equation (10). ⎫ 1 1 1 ⎪ ⎪ + + · · · + ⎪ ⎪ N 2N 2 3N 3 ⎪ ⎪ ⎪ 2 3 ⎪ 2 2 2 ⎪ ⎪ + + + + · · · ⎪ ⎪ ⎬ N 2N 2 3N 3 1 1 1 2 3 =n + + · · · .(11) + 3 3 3 ⎪ N 2N 2 3N 3 + + + + ··· ⎪ 2 3 ⎪ N 2N 3N ⎪ ⎪ ⎪ ⎪ +··· ⎪ ⎪ ⎪ 2 3 ⎪ r − 1 (r − 1) (r − 1) ⎪ ⎭ + + + · · · + 2 3 N 2N 3N Now we sum the left side of Equation (11) column by column, using the usual formulas for the sum of the integers, the sum of the squares of the integers and so on to get

352

Frederick Mosteller

r(r − 1) r(r − 1)(2r − 1) r2 (r − 1)2 + + + ··· 2N 6(2N 2 ) 4(3N 3 ) 1 1 1 + =n + + ··· . N 2N 2 3N 3

(12)

To get a refined estimate of n in terms of r, we divide both sides of Equation (12) by 1 1 1 + + + ··· , N 2N 2 3N 3 and by ordinary but arduous long division we find the quotient r(r − 1) r(r − 1)(r − 2) r2 (r − 1)(r − 2) + + ··· + 2 6N 12N 2 Thus factoring r(r − 1)/2 out of the quotient gives   r − 2 r(r − 2) r(r − 1) 1+ + . n≈ 2 3N 6N 2

(13)

It is amusing to note that the coefficient of 1/N is the number of triple birthdays, but I have not found an intuitive rationale for this. Thus for N =365, r=23, to equate probabilities for the two problems, n ≈ 258. And for these numbers PR ≈ 0.5073 and PS ≈ 0.5073. Table for the Birthday Problems: N = 365 (1) Target P .01 .05 .10 .20 .30 .40 .50 .60 .70 .80 .90 .95 .99

(2) n 4 19 39 82 131 187 253 334 439 587 840 1092 1679

(3) PS .01091 .05079 .10147 .20146 .30190 .40132 .50048 .60001 .70013 .80020 .90019 .95001 .99001

(4) r 4 7 10 14 17 20 23 27 30 35 41 47 57

(5) PR .01636 .05623 .11695 .22310 .31501 .41144 .50730 .62686 .70632 .81438 .90315 .95477 .99012

(6) n1 6 21 45 91 136 190 253 351 435 595 820 1081 1596

(7) P1 .01633 .05598 .11614 .22093 .31141 .40623 .50048 .61824 .69682 .80454 .89456 .94848 .98746

(8) n∗ 6 21 45 92 138 193 258 359 447 614 851 1128 1683

(9) n13 6 21 45 92 138 193 258 359 447 614 851 1128 1682

In the Table for the Birthday Problems, we give in column 1 a target probability. Column 2 gives for the birthmate problem the smallest value of

19 Understanding the Birthday Problem

353

n that provides a probability of success larger than the target probability. Column 3 gives the corresponding probability of success, PS . Column 4 gives the smallest value of r that produces a probability of success for the birthday problem greater than the target probability shown in column 1. The actual probability of success, PR , is shown in column 5. The 6th and 7th columns show n1 = r(r − 1)/2, for the r given in column 4, together with the probability of success, P1 , in the birthmate problem, with n1 people. Column 8 gives n∗ , the value of n for the birthmate problem that makes the probability of success in the birthmate problem nearest to the probability of success, PR , in the birthday problem. Column 9 gives n13 , the value of n obtained from the approximation of Equation (13). Over the range of target probabilities considered in the table, n, n1 , n∗ , and n13 are rather close. Also the probabilities PR and P1 are close, though PR > P1 .

Reprinted from Journal of the American Statistical Association (1966), 61, pp. 35–73

20. Recognizing the Maximum of a Sequence John P. Gilbert and Frederick Mosteller Center for Advanced Study in the Behavioral Sciences and University of North Carolina and Harvard University

The classical dowry, secretary, or beauty contest problem is extended in several directions. In trying to find sequentially the maximum of a random sequence of fixed length, the chooser can have one or several choices (section 2), no information about the distribution of the values (section 2), or at the other extreme, full information about the distribution and the value of the observation itself (section 3). He can have an opponent who alters the properties of the sequence (section 4). The payoff function may be 0 or 1 (sections 2–4), or it may be the value of the observation itself as in certain investment problems (section 5). Both theoretical and numerical results are given for optimum and near optimum play.

1. summary and introduction We analyze strategies for playing variations on the game variously called the beauty contest problem, the secretary problem, Googol, and the dowry problem. Beauty contest: Suppose a boy is to have a date with his choice of one of n unseen and unknown girls, and suppose he wishes to choose the prettiest. The girls are presented for him to see one at a time in a random order, and he must choose or reject a girl when she appears. Once he chooses, he sees the rest, and he is disappointed if his date is not the prettiest. How can he maximize his probability of choosing the prettiest of the lot? Dowry problem: Each of n slips of paper has a different dowry written on it, and a lucky bachelor has one chance to choose the girl with the largest dowry. Slips are drawn at random, and he must choose or reject each slip 

This work was facilitated by the Center for Advanced Study in the Behavioral Sciences and by grants to Harvard University from the National Science Foundation NSF G-13040 and GS-341. It was also aided by the National Science Foundation through their grant NSF GP-683 to the Harvard Computing Center and by the Faculty-Aide Program administered under the Faculty of Arts and Sciences by the Student Employment Office of Harvard University.

356

John P. Gilbert and Frederick Mosteller

when it is drawn without replacement. If he chooses the slip with the largest dowry, he gets girl and dowry, otherwise nothing. The two problems are intended to be mathematically equivalent. In neither does the chooser know anything about the distribution of the characteristic he is attending to; no memory or measurement difficulties are intended. Our efforts to discover the originator of this problem have been unsuccessful; though one of the authors (F.M.) was told the problem in 1955 by Andrew Gleason [7], who claimed to have heard it from another. Fox and Marnie posed a mathematically equivalent problem under the name “Googol” in Martin Gardner’s February, 1960, column of the Scientific American [6], and a solution was outlined by Moser and Pounder [12] in the March, 1960, issue. Lindley [11] discusses the problem and a variant where the payoff is the rank of the slip chosen rather than just a win or a loss. Chow, Moriguti, Robbins, and Samuels [4] also treated the rank problem, finding that as the number of slips grows large, the optimum strategy gives expected rank 3.869. In 1963, Bissinger and Siegel [2] proposed the original problem for 1000 slips. A solution was given by Bosch and others [3]. Problems of optimal stopping are of course sequential decision problems and a basic reference is Arrow, Blackwell, Girshick [1]. Problems of optimal stopping have delicate theoretical points that require more rigor than we have always supplied here to establish the existence and optimality of a strategy; see for example Chow and Robbins [5]. We are indebted to Gus Haggstrom who has extended the Arrow, Blackwell, Girshick theory in ways that would tighten the arguments about the existence of a solution in some of the games presented here. We hope he will publish these results. When variants of this problem are considered, some have kinship with practical problems, such as investment procedures and atomic bomb inspection programs, though the versions we give are oversimplified for such applications. In the games presented here the player wishes to choose the most desirable of several objects but relevant information is presented sequentially, for one object at a time. The player must choose or reject an object on first acquiring information about it, and he always knows the number of his choices. The details vary from one form of the problem to another. We now summarize the problems treated and give our results. Dowry problem with r choices. From an urn containing n tags, each with a different number, tags are drawn one by one at random without replacement. When a tag is drawn, the player is told the rank of its number among those drawn thus far, and he must choose or reject it before another drawing. He gets r choices and wins if the largest number is one of these. For r=1, the optimum strategy has the form: reject s − 1 tags and choose the first candidate thereafter, if any (throughout this paper, we refer to an observation which is the largest one so far as a candidate). For a given n, formula (2a– 4) is the criterion for optimum s; Table 2 gives optimum s and P (win) for

20 Recognizing the Maximum of a Sequence

357

n = 1(1)50(10)100,1000; and the well-known asymptotic (n → ∞) optimum strategy gives s = n/e, and P (win) = 1/e, where e=2.718 · · · . For two choices, the optimum strategy picks the first candidate after a given starting point. The second choice is only used to pick a candidate which occurs after a second starting point. Table 3 gives optimum starting points and P (win) for n=1(1)50(10)100; the optimum asymptotic (n → ∞) strategy passes n/e3/2 tags before using a first choice, passes n/e before using a second choice, and has P (win)=0.591. Table 4 gives asymptotic results for the r-choice problem with r=1(1)8. A simplified strategy passes a first set of numbers and chooses thereafter as many as r candidates as soon as they appear. The ∗ optimum asymptotic single starting point is n/ea , where a∗ = r[!](r!)1/r , and asymptotically r  −a∗ P (win) = e (a∗ )i /i!. i=1

With one choice in the dowry problem, but a win given if the player chooses either the largest or second-largest number, the optimum strategy passes a set of draws, chooses the highest ranked so far in the next set of draws, and chooses either the highest or second highest ranked in the final set. Formula (2d–2) gives the exact P (win) for any strategy of the optimum form; Table 6 gives exact optimum strategies and P (win) for selected n ≤ 1000; the optimum asymptotic strategy passes 0.347n draws, chooses the highest rank so far until 2n/3 draws, and thereafter chooses either the highest or second highest, thereby achieving P (win) ≈ 0.574. In the classical game, we have an extreme case where the chooser knows nothing about the distribution. To find what room there is for statistical inference, we consider in section 3 the other extreme, the full-information problem, where the chooser knows exactly the distribution of the characteristic that is being assessed and is told the size of the measurement. Full-information problem. The numbers on the tags are drawn from a distribution known to the player, and the numbers themselves are revealed to the player one by one. Otherwise the problem is the same as the dowry problem and again the player wishes to choose the largest among the n numbers. The problem for any known distribution is readily transformed to that for the standard uniform distribution on (0,1), from which latter point of view we discuss it. For the one-choice problem, we treat strategies determined by a series of decision numbers di , where i is the number of the draw. The player chooses the number at draw i if and only if it is a candidate and larger than di . For di monotonically decreasing in i, formula (3c–1) gives a basis for computing P (win). Exact formulas (3b–1) can be used to obtain optimum di ; Table 7 gives optimum di (actually bi where bi+1 = dn−i ) for the first 51 trials; approximate optimum values are given by dn−i = 1/(1 + c/i), c = 0.8043 · · · . The asymptotic P (win) ≈ 0.5802. The simple strategy with d1 = d2 = · · · = dt = d, dt+1 = dt+2 = · · · = dn = 0 yields asymptotically optimum P (win) ≈ 0.5639, with t ≈ 0.6489n and

358

John P. Gilbert and Frederick Mosteller

d ≈ 1/(1 + 1.17/n). We also investigate the r-choice problem using the simple strategy with di = d, i = 1, 2, · · · , n; Table 9 shows results for r = 1(1)10. With 5 choices P (win)>0.99. Two-person game. Under the conditions of the one-choice dowry problem a player tries to find the maximum which has been placed in the sequence by an opponent in advance of any knowledge of the sequence and without otherwise disturbing its randomness. We find the minimax solution for this game, and find -  n−1  P (find maximum) = 1 1+ 1/i . i=1

We also solve a two-choice version of this game, where P (find maximum) turns out to be approximately doubled. These games represent oversimplified atomic bomb inspection programs where the player hunting the maximum is the detective nation and the placer of the maximum is the cheating nation. In section 5 we consider problems in which the payoff is the value of the number chosen and the player wishes to maximize its expected value. We study the uniform, normal, half-normal, exponential and inverse powers distributions with 1, 2, and 3 choices. The general theory for r choices is indicated, where the payoff for r choices is the sum of the choices. Kaufman [10] has considered problems of this sort. Earlier Moser [13] and Guttman [14] considered one-choice problems. 2. the dowry problem Let us describe the dowry problem more carefully. An urn contains n tags, identical except that each has a different number printed on it. The tags are to be successively drawn at random without replacement (the n! permutations are equally likely). Knowing the number of tags, n, the player may choose just one of the tags, his object being to choose the one with the largest number. The player’s behavior is restricted because after each tag is drawn he must either choose it, thus ending the game, or permanently reject it. The problem is to find the strategy that maximizes the probability of winning and to evaluate that probability. To emphasize that the player has no access to prior information, we could specify that the player is told after each tag is drawn only the rank of its number among those numbers drawn so far. An important implication of this model is that the maximum value occurs at draw k + 1 with probability 1/n independently of the player’s information up to and including the kth draw (1 ≤ k ≤ n − 1), because what the player is told about the first k draws depends only upon the ordering of these k tags. To fix ideas, let us look at the case n = 4. Let us give the name “candidate” to any draw whose value exceeds those of all earlier draws. We consider strategies that pass the first s draws and choose the first candidate thereafter, if any. All permutations of the ranks of the observations are equally likely. We

20 Recognizing the Maximum of a Sequence

359

rank observations from least to greatest, so that 4 is the rank of the largest. Table 1 shows the 24 permutations. Those marked ∗ lead to a win by the strategy that passes the first number and chooses the first candidate thereafter, if any. From now on we shall drop the expression “if any” in speaking of strategies of this form; “if any” will be understood. Those marked † lead to a win by the strategy that passes the first two numbers and then chooses the first later candidate. Since the first number is always a candidate, we find 6 ways to win by choosing the first number. The table shows the 11 ways to win by passing the first number, the 10 ways by passing the first two numbers, and, of course, 6 ways by passing the first three. Among these four strategies, the best passes the first draw and chooses the first candidate thereafter; and its probability of winning is 11/24. That strategy improves considerably upon the strategy of a random choice, which gives 1/4 as the probability of winning. TABLE 1. WINS FOR THE DOWRY PROBLEM WHEN n=4 1234 †1243 †1324 †1342 ∗ 1423 ∗ 1432

2134 †2143 †2314 †2341 ∗ 2413 ∗ 2431





†3124 †3142 ∗ †3214 ∗ †3241 ∗ 3412 ∗ 3421 ∗

4123 4132 4213 4231 4312 4321

∗ Wins by passing first number. † Wins by passing first two numbers.

2a. the dowry problem with one choice: exact and asymptotic solutions Let us find the form of the optimum strategy for the dowry problem with n tags. Exposition is simplified if we think of ourselves as the player. We should choose the ith draw if the probability of winning with it exceeds the probability of winning with the best strategy available if we continue. That is, we should choose the ith draw if P (win with draw i) > P (win with best strategy from i + 1 on ).

(2a-1)

If the ith draw is not the largest so far, it is a sure loser, and so we need only consider choosing candidates. We now observe that the probability of winning with a candidate at draw i is a strictly increasing function of i, since it is the probability of the maximum’s being among the first i draws, namely i/n. The right-hand side of inequality (2a–1) is a decreasing function of i, since we can always get to a later point in the sequence and then use whatever strategy is available—being young in the game loses no strategies than can be employed later. Consequently the form

360

John P. Gilbert and Frederick Mosteller

of the optimum strategy is to pass, say, the first s − 1 draws and to choose the first candidate thereafter. Let us compute the probability of winning with strategies of the optimum form. The probability that the maximum tag is at draw k is 1/n. The probability that the maximum of the first k − 1 draws appears in the first s − 1 is (s − 1)/(k − 1). The product (s − 1)/[n(k − 1)] gives the probability of a win at draw k, s ≤ k ≤ n. Summing gives the required probability π(s, n) of picking the true maximum of n when we pass the first s − 1 draws and choose the first later candidate: π(s, n) =

n n−1 1  s−1 s−1  1 = , n k−1 n k k=s

1 < s ≤ n.

(2a-2)

k=s−1

Since the first draw is always a candidate, π(1, n) = 1/n. Note that for n = 4, s = 2, π(2, 4) = 11/24, as we got in our example. The optimum value of s, say s∗ , is the smallest s for which inequality (2a–1) holds: s 1 1 1 s > π(s + 1, n) = + + ··· + , (2a-3) n n s s+1 n−1 or equivalently, that s for which 1 1 1 1 1 1 1 + + ··· +

s

n

1 dx/x + (1/s − 1/n). 2

The first inequality implies s < (n − 1/2)/e + 3/2, the second that

(2a-6)

20 Recognizing the Maximum of a Sequence

361

TABLE 2. OPTIMUM STARTING DRAW s∗ FOR THE DOWRY PROBLEM, TOGETHER WITH THE OPTIMUM PROBABILITY OF WINNING, TRUNCATED AT FIVE DECIMALS, FOR n = 1(1)50(10)100, 1000, ∞ n

s∗

π(s∗ , n)

n

s∗

π(s∗ , n)

n

s∗

π(s∗ , n)

1 2 3 4 5 6 7 8 9 10

1 1,2 2 2 3 3 3 4 4 4

1.00000 .50000 .50000 .45833 .43333 .42777 .41428 .40982 .40595 .39869

21 22 23 24 25 26 27 28 29 30

9 9 9 10 10 10 11 11 11 12

.38281 .38272 .38189 .38116 .38091 .38011 .37979 .37946 .37869 .37865

41 42 43 44 45 46 47 48 49 50

16 16 17 17 17 18 18 18 19 19

.37572 .37548 .37526 .37518 .37493 .37482 .37470 .37443 .37441 .37427

11 12 13 14 15 16 17 18 19 20

5 5 6 6 6 7 7 7 8 8

.39841 .39551 .39226 .39171 .38940 .38808 .38731 .38540 .38503 .38420

31 32 33 34 35 36 37 38 39 40

12 13 13 13 14 14 14 15 15 16

.37826 .37776 .37767 .37726 .37699 .37684 .37641 .37632 .37612 .37574

60 70 80 90 100 1000 ∞

23 27 30 34 38 369 n/e

.37320 .37239 .37185 .37142 .37104 .36819 .36787

1 1 e >(n/s)e 2 (1/s−1/n) > (n/s)[1 + (1/s − 1/n)] 2 1 >(n/s)[1 + (1/[(n − 1/2)/e + 3/2] − 1/n)]. 2

(2a-7)

Consequently we have n − 1/2 1 n − 1/2 3 3e − 1 + − ≤ s∗ ≤ + , e 2 2(2n + 3e − 1) e 2

(2a-8)

which pins s∗ down to an interval of length 1+

1.79 . n + 1.79

The greatest integer contained in (n − 1/2)/e + 3/2 is the correct value of s∗ up to n = 100, except for n = 97, when it gives 37 instead of 36. The loss in P (win) of using s∗ + 1 instead of s∗ never exceeds 1/n(s − 1) ≈ e/n2 . We are indebted to John Pratt for these bounds, which are tighter than we originally had.

362

John P. Gilbert and Frederick Mosteller

Asymptotic solution. For large n, we can approximate n 

1/i

i=1

by C +log n, where C is Euler’s constant. Using this approximation in formula (2a–2), we get n s−1 s n−1 π(s, n) ≈ ≈ log . (2a-9) log n s−1 n s Similarly, approximating the left- or right-hand sum in inequality (2a–4) shows us that log (n/s) ≈ 1, and so s ≈ n/e. Substituting these results into the final formula of equation (2a–9) gives us the well-known asymptotic result lim π(s, n) = 1/e ≈ 0.368,

n→∞

s ≈ n/e.

(2a-10)

To sum up, the optimum asymptotic strategy is to pass the fraction 1/e of the tags and choose the first candidate thereafter, and then the probability of winning is asymptotically 1/e. For large n and arbitrary starting points, Figure 1 shows the probability of winning given by (2a–9). 2b. the dowry problem with two choices: exact and asymptotic theories In this section we allow the player to have two choices, retaining the other conditions of the dowry problem. If either of his choices is the tag with the largest number, the player wins. As before, we wish to find the strategy with maximum probability of winning and the value of that probability. The asymptotic theory of this section may be regarded as a warm-up for the large-n dowry problem with r choices. An argument similar to that used in section 2a shows that the optimum strategy belongs to the class of strategies indexed by a pair of starting numbers (r, s), r < s. The first choice is to be used on the first candidate starting with draw r, and once the first choice is used, the second choice is to be used on the first candidate starting with draw s. Once the first choice is made, the problem is reduced to the one-choice problem of section 2a, and the second choice will be used on the first candidate starting with s∗ , as defined by inequalities (2a–4) or given in Table 2. To compute the probability of winning with strategy (r, s), the ways of winning are broken into three mutually exclusive events: a. win with first choice (never use second choice), b. win with second choice and no choice is used before s, c. win with second choice and the first choice is used in one of the positions r, r + 1, · · · , s − 1.

20 Recognizing the Maximum of a Sequence

363

Figure 1. Asymptotic probability of winning with different starting points.

By extensions of the methods used to derive equation (2a–2), it is easy to show that 1 r−1 1 1 , r > 1, P (a) = + + ··· + n r−1 r n−1 P (a) = 1/n,

r = 1,

P (b) =

n v−1 r−1   1 , n v=s+1 u=s (u − 1)(v − 1)

s > r ≥ 1,

P (c) =

n s−r  1 , n v=s v − 1

s > r ≥ 1.

(2b-1)

π(r, s, n) = P [win with (r, s)] = P (a) + P (b) + P (c). In our example for n = 4, both strategies (1, 2) and (2, 3) produce 17/24 as the probability of winning. Table 3 shows, for n = 1(1)50(10)100, optimum strategies and their probabilities of winning. Asymptotic solution. Assume n, r, and s are large. When we apply Euler’s approximation to P (a) of equations (2b–1), we get

364

John P. Gilbert and Frederick Mosteller

TABLE 3. OPTIMUM STRATEGY (r∗ , s∗ ) FOR THE TWO-CHOICE DOWRY PROBLEM, AND ITS PROBABILITY OF WINNING, TRUNCATED AT FIVE DECIMALS, FOR n = 1(1)50(10)100, ∞ n

r∗

s∗ π(r∗ , s∗ , n)

1 2 3 4 5 6 7 8 9 10

1 1 1 1 2 2 2 2 3 3

1 2 2 2 3 3 3 4 4 4

11 12 13 14 15 16 17 18 19 20

3 3 4 4 4 4 4 5 5 5

5 5 6 6 6 7 7 7 8 8

n

r∗

1.00000 1.00000 .83333 .70833 .70833 .69305 .67222 .65560 .65103 .64632

21 22 23 24 25 26 27 28 29 30

5 6 6 6 6 6 7 7 7 7

9 9 9 10 10 10 11 11 11 12

.64190 .63531 .63058 .62982 .62731 .62441 .62115 .62014 .61936 .61781

31 32 33 34 35 36 37 38 39 40

8 8 8 8 8 9 9 9 9 10

12 13 13 13 14 14 14 15 15 16

P (a) ≈

s∗ π(r∗ , s∗ , n)

r−1 log n



n

r∗

s∗ π(r∗ , s∗ , n)

.61561 .61456 .61405 .61316 .61210 .61055 .61039 .60996 .60909 .60829

41 42 43 44 45 46 47 48 49 50

10 10 10 11 11 11 11 11 12 12

16 16 17 17 17 18 18 18 19 19

.60378 .60348 .60307 .60263 .60253 .60238 .60211 .60170 .60154 .60143

.60767 .60738 .60701 .60634 .60556 .60550 .60516 .60480 .60425 .60386

60 70 80 90 100

14 16 19 21 23

23 27 30 34 38

.59969 .59837 .59741 .59675 .59617

n−1 r−2

∞ n/e3/2 n/e

.59100

.

Applying the approximation to the inside sum of P (b), we get n 1 r−1  v−2 . P (b) ≈ log n v=s+1 v − 1 s−2 Neglecting small additive constants like 1’s and 2’s and replacing the sum by an integration, we get the further approximation  v r n1 P (b) ≈ dv. log n s v s r  n 2 ≈ log . 2n s The corresponding approximation for P (c) is n s−r log . P (c) ≈ n s

20 Recognizing the Maximum of a Sequence

365

To find the asymptotic optimum values of r and s, let r = αn, s = βn; then P [win with (r, s)] ≈ α log(1/α) + (α/2)[log(1/β)]2 + (β − α) log(1/β). (2b-2) Differentiating with respect to α and β and setting the resulting expression equal to zero, we get two pairs of roots. One gives β ∗ = e−1 , α∗ = e−3/2 , and the asymptotic optimum probability of winning is P [win with (r∗ , s∗ )] ≈ e−1 + e−3/2 ≈ 0.5910.

(2b-3)

Note that the value of s∗ agrees with that obtained for the one-choice game, which could have been obtained similarly. √ The other pair of roots gives α = β = e− 2 , which optimizes the probability of winning with a strategy using a single starting point and expending the two choices on the first two candidates thereafter. Call this single starting point t∗ . Then for large n √ √ P [win with strategy t∗ ] ≈ e− 2 ( 2 + 1) ≈ 0.5869. (2b-4) Note that this probability is close to the optimum. 2c. the dowry problem with r choices: asymptotic theory Now let the player have r choices in the dowry problem. He wins if any of his choices is the tag with the largest number. Our object is to extend the asymptotic theory and thus find the asymptotic strategy that maximizes the probability of winning. As our first step, we get the asymptotic probability distribution of the number of candidates in a connected set of draws. We apply this result to get the asymptotic probability of winning the r-choice game with the strategy that passes the first t draws and chooses the next r candidates, if they exist. Finally, we develop the general asymptotic approach to the optimum strategy for r choices. Let the draws be indexed by i = 1, 2, · · · , n, where i=1 corresponds to the first draw. We now prove Theorem 1. The probability of exactly r candidates in the set of draws for which a < i ≤ b is given asymptotically (n → ∞) by p(r|a, b), where p(r|a, b) = (a/b)[log(b/a)]r /r!.

(2c-1)

The proof is by induction.1 Throughout the development we assume n, a, and b − a are much greater than r. If the largest measurement in the first a draws is the largest among the first b, we get exactly p(0|a, b) = a/b. For 1

In section 2c, if a is the starting number, draw a + 1 is the first one that can be chosen. In sections 2a and 2b, the starting draw itself could be chosen.

366

John P. Gilbert and Frederick Mosteller

r = 1, the kth draw (a < k ≤ b) is the only candidate in the set if: (1) it is the largest measurement in the first b, and (2) the largest of the first k − 1 draws is in the first a. The probability of event (1) is 1/b, of event (2) is a/(k − 1). Consequently, the probability of the joint event is the product a/b(k − 1). Summing over the possible values of k gives p(1|a, b) =

b 

a/b(k − 1) ≈ (a/b) log[(b − 1)/a] ≈ (a/b) log(b/a). (2c-2)

k=a+1

For r = 2, the kth draw (a < k ≤ b) is the second and last candidate in the set if it is the largest in the first b and if there is no candidate between draw a + 1 and k − 1 inclusive. The probability of the first event is 1/b. That for the second is obtained from the final formula of (2c–2) by replacing b by k − 1. Their product, [a/b(k − 1)] log[(k − 1)/a], must be summed over the possible values of k. We get b  a log[(k − 1)/a] b(k − 1) k=a+2  b (a/x) log(x/a)dx ≈ (1/b)

p(2|a, b) ≈

(2c-3)

a+1

≈ (a/b)[log(b/a)]2 /2!. To complete the induction, assume that p(r|a, b) ≈ (a/b)[log(b/a)]r /r!.

(2c-4)

Then the kth draw is the last and (r + 1)st candidate in set if it is the largest among the first b and if r candidates appear in the interval a + 1 to k − 1. By essentially the same argument used in going from r = 1 to r = 2, we get the probability at k from p(r|a, k − 1)/b, and then summing and approximating by integration completes the induction and the demonstration of Theorem 1. Asymptotic probability with r choices and a single starting point. As a crude strategy in the r-choice dowry problem, we could use a single starting point T + 1, and thereafter use up the choices as fast as candidates appear. If the number of candidates from T + 1 to n is zero or exceeds r, we lose; if from 1 to r, we win. Theorem 1 gives the asymptotic probability that the number of candidates is from 1 to r as (T /n)

r 

[log(n/T )]i /i!.

i=1

To optimize this asymptotic strategy, let T = αn, α = e−a ; then P (win with r choices starting after T ) = e−a

r  i=1

ai /i!.

20 Recognizing the Maximum of a Sequence

367

Note that this expression gives the probability that the count from a Poisson experiment with mean a is from 1 to r. Differentiating with respect to a and setting the result equal to zero, we find the optimum value of a is a∗ = (r!)1/r ,

(2c-5)

and the asymptotic optimum probability of winning is ∗

e−a

r 

a∗i /i!.

(2c-6)

i=1

Furthermore, the asymptotic probability of winning with the ith choice is ∗

e−a a∗i /i!, as one can verify by tracing the origin of the single terms of expression (2c–6). Note that for r=1, formula (2c–6) reduces to e−1 as it should [cf. (2a–10)]; and for r = 2, it reduces to formula (2b–4). Numerical results are given in Table 4. The differences in a∗ are nearly constant at 0.382 by r=9, but they slowly drift down toward e−1 ≈ 0.368 as r grows. TABLE 4. OPTIMUM ASYMPTOTIC PROBABILITIES OF WINNING THE r-CHOICE DOWRY PROBLEM WITH r STARTING NUMBERS AND WITH ∗ ONE STARTING NUMBER T ∗ = e−a n r Starting Numbers One Starting Number −log ur

r r

=

r 

i

ur

P (win)

0.367879 0.223130 0.141093 0.091018 0.059429 0.039125 0.025913 0.017243

0.367879 0.591010 0.732103 0.823121 0.882550 0.921675 0.947588 0.964831

i=1

1 2 3 4 5 6 7 8 9 10

1.000000 0.500000 0.458333 0.438368 0.426267 0.418026 0.411998 0.407372

1.000000 1.500000 1.958333 2.396701 2.822969 3.240994 3.652992 4.060364

P (win) 0.368 0.587 0.726 0.817 0.877 0.917 0.944 0.962 0.974 0.982

a∗ 1 1.414 1.817 2.213 2.605 2.994 3.380 3.764 4.147 4.529



e−a

0.368 0.243 0.162 0.109 0.074 0.050 0.034 0.023 0.016 0.011

Asymptotic theory for optimum strategy in the dowry problem with r choices. Again we define the dowry problem as in section 2a, except that the player is allowed r choices, and he wins if any choice is the tag with the

368

John P. Gilbert and Frederick Mosteller

largest number. The form of the optimum strategy involves a set of starting numbers ai , i = 1, 2, · · · , r, where a1 < a2 < · · · < ar . To round out the notation, let ar+1 = n. The first choice is used on the first candidate following draw a1 , the second choice on the first candidate available for it after draw a2 , and so on. The point is that up to and including draw ai (for each i) no more than i − 1 choices may be made. The starting points ai , i = 1, 2, · · · , r + 1, break the draws into r + 1 sets. The draws that are initially passed are numbered 1, 2, · · · , a1 . The first set of draws for which a choice is available runs from draw a1 + 1 to draw a2 inclusive. The ith set runs from draw ai + 1 to draw ai+1 inclusive. We plan to get the asymptotic joint distribution of the number of candidates in sets through n and then to get a program for obtaining the optimum large-sample strategy and its asymptotic probability of winning. For notation let ai+1 /ai = eδi , so that δi = log(ai+1 /ai ), i = 1, 2, · · · , r. In this notation Theorem 1 states that the asymptotic probability of exactly i candidates in the kth set of draws where choices are allowed is e−δk (δki /i!),

1 ≤ k ≤ r.

(2c-7)

It its not hard to show that the number of candidates αk in the set from ak + 1 to ak+1 is independent of the numbers of candidates in other sets. From this independence, we can obtain the joint distribution of the numbers of candidates in the sets for which choices are allowed by multiplying together terms of the form given in expression (2c–7). Using the independence of αi , i = 1, 2, · · · , r, and Theorem 1, we have Theorem 2. Asymptotically (n → ∞), the probability of winning with the vector of candidates (α1 , α2 , · · · , αr ) is given by P (α1 , α2 , · · · , αr ) =

r 

e−δi δiαi /αi !.

i=1

Note that when αi =0, the contribution of the factor is e−δi . Rather than try to characterize all candidate vectors (α1 , α2 , · · · , αr ) that lead to wins, we characterize those winning vectors for the r-choice game with starting points a1 , a2 , · · · , ar for which the (r − 1)-choice game with the starting points a2 , a3 , · · · , ar would have lost. A few examples illustrate the pattern. In the examples the vectors have r coordinates. Example 1. The alpha vector is (1, 0, 0, · · · , 0). Such a vector arises from sequences of draws where only one candidate occurs and that in the first set (a1 + 1 through a2 ) of the r-choice game. Since the (r − 1)-choice game cannot choose a candidate in the first set, it misses the maximum in such sequences. Example 2. (0, 2, 0, 0, · · · 0). Sequences with no candidates in the first set, two candidates in the second set, and no later candidates. The (r − 1)-choice game can only choose one candidate in the second set.

20 Recognizing the Maximum of a Sequence

369

If, in the course of the sequence, the additional choice of an r-choice game has been used, then the r- and (r − 1)-choice games become identical as far as later candidates are concerned, as in the examples, (1, 1, 2, · · · ) and (0, 2, 1, · · · ). Consequently, to have a new winner for the r-choice game, the vector of candidates must be such that until the win occurs we have at least one more choice available in every set than we have occasion to use. This means that if αK is the last nonzero α, then for j < K we require Condition 1. j  αi < j. i=1

Furthermore, when we do win, it must be the first (and only) occasion when all available choices have been used, otherwise the (r − 1)-game would have won, and so we require Condition 2. K  αi = K. i=1

Using Theorem 2 to write the probability of such a sequence, we see that if the δ’s in the product are written in the ascending order of their subscripts, the two conditions above imply that the sum of all of the exponents must equal the largest subscript, and that for any proper subset of the factors, the sum of the exponents must be less than the largest subscript in the subset. Let us denote the probability of winning an r-choice game of length n by the strategy whose choice points are a1 , a2 , · · · , ar by p(r|a1 , a2 , · · · , ar ; n). Then we have shown Theorem 3 p(r|a1 , a2 , · · · , ar ; n) = p(r − 1|a2 , a3 , · · · , ar ; n) + Ar , where

 Ar = exp −

r 

 δi

i=1

r 

(2c-8)

(δiαi /αi !),

α∈A i=1

and where α is a vector of candidates for the r-choice game, and A is the set of α’s meeting conditions 1 and 2 above. These conditions imply that the only vector in A with α1 = 0 is (1, 0, 0, · · · , 0). Hence Ar can be put into the form   r  (2c-9) [δ1 + f (δ2 , · · · , δr )]exp −δ1 − δi , i=2

where f (δ2 , · · · , δr ) =

r   α∈A

i=2

δiαi /αi !,

370

John P. Gilbert and Frederick Mosteller

and A is A except that (1, 0, 0, · · · , 0) has been deleted. Taking the partial derivative of expression (2c–9) with respect to δ1 gives   r  δi , [1 − δ1 − f (δ2 , · · · , δr )]exp −δ1 − i=2

and therefore the value of δ1 , say δ1∗ , that maximizes p(r|a1 , a2 , · · · , ar ; n) is δ1∗ = 1 − f (δ2 , · · · , δr ). We can optimize the δ’s sequentially, because when we are at some point left with k choices, we have either won or we have not. If the latter, the number of choices we have left determines our chance to win; all that matters is that we play the optimum k-choice game, given the number of draws left and the largest number drawn so far. Thus the optimum a1 for a 1-choice game is the optimum a2 in a 2-choice game, and so on. To handle this conveniently, it is preferable to number our starting values from last to first. Let u1 be the fraction of n giving the starting value for the choice in a 1-choice game, u2 that for the first choice in a 2-choice game, and ui that for the first choice in an i-choice game. As we have seen in section 2b, asymptotically, u1 = e−1 ≈ 0.368,

u2 = e−(1+1/2) ≈ 0.223.

But the next two u’s have grittier exponents for e 1 11 + , 2 24 505 1 11 + , − log u4 = 1 + + 2 24 1152

− log u3 = 1 +

u3 ≈ 0.141, u4 ≈ 0.091.

Therefore, we turned to decimal computation. Let us define − log ur =

r 

j ,

j=1 1 so that  1 = 1, 2 = 2 , and so on. Then in Table 4 for r=1(1)8, we give r , i , ur , and the probability of winning a large game with r choices.

2d. the dowry problem with one choice when both the largest and second–largest numbers are counted as wins Suppose that drawings and decisions are made in the fashion of the onechoice dowry problem, ranks only being reported to the chooser, but he wins if he chooses either the drawing with the highest or second-highest rank among the n tags. It turns out that his best strategy has the form: pass s1 − 1 draws, then choose the first candidate thereafter, but beginning with the s2 th

20 Recognizing the Maximum of a Sequence

371

draw choose also a drawing that is second best among those so far made. Naturally s1 and s2 are chosen to maximize P , the probability of winning. For a given n, P breaks into three parts shown as sums on the right-hand side of (2d–1): the probability of winning by choosing the largest so far on draw i, s1 ≤ i ≤ s2 ; the probability of winning by choosing the largest so far on draw i, s2 + 1 ≤ i ≤ n, the algebraic form being altered because of the possible choice of a draw ranked second largest thus far; and the probability of winning by choosing the second-largest draw thus far, s2 ≤ i ≤ n. We do not give the derivation, but the total probability of winning is given by 2/n for s1 = 1, otherwise by P (win with highest or second highest|s1 , s2 ; n)  s2  s1 − 1  2(n − 1) −1 = n(n − 1) i=s i−1 1

+

  n 2(n − 1) (s1 − 1)(s2 − 2)  1 −1 n(n − 1) i−2 i−1 i=s +1

(2d-1)

2

+

n (s1 − 1)(s2 − 2)  1 . n(n − 1) i−2 i=s 2

Performing summations in equation (2d–1) and grouping terms gives P =

s2 (s1 − 1)(s1 − s2 ) 1 2(s1 − 1)  + n i − 1 n(n − 1) i=s1   1 1 2(s1 − 1)(s2 − 2) . − + n s2 − 1 n − 1

(2d-2)

TABLE 5. PROBABILITIES OF WINNING THE DOWRY PROBLEM FOR EACH ADMISSIBLE STRATEGY FOR n = 5 WHEN EITHER THE LARGEST OR SECOND-LARGEST NUMBER WINS s1

s2

P

1 2 2 2 2

— 2 3 4 5

2/5 24/60 39/60 42/60=.7 41/60

s1

s2

3 3 3 4 4 5

3 4 5 4 5 5

P 36/60 42/60=.7 40/60 36/60 33/60 24/60

To get an asymptotic result, let s1 = αn, s2 = βn, and write approximately P ≈ 2α log(β/α) + α(α − β) + 2α(1 − β).

(2d-3)

372

John P. Gilbert and Frederick Mosteller

To maximize the asymptotic value of P , we differentiate approximation (2d– 3) with respect to β and set the result equal to zero to find the optimum β to be β ∗ =2/3. Treating α similarly, we get the condition ∗



−1

= 3α∗ /2,

which means that the optimum α is α∗ ≈ 0.347. Then the maximum probability of winning is P ≈ α∗ (2 − α∗ ) ≈ 0.574. Examples: 1. For s1 = 2, s2 = 3, n = 4, P = 18/24, as it is also for s1 = 2, s2 = 4, n = 4. 2. For n = 5 we give Table 5. Table 6 gives the optimum strategies and probability of winning for selected values of n. Not all possible (s1 , s2 ) pairs were tested, but a 10 × 10 grid centered on the values (0.347n, 2n/3) was scanned for the largest values of P (win). TABLE 6. OPTIMUM STRATEGY AND PROBABILITY OF WINNING (TRUNCATED AT FOUR DECIMALS) FOR THE DOWRY PROBLEM WHEN EITHER THE LARGEST OR SECOND-LARGEST NUMBER WINS Starting Numbers

Starting Numbers

n

s1

s2

P (win)

n

s1

s2

P (win)

3 4 5 6 7 8 9 10

2 2 2(3) 3 3 4 4 4

3 3(4) 4 5 5 6 7 7

0.8333 0.7500 0.7000 0.6888 0.6666 0.6517 0.6472 0.6366

20 50 100 200 500 1000 ∞

8 18 35 70 174 348 0.3470n

14 34 67 134 334 667 2n/3

0.6046 0.5857 0.5795 0.5765 0.5747 0.5741 0.5736

One expects the optimum probability of winning for this problem to be close to that for the dowry problem with two choices when the largest dowry must be chosen to win (section 2b). For n = 5, the optimum probability of winning in the two-choice problem is about 0.708, that for the present problem 0.700; asymptotically the optimum for the two-choice problem is 0.591, for the present problem 0.574. This variant of the dowry problem was suggested to us by Arthur P. Dempster. Since formula (2d–1) is already a bit thick, we did not pursue the problem of finding one’s chance to win if any one of the top r values among the n slips gives the chooser a win.

20 Recognizing the Maximum of a Sequence

373

3. choosing the largest measurement from a sample drawn from a known distribution In the dowry problem, no information about the distribution of the numbers on the tags was either available in advance or could be gathered during the course of the drawings. To guarantee this degree of ignorance, we suggested reporting to the player only the rank of a new draw among the numbers drawn thus far (section 2a). At another extreme, the player might know exactly the distribution from which the tags’ numbers are to be drawn. (Sometimes the expression “know the distribution” is used loosely for “know the family of the distribution.” Here we mean knowing all about the distribution. For example, knowing that the distribution is normal would not be enough, one would need its mean and variance as well.) If the one-choice dowry game were played with these tags according to the rules laid down earlier, except that at each draw the actual number on the tag were to be reported to the player, he should be able to devise a strategy that produces a higher probability of winning than was achieved in the dowry problem of section 2a. We show in this section that his optimum asymptotic probability of winning rises from the 1/e ≈ 0.37 for the game with unknown distribution to about 0.58 in the new full-information game. In many statistical problems, some but not all the information about the distribution is assumed to be known. With such partial information, but full reporting of the numbers, the player ordinarily gathers additional information about the distribution as the tags’ numbers are reported. This information plus statistical inference might aid him to make a better choice than that offered by the “pass s∗ and take the first candidate thereafter” strategy of section 2a. Aside from its own interest, the problem when the distribution is known provides an upper bound to the probability that might be achieved from statistical inference. 3a. the full-information game One by one, a sample of n measurements is drawn from a population with continuous cumulative distribution F . The continuity assures that ties have probability zero. After each draw, the player, who knows F and n, is informed of its value, whereupon he must decide whether or not to choose that draw. He is to maximize the probability of choosing the draw with the largest measurement in the sample. Since the distribution is known exactly, and since the largest measurement in a sample remains the largest under all monotonic transformations of its variable, we lose no generality by assuming that F is the standard uniform: F (x) = x and density f (x) = 1, on 0 ≤ x ≤ 1. If we make this transformation, we gain some simplicity in the intermediate work. From now on we treat the standard uniform.

374

John P. Gilbert and Frederick Mosteller

Our questions are: What strategy maximizes the probability of winning? What is that optimal probability? What is the asymptotic (n → ∞) probability of winning? If n were modest, say 10, and the first measurement were large, say 0.9998, we would choose it because the chance that the other nine numbers are smaller is (0.9998)9 ≈ 0.9982. Thus in the full-information problem, as opposed to the classical problem treated in section 2, no buildup of experience is needed to set a standard, and a profitable choice can sometimes be made immediately. 3b. the optimum strategy for the full-information game Before deriving the optimum strategy, we describe a class of strategies that includes the optimum. Corresponding to each draw, assign a decision number. As the drawing proceeds, choose as the largest the first candidate whose value exceeds its decision number. We shall see that the optimum decision numbers depend only upon the number of remaining draws. To get the optimum strategy, we deduce the optimum decision numbers by working backward from the last draw. If we arrive at the final draw before choosing, we automatically win or lose according as it is or is not the largest. Thus the decision number for the final draw is zero. Suppose we have one number yet to draw and the one in hand with value x is a candidate. For such a candidate, whatever the value of x, the probability that the final draw wins is 1 − x. Then, if x > 12 we should choose it; if x < 12 , reject it; and we are indifferent to x = 12 . For convenience of exposition, let us say that we choose the draw if x ≥ 12 . Therefore the optimum decision number for the next to last draw is 12 . Except for the decision number 0 associated with the final draw, all our optimum decision numbers are indifference numbers, numbers such that we are indifferent between choosing a candidate with that value and going on. Let bn be the optimum decision number for the first draw (nth from the end), bi that for the (n−i+1)st draw, and b1 =0 that for the nth draw. Suppose that we have not yet chosen and that the (n − i)th draw is a candidate. There are i draws left. We wish to find a decision or indifference number x(= bi+1 ) such that if x were the value of the (n − i)th draw, the probability of winning with it would be equal to the probability of winning later if the best strategy is used. (To win later, the largest number must be drawn later and we must choose it.) Then if the number in hand exceeds the decision number x, we have a larger chance to win with it than if we pass it. (The decision numbers obviously decrease as we go through the draws, because with fewer draws to go, we have less chance of getting a high number.) Pretend we have a draw of value x in hand. The probability that the next i numbers are all smaller than x is xi . Though we are indifferent to x now, we would not be on a later draw, because the decision number would then be smaller than x. Consequently, we would choose on a later draw the first candidate whose value is at least as large as x. Suppose we go on. If only one number larger than x occurs later, we choose it; if two occur later, the chance is 21 that we choose the larger when

20 Recognizing the Maximum of a Sequence

375

we choose the first one; if three occur, the chance of choosing the largest is 13 ; and so on. Thus the probability of winning, if we go on, is 1 i 1 i i i−1 i−2 2 x (1 − x) + · · · + (1 − x)i . x (1 − x) + 1 2 2 i i To find the value of x, the decision number for draw n − i, we must solve i

x =

i  i j=1

j

i−j

x

(1 − x) /j or 1 = j

i  i j=1

j

z j /j,

(3b-1)

where z = (1 − x)/x. For i = 1, we get x = 1 − x, or x = 12 , as found earlier. Thus b2 = 12 . For i = 2, after simplifying we get 5x2 − 2x − 1 = 0, or √ x = b3 = (1 + 6)/5 ≈ 0.6899. For modest values of i we can solve equation (3b–1) for values of x and get numerical results, of which those in the first column of the body of Table 7 are illustrative. We want an approximate solution. Some power series work on (3b–1) yields the second-order and first-order approximations in that sequence: x = bi+1 ≈

1 1+ + c i

a1 i2



1 , 1 + ci

(3b-2)

where

c2 c3 + ··· , (3b-3) + 2!2 3!3 with numerical value c=0.80435226286 · · · , and a1 =0.183199 · · · . The second and third columns of the body of Table 7 compare the second-order and firstorder approximations of (3b–2) with values precisely computed from equation (3b–1). Later considerations will make it clear that little would be lost in P (win) from using such approximations. Now that we have determined the strategy, we want the probability of choosing the largest number when we use this strategy. 1=c+

3c. probability of winning the full-information game with strategies having monotone decreasing decision numbers Next, we obtain a general formula for the exact probability of winning the one-choice, full-information game with n trials for any strategy whose decision numbers are nonincreasing as the draws progress. In section 3b, it was convenient to number the decision numbers according to the position of the draw, counting from the end, calling the first optimum decision number bn and the last b1 . In the present derivation, it is more convenient to index the decision numbers with the number of the draw, the reverse of the ordering in

376

John P. Gilbert and Frederick Mosteller

section 3b. Let us call these newly indexed decision numbers, which are not necessarily optimum, d1 , d2 , · · · , dn , where d1 ≥ d2 ≥ · · · ≥ dn ,

1 ≥ di ≥ 0.

Thus d1 is the decision number for the first draw. For our uniform distribution, di is the probability that the number of the ith draw is less than its decision number. For a general distribution F, di has the same probabilistic interpretation, but the decision number would be ki such that F (ki ) = di . Next we prove Theorem 4. For any strategy with monotone nonincreasing decision numbers di , i = 1, 2, · · · , n, the probability P (r + 1) of winning at draw r + 1 is P (1) = (1 − dn1 )/n, r r   P (r + 1) = dri /r(n − r) − dni /n(n − r) − dnr+1 /n, n − 1 ≥ r ≥ 1. i=1

i=1

(3c-1) Proof. First we find the chance that the ith draw is largest of the first r and we get to the (r+1)st draw without choosing. Consider the number di , 1 ≤ i ≤ r. The probability that the first r draws are less than di is dri . The probability of the ith draw being largest among the first r and of all r draws being less than di is dri /r. Possibly there never will be a choice if the ith draw is largest and no draw among the n was so large. The probability of no choice is dni /n. The difference dri /r − dni /n is the probability that the ith draw is smaller than its decision number, largest in the first r, but not the largest in the sample. Since the di are monotone decreasing, this condition insures that no candidate before the ith draw could have been larger than its decision number. Summing the difference over the first r draws gives the probability that no draw is made among the first r draws and that the largest draw is among the last n − r. Given this information, the probability that the (r + 1)st draw is largest is 1/(n − r). Thus we can write the probability that no draw among the first r is chosen and that the largest is at draw r + 1 as  r   r n (di /r − di /n) /(n − r). i=1

If we choose the (r + 1)st draw when it is largest, we win. The chance that we do not take the (r + 1)st and it is largest is dnr+1 /n. And so at last we get, for the probability of winning, equation (3c–1) quoted in Theorem 4 for r > 0. The case of the first draw is special. The chance that the first draw is largest and all draws are less than d1 is dn1 /n. The probability that the first

20 Recognizing the Maximum of a Sequence

377

TABLE 7. OPTIMUM DECISION OR INDIFFERENCE NUMBERS bi+1 AND THEIR APPROXIMATIONS FOR THE FULL-INFORMATION GAME Computed from

Computed from

1 2 3 4 5

Eq. (3b–1) 0 .50000000 .68989795 .77584508 .82458958

SecondOrder Approx. 0 .503132 .690619 .776113 .824716

FirstOrder Approx. 0 .554215 .713177 .788571 .832578

6 7 8 9 10

.85594922 .87780702 .89391004 .90626530 .91604417

.856019 .877849 .893938 .906284 .916058

11 12 13 14 15

.92397614 .93053912 .93605929 .94076683 .94482887

16 17 18 19 20 21 22 23 24 25

i+1

26 27 28 29 30

Eq. (3b–1) .96855307 .96973619 .97083353 .97185407 .97280561

SecondOrder Approx. .968554 .969737 .970834 .971855 .972806

FirstOrder Approx. .968829 .969992 .971071 .972075 .973012

.861423 .881789 .896935 .908642 .917960

31 32 33 34 35

.97369491 .97452791 .97530977 .97604506 .97673783

.973695 .974528 .975310 .976045 .976738

.973888 .974709 .975480 .976206 .976889

.923986 .930547 .936065 .940772 .944833

.925553 .931860 .937181 .941732 .945668

36 37 38 39 40

.97739165 .97800972 .97859490 .97914975 .97967655

.977392 .978010 .978595 .979150 .979677

.977535 .978145 .978723 .979272 .979792

.94836963 .95148338 .95424297 .95670555 .95891663

.948373 .951486 .954245 .956707 .958918

.949106 .952134 .954823 .957225 .959385

41 42 43 44 45

.98017740 .98065414 .98110851 .98154201 .98195608

.980178 .980654 .981109 .981542 .981956

.980288 .980759 .981209 .981638 .982047

.96091288 .96272413 .96437496 .96588577 .96727367

.960914 .962725 .964376 .965887 .967274

.961337 .963110 .964728 .966210 .967572

46 47 48 49 50

.98235197 .98273085 .98309382 .98344183 .98377582

.982352 .982731 .983094 .983442 .983776

.982439 .982815 .983174 .983519 .983850

51

.98409659 .984097

.984168

i+1

378

John P. Gilbert and Frederick Mosteller

is largest is 1/n. Their difference (1 − dn1 )/n is the chance of winning on the first draw. This completes the demonstration of Theorem 4. Summing the probabilities of Theorem 4 over r from 0 to n−1 gives us the probability of winning with a strategy having monotone decreasing decision numbers. Example. For n = 3, let d1 = d2 = d, d3 = 0. Then P (win) = 1/3 + d/2 + d2 /1 − 3d3 /2. √ For strategies of this form, the best value of d is (2 + 13)/9 ≈ 0.6228, and for that d, P (win) ≈ 0.6702. 3d. numerical results for the full-information game For the optimum strategy described in section 3b, di = bn−i+1 . These b’s of Table 7 were substituted for the d’s in equation (3c–1), and we obtained by high-speed computation the probabilities of winning for n=3(1)5(5)20(10)50. The results are shown in Table 8. When n = 1, we are sure to win; when n = 2 the probability goes down to 0.75; and, indeed, the probability of winning decreases monotonically as n increases. TABLE 8. PROBABILITIES OF WINNING ONE-CHOICE GAMES OF LENGTH n WITH KNOWN DISTRIBUTION USING OPTIMUM STRATEGY n

P (win)

n

P (win)

1 2 3 4 5

1.000000 .750000 .684293 .655396 .639194

10 15 20 30 40 50 ∞

.608699 .598980 .594200 .589472 .587126 .585725 .580164

Figure 2 suggests that P (win) is approximately linear in 1/n. If we use the values at n=40 and n=50 to extrapolate to the origin, we get an estimated asymptotic value slightly larger then 0.58012, which agrees closely with the limiting value shown in Table 8. The limiting value was obtained by passing to the limit in equation (3c–1) and replacing sums by integrations. The integrations in turn were evaluated by a high-speed computer. Full-information game with simple strategies. Single decision number. Use the same decision number d for each draw, and choose the first candidate exceeding d. Without developing the theory in full detail, we can sketch the asymptotic ideas for the best in this class of strategies. As n grows large, the best value of d tends to 1 but in such a manner that n(1 − d) tends to a constant µ. That is, the expected number of draws that exceed d tends to

20 Recognizing the Maximum of a Sequence

379

a constant independent of n. Furthermore, Poisson theory can be applied to compute the asymptotic probability of winning as follows: lim P (win) = e−µ µ + e−µ

n→∞

µ2 µ3 + e−µ + ··· . 2!2 3!3

Figure 2. Approximate linearity of P (win) in 1/n for the game with known distribution

The optimum asymptotic value of µ ≈ 1.503, which produces a limiting probability of winning of about .51735. Example. When this asymptotic theory is applied to the extremely unfavorable case of n = 2, it gives d ≈ 0.25. The probability of winning with d = 1/4 is about 0.6662, while the optimum d for this strategy is d=1/3 with probability of winning 2/3 ≈ 0.6667. Decision numbers d and 0. Use the decision numbers d1 = d2 = · · · = dt = d, and dt+1 = dt+2 = · · · = dn = 0. For large values of n, the casual choice of t = n/2 gives the probability of winning as 0.551. Using theory we omit, we maximize the probability of winning by optimizing both d and t; we find optimum t ≈ 0.6489n, d ≈ 1/(1 + 1.17/n), and the asymptotic probability of winning is about 0.56386. The outstanding feature of this result is that such a simple strategy can produce results nearly as good as those of the optimum strategy, a loss in probability of only about 0.015. An important part of the reason is that for large values of n, the size of the early draws rather than the di ’s usually sets the standard for candidates in the late draws. The following remarks make this reasoning more quantitative.

380

John P. Gilbert and Frederick Mosteller

Theory we do not give shows that, for large n, the probability of winning on any draw with the optimum strategy of section 3b is roughly (1 − e−c )/n (as determined by equation (3b–3), c = 0.804 · · · ). The chance that all measurements prior to draw i are less than di and that draw i exceeds di and is the maximum draw is easily found to be di−1 (1 − dn−i+1 )/(n − i + 1). If i i i = αn, 0 < α < 1, the limiting value of this probability under the optimum strategy is e−cα/(1−α) (1 − e−c )/n(1 − α). For small values of α (early draws) the factor e−cα/(1−α) /(1 − α) is near unity, and even for α = 1/2 it is 0.9. But by α=2/3 it is down to 0.6 and by α=4/5 to 0.2. Thus toward the end of the drawings, few wins are gained from choices that merely pass the decisionnumber test; most are gained by exceeding an earlier draw whose value is larger than the decision number. Consequently, setting di =0 for late draws does little damage. 3e. asymptotic probability of winning on draw i Theorem 4 of section 3c gives the probability of winning under the optimum strategy at each draw. One would correctly anticipate that as n increases, the probability of winning at a given draw tends to zero. On the other hand, nP (win at draw i) tends to a constant for i/n tending to a constant λ. In Figure 3 we show the limiting curve of nP (win at i) plotted against λ. For most of the range, the ordinates are close to 0.6. 3f. asymptotic theory for the r-choice, full-information game using identical decision numbers In the full-information game, suppose r choices are permitted, and if any is the largest, the player wins. While one could no doubt fight one’s way through the algebra to produce some optimum results, we settle here for a simpler strategy. We find a standard decision number d that depends on n and r and choose the first candidates that exceed d. To find the optimum d for fixed r and large n, proceed as in the simple strategy of the one-choice game in section 3d. Let n(1 − d) = µ, the expected number of draws whose values exceed d. Approximate the binomial probability of exactly t members of the sample exceeding d by its Poisson limit P (t) = e−µ µt /t!. Suppose t numbers exceed d. For t ≤ r, the probability of choosing the largest number is unity, but for t > r, we must compute the chance that the largest is among the chosen r. This chance is identical to the probability of winning the no-information dowry game with r choices and t draws in all under the strategy of choosing the first r candidates. Note that not all t numbers need be candidates, and only when they are in ascending order are they all candidates. As in the dowry problem, the expected number of candidates among the t is 1 + 1/2 + · · · + 1/t.

20 Recognizing the Maximum of a Sequence

381

Figure 3. Asymptotic probability of winning at draw i.

The first member of the set of t (supposing t > 0) is a candidate, therefore a choice, and the probability that it is largest is 1/t. The probability that the second choice among t ≥ 2 is the largest is   1 1 1 1 + + ··· + . T2 (t) = t 2 t−1 Similarly, the probability that  1 1 1 1+ T3 (t) = (1) + t 2 3

the third choice among t ≥ 3 is largest is  1 1 1 1 + ··· + 1 + + ··· + . 2 t−1 2 t−2

And more generally, the probability that the rth choice is largest is Tr (t) =

x t−1 3 −1 x 2 −1 1 1  ··· , t x =r−1 x x · · · xr−1 x =2 x =1 1 2 r−1

2

1

Incidentally, Tr (t) =

t−1 1  Tr−1 (λ). t λ=r−1

t ≥ r.

(3f-1)

382

John P. Gilbert and Frederick Mosteller

For completeness, let Ti (t) = 0 for i > t, and then we can write for the probability of choosing the largest number among t that exceed d, given r choices, as r  Ut = Ti (t), t = 1, 2, · · · , n. (3f-2) i=1

The interesting cases are r < t, because for r ≥ t, Ut = 1. Finally, the asymptotic probability of winning the game with this strategy is P (win with r choices) =

n 

Ut P (t).

(3f-3)

t=1

For each Poisson parameter µ, one can by direct calculation find the probability of winning and, by varying µ, obtain the optimum strategy for moderate values of r. We give numerical results in Table 9. Though we have obtained only the asymptotic optimum for this strategy, the weights Ut can be applied directly to binomial probabilities to obtain probabilities of winning with this strategy when n is modest. Earlier we found that the information in the one-choice, full-information game was worth about as much as two choices in the dowry game. The numbers in Table 9 suggest that r choices in the full-information game are worth about 2r choices in the dowry game of section 2c. 4. game against an opponent Recall the one-choice dowry problem of section 2. Suppose that an opponent2 is allowed to choose the order of the members of the sequence and that the player is restricted to information about rank as in section 2, then what is the highest probability of choosing the largest member that the player can achieve? In section 4a, we report that the player can be limited to a probability of 1/n of winning. If the opponent is more restricted in his moves, the player has a better chance. For example, suppose, as we do in section 4b, that the opponent may choose in advance a position from 1 to n and assign it the largest rank without disturbing the randomness of the drawings, then how shall the player move to try to find the largest rank, and how shall the opponent choose the position for the largest number? This particular game suggested itself as a nonparametric analogue for an atomic bomb inspection program with only one inspection. The idea is that time is quantized into n discrete periods. In each period a variable is observed giving one value for the period. Suppose that when a large bomb is set off in some period it produces for its period the largest measurement of the n. And suppose further that this explosion does not disturb 2

We shall continue to call the person hunting for the largest number “the player” and we shall call the person, newly introduced in this section, who changes the random sequence “the opponent.”

20 Recognizing the Maximum of a Sequence

383

TABLE 9. ASYMPTOTIC PROBABILITY OF WINNING THE FULL-INFORMATION GAME WITH r CHOICES USING OPTIMUM STRATEGY WITH IDENTICAL DECISION NUMBERS

r

Optimum µ

Optimum P (win)

r

Optimum µ

1 2 3 4 5

1.503 2.435 3.485 4.641 5.890

0.5174 0.7979 0.9254 0.9753 0.9926

6 7 8 9 10

7.225 8.637 10.121 11.672 13.284

Optimum P (win) 0.9980 0.99949 0.99988 0.999974 0.9999946

the random course of the other measurements. Then using rank information as in section 2, the player wants to immediately choose for inspection the period with the largest measurement, and the opponent wishes to minimize the player’s chance of inspecting during this period. The unrealities in the analogy will be obvious to the reader. For example, if we know nothing about the distribution, how can we be sure the bomb produces the largest measurement? What about small bombs? What about more inspections? Well, we do look into two inspections in section 4c. 4a. opponent arranges sequence Numbers are presented to the player searching for the largest in an order determined by an opponent who wishes to minimize the player’s chance of finding the largest among n numbers. As in the dowry problem, after each draw, the rank of the new number among those so far drawn is reported to the player who must reject or choose the number before the next draw. The opponent can reduce the player’s chance to win to 1/n by using the rows of a cyclic Latin square with equal probabilities. For example, for n = 5 he may choose with probability 1/5 rows from the 5 sequences: 1 2 3 4 5

2 3 4 5 1

3 4 5 1 2

4 5 1 2 3

5 1 2 3 4

Of course, the opponent cannot reduce the probability of winning below 1/n because the player can always choose a position at random. 4b. finding the largest when the opponent places the largest one The problem. Drawings are made at random without replacement from among n numbers, as in the dowry problem of section 2. The player, who has

384

John P. Gilbert and Frederick Mosteller

one choice, must choose or reject each number when it appears, and he wins if and only if he chooses the largest. Before drawing begins, the opponent, who has no knowledge of the sequence, must choose one position and replace the number drawn there by a number larger than any other in the sequence. The player wishes to maximize his probability of choosing the largest, the opponent to minimize it. Thus the minimax criterion is appropriate. As an example, suppose the opponent chose position 2 and that the random drawing for n = 4 produced the ranks 1324 in that order; then the new ranks of the four numbers, adjusted for the opponent’s choice, would be 1423. In formulating this problem, we have restricted the strategies of the opponent, the placer of the largest number, to a probability mixture of the n pure strategies: he places the largest number in position i with probability pi . Similarly, we restrict the player to a probability mixture of the following pure strategies: beginning with the ith position, choose the first candidate. To get intuitive ideas about the solution to the problem with strategies more general than those above, we studied the case n = 4 allowing more complicated strategies for both players than those described in the previous paragraph. The results turned out to be equivalent to what could be achieved with the more limited strategies described above. Let us study first what happens to the random sequences when the opponent replaces the rank of the first position by n, because the analysis is the same for all other positions. Originally there were n! arrangements, with (n − 1)! starting with each of the ranks 1, 2, · · · , n. Those beginning with rank 1 are followed by every possible permutation of the numbers 2, 3, · · · , n. When rank 1 is replaced by rank n, each following rank is reduced by 1 and we are left with the (n − 1)! permutations of the numbers 1, 2, 3, · · · , n − 1. Much the same adjustment applies to the other starting ranks, so that all told after adjustment we have n sets of (n − 1)! permutations following the first number n. One set of (n − 1)! permutations is indistinguishable from another, and from the player’s point of view there is just one set of (n − 1)! permutations. A similar argument applies to choices of each of the positions by the opponent. Consequently, the player may take the view that every permutation with the largest member in the ith position has probability pi /(n − 1)!, where, as before, pi is the probability that the opponent places the largest number in position i. The player has pure strategies Si , i = 1, 2, · · · , n, where Si means choose the ith element if it is a candidate, otherwise choose the first candidate thereafter. Our plan is to find a strategy for the opponent and to find the optimum probability of winning for the player against this strategy. That gives a probability the opponent can enforce. Then we show that the player can also enforce this same probability, and thus we get a solution for the game. Table 10 shows, for each strategy Si , the probability that the player chooses the largest, given that it is in the rth position. One way to locate a good strategy for the opponent is to find one that produces the same probability of

20 Recognizing the Maximum of a Sequence

385

winning against every strategy of the player. Thus we can take advantage of the triangle of zeros in Table 10. Let us initially assign numbers proportional to pi and normalize later. Let pi ∝ Wi and take Wn =1. Then if Sn is used, the probability that the player wins is proportional to 1 · 1=1. If Sn−1 is used, we then want the probability that the player wins to be proportional to n−2  n−2 = 1, Wi (0) + Wn−1 (1) + 1 n−1 i=1 TABLE 10. PROBABILITY THAT LARGEST AT POSITION r IS DETECTED BY STRATEGY Si Strategy of Player

Position of Largest (r) 1

2

3

4

···

i

i+1

···

S1

1

0

0

0

···

0

0

···

0

0

0

S2

0

1

1 2

1 3

···

1 i−1

1 i

···

1 n−3

1 n−2

1 n−1

S3

0

0

1

2 3

···

2 i−1

2 i

···

2 n−3

2 n−2

2 n−1

·

·

·

.. .

n−2 n−1

n

Si

0

0

0

0

···

1

i−1 i

···

i−1 n−3

i−1 n−2

i−1 n−1

Si+1

0

0

0

0

···

0

1

···

i n−3

i n−2

i n−1

·

·

·

.. . Sn−2

0

0

0

0

···

0

0

···

1

n−3 n−2

n−3 n−1

Sn−1

0

0

0

0

···

0

0

···

0

1

n−2 n−1

Sn

0

0

0

0

···

0

0

···

0

0

1

which yields Wn−1 = 1/(n − 1). Continuing in this vein we find Wi = 1/i, i = 1, 2, · · · , n − 1, Wn = 1. Thus the opponent’s strategy uses ⎤ - ⎡ n−1  1/j ⎦ . (4-1) pi = 1 i ⎣1 + j=1

This means in turn that the player’s probability of choosing the largest number can be reduced to

386

John P. Gilbert and Frederick Mosteller

-⎡ 1

⎣1 +

n−1 

⎤ 1/j ⎦

(4-2)

j=1

when the opponent uses the proposed strategy. Next we develop a mixture of strategies for the player that makes all positions r for the largest number lead to the same probability of a win for the player. This mixed strategy gives a probability of winning that the player can enforce. The result is identical with expression (4–2), which shows that we have attained the minimax solution. Let πi be the probability that Si is used. Then to get equal probabilities of winning for position r and r + 1 (r > 1), we need π2 2π3 3π4 r−2 + + + ··· + πr−1 + πr r−1 r−1 r−1 r−1 2π3 (r − 2) π2 r−1 + + ··· + πr−1 + πr + πr+1 . = r r r r This is equivalent to requiring π2 + 2π3 + · · · + (r − 1)πr = r(r − 1)πr+1 . We can solve these equations for the π’s, and we find π1 = 1/T,

πi = 1/(i − 1)T,

where T =1+

n−1 

i = 2, 3, · · · , n,

1/i

i=1

is the normalizing factor. With this mixed strategy for the player, the probability of his choosing the largest number when it is at any position r is the same as for any other. We find from r = 1 the probability of winning 1/T , which is also the value the opponent can enforce (see expression (4–2)). Therefore we have solved the game. For example, for n = 4, p1 = 6/17,

p2 = 3/17,

p3 = 2/17,

p4 = 6/17

π1 = 6/17,

π2 = 6/17,

π3 = 3/17,

π4 = 2/17,

and

and the value of the game is 6/17 ≈ 0.353. For large values of n, the value of the game is approximately 1/(1 + C + log(n − 1)), where C is Euler’s constant, 0.577 · · · ; for n = 4, this asymptotic estimate gives 0.374.

20 Recognizing the Maximum of a Sequence

387

4c. finding the largest when the opponent places the largest one and the player has two choices The problem. As in the previous game where the player has one choice, initially all permutations of the ranks 1, 2, · · · , n are assumed equally likely. The opponent chooses in advance position i with probability pi and makes the rank of the number in that position n. Then the remaining positions are re-ranked. The player’s pure strategies Sij are limited by us to (i, j) pairs, 1 ≤ i < j ≤ n. He may mix these strategies by assigning probabilities to them. Under strategy Sij the player chooses the number in the ith position if it is a candidate, and if not, the first candidate thereafter. His second choice may not be used until the jth position, even though more than one candidate may have appeared in the positions i to j − 1. Thus i and j play the roles of r and s in the two-choice dowry problem of section 2b. We wish to find minimax strategies for the opponent and player and to obtain the value of the game. Except that the algebra and the argument are more elaborate, the methodology is entirely similar to that of section 4b. The result is that the opponent chooses position i with probability p1 = (n − 2)/(n − 1)T,

2 ≤ i ≤ n − 1,

pi = 1/iT,

where T = (n − 2)/(n − 1) + 1 +

n−1 

1/i = 1 +

n−2 

i=2

pn = 1/T,

1/i.

i=1

It turns out that the player restricts himself to strategies of the form S1j and S2j . Let us describe the mixed strategy of the player by: P (S1j ) = πj ,

P (S2j ) = τj ,

n  j=2

πj +

n 

τj = 1.

j=3

A family of good strategies for the player has the probabilities



π2 = 1 − 2τ, πi = 1/(i − 1) − τ /(i − 2),

3 ≤ i ≤ n,

where τ = τj . Note that the player could restrict himself to S23 and make P (S23 ) = τ . Because the total probability must be unity, n−1 1/i − 1 τ = i=1 . n−2 1 + i=1 1/i With these strategies both the opponent and the player can enforce a probability of winning of -  n−2  [2 − 1/(n − 1)] 2+ 1/i , i=2

388

John P. Gilbert and Frederick Mosteller

a result approximately double that for the one-choice game. The result suggests, but we have not proved, that for r choices and large n, the probability of winning is about r/[1 + C + log(n − r)]. 5. games with other payoffs In sections 2, 3, and 4, the payoff in each game was essentially 1 for a win, 0 for a loss. The payoff might be proportional to the number actually chosen. For example, some kinds of income are subjected to higher tax rates if they are not reinvested within a fixed period. Investment opportunities arise at a more or less fixed rate, and an opportunity can be appraised as to its income value. As each opportunity arises, it must be accepted or rejected. We shall not consider the fact that the number of opportunities in a given time period may be variable, nor that the appraisal of an opportunity may not be identical with its value. In the new problem, the optimum strategy depends upon the distribution of the investment values as well as upon the number of investment opportunities and the number of investments one can make. A reasonable criterion for judging a strategy is the expected value of the investment (or investments) one could expect to make using the strategy. We shall discuss only distributions of investment values for which this expected value exists. Furthermore, if we deal with large numbers of opportunities, the property of the distribution we need to know about, and actually do not and cannot know much about in practice, is the shape of the far right-hand tail. Consequently, we investigate 2 the problem for tails of several shapes: the uniform (0), the normal (e−x ), the exponential (e−x ), and the inverse powers (1/xn ), to order them in terms of their ordinates for large x. The relation of the utility, or payoff, which we have taken √ here to be x, to the distribution is what counts. For example, a payoff of x for the uniform distribution on (0, 1) is equivalent to a payoff of x for the distribution with density 2x on (0, 1). Changing the payoff, then, could make other far-right tail shapes natural. First, we discuss the case of one choice and then that for two or more. To fix ideas, we begin with the uniform distribution and one investment choice in n opportunities. 5a. the uniform game A sequence of n numbers is to be drawn from the uniform distribution. The player knows n, the number of opportunities, he knows that the distribution is uniform from 0 to 1, and he is informed of the value of each draw after it is made. He must decide after each draw whether to choose or reject it. His score is the value of the draw he chooses. If he has not chosen until the final draw, then he is forced to choose that one. The object is to find a strategy that maximizes the expected value of the score. (Note that the uniform distribution, though general in section 3, here is a special case.)

20 Recognizing the Maximum of a Sequence

389

The optimum strategy. If n = 1, the expected score is 1/2. If n = 2, the player must decide whether to keep the first draw or to go on to the second and last. If he goes on, his expected value is 1/2. Therefore he keeps the current one, x, if x > 1/2, rejects it if x < 1/2, and is indifferent if x = 1/2. Then for n = 2, his expected value is (1/2)(3/4) + (1/2)(1/2) = 5/8. Let Rn be the expected value of a game of length n under the optimum strategy. Then Rn is also the optimum decision number for the first draw in a game of length n + 1. We get a recurrence relation between the expected value of the game of length n + 1 and that of length n. Let Xn+1 be the uniform random variable at the (n + 1)st draw from the end. If it exceeds Rn , we keep it; and if not, we reject it. If we keep it, our expectation is the average value of a uniform variable, conditional on its being larger than Rn . If we reject it, our expectation is Rn , that for a game of length n. These two expectations must be weighted by their probabilities 1 − Rn and Rn , respectively, to give the expectation for a game of length n + 1: Rn+1 = (1 − Rn )E(Xn |Xn > Rn ) + Rn2 = (1 − Rn )(1 + Rn )/2 + Rn2 Rn+1 =

1 2 (1

+

Rn2 ),

(5a-1)

R1 = 1/2.

To repeat, the Rn ’s play two roles. First, Rn is the optimum decision number for the (n + 1)st draw from the end, and second, it is the optimum expected value of the chosen draw for a game of length n draws. We have R1 = 1/2, R2 = 5/8, R3 = 89/128, R4 = 24305/32768. Calculating Rn recursively is relatively easy, and we have carried it out to n=1000. Table 11 shows results for selected values and comparisons with an approximation. Bounds for Rn . For the game with uniform distribution, let Rn = 1 − 2/(xn + 1),

x1 = 3.

(5a-2)

Then we can replace the recurrence relation (5a–1) by xn+1 = xn + 1 + 1/xn . It follows that xn = n + 2 +

n−1 

(5a-3)

1/xi .

(5a-4)

i=1

Since the x’s are positive, xn ≥ n + 2. Consequently, xn < n + 2 + 1/3 + 1/4 + · · · + 1/(n + 1) = wn ,

n > 1.

(5a-5)

By studying yn = wn − xn , it is possible to obtain analytically upper and lower bounds for xn and therefore bounds for Rn . The y’s monotonically

390

John P. Gilbert and Frederick Mosteller

TABLE 11. FOR THE UNIFORM GAME, Rn IS THE OPTIMUM DECISION NUMBER FOR THE (n + 1)st DRAW FROM THE END, AND ALSO THE EXPECTED VALUE FOR THE OPTIMUM CHOSEN DRAW IN A GAME HAVING n DRAWS Approximation (5a–7)

Rn 1 2 3 4 5 6 7 8 9 10 20 30 40 50

0.500000 0.625000 0.695312 0.741730 0.775082 0.800376 0.820301 0.836447 0.849821 0.861098 0.919887 0.943372 0.956117 0.964145

.858806 .919392 .943183 .956025 .964093

Rn 100 200 300 400 500 600 700 800 900 1000

0.981208 0.990343 0.993496 0.995095 0.996063 0.996711 0.997176 0.997526 0.997799 0.998017

Approximation (5a–7) .981200 .990341 .993495 .995095 .996063 .996711 .997176 .997526 .997799 .998017

increase with n, y10 ≈ 0.121, y1000 ≈ 0.303, and their limit has an upper bound 0.310. From this, bounds for Rn follow: 1 − 2/(wn + 1 − 0.121) ≥ Rn ≥ 1 − 2/(wn + 1 − 0.310),

n ≥ 10. (5a-6)

Andrew Gleason [8] was kind enough to suggest this development. If we apply the Euler approximation to the harmonic series and use the lower bound of inequality (5a–6) as an approximation, then for large n Rn ≈ 1 −

2 , n + log(n + 1) + 1.767

(5a-7)

and results are shown in Table 11. We are indebted to a referee for calling our attention to previous work of Leo Moser [13] and Irwin Guttman [14]. Moser developed an approximation very similar to that given above in equation (5a–7) for this uniform game and called attention to the difficulty of studying distributions other than the rectangular. Guttman gave a table of Rn for n=1(1)10, 20(20)100(50)350. 5b. other distributions Just as for the uniform distribution, we can derive sets of optimum decision numbers, Rn , for other distributions having means. Consider the density f (x) and a game with n draws. We number draws from the end, so that the final

20 Recognizing the Maximum of a Sequence

391

draw has number 1, next to the last number 2, and so on. Using the same argument as that for obtaining recurrence relation (5a–1), we have in general     ∞  ∞ Rn xf (x)dx Rn ∞ Rn+1 = f (x)dx f (x)dx Rn , + f (x)dx Rn −∞ Rn or







Rn+1 =

xf (x)dx + Rn

Rn

f (x)dx.

(5b-1)

−∞

Rn

(The limits of +∞ and −∞ are formal, considering that f (x) is defined over the whole line.) If f (x) > 0 only for a < x < b, then 



b

Rn+1 =

Rn

xf (x)dx + Rn

f (x)dx.

(5b-2)

a

Rn

Thus for the uniform, f (x) = 1, 0 ≤ x ≤ 1, we get 



1

Rn+1 =

dx,

R1 = 1/2,

0

Rn

Rn+1 =

Rn

xdx + Rn

.1 x2 .. + Rn2 = (1/2)(1 + Rn2 ), 2 .Rn

as we knew before. The behavior of Rn depends on the form of the distribution and especially of the upper tail. To give variety in the order of tangency with the x-axis, three kinds of distributions in addition to the uniform suggest themselves for investigation—the normal, e−x , and (t − 1)/xt for some values of t. Inverse powers. Consider f (x) = (t − 1)/xt ,

1 ≤ x < ∞, t > 2.

We need t > 2 to get convergence of the integral dealing with the mean. We get the recurrence relation Rn+1 = Rn + 1/(t − 2)Rnt−2

R1 = 1 + 1/(t − 2).

(5b-3)

We have calculated the first 10 R’s for t = 2 12 , 3, 6, 10. In addition, we computed for each t and R the cumulative  F (Rn ) = 1

Rn

f (x)dx = 1 −

1

, Rnt−1

(5b-4)

because this quantity is more easily compared across distributions than the R’s.

392

John P. Gilbert and Frederick Mosteller

Exponential distribution. For f (x) =e−x ,

x ≥ 0, −Rn

Rn+1 =Rn + e

R1 = 1.

(5b-5)

2 2 x ≥ 0, f (x) = √ e−x /2 , 2π √ Rn+1 = f (Rn ) + Rn F (Rn ), R1 = 2/ 2π

(5b-6)

F (Rn ) =1 − e−Rn . Folded normal distribution.

Normal distribution. 2 1 −∞ < x < ∞, f (x) = √ e−x /2 , 2π R1 = 0. Rn+1 = f (Rn ) + Rn F (Rn ),

In Table 12, we give values for Rn and F (Rn ) for the inverse-powers distributions, and in Table 13 for the other four. Note that F (Ri ) are in the order to be expected: uniform, normal, folded normal, exponential, descending powers of t. Figures 4 and 5 relate F (Rn ) to n/(n + 1). The thought here is that the area to the left of the optimum expected value in a game of length n should be related smoothly to the expected area to the left of the largest order statistic. Guttman [14] gives results for the standard normal distribution, as we do, for the same grid as he used for the uniform distribution (see close of section 5a). Our Table 13 disagrees with his by two units in the fourth place for n = 2, and √ by similar amounts up to n = 10. Since the exact value for n = 2 is 1/ 2π, and since the computation of Rn is recursive in character, this disagreement raises questions about the digits at and beyond the fourth decimal in Guttman’s Table 2 (p. 39). Guttman also studies the normal with mean 1/2 and variance 1/12. 5c. several choices when payoff is value of choice Let us extend the problem of section 5a to the case of r choices. The chooser knows n the number of opportunities, the distribution of the values of the opportunities, and the number r of choices he must take. The key step in the generalization is in the change from one choice to two. With two choices, the total payoff is the sum of the payoffs for the two chosen investments. As before, Rn is the optimum expected payoff when there are n opportunities of which one must be chosen. Let Sn be the optimum expected total payoff for two investments when there are n opportunities.

20 Recognizing the Maximum of a Sequence

393

TABLE 12. PAYOFF VALUES AND CUMULATIVES FOR INVERSE-POWERS DISTRIBUTION t=2.5

i

1 2 3 4 5 6 7 8 9 10

t=3

t=6

t=10

Ri

F (Ri )

Ri

F (Ri )

Ri

F (Ri )

Ri

F (Ri )

3.0000 4.1547 5.1359 6.0184 6.8337 7.5987 8.3243 9.0175 9.6835 10.3262

.8076 .8819 .9141 .9323 .9440 .9523 .9584 .9631 .9668 .9699

2.0000 2.5000 2.9000 3.2448 3.5530 3.8345 4.0953 4.3394 4.5699 4.7887

.7500 .8400 .8811 .9050 .9208 .9320 .9404 .9469 .9521 .9564

1.2500 1.3524 1.4271 1.4874 1.5385 1.5831 1.6229 1.6589 1.6920 1.7225

.6723 .7790 .8311 .8626 .8840 .8994 .9112 .9204 .9279 .9340

1.1250 1.1737 1.2084 1.2359 1.2589 1.2787 1.2962 1.3119 1.3261 1.3392

.6536 .7634 .8180 .8514 .8741 .8906 .9032 .9131 .9212 .9278

TABLE 13. PAYOFF VALUES AND CUMULATIVES FOR UNIFORM, e−x , FOLDED NORMAL, AND NORMAL DISTRIBUTIONS

i

1 2 3 4 5 6 7 8 9 10

e−x

Uniform

Folded Normal

Ri

F (Ri )

Ri

F (Ri )

Ri

F (Ri )

.5000 .6250 .6953 .7417 .7751 .8004 .8203 .8364 .8498 .8611

.5000 .6250 .6953 .7417 .7751 .8004 .8203 .8364 .8498 .8611

1.0000 1.3679 1.6225 1.8199 1.9820 2.1198 2.2398 2.3463 2.4420 2.5290

.6321 .7454 .8026 .8380 .8622 .8799 .8935 .9043 .9130 .9203

.7979 1.0392 1.1938 1.3074 1.3970 1.4709 1.5335 1.5878 1.6356 1.6783

.5751 .7013 .7674 .8089 .8376 .8587 .8748 .8877 .8981 .9067

Normal Ri

F (Ri )

.3989 .6297 .7904 .9127 1.0108 1.0924 1.1621 1.2227 1.2762

.5000 .6550 .7356 .7854 .8193 .8439 .8627 .8774 .8893 .8991

0

When n = 2, both investments must be taken and the expected total payoff is 2µ, where µ is the mean of the payoff distribution whose density is f . Thus S2 = 2µ. (5c-1) For n = 3, let us number the payoff random variables from the end of the investment period. Thus X3 is the payoff for the first investment opportunity and X1 that for the last. When payoff X3 is presented, we accept it if it exceeds or equals some number s3 , otherwise reject it. We must find the optimum value s∗3 of s3 . If we take X3 , we have two opportunities left and one more choice, and we already know that R2 is the optimum gain to be expected. Therefore

394

John P. Gilbert and Frederick Mosteller

Figure 4. The cumulatives F (Rn ) plotted as functions of x = n/(n + 1) for the distributions (t − 1)/xt .

we can write as the expected value for given s3 S3 (s3 ) =P (X3 ≥ s3 )[E(X3 |X3 > s3 ) + R2 ] + [1 − P (X3 ≥ s3 )]S2  ∞  ∞ =S2 + xf (x)dx − (S2 − R2 ) f (x)dx. s3

(5c-2)

s3

Since S2 and R2 are constants, indeed their values have already been determined, we need only maximize S3 (s3 ). Differentiating, we get dS3 (s3 ) = −s3 f (s3 ) + (S2 − R2 )f (s3 ). ds3

(5c-3)

Setting the derivative equal to zero and recalling f (s3 ) > 0, we have the optimizing value of s3 as s∗3 = S2 − R2 . (5c-4) A check on the second derivative assures that we have maximized. Therefore the maximum value of S3 (s3 ) is

20 Recognizing the Maximum of a Sequence

395

Figure 5. The cumulatives F (Rn ) plotted as functions of x = n/(n + 1).

 S3 = S 2 +



s∗ 3

 xf (x)dx − (S2 − R2 )



f (x)dx.

(5c-5)

s∗ 3

The same approach extends to Sn , and we have s∗n = Sn−1 − Rn−1  ∞  Sn = Sn−1 + xf (x)dx − (Sn−1 − Rn−1 ) s∗ n



f (x)dx.

(5c-6)

s∗ n

To extend the effort to three choices involves equivalent technology. Let Tn be the optimum payoff with three choices, n ≥ 3. Then T3 = 3µ. As before, let tn be the decision number so that the first choice is used when Xn ≥ tn . Then t∗n = Tn−1 − Sn−1  ∞  Tn = Tn−1 + xf (x)dx − (Tn−1 − Sn−1 ) t∗ n



f (x)dx.

(5c-7)

t∗ n

To write formulas for the r-choice problem requires only that we tool up some double-subscripted notation, which hardly seems worth the effort since the solution in equation (5c–7) extends so naturally.

396

John P. Gilbert and Frederick Mosteller TABLE 14. EXPECTED PAYOFF VALUES FOR THE UNIFORM, EXPONENTIAL, AND NORMAL DISTRIBUTIONS, WITH 1, 2, AND 3 CHOICES e−x

Uniform

n

1 2 3 4 5 6 7 8 9 10

R

S

.50 .62 .70 .74 .78 .80 .82 .84 .85 .86

1.00 1.20 1.32 1.41 1.48 1.53 1.57 1.61 1.64

T

R

S

1.50 1.74 1.91 2.03 2.13 2.21 2.28 2.33

1.00 1.37 1.62 1.82 1.98 2.12 2.24 2.35 2.44 2.53

2.00 2.53 2.93 3.26 3.54 3.78 4.00 4.19 4.36

Normal T

R

S

T

0 3.00 3.63 4.13 4.55 4.91 5.24 5.53 5.79

.40 .63 .79 .91 1.01 1.09 1.16 1.22 1.28

0 .63 1.03 1.32 1.55 1.73 1.89 2.03 2.15

0 .79 1.32 1.72 2.04 2.30 2.53 2.73

Table 14 compares R’s, S’s, and T ’s for the uniform, the exponential, and the normal. TABLE 15. RESULTS FOR UNIFORM AND EXPONENTIAL FOR LARGE NUMBERS OF OPPORTUNITIES FOR 1, 2, AND 3 CHOICES Uniform Distribution

n

10 20 30 40 50 100 200 300 400 500 1000

Exponential Distribution

Rn

Sn

Tn

Rn

Sn

Tn

.861 .920 .943 .956 .964 .981 .990 .993 .995 .996 .998

1.636 1.790 1.852 1.885 1.906 1.951 1.975 1.983 1.987 1.990 1.995

2.330 2.614 2.727 2.789 2.827 2.910 2.954 2.969 2.976 2.981 2.990

2.529 3.129 3.497 3.765 3.975 4.641 5.318 5.718 6.002 6.223 6.912

4.362 5.564 6.302 6.837 7.258 8.588 9.943 10.742 11.311 11.754 13.132

5.788 7.593 8.700 9.503 10.134 12.130 14.162 15.361 16.215 16.878 18.946

For the uniform, with n = 10 opportunities, the expected values of the three largest measurements are 8/11, 9/11, 10/11. Thus the average maximum score from knowing all the values for the whole sequence is 3−6/11 ≈ 2.545, as compared with the observed 2.330 for our optimum strategy. When n = 1000,

20 Recognizing the Maximum of a Sequence

397

the expected value of the sum of the three largest measurements is 3−6/1001 ≈ 2.994, which is close to the 2.990 achieved by the optimum strategy. By using H. Leon Harter’s tables of the expected values of the order statistics for the exponential distribution [9], we can compare the results of using our strategy with the best expected value obtainable when one reviews the whole series after the fact and chooses the largest 1, 2, or 3 measurements in the series. The percentage loss decreases with both increasing choices and increasing n. 10 Best expected 100 Best expected

R10 = 2.53

S10 = 4.36

T10 = 5.79

2.93

4.86

6.29

R100 = 4.64

S100 = 8.59

T100 = 12.1

5.19

9.37

13.1

acknowledgment We wish to thank Michael Feuer and Mrs. Cleo Youtz for their contributions to calculations in preparation of this work. John Pratt has suggested a number of improvements. references [1 ] Arrow, K.J., Blackwell, D., and Girshick, M.A., “Bayes and minimax solutions of sequential decision problems,” Econometrica, 17 (1949), 213– 44. [2 ] Bissinger, B.H., and Siegel, C., “Problem 5086,” Advanced Problems and Solutions, American Mathematical Monthly, 70 (1963), 336. [3 ] Bosch, A.J., “Solution to Problem 5086: Optimum strategy in a guessing game,” Advanced Problems and Solutions, American Mathematical Monthly, 71 (1964), 329–30. [4 ] Chow, Y.S., Moriguti, S., Robbins, H., and Samuels, S.M., “Optimum selection based on relative rank (the ‘Secretary Problem’),” Israel Journal of Mathematics, 2 (1964), 81–90. [5 ] Chow, Y.S., and Robbins, H., “On optimal stopping rules,” Z. Wahrscheinlichkeitstheorie, 2 (1963), 33–49. [6 ] Fox, J.H., and Marnie, L.G., in Martin Gardner’s column, “Mathematical Games,” Scientific American, 202 (Feb. 1960), 150 and 153. [7 ] Gleason, Andrew, personal communication (1955). [8 ] Gleason, Andrew, personal communication (1963). [9 ] Harter, Leon, Expected values of Exponential, Weibull, and Gamma Order Statistics, ARL 64-31, Office of Aerospace Research, United States Air Force, Wright-Patterson Air Force Base, Ohio (Feb.1964). [10 ] Kaufman, G.M., “Sequential investment analysis under uncertainty,” The Journal of Business of the University of Chicago, 36 (1963), 39–64. [11 ] Lindley, D.V., “Dynamic programming and decision theory,” Applied Statistics, 10 (1961), 39–51.

398

John P. Gilbert and Frederick Mosteller

[12 ] Moser, Leo, and Pounder, J.R., in Martin Gardner’s column, “Mathematical Games,” Scientific American, 202 (March 1960), 178 and 181. [13 ] Moser, Leo, “On a problem of Cayley,” Scripta Mathematica, 22 (1956), 289–92. [14 ] Guttman, Irwin, “On a problem of L. Moser,” Canadian Mathematical Bulletin, 3 (1960), 35–39.

Reprinted from Demography (1967), 4, 850–858

21. The Distribution of Sums of Rounded Percentages Frederick Mosteller, Cleo Youtz, and Douglas Zahn*

RESUMEN Cuando se calcula porcentajes para recuentos en diversas categor´ıas o para varias medidas positivas, tomando cada una como una fracci´ on de su suma, a menudo, los porcentajes redondeados no suman 100 porciento. Investigamos la frecuencia con que ocurre este error y cuales son las distribuciones de las sumas de los porcentajes redondeados, para (1) un conjunto de datos emp´ıricos; (2) la distribuci´ on polinomial en muestras peque˜ nas; (3) espaciamientos entre puntos ubicados en un intervalo, el modelo de la barra quebrada; y (4) para la sumulaci´ on de varias categor´ıas. Los diversos m´etodos producen distribuciones similares. Hallomos que en promedio, la probabilidad de que la suma de los porcentajes redondeados alcance exactamente a 100 porciento, es evidente para dos categor´ıas; es cerca de tres cuartos " para tres categor´ıas; cerca de dos tercios para cuatro categor´ıas; y cerca de 6/cπ para un mayor n´ umero de categor´ıas c, cuando las categor´ıas no son improbables. SUMMARY When percentages are computed for counts in several categories or for several positive measurements each taken as a fraction of their sum, the rounded percentages often fail to add to 100 percent. We investigate how frequently this failure occurs and what the distributions of sums of rounded percentages are for (1) an empirical set of data, (2) the multinomial distribution in small samples, (3) spacings between points dropped on an interval—the broken-stick model—; and (4) for simulation for several categories. The several methods produce similar distributions. We find that the probability that the sum of rounded percentages adds to exactly 100 percent is certain for two categories, about three-fourths for three 

Harvard University. This work was facilitated by grants from the National Science Foundation.

400

Frederick Mosteller, Cleo Youtz, and Douglas Zahn

" categories, about two-thirds for four categories, and about 6/cπ for larger numbers of categories, c, on the average when categories are not improbable. sums of rounded percentages In tabulating percentages, even for exhaustive categories, everyone finds that the sums of the rounded percentages often fail to add to 100. For example, the accompanying four-category tabulation of counts was percentaged and rounded off to the nearest 10 percent.

Category 1 Counts . . . . . . . . . . . Rounded percent . .

1 10

2 3 30

3 2 20

Total (percent) 3 9 30 90

4

This failure to add to 100 percent occurs so frequently that, if very many sums of percentages do add to 100 percent in a set of reported tables, one begins to suspect the reporter of fudging. How often should the percentages fail to add up correctly? How wrong are they likely to be? We investigate these questions (1) by illustrating the distribution of the sums of rounded percentages for a large collection of tables; (2) by examining the distribution of sums for the multinomial distribution, for small samples; (3) by turning to the continuous analogue of the multinomial for large samples—the broken-stick model—and getting exact distributions, means and variances, and asymptotic approximations; and (4) by simulating the brokenstick model for larger numbers of categories. We find that the probability of the sum of rounded percentages adding to exactly 100 percent is certain" for two categories, about 3/4 for three categories, 2/3 for four categories, and 6/cπ for larger numbers of categories, c, on the average, when categories are not improbable. The details of the probability distribution of sums would, of course, depend upon such things as the number of categories and the fineness of the rounding. Our numerical work is mainly with the standard decimal roundings: to the nearest 10 percent, the nearest 1 percent, or the nearest 0.1 percent. We convert these quantities to proportions and call them the grid spacings , 0.1, 0.01, and 0.001. Once in a while, people round to the nearest 5 percent; = 0.05. A grid width that divides the interval from 0 to 1 into an integral number of parts yields n = 1/ parts. The n+ 1 possible values that could occur as a result of rounding to the nearest are 0 percent, 100 percent, 200 percent, . . ., 100n percent. For example, in a sample of 9 rounded to the nearest 10 percent ( = 0.1), a count of 0 would yield 0 percent; a count

21 Distribution of Sums of Rounded Percentages

401

of 1, 10 percent; of 2, 20 percent; 3, 30 percent; 4, 40 percent; 5, 60 percent; 6, 70 percent; 7, 80 percent; 8, 90 percent; and of 9, 100 percent. Note that although 50 percent was available, it was not used. What sums are possible? For either one or two categories it is not possible to have a correct sum failing to add to 100 percent exactly. (If a number ends in a 5 that must be rounded, our rule is to round to the nearest even digit, but a high-speed computer often has views of its own about this, since it may produce 2.5 as 2.4999999, depending on just how it carried out the calculation. Consequently, we set this rounding 5’s problem aside for the moment and regard it as a rare event not worth special attention in this study of average results.) Sometimes a category practically never occurs. In such cases we suppose that the distribution of the sums of the rounded percentages behaves much as if this category were not present. We have not investigated this point. One could use rules for rounding other than rounding to the nearest grid point, but we do not investigate these rules here. If a rule is used that forces the totals to add to 100 percent, it is probably well to state it. empirical distributions from the national halothane study To illustrate the behavior of distributions of the sums of the rounded percentages, we computed percentages for a considerable collection of tables from the National Halothane Study1 —a study of death rates associated with surgery and anesthesia (sponsored by the National Academy of Sciences–National Research Council). These tables included large numbers of lines of counts associated with varying numbers of categories. Some typical variables are reports of numbers dying categorized by class intervals of age, by physical status (a 7-category scale), and by anesthetic administered (5 categories). Tables for surgical patients were constructed by using the same variables. The total for any given line is likely to be in the hundreds or the thousands. Thus the totals are fairly large, a matter whose relevance will be clarified below. Table 1 shows, at the bottom, the numbers of lines that we used for each number of categories from 3 to 9. The same line was not reused, as it could have been—a five-category line could make ten three-category lines. But the same patient may be represented in several lines. We do not think this overlap affects the conclusions. We converted each line of data to decimal fractions, rounded to the nearest 10 percent, 1 percent, and 0.1 percent, and, for each of these rounding grids, formed the distribution of the sums. Table 1 gives these distributions. The 100 percent line of each panel of Table 1 shows that, as we expect, increasing numbers of categories generally lead to decreasing percentages of 1

J. P. Bunker, W.H. Forrest, Jr., F. Mosteller, and L. D. Vandam (eds.), The National Halothane Study: A Study of the Possible Association Between Halothane Anesthesia and Postoperative Hepatic Necrosis (Report of the Subcommittee on the National Halothane Study of the Committee on Anesthesia, National Academy of Sciences– National Research Council [1969].)

402

Frederick Mosteller, Cleo Youtz, and Douglas Zahn

sums that add to exactly 100 percent. The middle line of the bottom panel suggests that when the rounding is very fine for three categories, the fraction totaling exactly 100 percent is about 3/4; for four categories, about 2/3; and for five categories, about 7/12. Below, we relate these results to theoretical formulations. The behavior of the empirical rounding error distributions seems fairly regular as a function of the grid width and of the number of categories k + 1. We note that the distributions for rounding to the nearest 1 percent or 0.1 percent are approximately symmetrical. The 10 percent rounding led to distributions that had more lines producing less than 100 percent than lines producing more than 100 percent. We shall return to this asymmetry below. distribution of the sums of rounded percentages for small samples from the multinomial distribution having equally likely categories One way to view tables of counts is as arising from observed samples drawn from multinomial distributions. Here we consider the distributions of sums of rounded percentages arising from the multinomial distribution. Recall that if there are equally likely categories and a sample of size N drawn from this distribution, the probability of observing a count of xi in category i, where i = 1, 2, . . . , c, is given by P (x1 , x2 , . . . , xc ) =

N !(1/c)N , x1 !x2 ! . . . xc !

(1)

 where xi = N . To get the final probability of a particular kind of partition, we must further multiply by the number of arrangements A of the partition itself. For a given number of categories, we can write out all the possible samples of N for such a multinomial distribution, compute the -rounded percentages for each category, and add. Then, associated with each possible sum, we will have a probability which is the sum of the probabilities associated with the samples that gave the sum. Thus we get the exact distribution of the sum. Table 2 shows the details of the procedure for a particular case, N = 7 with four categories. Panel A shows the prototypic partitions of 7 into four categories and their associated probabilities. For example, P (7, 0, 0, 0) = 7!(1/4)7 /(7!0!0!0!) = 1/47 and A = 4!/(3!1!) = 4; therefore, P × A = 1/46 = 1/4096 ≈ 0.00024. Panel B shows the percentages computed for each partition to two decimals. Panel C shows the percentages rounded to the nearest 1 percent, and their totals. Panel D gives the final distribution of sums. For example, 101 occurs on lines 8 and 11, and their probabilities in panel A add to 0.30762 or 30.76 percent. Table 3 summarizes the percentages of samples that add to exactly 100 percent for each sample size from N = 1–20 and each number of categories from 3 to 6 for our three standard rounding grids. The pattern of the numbers is informative. For example, N = 9 generates a good deal of rounding error

21 Distribution of Sums of Rounded Percentages

403

Table 1.—Empirical Distribution of Sums of Rounded Percentages for Rounding to the Nearest 10 Percent, 1 Percent, and 0.1 Percent, for Each Number of Categories from 3 to 9

Sum of rounded Percentages

Number of categories 3

4

5

6

7

8

9

100 ··· 8.97 68.79 22.24 ···

100 ··· 17.91 63.39 18.70 ···

100 0.29 18.93 58.63 21.66 0.49

100 ··· 15.90 50.87 31.21 2.02

100 ··· 12.41 59.31 28.28 ···

100 ··· 15.59 49.81 31.94 2.66

100 ··· 17.65 26.47 41.18 14.71

100 ··· 14.48 72.93 12.59 ···

100 0.16 17.27 66.40 16.01 0.16

100 0.10 20.49 58.73 20.49 0.20

100 0.87 17.34 58.38 21.68 1.73

100 0.00 24.83 46.90 27.59 0.69

100 1.52 28.14 39.54 27.76 3.04

100 0.00 35.29 42.65 22.06 0.00

100 ··· 12.59 74.48 12.93 ···

100 ··· 16.96 66.40 16.64 ···

100 0.29 20.49 58.44 20.59 0.20

100 1.16 21.10 53.18 23.70 0.87

100 2.76 22.76 53.10 20.00 1.38

100 1.90 21.67 52.85 20.15 3.42

100 1.47 19.12 48.53 23.53 7.35

580

631

1025

346

145

263

68

 = 10 percent 120 110 100 90 80

Total(a) · · · · · · ............... ............... ............... ............... ...............

 = 1 percent 102 101 100 99 98

Total · · · · · · ............... ............... ............... ............... ...............

 = 0.1 percent 100.2 100.1 100.0 99.9 99.8

Total · · · · · · .............. .............. .............. ..............

Number of lines . . . (a)

Column sums may fail to add to 100 percent because of rounding.

for the standard grids. Note also that some otherwise nice numbers like N = 4 start off poorly for the wider grids before producing sums of exactly 100 percent for all finer standard decimal grids. Beyond the pattern of results displayed in Table 3, we can look at the average behavior over the first twenty values of N as shown in Table 4. Turning at once to the finest of the standard grids, we see that the average percentage of exactly 100 percent sums is again nearly 3/4 for 3 categories, 2/3 for 4, and 7/12 for 5, as it was for the empirical data of the National Halothane Study. It is interesting that such small N ’s as 20 and less give averages similar to those

404

Frederick Mosteller, Cleo Youtz, and Douglas Zahn

Table 2.—Illustration of Multinomial Calculations for Samples of 7 in 4 Categories Panel A

Panel B

Partitions of N = 7 into c = 4 categories

Percentages to be rounded

Category

Category

Partition number 1 2 3 4 5 6 7 8 9 10 11

...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ......

P (x1 , x2 , x3 , x4 )A 1 7 6 5 4 5 4 3 3 4 3 2

2 0 1 2 3 1 2 3 2 1 2 2

3 0 0 0 0 1 1 1 2 1 1 2

4 0 0 0 0 0 0 0 0 1 1 1

.00024 .00513 .01538 .02563 .03076 .15381 .10254 .15381 .05127 .30762 .15381 1.00000

1 100.00 85.71 71.43 57.14 71.43 57.14 42.86 42.86 57.14 42.86 28.57

2 0.00 14.29 28.57 42.86 14.29 28.57 42.86 28.57 14.29 28.57 28.57

3 0.00 0.00 0.00 0.00 14.29 14.29 14.29 28.57 14.29 14.29 28.57

4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.29 14.29 14.29

Panel C

Panel D

Percentages rounded to the nearest 1 percent

Distribution of sums

Category Partition number 1 2 3 4 5 6 7 8 9 10 11

...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ......

1 100 86 71 57 71 57 43 43 57 43 29

2 0 14 29 43 14 29 43 29 14 29 29

3 4 0 0 0 0 0 0 0 0 14 0 14 0 14 0 29 0 14 14 14 14 29 14

Total

Sum

Percentage of samples occurring

100 100 100 100 99 100 100 101 99 100 101

101 100 99

30.76 61.04 8.20

21 Distribution of Sums of Rounded Percentages

405

obtained for very large N ’s. As before, we note that increasing the number of categories reduces the average percentage adding to exactly 100 percent. Space precludes showing all the distributions of sums of rounded percentages, but Table 5 gives the rounded results for c = 3 and 5 for those N ’s for which all partitions do not add to 100 percent, for = 0.001. The bottom line of this table shows the average for the first twenty N ’s, not just the tabled ones, and we see that for three categories the distribution is asymmet1 ric around 100 percent with about 16 falling at 100.1 and 12 falling at 99.9. We do not expect this asymmetry for large values of N . Furthermore, we note that this asymmetry goes in a direction opposite to that for the halothane data where the sample sizes are very large. Table 3.—Percentages of Multinomial Samples Having Sums of Rounded Percentages Totaling Exactly 100 Percent, For Sample Sizes 1 to 20; 3 to 6 Categories and Three Rounding Grids  = .1 N

c:

 = .01

 = .001

3

4

5

6

3

4

5

6

3

4

5

6

1 2 3 4 5

........... ........... ........... ........... ...........

100 100 78 26 100

100 100 62 16 100

100 100 52 10 100

100 100 44 7 100

100 100 78 100 100

100 100 62 100 100

100 100 52 100 100

100 100 44 100 100

100 100 78 100 100

100 100 62 100 100

100 100 52 100 100

100 100 44 100 100

6 7 8 9 10

........... ........... ........... ........... ...........

75 75 53 43 100

71 51 45 20 100

62 34 47 10 100

52 23 52 5 100

75 65 25 43 100

71 61 13 20 100

62 63 7 10 100

52 62 4 5 100

75 52 100 43 100

71 28 100 20 100

62 17 100 10 100

52 11 100 5 100

11 12 13 14 15

........... ........... ........... ........... ...........

37 69 77 74 78

14 41 61 70 70

6 19 48 56 63

3 9 32 49 58

37 78 70 64 78

14 70 59 67 70

6 62 55 47 63

3 57 53 27 58

37 78 31 75 78

14 70 10 73 70

6 62 4 62 63

3 57 1 47 58

16 17 18 19 20

........... ........... ........... ........... ...........

72 71 75 75 61

51 66 74 50 54

43 60 62 31 46

39 55 47 19 34

63 66 75 75 100

48 59 74 63 100

34 29 62 51 100

28 13 47 39 100

25 81 75 75 100

12 51 74 62 100

6 45 62 49 100

3 48 47 43 100

406

Frederick Mosteller, Cleo Youtz, and Douglas Zahn Table 4.—Average Percentage Summing to Exactly 100 Percent Categories Average over N = 3

4

5

6

1(1)10  = 0.1 . . . . . . . . . . . . . 0.01 . . . . . . . . . . . . 0.001 . . . . . . . . . .

75 79 85

66 73 78

62 69 74

58 67 71

1(1)20  = 0.1 . . . . . . . . . . . . . 0.01 . . . . . . . . . . . . 0.001 . . . . . . . . . .

72 75 75

61 68 66

52 60 60

46 55 56

Table 5.—Distribution of Sums of Rounded Frequencies for  = 0.001 for N ’s Not Always Yielding Sums of 100 Percent, Through N = 20, for the Multinomial with 3 and 5 Categories 3 categories N 3 6 7 9

........... ........... ........... ...........

11 12 13 14

........... ........... ........... ...........

15 16 17 18 19

........... ........... ........... ........... ...........

Average for 20 . . . . . . . . . (a)

5 categories

99.9

100.0

100.1

22 12

78 75 52 43

12 48

37 78 31 75

63 11 69 18

14 20 12

78 25 81 75 75

11 75 5 5 13

5

8.3

75.2

16.7

0.2

57

11 7 11

99.8

99.9

100.0

100.1

48 6

52 62 17 10

33 83

6 62 4 62

94 18 96 6

43 6 5

63 6 45 62 49

18 63 7 31 43

13.4

60.0

24.6

90

19 0(a)

32 19

Blanks mean nothing occurred, 0 means the entry rounded to zero.

100.2

31 0 3

1.7

21 Distribution of Sums of Rounded Percentages

407

the broken–stick model and its relation to the multinomial Insofar as we use the exact multinomial model for the rounding error of percentages, we are limited by what we can do on the computer, and of course, N ! soon outruns the capabilities of even the largest computer. To pursue larger values of N , we have to turn to other theory. In large samples there is an intimate relation between the distribution of the spacings between the order statistics in a sample from a unit uniform distribution and the distribution of the observed proportions in the equally-likely-category multinomial.2 Let us describe the relation.

Fig. 1.–Illustrating the relation of the order statistics zi to the interval lengths yi .

When a sample of c − 1 points is dropped at random on the unit interval, these points, together with the two endpoints, can be used to form c adjacent intervals. If we label the dropped points, after their values have been ordered as z1 , z2 , . . . , zc−1 , then the intervals will have lengths y1 , y2 , . . . , yc , as shown in Figure 1. Note that y1 = z1 , yi = zi −zi−1 , i = 2, . . . , c−1, and yc = 1−zc−1 . The z’s are called the order statistics of the sample. The y’s must add to 1, just as the unrounded percentages in the multinomial problem must always add to 100 percent. Then yi plays the role of a proportion in a category in the equally likely multinomial problem. If x1 /N is the proportion of observations in the first category of a ccategory multinomial, for example, then for large N its distribution is approximately that of the length of the leftmost interval created when a sample of c − 1 observations is drawn from the unit uniform. Further, xi /N is distributed like that of the length of the ith interval, and the joint distribution of all the xi /N is approximately that of the joint distribution of the lengths of the intervals y1 , y2 , . . . , yc . When N grows large, this distribution theory for the y’s matches that for the xi /N ’s very closely. However, some specific N ’s such as 10,000, will be commensurate with standard decimal ’s, and our 2

J. W. Tukey, “Non-parametric Estimation II. Statistically Equivalent Blocks and Tolerance Regions—Continuous Case,” The Annals of Mathematical Statistics, XVIII (1947), 529-39.

408

Frederick Mosteller, Cleo Youtz, and Douglas Zahn

rounding error theory will be inaccurate for such values of N . We neglect this in what follows. We are discussing large N ’s “on the average.” With this relationship between the order statistics and the multinomial established we can push a bit further. From the geometric theory of the order statistics, we have worked out the exact distribution of the sums of the rounded percentages (rounded yi ’s expressed as percentages) for two, three, and four categories. For two categories the theory is easy, because under our rule all sums add to 100 percent exactly. For three categories, matters are a bit harder, but the analysis only requires breaking up an equilateral triangle into many smaller ones and counting them. The result is simple. Let n = 1/ , as before, then we have the accompanying tabulation. Sum

Probability

100 + . . . . . . . . . .

n−1 8n

100 . . . . . . . . . . . . . . .

6n 3 = 8n 4

n+1 8n For four categories, the geometry becomes three-dimensional and involved. The distribution of the sums is shown in the accompanying tabulation. 100 − . . . . . . . . . .

Sum

Probability

2n(2n − 1)(2n − 2) 100 + . . . . . . . . . . . . . . 48n3 (2n − 1)(2n − 2) = 24n2 100 . . . . . . . . . . . . . . . . . . .

4(2n + 1)(2n)(2n − 1) 48n3 4(4n2 − 1) = 24n2

(2n + 2)(2n + 1)(2n) 48n3 (2n + 2)(2n + 1) = 24n2 We have not carried the exact large-sample theory beyond four categories at this time. Note, first, that sums larger than 100 percent are less likely than sums smaller than 100 percent, agreeing with the halothane data rather than with the average for the small N multinomials. Second, note that for large n (small ) the probability of a sum being exactly 100 percent is 1 for two categories, 3/4 for three categories, and tends to 2/3 for four categories. These results 100 − . . . . . . . . . . . . . .

21 Distribution of Sums of Rounded Percentages

409

are gratifyingly close to those given by the halothane data and by the smallsample multinomial averages for a fine grid. This theory also models the percentaging and rounding of continuous measurements, such as proportions of protein, fats, and carbohydrates in a diet. mean and standard deviation Although we found this large-sample theory difficult to press beyond c = 4, it is tedious but fairly straightforward to obtain the mean and variance of these distributions. Let µsum be the mean of the sum of the rounded intervals and µI be the mean rounded percentage for a single interval, then µsum = cµI = c

n 

[1 − (i − 12 ) ]c−1 ,

(2)

i=1

and for c ≥ 3, the variance of the sum is 2 = 2c σsum

n 

i[1 − (i − 12 ) ]c−1

(3)

i=1

+(c − 1)c

n−1 

i(1 − i )c−1 − µsum − µ2sum .

i=1

For c = 1 or 2 the variance is zero. Limiting case. For fixed c, as the grid gets fine ( → 0), it can be shown that the mean of the sum of the rounded values tends to n in units of , or to 1 in n . The variance tends to c/12 in units of 2 . This result is fairly reasonable if we assume that the rounding errors are uncorrelated and uniformly distributed over an interval of length for each of the c pieces. Using the normal approximation for small gives " √ P (sum = 1) ≈ 1/ 2πc/12 ≈ 1.38/ c. (4) For three categories, this gives 0.798 instead of 0.75; for four categories, 0.69 instead of 0.67, and the approximation improves with increasing values of c. It is rather interesting from the point of view of conjecture that the distribution of the sum of two unit uniforms has 3/4 of its probability within 1/2 a unit of its mean, thus corresponding to the case of c = 3. And the sum of three independent unit uniforms has, under its middle unit interval, the probability 2/3, thus corresponding to the case c = 4. This raises the question whether the limiting probabilities could be given exactly by distributions computed from the sum of c−1 independent uniforms. Z.W. Birnbaum3 quotes Laplace as having discovered the areas to the left of z under the distribution of the sum of u uniform random variables, each on the interval −1 to 1 (not unit uniforms) to be given by 3

Z. W. Birnbaum, “On Random Variables with Comparable Peakedness,” The Annals of Mathematical Statistics, XIX (1948), 80.

410

Frederick Mosteller, Cleo Youtz, and Douglas Zahn

F (z) =

1 u!



(−1)i

i≤(z+u)/2

u z+u u −i . i 2

(5)

For our problem z = −1, and the estimate is 1 − 2F (z). This formula breeds denominators that match those of the exact theory given earlier, c = 2:8 = 2!22 , c = 3:48 = 3!23 , and the next from the Laplace formula is 4!24 . In any case, as the number of categories grows the normal approximation based on c categories, and the Laplace numbers are bound to grow close to one another, since the sums of the uniforms are tending to the normal. As c grows large, the discrepancy 1/12 in variance makes less numerical difference. Thus the Laplace numbers, being exactly correct for small c (2, 3, 4) and correct asymptotically, cannot go far astray. simulation and collation To check the results for the broken-stick model and to look into larger values of c, we ran 1,000 samples in which we drew randomly c − 1 observations from a uniform, thus generating c values of y; we -rounded these and summed, and obtained the frequency distribution of totals. For this study we used c = 4, 6, 8, 10, 25. For c = 25, = 0.001, we observed a variance of 2.013 2 whose coefficient is very close to the theoretical 25/12 = 2.0833. Table 6.—True and Estimated Probabilities of the Sum of the Rounded Percentages Adding to Exactly 100 Percent, for Various Numbers of Categories for the Broken-Stick Model

Number of categories 1 2 3 4 5 6 7 8 9 10 25 (a)

.......... .......... .......... .......... .......... .......... .......... .......... .......... .......... ..........

True

1 1 0.75 0.667 ... ... ... ... ... ... ...

Normal approxi- Laplace Simulation Halothane Multi(a) (b) (c) mations number data nomial(d) 0.917 0.779 0.683 0.614 0.561 0.521 0.487 0.460 0.436 0.416 0.271

... 1 0.75 0.667 0.599 0.550 0.511 0.479 0.453 0.430 0.275

Using integral with correction term. Based on c − 1 categories. (c)  = 0.001. (d) Average for first 20 N ’s,  = 0.001. (b)

... ... ... 0.645 ... 0.536 ... 0.490 ... 0.410 0.273

... ... 0.745 0.664 0.584 0.532 0.531 0.528 0.485 ... ...

... ... 0.75 0.66 0.60 0.56 ... ... ... ... ...

21 Distribution of Sums of Rounded Percentages

411

Table 6 gives the various estimates of the probability for the sum of the rounded percentages adding to exactly 100 percent. Both the Laplace result and the normal approximation are close to the simulated value for c = 6, 8, 10, 25, and for order of magnitude, either is adequate. All results give compatible probabilities. We could, if we wished, explore the distribution further, but the normal approximation should be adequate. The agreement would be improved still further if the normal approximation were moved one unit down in Table 6, except for the c = 25, which " would be replaced by 0.276. The column would then agree closely with 6/cπ. Apparently the latter formula handles the flatter shape of the true distribution for small c better than the normal we fitted does.

Reprinted from The Annals of Mathematical Statistics (1969), 40, pp. 644–647

22. The Expected Coverage to the Left of the ith Order Statistic for Arbitrary Distributions Barry H. Margolin1 and Frederick Mosteller Yale University and Harvard University

1. Introduction. The coverage of the ith order statistic X(i) , i = 1, 2, · · · , n, in a sample of size n drawn from the continuous distribution F is F (X(i) ). The distribution of F (X(i) ) is well known ([2], p.236) to be a beta distribution with parameters i and n − i + 1, and the expected coverage E(F (X(i) )) = i/(n + 1).

(1)

We want a definition of coverage of the ith order statistic that has expectation i/(n + 1) in the general case where the parent distribution may have atoms. A natural way to define coverage in the general case involves the Scheff´eTukey transformation [1], described below, plus a special randomization when the ith ordered observation falls at an atom. This approach generates coverages distributed according to the same beta distribution as the usual coverages generated by samples from a continuous distribution. Instead of using this approach for the general case, we introduce below a modified definition of coverage that avoids randomization and nevertheless has expected coverage equal to i/(n + 1). For a continuous parent distribution F , the modified definition agrees with the usual one; if the parent has at least one atom, the distribution of the modified coverage is not beta-distributed, but also has at least one atom. 2. The modified definition of coverage and its expectation. Let X be a random variable (whose distribution is continuous, discrete, or mixed), and let F − (x) = Pr{X < x}, F (x) = Pr{X ≤ x}, p(x) = F (x) − F − (x) = Pr{X = x}, V = {x|p(x) > 0}.  1

(2)

Received 30 October 1967; revised 29 July 1968. The first author’s research was supported in part by the Army, Navy, Air Force and NASA under a contract administered by the Office of Naval Research. The second author’s research was facilitated by a National Science Foundation grant (GS-341). The authors wish to express appreciation for discussions with Paul Holland and I.R. Savage and for suggestions from the referees.

414

Barry H. Margolin and Frederick Mosteller

In a random sample of size n from F , let X(i) be the ith ranked observation in ascending order of magnitude, so that X(1) ≤ X(2) ≤ · · · ≤ X(i) · · · ≤ X(n) . If there are ties in the sample, we may not be able to say which of the tied observations is the ith, only that it lies in a particular clump. For given values of i and n, suppose that in the sample Ti observations have values less than X(i) , Wi observations have values equal to X(i) ,

(3)

n − Ti − Wi observations have values greater than X(i) . Ti may take on the values of 0, 1, · · · , i − 1, and consequently Wi can take on the values i − Ti , · · · , n − Ti . Definition. The modified coverage of X(i) for a sample of size n is defined as Ci (X(i) , Ti , Wi ) = F − (X(i) ) + (i − Ti )(Wi + 1)−1 p(X(i) ) (4) where Ti and Wi are described in (3). We use X(i) ∈ V to mean that the value of X(i) is an atom. If X(i) ∈ V , then since 1 ≤ i − Ti ≤ Wi F − (X(i) ) < Ci (X(i) , Ti , Wi ) < F (X(i) ).

(5)

Note that if in a sample X(i) ∈ / V , then p(X(i) ) = 0, F − (X(i) ) = F (X(i) ), Wi = 1 and Ti = i − 1 with probability 1, and the modified coverage Ci (X(i) , Ti , Wi ) = F (X(i) ), the usual coverage for the continuous case. In a random sample of size n, we have the Theorem. E(Ci (X(i) , Ti , Wi )) = i/(n + 1). (6) Proof. Our proof uses the Scheff´e-Tukey ([1], p.189) transformation which we now describe. Let X ∗ be a random variable having a uniform distribution on the interval from 0 to 1. Let U denote the cumulative distribution function of X ∗ , i.e., if x∗ is a value of X ∗ U (x∗ ) = 0 if x∗ < 0 = x∗ if 0 ≤ x∗ ≤ 1 = 1 if 1 < x∗ . Recall that F is the cdf of the random variable X. Consider the transformation X ∗ → gF (X ∗ ) such that F (gF (X ∗ ) − 0) ≤ U (X ∗ ) ≤ F (gF (X ∗ ) + 0). Observe that if F − (x) < x∗ ≤ F (x), then gF (x∗ ) = x, where x ∈ V . Scheff´e and Tukey observed that to every x∗ , −∞ ≤ x∗ ≤ ∞, there corresponds at least one gF (x∗ ) and that this gF (x∗ ) is unique unless it lies in an interval to which F assigns zero probability. In this case they (and we) assume that some

22 Expected Coverage to the Left of the ith Order Statistic

415

value in the interval is designated to be gF (x∗ ); which value is immaterial for our purposes. Scheff´e and Tukey proved that gF (X ∗ ) has the cdf F and can thus be identified with the random variable X. A random sample X1∗ , · · · , Xn∗ from U transforms into a random sample X1 , · · · , Xn from F . For fixed i, consider those samples from U in which: Ti observations are less than or equal to F − (X(i) ), Wi observations have values in the interval (F − (X(i) ), F (X(i) )], n − Ti − Wi observations are greater than F (X(i) ), Ti = 0, · · · , i − 1,

Wi = i − T i , · · · , n − T i ,

∗ , i.e., for these samples the ith order statistic from the uniform sample, X(i) − falls in the half-open interval (F (X(i) ), F (X(i) )]. The conditional distribu∗ tion of X(i) , given X(i) , Ti , Wi for X(i) ∈ V is that of the (i − Ti )th order statistic of a sample of size Wi from a uniform distribution on the interval (F − (X(i) ), F (X(i) )] = (F − (X(i) ), F − (X(i) ) + p(X(i) )]. Thus from the wellknown theorem on order statistics from the uniform distribution, the expected ∗ value of X(i) , given X(i) , Ti , Wi , for X(i) ∈ V , is ∗ E(X(i) |X(i) , Ti , Wi ) = F − (X(i) ) + (i − Ti )(Wi + 1)−1 p(X(i) ) = Ci (X(i) , Ti , Wi ).

(7)

/ V . We conclude that This is obviously true as well for X(i) ∈ ∗ E(Ci (X(i) , Ti , Wi )) = E(E(X(i) |X(i) , Ti , Wi )) ∗ = E(X(i) ) = i/(n + 1).

(8)

Corollary 1. If the distribution F has at least one atom then E(F − (X(i) )) < i/(n + 1) < E(F (X(i) )).

(9)

Proof. This follows from the strict inequalities of (5). Corollary 2. (i + 1)−1 p∗ Pi (V ) ≤ E(F (X(i) )) − i(n + 1)−1 −1 ∗

≤ (n − i + 1)(n − i + 2)

p Pi (V )

where Pi (V ) = Pr{X(i) ∈ V }, p∗ = infx∈V p(x), and p∗ = supx∈V p(x). Proof. For X(i) ∈ V , F (X(i) ) − Ci (X(i) , Ti , Wi ) = [1 − ((i − Ti )/(Wi + 1))]p(X(i) ), Ti = 0, · · · , i − 1, Wi = i − Ti , · · · , n − Ti ,

(10)

416

Barry H. Margolin and Frederick Mosteller

and for X(i) ∈ / V,

F (X(i) ) − Ci (X(i) , Ti , Wi ) = 0.

Hence, for all X(i) F (X(i) ) − Ci (X(i) , Ti , Wi ) = [1 − ((i − Ti )/(Wi + 1))]p(X(i) )IV (X(i) ) where IV (x) is the indicator for the set V . Let Ri = Wi + 1 − (i − Ti ). Then as Wi goes from i − Ti to n − Ti , Ri goes from 1 to n − i + 1. Now 1 − ((i − Ti )/(Wi + 1)) = 1 − ((i − Ti )/(i − Ti + Ri )), and is monotonically increasing in both Ri and Ti . Hence, 1/(i + 1) ≤ 1 − ((i − Ti )/(Wi + 1)) ≤ (n − i + 1)/(n − i + 2). Therefore, (i + 1)−1 p∗ IV (X(i) ) ≤ F (X(i) ) − Ci (X(i) , Ti , Wi ) ≤ (n − i + 1)(n − i + 2)−1 p∗ IV (X(i) ). Taking expectations gives (i + 1)−1 p∗ Pi (V ) ≤ E(F (X(i) )) − i/(n + 1) ≤ (n − i + 1)(n − i + 2)−1 p∗ Pi (V ). Corollary 3. (n − i + 2)−1 p∗ Pi (V ) ≤ i/(n + 1) − E(F − (X(i) )

(11)

−1 ∗

≤ i(i + 1)

p Pi (V ).

Proof. Similar to that for Corollary 2. Remark. Corollaries 2 and 3 are probably more useful for the special case of a discrete distribution F , for which Pi (V ) = 1, than for the mixed case. references ´, H. and Tukey, J.W. (1945). Non-parametric estimation, I. [1 ] Scheffe Validation of order statistics. Ann. Math. Statist. 16 187–192. [2 ] Wilks, Samuel S. (1962). Mathematical Statistics. Wiley, New York.

Reprinted from Psychometrika (1971), 36, pp. 1–19

23. Bias and Runs in Dice Throwing and Recording: A Few Million Throws Gudmund R. Iversen, Willard H. Longcor, Frederick Mosteller, John P. Gilbert, and Cleo Youtz Mr. Longcor is from Waukegan, Illinois; the other authors are from Harvard University. Dr. Iversen has moved to the University of Michigan. An experimenter threw individually 219 different dice of four different brands and recorded even and odd outcomes for one block of 20,000 trials for each die—4,380,000 throws in all. The resulting data on runs offer a basis for comparing the observed properties of such a physical randomizing process with theory and with simulations based on pseudo-random numbers and RAND Corporation random numbers. Although generally the results are close to those forecast by theory, some notable exceptions raise questions about the surprise value that should be associated with occurrences two standard deviations from the mean. These data suggest that the usual significance level may well actually be running from 7 to 15 percent instead of the theoretical 5 percent. The data base is the largest of its kind. A set generated by one brand of dice contains 2,000,000 bits and is the first handmade empirical data of such size to fail to show a significant departure from ideal theory in either location or scale.

1. Introduction How well do the laws of chance actually work? When a die is repeatedly thrown and its outcomes recorded, do imperfections in the die, in the throwing, in the perception of the outcome, and in recording appear? What sorts of deviations from chance do we find? Weldon’s dice data [Fry, 1965] and Kerrich’s coin tossing monograph [Kerrich, 1946] both give us some experience with large bodies of data produced by humanly run physical randomizing devices whose idealized probabilities and properties are known to a good approximation. In a sense, such experiments are controls on other experiments where probability plays an important role. For example, such dice and coin experiments give us an idea of how seriously we should take small departures from mathematically predicted results in investigations where we search for small departures from a standard. They do 

The analysis was facilitated by a National Science Foundation grant GS-341 and its continuation GS-2044X. It forms part of a larger study of data analysis.

418

G.R. Iverson, W.H. Longcor, F. Mosteller, J.P. Gilbert, and C. Youtz

this by showing the sizes and kinds of departures observed in an experiment with no planned human or material effects. They are placebo experiments. If one does not believe in extra-sensory perception, then many ESP investigations also would be judged to qualify, but if one does believe in ESP then, in such experiments, departures from mathematical forecasts are contaminated by small effects over and above the procedural ones mentioned above. In the latter case, results of experiments like the one reported here are especially relevant because they provide a baseline for departures from perfection (unless of course the experimenter is making use of psychokinesis). Examining such data gives us a background of experience with physical devices for carrying out randomization. This experience can be compared with the results of pseudo-randomizing devices such as those used in high-speed computers. We report here an analysis of long sequences of throws of dice of four brands by a single experimenter, together with two control series—one based on the RAND random digits [1955], the other produced by a pseudo-random number generator on a high-speed computer. In the present investigation the experimenter, Willard Longcor, was curious as to whether enormously long sequences of throws would continue to behave according to the laws of chance. He had taken dice throwing and recording as a long-time hobby. In advance of the investigation but after pilot work, he and Frederick Mosteller agreed that the outcome of the throw of a single die would be recorded as even or odd, because Longcor had much experience with this particular way of recording. Testing for bias in dice with holes for pips might have been more powerful using “low” (1, 2, 3) versus “high” (4, 5, 6), but in pilot work Longcor found that this method produced recording errors for him and he scrapped the data based on it and returned to the more familiar even and odd. Recording the actual number was not an option the experimenter wished to try at that time. The matter is discussed further in Appendix 2. Longcor and Mosteller agreed that lengths of runs of evens would be the basis for summarizing the data. Inevitably this focuses the experimenter’s attention on runs and their lengths, but perhaps not more than it would naturally have been. Long runs are the natural source of surprise. Therefore anyone would look for bias in them. We shall then be attending to long runs, but we also want to look at the joint behavior of numbers of runs of various lengths, through their covariances, and at other aspects of the data, because biases might very easily appear in these second-order statistics that few can know or compute. Some sorts of bias, of course, cannot be detected by this even-odd method of recording. For example, if the probabilities of opposite sides are equally inflated and other opposites equally deflated as they might be if the die were a rectangular parallelepiped instead of a cube, this bias would not be detected by our recording method. And more generally, any bias leaving P (even) = P (odd) would not be detected.

23 Bias and Runs in Dice Throwing and Recording

419

The distribution theory of runs of two or more kinds of elements was developed in a pathbreaking paper [Mood, 1940], and we have leaned on this work for our theoretical values. In trying to apply Mood’s formulas we found some errors which we have corrected using Mood’s methods of derivation. The required formulas are in Appendix 1 of this paper. In developing such formulas, errors arise from two sources. First, there are separate cases which lead to different formulas and second, the algebra is tedious. Each of our formulas has been checked against a set of numerical examples large enough to illustrate all the cases. 2. The Experiment Procedure. The experimenter fitted a desk top with carpeting and a step against which a die could be thrown. The die was thrown and the outcome recorded as “1” or “0” according as the top face of the die was even or odd. The die was thrown 20,000 times. Then the frequency distribution for even runs was constructed and totals checked. The data were placed in an envelope together with the die that produced them. The next 20,000 throws used a new die. The accumulated envelopes were shipped to Frederick Mosteller. Cleo Youtz checked samples of the tabulations and prepared the data for the computer. Dice. Several different brands of dice were used. We were curious whether an inexpensive die with holes for the pips, with a drop of paint in each hole, would show bias, and whether precision dice such as used in the great Las Vegas gambling houses would be more accurate. Those familiar only with 5-and-10-cent store dice may wish a description. These precision-made dice, about 0.75 inch1 on a side, have sharp edges meeting at 90◦ angles, and the pips are either lightly painted or constructed from extremely thin disks, the whole object being polished beautifully. They are shipped in plastic cases, each die fitting in a slot cut in foam rubber. At the time of purchase several years ago, they were about $1.25 a pair. We shall call the brands used A, B, C, and X. A, B, and C were precision dice, while brand X were the usual inexpensive drilled plastic dice about 0.6 inch on a side. Brand A has 100 blocks of 20,000 throws, B has 30 blocks, C has 31 blocks, and X has 58 blocks—all told 4,380,000 throws. In addition to the four sets of data based upon real dice, we twice simulated throwing 100 blocks of 20,000, once using the tape of the RAND random numbers [1955], and once using a pseudo-random number generator on a highspeed computer. These sets of data have been analyzed in parallel with the rest.

1

More precisely the sizes were about 0.77 inches on a side for brand A, 0.71 for brand B, and 0.75 for brand C. The inexpensive dice were about 0.61 inches on a side.

420

G.R. Iverson, W.H. Longcor, F. Mosteller, J.P. Gilbert, and C. Youtz

3. Bias To study imperfections in the dice or their throwing we can compare the relative frequency of even and odd throws. If the proportion of even throws in a series differs from 0.5 more than is reasonable on the basis of sampling variation, we conclude bias exists in the process. We may not be able to conclude that a single die is biased, but since we have many sets of throws for different kinds of dice, we may find evidence that a particular brand of dice is biased. A single die with probability 0.5071 of coming up even has about an 0.5 probability of producing a number of evens more than 2 standard deviations away from 10,000. To have an 0.95 chance of producing a count beyond 2 standard deviations, the die would need to have probability 0.5129 of coming up even. Thus we cannot expect a good chance of detecting bias in single dice unless the bias is of the order of 0.01 or 0.02. In a series of n = 20,000 independent throws of a fair die, the number of even throws averages np = (20,000)(0.5) = 10,000 and has variance √ np(1 − p) = (20,000)(0.5)(0.5) = 5,000 and therefore standard deviation 5, 000 = 70.7. For each of six sets of data we have many blocks of 20,000 throws, and Table 1 shows the percentage distributions of the number of even throws in the blocks for each set of data. In addition, the table shows a theoretical distribution obtained from the normal distribution with mean 10,000 and variance 5,000. The table gives means variances, and standard deviations for the various distributions. The means of the distributions for brands A, B, C, and the random simulations are all within 1.2 standard deviations of the theoretical fair die mean; indeed, brand B’s result is remarkably close to zero. Thus the means show no evidence of bias. Since brand X has holes to show the numbers on each of its sides, we wanted to see whether its dice would be biased toward even tosses. The net effect of drilling holes to show the numbers makes the even sides 12 − 9 = 3 units lighter than the odd numbered ones, and so we expect the throwing to result in more even than odd sides. The effect on each throw of the die is slight, but it is enough to make the average number of even faces per block 10,145 and the proportion of observed even faces equal to 0.5072, or about 16 standard deviations too high. This deviation shows a large bias in the dice of brand X. Not only is brand X’s mean large, but its every block of 20,000 tosses has more than 10,000 even faces. See Appendix 2 for further discussion of biased dice. The observed variances divided by the theoretical variance should give F ratios with b − 1 and ∞ degrees of freedom, where b is the number of blocks. The approximate descriptive levels of significance are: Data P (F ≥ observed)

A .07

B .11

C .54

R .04

P .69

X .50

The RAND random numbers produced a one-tailed descriptive level of significance of 0.04, none of the rest were significant at the 0.05 level. We had no

23 Bias and Runs in Dice Throwing and Recording

421

TABLE 1 Percentage Distributions for Blocks of 20,000 Throws According to the Number of Even Throws in the Blocks for Theoretical Distributions and for Six Sets of Data, Together with Observed Means, Variances, and Standard Deviations, and Standard Scores for Mean. Percentage Distribution of Blocks of 20,000 Throws

Number of Even Throws 10,281 - 10,320 10,241 - 10,280 10,201 - 10,240 10,161 - 10,200 10,121 - 10,160 10,081 - 10,120 10,041 - 10,080 10,001 - 10,040 9,961 - 10,000 9,921 - 9,960 9,881 - 9,920 9,841 - 9,880 9,801 - 9,840 Total∗

Theoretical

1 3 8 16 22 22 16 8 3 1 100%

Variance of Block Totals Standard Deviation Standard Deviation of Mean Standard Score For Mean Based on Observed S.D. ∗

1 1 1 5 6 19 21 20 15 6 4 1

B

10 17 7 7 20 17 20 3

Pseudorandom numbers P

3 6 10 10 19 35 13 3

4 3 7 11 17 21 19 12 5 1

1 2 2 5 17 28 21 14 8 1 1

X 3 7 12 19 16 22 17 3

100%

101%

99%

100%

100%

99%

100

30

31

100

100

58

0

9

−1

14

−7

6

145

5000

6124

6651

4776

6348

4618

4933

71

78

82

69

80

68

70

7.8

14.9

12.4

8.0

6.8

9.2

1.15

−.06

1.16

−.82

.86

15.70

Number of Blocks of 20,000 Throws Mean − 10,000

A

RAND random numbers C R

Totals may not add to 100 because of rounding (see Mosteller, Youtz, and Zahn [1967] where it is shown that the percentage of distributions adding to 100% exactly is about 100 6/cπ, where c is the number of categories used and π = 3.14159. . .).

422

G.R. Iverson, W.H. Longcor, F. Mosteller, J.P. Gilbert, and C. Youtz

reason to think that the observed variation would be smaller or larger than the theoretical, and so the reader may prefer to note that none of the data reached the two-sided 0.05 level, or that one set did reach the two-sided 0.10 level. TABLE 2 Observed Means, Variances, and Standard Deviations of the Number of Runs of 10 or Longer Even Throws in Blocks of 20,000 Throws for the Six Sets of Data and the Corresponding Theoretical Parameters. Number of Mean Blocks of Number 20,000 of Runs of Throws, b Length 10 or More 1 9.76 100 10.08 30 9.67 31 10.52 100 10.17 100 9.77 58 11.36

Estimated Standard Standard Score Standard Deviation (Based on Variance Deviation of the Observed S.D.) Mean √ Theoretical 9.66 3.1 3.1/ b A 12.7 3.6 0.36 0.9 B 7.8 2.8 0.51 −0.2 C 9.1 3.0 0.54 1.4 R 10.0 3.2 0.32 1.3 P 10.6 3.3 0.33 .03 X 12.1 3.5 0.46 3.5∗ 0.6∗ ∗ Based on expected value of 11.10 for p = 0.5072.

4. Long runs Turning to an examination of the experimenter as a possible source for results departing from the ideal model, we investigate the phenomenon of long runs. The occurrence of a long run of even throws represents an unusual event for the person throwing the die. We have investigated the data to see whether there are more such unusual events reported than could reasonably be expected under the chance model. Let us define a long run as one of length 10 or more. Table 2 shows the observed means, variances, and standard deviations of the number of long runs in each block of 20,000 throws. When we investigated bias, we noted that brand X produced more even throws than theory predicts. It could then be that more even throws are associated with more long runs of events in each block of 20,000 throws. The mean number of runs of length k or more is pk [(n − k)(1 − p) + 1]. For k = 10, n = 20,000, changing p from 0.5 to 0.5072 would increase this mean from 9.76 to about 11.10. The latter is less than 1 standard deviation from the observed value. Therefore brand X’s high mean of 11.36 can be attributed to bias in p. The mean numbers of runs of 10 or more of the other brands and the simulation data all are within 1.4 standard deviations of the expected value, and so the agreement is close. For the pseudo-random numbers, perhaps it is too close.

23 Bias and Runs in Dice Throwing and Recording

423

Similarly, the observed variances are not far from their theoretical values, the most extreme being for brand A whose F ratio is about at the 0.025 level. Since long runs are rare events, we might expect them to be approximately distributed according to the Poisson so that their mean and variance would be about the same, and they are close (see Table 2 first line). The longest runs of evens observed were: A 24

B 19

C 19

R 21

P 29

X 23

To appreciate these record lengths of runs, we need more information about the distribution of the longest run. From Feller [1968, pp. 325–326, problem 26 p. 341] m P (longest run ≥ m) ≈ 1 − e−nqp (n = 20,000, p = q = 12 ). We compute the following probability that the longest run of evens in 20,000 tosses is m or more: m

P (longest run ≥ m)

8 9 10 11 12 13 14 15 16 17 18

1.00000 1.00000 .99994 .99242 .91296 .70498 .45684 .26301 .14152 .07346 .03743

m

P (longest run ≥ m)

19 .01889 20 .00949 21 .00476 22 .00238 23 .00119 24 .00060 25 .00030 26 .00015 27 .00007 28 .00004 29 .00002 30 .00001 mean = 13.62 variance = 3.51 standard deviation = 1.87

In 100 independent trials, the largest measurement would have an average probability of 1/101 ≈ 0.01 above it, and so we expect to find the longest run in 100 blocks of 20,000 at about 19 or 20. The longest run for RAND random numbers is not far off at 21, but brand A at 24 is rather high and the pseudo-random numbers are extremely high at 29. Brands B and C with about 30 blocks should give a longest run of about 18, and their observed 19’s are close. Finally brand X with 58 blocks should give about 19 or 20, instead of the observed 23. Generally, the longest run is higher than theory suggests, but only in the pseudo-random numbers is it outrageously high.

424

G.R. Iverson, W.H. Longcor, F. Mosteller, J.P. Gilbert, and C. Youtz

5. Persistence Our main investigation of these data concentrates on the analysis of the runs of even throws. But from the data on runs and the knowledge of the total number of even throws, we can recover the number of transitions of each type (EE, EO, OE, OO) in consecutive throws, at least within 1. In particular, the number of even-even and odd-odd transitions tells whether there exists any tendency to have persistence, meaning that a throw of one particular kind is more likely to be followed by a throw of the same kind. Just as we might expect long runs from human factors, we also may expect more persistence than randomness would predict. In a block of 20,000 throws of a die there are 19,999 consecutive transitions. Let r1 and r2 be the numbers of runs of evens and odds, respectively. Let toe and teo be the numbers of transitions from odd to even or even to odd. It is easy to see that |toe − r1 | ≤ 1, and |teo − r1 | ≤ 1. For the sake of simplicity, we assume that the number of odd-even transitions equals the number of evenodd transitions, and that both these numbers equal the number of runs of evens. Each number of these two kinds of transitions may therefore be one larger than the true number of transitions, but with around 5,000 transitions of each kind in a block of 20,000 throws, this possible source of error is small. However, since the two numbers of transitions may be one larger than the true numbers, we compensate by arbitrarily increasing the total number of transitions by one, to 20,000. The mean numbers of the various transitions are shown in Table 3 for the six sets of data together with the theoretical values. We note that for the four sets of data (brands A, C, X, and the pseudo-random numbers) where we have at least 0.5000 even throws, the highest number of transitions is the number of even-even transitions. TABLE 3 Mean Number of the Various Transitions in Blocks of 20,000 Throws and Proportion of Throws Where a Throw of One Kind Is Followed by a Throw of the Same Kind for Six Sets of Data and Theoretical Values Data

Theoretical A B C R P X

Mean Number of Transitions EvenEven 5,000 5,016 4,996 5,014 4,989 5,007 5,160

EvenOdd 5,000 4,993 5,004 5,000 5,005 4,999 4,985

OddEven 5,000 4,993 5,004 5,000 5,005 4,999 4,985

OddOdd 5,000 4,998 4,996 4,986 5,001 4,995 4,870

Proportion of Persistent Transitions: Even-Even and Odd-Odd 0.5000 0.5007 0.4996 0.5000 0.4995 0.5001 0.5015

23 Bias and Runs in Dice Throwing and Recording

425

The last column of Table 3 shows some evidence of persistence in two of the six sets of data. That is, when we look at the proportions of throws where a throw of one kind is followed by a throw of the same kind, this proportion is significantly larger than the theoretical proportion of 0.5 for the data from brand " A and brand X. (The standard deviation of the persistence proportion is 1/(4bn), where b is the number of blocks and n = 20,000.) Another approach to studying persistence looks at the mean number of even runs, which is given in the even-odd √ column in Table 3. That mean has a theoretical standard deviation of 35/ b (See Appendix 1, Equation 5, setting k = 1) where b is the number of blocks of 20,000 throws. The data from brand A give a standard score of (4, 993 − 5, 000)/3.5 = −2.0 and the brand X data give a standard score of (4, 985 − 5, 000)/4.6 = −3.3 for the mean number of runs. If we correct for the bias in the brand X data and take the observed proportion 0.5072 of evens as the probability of an even throw, we expect to get 4,999 runs of evens. The observed mean of 4,985 runs then results in a standard score of (4, 985 − 4, 999)/4.6 = −3.0. Allowing for bias makes little difference. Thus, particularly the data from brand X shows a smaller number of runs of even throws than is expected. A smaller number of runs of evens means fewer transitions of the kinds odd-even and even-odd and therefore more persistence. We presume that this persistence is a result of observer behavior rather than anything in the dice. Dice, it has been remarked, have neither memory nor conscience. Although persistence seems to be real, its total amount is slight. At an extreme, one could have 2 runs, one of 10,000 evens and one of 10,000 odds and still have half evens. For each set of data the four observed persistence proportions in Table 3 could be regarded as the sizes of Markov probabilities suggested by the data if they were thought of as generated by a Markov chain model instead of an independent throwing process. The departure from 0.5 would measure the extent of dependence. The departures from 0.5 are slight. 6. Mean number of runs of lengths 1 to 10 For each set of data we give in Table 4 the mean numbers of even runs of lengths 1 to 10. Table 5 shows the standard scores of the observed means in Table 4 using theoretical means and standard deviations based on the probability of an even number in a throw being 0.5. The data for brand X show a number of large standard scores. We recall that data from brand X has the highest proportion of even throws, and this high proportion of even throws has produced too few runs of lengths 1 and 2. Instead we get too many runs of lengths in the middle range from 3 to 8. If we use the observed proportion 0.5072 of evens as the probability of an even throw for brand X in calculating theoretical means and variances, we get the second column for brand X in Table 5 labeled X ∗ . Among these adjusted standard scores, two lie outside the range from −2 to +2. The adjusted standard score for runs of length 1 is −2.30, which means that we still have

426

G.R. Iverson, W.H. Longcor, F. Mosteller, J.P. Gilbert, and C. Youtz TABLE 4 Mean Number of Runs of Lengths from 1 to 10.

Length of Run 1 2 3 4 5 6 7 8 9 10 b= number of blocks of 20,000 throws

Theoretical (Approx.) A 2, 500 2, 492 1, 250 1, 244 625 631 312 313 156 155 78 78 39 40 20 20 10 10 4.9 5.0 100

B 2, 510 1, 240 628 314 158 79 38 18 10 5.1 30

C 2, 497 1, 248 634 306 159 75 40 19 11 4.9 31

R 2, 509 1, 245 626 315 155 77 38 19 10 4.8 100

P 2, 499 1, 244 629 313 158 79 39 19 10 4.9 100

X 2, 450 1, 244 636 322 164 84 42 23 10 5.5 58

too few runs of that length. It is this shortage of runs of length 1 that gives too few transitions of the kinds even-odd and odd-even and which therefore led us to conclude that the brand X data showed slight but real persistence. TABLE 5 Standard Scores for the Mean Number of Runs of Lengths 1 to 10 in Table 4 Based on Theoretical Means and Standard Deviations When the Probability of an Even Throw Equals 0.5. For Brand X the Standard Scores Are also Computed When the Probability of an Even Throw Equals 0.5072, Brand X’s Observed Proportion of Even Tosses. Length of Runs 1 2 3 4 5 6 7 8 9 10

A −1.74 −2.02 2.79 .58 −.94 .22 1.03 1.47 −.68 .68

B 1.09 −1.82 .63 .43 .83 .79 −.82 −1.45 −.40 .54

C −.44 −.34 2.15 −2.00 1.26 −1.69 1.12 −.63 1.64 .06

R 1.94 −1.48 .38 1.47 −.99 −1.47 −1.06 −1.09 −.59 −.23

P −.21 −1.76 1.79 .17 1.07 .75 −.82 −.74 .15 .22

X −8.24 −1.33 3.48 4.23 4.99 4.96 3.27 5.37 .92 2.02

X∗ −2.30 −1.20 .58 .15 .70 .91 −.27 2.25 −1.51 −.02

1 (Total) 2

for A, B, C, X ∗ −1.70 −2.69 3.08 −.42 .92 .12 .53 .82 −.48 .63

If we attend to the standard scores for A, B, C, and X ∗ of Table 5, half their sum is a standard score for the line if throwing is independent (see last column). We see that runs of length 1 and 2 are too infrequent and that this lack is compensated by too many runs of length 3. Beyond this the pooled effect is negligible. We could have chosen a weighted average here instead of

23 Bias and Runs in Dice Throwing and Recording

427

weighting equally. Our approach takes the view that each brand has a fairly well determined set of parameters of its own and we want to estimate their mean (across brands). Other workers might prefer a different weighting or a different question. A slight approximation is involved in our handling of X ∗ , the referee reminds us, because we owe a degree of freedom for estimating the mean. Fortunately we have a good many degrees of freedom and so the effect should be slight, but if there were few degrees of freedom we would probably need to adopt a different approach altogether. Again, the effects being measured here are ones that would be due to the behavior of the observer in throwing, perceiving, and recording, rather than features caused by imperfections in a die. 7. Variances and covariances of runs of lengths 1 to 10 To see how the numbers of runs of a given length vary across the blocks of 20,000 throws in each set of data, we computed the observed variances s2i , i = 1, · · · , 10, of the number of runs of lengths i. We also have the corresponding theoretical variances σi2 . If b is the number of blocks of 20,000 throws in a set of data, the quantities (b − 1)s2i /σi2 are distributed as chi-square with (b − 1) degrees of freedom. Since the degrees of freedom are not the same for the six sets of data, we make the chi-squares more " comparable by computing the standard scores ui = [(b − 1)s2i /σi2 − (b − 1)]/ 2(b − 1) for the runs of various lengths. Since b is large, these standard scores are approximately normally distributed with means zero and variances one. The standard scores of the chi-squares are shown in Table 6. No unusual variations occur in brand A’s data. Brand X shows a large variation in the number of runs of length 8. Large variations are found in brand C for runs of lengths 1 and 2. The pseudo-random number simulation shows a large variation in the number of runs of length 10 from block to block—varying from 0 to 13 in the one hundred blocks, with an observed variance of 6.6 as opposed to a theoretical variance of 4.9. No simple explanation can be found for the large standard scores for brand C especially for runs of length 1 and 2, for the random simulation R for runs of length 5 (a deviation of 2.63), or for brand X for runs of length 3 and 8. It is chastening for the random simulations to produce two large values. There are 6 values outside −1.96 to 1.96 in the table. We would expect 6× 10 × 0.05 = 3. Standard scores also provide a way to investigate the observed covariances sij of the number of runs of length i and the number of runs of length j. We have the expected values σij = E(sij ), and when the numbers of runs of lengths i and j are normally distributed we have [Anderson, 1958, p.75] 2 var(sij ) = (σij + σi2 σj2 )/(b − 1)

The numbers of runs of lengths from 1 to 10 result in 45 covariances. The observed distributions of standard scores of these covariances are shown in

428

G.R. Iverson, W.H. Longcor, F. Mosteller, J.P. Gilbert, and C. Youtz TABLE 6 Standard Scores of (b − 1)s2i /σi2 for Runs of Length i, i = 1, . . . , 10. b : Number of Blocks of 20,000 Throws s2i : Observed Variance of the Number of Runs of Length i σi2 : Theoretical Variance of the Number ofRuns of Length i [(b − 1)s2 /σ 2 − (b − 1)]/ 2(b − 1) Length of Run A B C R P 1 0.53 −0.53 2.19 0.95 0.03 2 1.58 −0.26 3.61 1.24 0.45 3 −0.65 0.79 1.55 0.08 1.90 4 −0.31 0.49 1.38 −0.92 −0.87 5 1.86 0.85 −1.86 2.63 −0.28 6 1.46 −0.64 0.07 −0.14 −0.11 7 −0.40 0.92 0.35 0.26 0.71 8 −0.21 −0.12 −0.51 1.13 −0.20 9 −1.61 −0.97 −0.18 −0.41 0.43 10 −0.32 0.37 0.56 0.92 2.57

X −1.08 −0.32 −2.06 0.16 −1.25 −0.23 0.21 3.67 −0.11 0.65

Table 7 for the six sets of data. A total of 19 of the 270 standard scores, or 7.0 per cent of the scores, lie outside the range from −2 to 2. In a standard normal distribution 4.6 per cent of the probability lies outside the same range. None of the means of the six distributions in Table 7 are significantly different from zero, and only the variance for the RAND numbers is significantly different from one. We conclude from the results in Table 7 that the observed covariances, especially for the dice, are not significantly different from the theoretical covariances when compared using standard scores. Let us compare the theoretical and observed covariance matrices by another method. Let Σ be the p × p covariance matrix of a multivariate normal distribution. Barlett [1950], Anderson [1958, p.265], and Kullback [1959] have developed the theory for testing the null hypothesis H0 : Σ = Σ0 against the alternative hypothesis Σ = Σ0 . They obtain the test statistic Lp = (b − 1)[ln(|Σ0 |/|S|) − p + tr (SΣ0−1 )] where b is the number of blocks and S is the unbiased sample covariance matrix. The asymptotic distribution of Lp is chi-square with p(p+1)/2 degrees of freedom. Using our theoretical covariance matrix, we computed the value of Lp for each of the six data sets for p = 1(1)10, for a total of 60 values of L. These values are shown in the top panel of Table 8. When going from a (p−1)×(p−1) covariance matrix to a p × p covariance matrix, we get p additional terms. The increase in the statistic is Lp − Lp−1 , and the number of degrees of freedom increases by p(p + 1)/2 − (p − 1)p/2 = p. In Table 8 we show in the bottom panel the quantity

23 Bias and Runs in Dice Throwing and Recording

429

TABLE 7  2 + σi2 σj2 )1/2 of the Distributions of the Standard Scores (b − 1)(sij − σij )/(σij Observed Covariances sij of the Number of Runs of Length i and the Number of Runs of Length j, i = 1, . . . , 9, j = i + 1, . . . , 10.

3.5 3.0 2.5 2.0 1.5 1.0 0.5 0 −0.5 −1.0 −1.5 −2.0 −2.5 −3.0

Standard Score − 3.0 − 2.5 − 2.0 − 1.5 − 1.0 − 0.5 − 0 − −0.5 − −1.0 − −1.5 − −2.0 − −2.5 − −3.0 − −3.5

A

B

C

R

P

X

1 1 2 3 8 10 8 8 2 2

2 5 4 8 11 6 5 2 1

7 9 5 8 8 5 2 1

3 4 4 5 8 9 3 1 2 2 3 1

1

1 5 8 8 10 5 3 2 2 1

1 1 6 7 8 8 11 1 2

Number of Covariances

45

45

45

45

45

Mean

−0.03

−0.01

−0.08

−0.02

−0.07

0.07

Variance

1.12

0.90

0.89

1.97

1.11

0.75

S.D.

1.06

0.95

0.94

1.40

1.05

0.87

0.16

0.14

0.14

0.21

0.16

0.13

45

S.D. of Mean∗ ∗

This standard deviation is a little off because we have neglected correlation in computing it.

(Lp − Lp−1 )/p which is the increase in L per degree of freedom. This gives an additional way of studying the changes in Lp with increasing p. Under the null hypothesis, the average value of an entry in the bottom panel of Table 8 is 1. By making the calculation for successively large numbers of lengths of runs, we can get a clue to where unusual values arise. All the theoretical covariances are negative. For brand A a large entry occurs in Table 8 for p = 7, primarily due to the appearance of a positive observed covariance in the (2, 7) cell. Brand C has large discrepancies both for p = 1 and p = 2 that have no easy explanation; further in brand C,

430

G.R. Iverson, W.H. Longcor, F. Mosteller, J.P. Gilbert, and C. Youtz

the additional increase at p = 5 results from the appearance of two positive covariances. The first positive covariance for the pseudo-random number simulation appears at p = 6, where one observes a rather large difference in the bottom panel. Of the 60 numbers, in the top panel of Table 8, we expect 3 to be significant at the 2-sided 0.05 level, and we find that 6 are. Since there is some correlation among the numbers in a column, once a given column is high (or low), it may persist in this state. Thus, brand C for p = 3 should scarcely be counted as a new high value. Only brand C has a record that looks poor using these statistics. 8. Summary The dice and the experimenter seem to have behaved nearly like an unbiased, independent, randomizing device, for the variables that we have measured—means, variances, covariances, persistence, and long runs—the only exception being the dice with holes for pips. The persistence, though in a few cases significant, was slight in magnitude. Except for the nonprecision dice (p ≈ 0.5072), the bias was small. The mean numbers of runs showed small departures from expectations, too few runs of length 1 or 2, and too many of length 3. Aside from this the joint behavior of the numbers of runs followed theory fairly closely. Brand A, X, and especially the pseudo-random numbers produced longest runs that were too large for theory. As a basis for comparing a physical randomizing device (including a human processor) with an ideal randomizing device, we found the following frequency distributions of “standard scores” for all the standard scores that we show in this publication (except those for the unadjusted statistics of brand X dice). These standard scores are all based on “large” numbers of degrees of freedom, and asymptotic sampling theory leads us to expect these scores to follow approximately the standard normal distribution for idealized dice throwing. −∞ to −3 A, B, C, X : Observed Normal R: Observed Normal P: Observed Normal

−3 to −2 to −1 to 0 to −2 −1 0 1

1 to 2

2 to 3

3 to ∞

Total

.4

7 5.7

31 36.3

94 91.1

86 91.1

40 36.3

6 5.7

3 267 .4 267.0

1 .1

5 1.4

7 9.1

19 22.9

18 22.9

13 9.1

4 1.4

.1

67 67.0

.1

3 1.4

6 9.1

22 22.9

26 22.9

8 9.1

2 1.4

.1

67 67.0

All three sets of data generally follow the normal distribution. All three show a tendency for there to be too many values beyond ±2 standard deviations. The ratio between observed and expected is about 1.3 for the dice, about 3 for the RAND random numbers, and about 1.7 for the pseudo-random

23 Bias and Runs in Dice Throwing and Recording

431

TABLE 8 Table of Lp to Two Significant Digits, Testing Whether Observed Variances and Covariances of Run Statistics Agree with Theoretical Values. Lp has Chi-square Distribution with p(p + 1)/2 Degrees of Freedom. p

A

B

C

R

P

X l

1 .27 .30 3.5 .82 .00 2 3.3 1.2 14.h 2.3 .22 3 5.2 4.0 15.h 5.4 3.3 4 13. 9.0 18. 14. 5.3 5 20. 13. 29.h 27. 7.0 6 29. 23. 32. 34. 21. 7 45.h 28. 36. 43. 27. 8 50. 37. 43. 50. 30. 9 63. 42. 47. 58. 41. 10 74. 48. 60. 84.h 56. h: Value Above 0.975 Level of the Cumulative. l: Value Below 0.025 Level of the Cumulative.

p(p + 1)/2

1.4 2.3 8.9 11. 17. 21. 31. 42. 53. 63.

1 3 6 10 15 21 28 36 45 55

Values of (Lp − Lp−1 )/p for the Six Sets of Data for p = 1, 2, . . . , 10, Where L0 = 0. p

A

B

C

R

P

X

1 2 3 4 5 6 7 8 9 10

0.27 1.53 0.63 1.90 1.46 1.47 2.32 0.60 1.43 1.15

0.30 0.43 0.95 1.24 0.74 1.76 0.75 1.04 0.62 0.58

3.51 5.39 0.32 0.72 2.09 0.55 0.60 0.85 0.42 1.35

0.82 0.74 1.03 2.12 2.67 1.21 1.20 0.86 0.95 2.55

0.00 0.11 1.02 0.51 0.33 2.29 0.84 0.48 1.22 1.45

1.36 0.48 2.18 0.63 1.05 0.81 1.37 1.41 1.19 0.98

numbers. Thus when we think we are working at a significance level of about 5%, we may well be somewhere between 7 and 15%. The bias of the nonprecision die has been removed from the effects of the standard scores since this was an obvious physical defect in the system whose magnitude was to be measured rather than tested for. A few questions remain about the variability of the dice themselves, and these are being pursued with further experimentation.

432

G.R. Iverson, W.H. Longcor, F. Mosteller, J.P. Gilbert, and C. Youtz

9. Acknowledgments Joseph I. Naus of Rutgers University has been particularly helpful in suggesting references relating to the distribution of long runs. Paul Holland of Harvard University and Bradley Efron of Stanford University developed a mathematical treatment on bias and persistence going well beyond that given in section 5. We had helpful discussions with Arthur Dempster, especially concerning the analysis of the covariance data. We wish to thank the referee for a number of suggestions helping us to clarify the exposition. appendix 1 Formulas for means, variances, and covariances of runs of one of two kinds of elements In this appendix we follow Mood’s [1940] notation so that the reader may refer readily to that basic paper. On each of n independent trials either a or b occurs, with P (a) = p1 , P (b) = p2 , p1 + p2 = 1. (The total number of a’s is notfixed.) Let r1i be the number of runs of a’s of length exactly i and let s1k = i≥k r1i . The mean number of runs of a’s of length i is E(r1i ) = pi1 p2 [(n − i − 1)p2 + 2] = pi1

i 2, the difference p2∗ − p2 is the difference between the squares of two odd primes, which means that the interval length is divisible by 24. This gives a convenient range of possible values of k, the number of segments; we use k = 4, 8.

438

Frederick Mosteller

4. Distributions in Four Segments In Table 1 we show, for each homogeneous interval whose left end point is the square of an odd prime of 29 or less, the distribution of the primes into four intervals of equal size. We show for each interval the computed value of χ2 , which would be a sample from an approximately chi-square distribution with 3 degrees of freedom, and therefore average 3, if the multinomial distribution applied. The counts of primes in each row of Table 1 are very nearly uniformly distributed—each segment has nearly one-fourth the primes in the row. Note that no value of χ2 was as large as the mean, 3, and that all but one were much smaller. It is almost embarassing that the totals do not decrease from left to right, since 1/ln x is monotonically decreasing.

5. Distribution into Eight Segments When we have more primes available per interval, as we do for the larger p’s, we divide the homogeneous intervals into eighths rather than quarters and looked again at the distribution. Table 2 shows this distribution for p’s used to determine the left end point of the interval running through successive primes from 19 to 293. Table 3 shows some selected additional intervals, the last of which has a lower boundary of 29532 ≈ 9 × 106 . The χ2 values in Tables 2 and 3 never get as high as 7, the theoretical average for the urn model. The table shows clearly then that the primes, at least the early primes, are distributed much more nearly uniformly than samples from the multinomial with equal probabilities would yield. The χ2 ’s seem to be rising slightly as p increases, but very slightly. Table 1 Distribution of Primes into Four Equal Intervals bketween Successive Squared Primes p2 and p2∗ 1 2 4 (p∗

Primes 2

p 3 5 7 11 13 17 19 23 29

p 9 25 49 121 169 289 361 529 841 Totals

1 1 1 3 2 6 1 7 11 4 36

Intervals − p2 ), for p2 ≤ x < p2∗ 2 1 1 5 2 4 4 7 14 4 42

3 2 2 3 2 6 2 6 11 4 38

4 1 2 4 3 6 4 7 11 4 42

Total N 5 6 15 9 22 11 27 47 16 158

χ2 .60 .67 .73 .33 .55 2.45 .11 .57 0 6.01

24 The Distribution of Primes and Litters of Primes

439

Table 2 Distribution of Primes into Eight Equal Intervals between Successive Squared Primes p2 and p2∗ 1 2 8 (p∗

Primes p 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 101 103 107 109 113 127 131 137 139 149 151 157 163 167 173 179 181 191 193

2

p 361 529 841 961 1369 1681 1849 2209 2809 3481 3721 4489 5041 5329 6241 6889 7921 9409 10201 10609 11449 11881 12769 16129 17161 18769 19321 22201 22801 24649 26569 27889 29929 32041 32761 36481 37249

1 3 4 1 7 3 3 7 10 9 2 10 8 2 14 11 16 19 14 4 12 6 15 47 12 20 5 36 5 23 21 20 26 24 11 49 10 18

2 4 7 3 9 6 2 5 11 11 6 12 7 6 13 12 12 21 8 7 9 5 10 38 11 22 4 40 8 23 24 16 20 23 9 47 9 23

Intervals − p2 ), for p2 ≤ x < p2∗ 3 3 7 3 8 7 3 4 10 8 4 12 10 3 15 7 14 19 12 7 12 4 12 46 12 21 7 39 6 19 24 16 31 26 9 41 11 15

4 4 7 1 6 6 3 8 8 9 4 12 7 4 12 7 11 19 12 4 10 4 12 40 12 20 6 34 7 26 22 13 26 26 11 49 10 17

5 4 4 2 6 5 2 5 8 10 4 9 9 3 17 8 15 25 12 4 12 6 12 47 14 22 6 34 9 25 24 13 23 27 9 42 12 16

6 2 7 2 7 6 3 7 8 9 4 15 4 4 9 11 17 19 8 7 8 3 14 44 11 22 6 34 6 29 23 17 27 23 12 46 9 18

7 4 4 2 9 7 2 7 15 13 3 9 11 4 13 9 15 21 11 3 13 8 14 46 15 22 8 42 9 18 26 12 24 27 8 41 7 16

8 3 7 2 5 4 2 3 10 9 5 11 10 4 13 10 14 20 12 6 11 6 11 46 12 16 7 40 8 19 22 21 21 19 7 41 9 21

Total N 27 47 16 57 44 20 46 80 78 32 90 66 30 106 75 114 163 89 42 87 42 100 354 99 165 49 299 58 182 186 128 198 195 76 356 77 144

χ2 1.15 2.87 2.00 2.09 2.55 0.80 3.74 3.80 1.79 2.50 2.44 4.30 2.53 2.83 2.76 1.93 1.47 2.78 3.71 1.92 3.33 1.60 1.84 1.12 1.45 1.78 1.98 2.14 4.64 0.75 4.75 3.54 2.13 2.11 2.07 1.65 2.89

440

Frederick Mosteller

Table 2 (continued) 1 2 8 (p∗

Primes p 197 199 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293

2

1 9 60 61 23 11 20 39 12 56 31 29 35 10 37 24 8 67 87

p 38809 36901 44521 49729 51529 52441 54289 57121 58081 63001 66049 69169 72361 73441 76729 78961 80089 85849

2 9 51 57 24 10 22 29 12 57 43 35 29 10 37 26 15 64 97

Intervals − p2 ), for p2 ≤ x < p2∗ 3 9 60 58 19 12 23 32 9 56 30 39 39 15 36 28 13 71 81

4 11 62 58 20 11 21 31 10 53 30 39 35 10 39 28 10 60 101

5 10 62 62 16 9 18 35 12 55 32 37 39 15 34 21 15 62 88

6 11 58 58 21 11 23 30 13 55 35 26 41 11 39 24 10 61 90

7 7 51 63 19 11 22 36 10 52 38 37 35 8 36 23 17 63 103

8 9 59 62 26 7 17 38 12 54 36 32 39 12 34 25 11 64 88

Total N 75 463 479 168 82 166 270 90 438 275 274 292 91 292 199 99 512 735

χ2 1.27 2.40 0.65 3.43 1.71 1.71 2.95 1.20 0.36 4.24 4.72 2.79 3.86 0.71 1.64 5.48 1.37 4.45

Table 3 Distribution of Primes into Eight Equal Intervals between Selected Successive Squared Primes p2 and p2∗ 1 (p2∗ 8

Primes 2

4 30 90 52 111 106 58 191 318 217 166 196

5 36 80 66 122 98 51 186 333 235 169 181

6 39 77 55 129 110 54 186 325 237 168 178

7 33 81 57 122 98 66 194 351 217 165 188

Totals: (Tables 2 and 3 combined) 2689 2666 2708 2653

2684

2675

2721

p 439 727 769 1097 1483 1667 1987 2029 2411 2659 2953

p 192721 528529 591361 1203409 2199289 2778889 3948169 4116841 5812921 7070281 8720209

1 40 80 53 111 94 53 206 323 235 181 187

2 36 88 62 116 98 55 183 323 222 165 188

Intervals − p2 ), for p2 ≤ x < p2∗

3 38 73 67 109 106 60 197 325 237 178 181

8 38 97 58 115 119 53 202 338 237 190 185

Total N 290 666 470 934 829 450 1545 2636 1837 1382 1484

χ2 2.14 5.14 3.87 2.67 4.59 2.98 2.43 2.44 2.60 3.45 1.20

2744 21540 168.16

24 The Distribution of Primes and Litters of Primes

441

But if the computed χ2 ’s are ever to match the theoretical chi-square distribution, we can see that a close fit, if there is one, would have to be sought much further out in the integers than we have gone. If the urn model is asymptotically correct, then as the primes get sparser, the χ2 values should slowly drift up to an average of 7 per line in an extension of Table 3. Tables 1, 2, and 3 were constructed with the aid of the tables of primes by Lehmer (1956).

6. Pressure Toward Uniformity Suppose that the N balls are tossed into M boxes in k sets, where M = kT , T an integer. Under free tossing the variance of the number of balls in any specific segment is N (1/k)(1 − 1/k) and the expected value of χ2 is (k/N )kN (1/k)(1 − 1/k) = k − 1, as we already knew. The point of the derivation, though, is to see the role of the binomial variation. If we have sampling without replacement as noted earlier, the effect is to multiply the binomial variance by 1 − f where f is the fraction that the sample is of the possible values. Initially, if we attend only to the fact that the same numbers cannot be used as primes more than once, the value of f is N/M . But as soon as we note that even numbers are not allowed, then M is essentially cut in half and f = 2N/M . As further integers are disallowed, the multiplier of N/M increases. As we noted earlier, we shall use f as our measure of pressure toward uniformity of spacing of primes. Instead of trying to evaluate the change by theoretical methods, we can find the average value of f over the first 90,000 integers by using the data from Table 3. The average χ2 is about 2.5 instead of 7, implying that f must be about 0.74. In sampling problems we usually have small values of f , but this value is large, indeed much larger than it would be if the “without replacement” feature were the only pressure, for then it would be less than 0.1.

7. Larger Integers These results suggest that we should move to much larger values of the integers if we are to see f reduce. Fortunately we had available the tables by Gruenberger and Armerding (1961). These tables give the counts of primes in intervals of length 50,000. We took the counts in intervals of length 50,000 starting with 10 × 106 . The numbers from 10 × 106 to 12.5 × 106 give 50 such counts. We computed the mean and the variance of such counts. Since 1/ln x does decrease a bit over such an interval, in computing the empirical variance, we computed sums of squares for sets of 10 successive intervals and then pooled them. The result is an estimate of variance based upon 45 degrees of freedom instead of 49, but it should be less affected by the change in mean. (See Snedecor, 1946.) The means and variances appear in Table 4 for the first interval of length 2.5 × 106 following whole multiples of 10 × 106 , plus an extra line in the middle to give us 10 sets. The mean values of the number of primes in these intervals of length 50,000 is 2,845. The variances vary from 615

442

Frederick Mosteller

to 1,286, about a factor of two, and they average 975. According to discrete Poisson theory the mean would be equal to the variance, so the combined effect of sampling without replacement and of restriction are measured by the value of f that makes 2, 845(1 − f ) = 975. This means an average value of f = 0.66; thus f has only to be reduced by about 10% from 0.74 as we moved from 90 × 103 to 90 × 106 . There may be a slight effect here due to the different technology of measurement in the small and the large sets of integers, but it seems doubtful. The main point is that f is still large, and the counts of primes do not vary as much from interval to interval as the urn scheme requires.

8. Prime Litters: Twin, Triple, Quadruple If primes are to be more nearly Poissonly distributed, apparently to see it one will have to study much larger integers than we have. More attractive would be to show theoretically that the distributions would or would not approximate the urn scheme asymptotically. Another line of empirical study is suggested by the multiple primes— the twin, triple, and quaruple primes. Instead of going further out in the numbers, by choosing a subset of the primes that is rarer, one might hope to see the Poissonness, if it exists, exert itself earlier in the integers. The tables of Gruenberger and Armerding (1961) give tabulations of these values. Our Table 4 shows again for intervals from 10 × 106 to 90 × 106 by intervals of 50,000 (except for twin primes where intervals of 100,000 were used) calculations comparable to those just described for the primes. Twin primes occur when p and p + 2 are prime (example: 11 and 13). Triple primes do not exist except for 1, 3, 5 (if you allow 1 as a prime) and 3, 5, 7, and so the name of triple prime is reserved for sets of the form p, p + 2, and p + 6 all prime, called a 2–4 triple prime, with a 4–2 triple prime being a set of the form p, p + 4, p + 6, all prime. The counts for quadruple primes given by Gruenberger and Armerding (1961) are for type 2–4–2. For the twin primes the averages of the means and variances of the counts are 425 and 304 respectively, shown in Table 4; therefore, we compute the average f to be f = 0.28, which does represent a substantial move in the direction of Poissonness compared to f = 0.66 for primes. For triple primes, averaging the results for the two kinds gives an average f = 0.12, a serious further reduction. The quadruple primes gave f = 0.13. This last proportion raised the question of whether large sampling error may have entered in. We therefore made further calculations. Table 5 gives three different frequency distributions of quadruple primes (one each for intervals starting at 10 million, at 50 million, and at 90 million). After allowing for the effect of the slightly changing means, we computed f for each of these distributions and found the average f to be 0.08, a considerable reduction compared to 0.13 based upon fewer intervals.

24 The Distribution of Primes and Litters of Primes

443

Table 4 Means and Variances of Numbers of Primes, Twin Primes, Triple Primes, and Quadruple Primes in Intervals of Length∗ 50,000 for Integers from 10 × 106 to 92.5 × 106 Primes Twin Primes∗ Interval (millions) Mean Variance Mean Variance 10.0 12.5 3082.48 615.20 493.60 394.08 20.0 22.5 2960.66 1139.40 462.10 275.04 30.0 32.5 2897.80 924.60 441.06 448.18 40.0 42.5 2850.92 1024.79 424.56 265.94 50.0 52.5 2815.86 840.09 415.38 210.02 55.0 57.5 2805.50 1286.49 413.76 347.68 60.0 62.5 2787.44 1125.77 408.28 185.93 70.0 72.5 2766.90 1127.40 400.66 344.43 80.0 82.5 2747.56 867.06 397.44 307.25 90.0 92.5 2730.30 797.02 393.40 261.42 Average 2845 975 425 304 Triple Primes 2–4 Interval (millions) 10.0 12.5 20.0 22.5 30.0 32.5 40.0 42.5 50.0 52.5 55.0 57.5 60.0 62.5 70.0 72.5 80.0 82.5 90.0 92.5 Average

Mean 33.38 29.78 28.02 26.40 25.06 26.76 24.40 24.16 24.82 23.36 26.6

Variance 27.71 24.27 23.79 24.44 21.56 31.84 24.48 18.28 13.28 24.40 23.4

Triple Primes 4–2 Quad Primes 2–4–2 Mean 33.78 30.20 27.60 26.10 24.58 24.88 25.72 24.28 24.24 22.28 26.4

Variance 25.44 19.28 29.49 20.91 25.47 34.12 23.24 16.99 17.52 18.73 23.1

Mean 2.84 2.62 2.06 2.30 1.94 2.08 2.08 1.96 2.06 1.72 2.17

Variance 2.99 1.71 1.74 2.19 1.51 2.44 1.66 1.88 1.48 1.27 1.89

Note: The computations are based on intervals of length 50,000 except for twin primes, where it is 100,000. ∗ The interval for twin primes is doubled, beginning at same lower bound, for example, 10.0 – 15.0.

9. Conclusion When short intervals of the integers are broken into segments of equal lengths, the empirically observed numbers of primes in the segments are more nearly equal than a simple multinomial model would predict, for x ≤ 100×106 . We quantify this “pressure toward uniformity” by f , by extension of the idea that the factor 1 − f gives the reduction in variance for effects of sampling without replacement. If f = 0, the sampling variability corresponds to that of an urn model or the Possion; when f = 1, the variability is zero and we get

444

Frederick Mosteller

Table 5 Distributions of Number of Quadruple Primes in Intervals of 50,000 Number of Frequency Quadruple 10,000,000 50,000,000 90,000,000 Primes in Intervals to to to of 50,000 20,000,000 60,000,000 100,000,000 0 13 27 23 1 29 52 63 2 42 49 69 3 48 43 30 4 42 20 6 5 15 5 4 6 8 2 4 7 3 1 1 8 ... 1 ...



Total

200

200

200

Mean Variance Residual variance∗ F

2.84 2.47 2.43 0.14

2.06 2.11 2.06 0

1.83 1.65 1.63 0.11

Adjusted for linear change in mean over the 200 intervals.

uniform spacing within the interval. For primes up to about 90×103 , f = 0.74, and moving up to 90 × 106 , f = 0.66. We see that f is decreasing slowly at best. Similar calculations for twin primes between 10 × 106 and 90 × 106 give an average f = 0.28. Triple primes in this interval give an average f = 0.12, and quadruple primes give f = 0.08.

References Gruenberger, F., and G. Armerding. Oct. 1961. Statistics on the first six million prime numbers. The RAND Corporation, P-2460. Lehmer, D.N. 1956. List of prime numbers from 1 to 10,006,721. New York: Hafner. Snedecor, George W. 1938. Statistical methods, pp. 362–378. Ames: Iowa State College Press. ——, 1946. Statistical methods, 4th ed., pp. 214–216. Ames: Iowa State College Press. Snedecor, George W., and William G. Cochran, 1967. Statistical methods, 6th ed., sect. 9.3, pp. 231–233. Ames: Iowa State College Press.

Reprinted from The University of Chicago Law Review (1974), 41, pp. 242–253

25. A Conversation About Collins William B. Fairley and Frederick Mosteller Associate Professor, Public Policy Program, Kennedy School of Government, Harvard University and Professor, Department of Statistics, Harvard University

People who wish to apply probability, statistics, and mathematics to legal work find careful analyses of the facts in specific cases rewarding. People v. Collins1 is a particularly good case for such study, because the prosecution used a probabilistic argument to try to establish an identification. Unfortunately, students tend to point out an initial difficulty in the prosecutor’s argument and then dismiss the case. As the court recognized, however, the issues involved are deeper and deserve far more attention. Professor Hans Zeisel is famed for his ability to give clear explanations of very difficult statistical ideas. In his honor, we shall try to emulate him by discussing some of the issues raised by this case in a dialogue between a lawyer and a statistician. Neither of the participants presumes the other to have a specialist’s knowledge of his field.2 The dialogue is intended to discuss three points. First, it considers the question of whether dependent or independent probabilities were the basis for the prosecutor’s argument in Collins and offers an interpretation more favorable to the prosecution than that usually given. Next, it suggests that data could have been assembled to make a reasonable estimate of the order of magnitude of the probability. The prosecution, however, did not provide any such evidence, and the court considered the consequences of accepting the 

The authors are indebted to Lloyd Weinreb for numerous helpful comments. We also wish to acknowledge helpful comments from Jack Appleman, Michael Brown, Byron Burnett, Stephen Fienberg, Michael Finkelstein, Richard Hill, David Hoaglin, Nan Hughes, Charles Kagay, Don Karl, Joel Kleinman, Gilbert Kujovich, Alan Lederman, David Oakes, James Reeds, Karen Reeds, and Bernard Rosner. The work was partially aided by a National Science Foundation Grant (GS 32327X1). 1 68 Cal. 2d 319, 438 P.2d 33, 66 Cal. Rptr. 497 (1968). 2 We call upon specialist knowledge in only one or two footnotes, and these notes may be passed over. For a discussion by a lawyer of Collins that presents an extensive exposition of elementary probability theory, see Cullison, Identification by Probabilities and Trial by Arithmetic (A Lesson for Beginners in How to be Wrong with Greater Precision), 6 Houston L. Rev. 471 (1969).

446

William B. Fairley and Frederick Mosteller

prosecution’s unsupported numbers. The court, in an appendix to its opinion, developed a probabilistic model intended to show that the identification was weak even if the prosecution’s numbers were accepted. The court apparently concluded that if a random item sampled from a large population belongs to a rare type, then the probability of at least one more item of that type in the population is about 0.41. The third goal of this dialogue is to consider the validity of the model that gave rise to this number. The conclusion reached in Collins deserves to be reviewed with some care. More than one valid model, and more than one interpretation, is possible; our treatment is intended to provoke further discussion of these matters. I. Prefatory Note on People vs. Collins Janet and Malcolm Collins were convicted in a jury trial in Los Angeles of second-degree robbery. Malcolm appealed his conviction to the Supreme Court of California, and the conviction was reversed. The court described the events of the robbery as follows:3 On June 18, 1964, about 11:30 a.m. Mrs. Juanita Brooks, who had been shopping, was walking home along an alley in the San Pedro area of the City of Los Angeles. She was pulling behind her a wicker basket carryall containing groceries and had her purse on top of the packages. She was using a cane. As she stooped down to pick up an empty carton, she was suddenly pushed to the ground by a person whom she neither saw nor heard approach. She was stunned by the fall and felt some pain. She managed to look up and saw a young woman running from the scene. According to Mrs. Brooks the latter appeared to weigh about 145 pounds, was wearing “something dark,” and had hair “between a dark blond and a light blond,” but lighter than the color of defendant Janet Collins’ hair as it appeared at trial. Immediately after the incident, Mrs. Brooks discovered that her purse, containing between $35 and $40, was missing. About the same time as the robbery, John Bass, who lived on the street at the end of the alley, was in front of his house watering his lawn. His attention was attracted by “a lot of crying and screaming” coming from the alley. As he looked in that direction, he saw a woman run out of the alley and enter a yellow automobile parked across the street from him. He was unable to give the make of the car. The car started off immediately and pulled wide around another parked vehicle so that in the narrow street it passed within six feet of Bass. The latter then saw that it was being driven by a male Negro, wearing a mustache and beard. At the trial Bass identified defendant as the driver of the yellow automobile. However, an attempt was made to impeach his identification by his admission that at the preliminary 3

68 Cal. 2d at 321, 438 P.2d at 34, 66 Cal. Rptr. at 498.

25 A Conversation About Collins

447

hearing he testified to an uncertain identification at the police lineup shortly after the attack on Mrs. Brooks, when defendant was beardless. In his testimony Bass described the woman who ran from the alley as a Caucasian, slightly over five feet tall, of ordinary build, with her hair in a dark blond ponytail, and wearing dark clothing. He further testified that her ponytail was “just like” one which Janet had in a police photograph taken on June 22, 1964. At the trial, following testimony by a college mathematics instructor, the prosecutor introduced a table in which he hypothesized the following probabilities of occurrence of the reported characteristics of the two people involved in the crime:4 Characteristic Individual Probability A. Partly yellow automobile 1/10 B. Man with mustache 1/4 C. Girl with ponytail 1/10 D. Girl with blond hair 1/3 E. Negro man with beard 1/10 F. Interracial couple in car 1/1000 The prosecutor multiplied these individual probabilities and arrived at a figure of 1/12,000,000, which represented the probability “that any couple possessed the distinctive characteristics of the defendants.”5 In his summation, the prosecutor repeated this analysis and emphasized the extreme unlikelihood—“somewhat like one in a billion”—that a couple other than the defendants had all these characteristics. II. A Conversation About Collins Lawyer: The California Supreme Court made a lot out of the prosecutor’s unproven assumption that each of the six characteristics was “statistically independent.” Will you explain that to me? Statistician: Statistical independence between X and Y means that it is neither more nor less likely that Y happens whether or not X happens. For example, in drawing a card from an ordinary pack of playing cards, let X mean the card is an ace and Y mean the card is black. The probability that the card drawn is black (rather than red) is 1/2, whether or not the card is an ace. L: Then the prosecutor’s assumption was not merely unproven, it was almost certainly wrong. For example, it must be much more likely that a man has a mustache if he has a beard than if he does not. Beards and mustaches tend to run together. And, if you have a “girl with blond hair” and a “Negro man with beard,” the chance that you have an “interracial couple” must be close to 1000 times greater than the estimate of 1/1000. Right? 4 5

Id. at 325-26 n. 10, 438 P.2d at 37, 66 Cal. Rptr. at 501. Id. at 325, 438 P.2d at 37, 66 Cal. Rptr. at 501.

448

William B. Fairley and Frederick Mosteller

S: Yes, everyone notices these dependencies. L: I don’t believe that the prosecutor could have blundered that badly. S: The court seems to have thought that he did. The 1/12,000,000 figure was certainly consistent with treating each of the characteristics as statistically independent and, according to the product rule, multiplying the probabilities of each to get the probability of all together. L: What is the product rule? S: You use it yourself all the time. What are the odds of getting heads if you toss a coin? L: Fifty-fifty. S: Yes, that makes the probability one in two or one-half. And if you toss it again? L: Same. S: Right. What is your chance of getting heads both times? L: One in four. S: Right; one-half times one-half. That’s the product rule. The prosecutor may have arrived at 1/12,000,000 by multiplying all of the individual probabilities of each characteristic occurring alone. L: But if there is dependence that wouldn’t be right. S: Yes, but I will explain how the individual probabilities used in the case are also consistent with an interpretation that is more favorable for the prosecutor. L: What is that? S: He may have been employing the product rule for dependent events. If we work backward in the table from (F) to (A) and appropriately define the characteristics, this interpretation is at least plausible. For (F), at the time of the trial, 1 in 1000 might have been an accurate frequency for interracial couples who sometimes drive cars. For (E), among interracial couples who sometimes drive, perhaps 1 in 10 had a Negro man with a beard. L: In 1964 beards were much rarer than in 1974. S: For (D), among couples having both of these characteristics, one can argue that 1 in 3 has a woman with blond hair. We could go on, interpreting each of the individual probabilities as a conditional probability, that is, given all of the preceding characteristics. If the individual probabilities were constructed in this manner, and if they were numerically correct, then their product would give the probability of the joint occurrence of the separate events or characteristics—those no longer regarded as independent. Of course I don’t know if that’s what the prosecutor had in mind, but it’s a lot more plausible than the assumption of independent events. L: But the resulting figure is the same either way.6 6

Charles Kingston has made an illustrative calculation of the probability of the joint occurrence of characteristics (A) through (F) that gives a figure close to 1/12,000,000. He made explicit assumptions about dependencies among the characteristics and, in addition, used a revised list of characteristics that removes the

25 A Conversation About Collins

449

S: Yes. L: Will you explain that? S: Ordinarily we would have two different sets of probabilities for the characteristics, one representing the unconditional probabilities of events and the other representing the conditional probabilities for the characteristics in the order we choose for computation. If the probabilities in the table are the approximately correct conditional probabilities, then they are not likely to be the correct unconditional probabilities. For example, consider (E), “Negro man with beard.” The population proportion of Negroes would be 10 or 12 percent, and the proportion of Negro males would be 5 or 6 percent; the proportion of Negro males with beards would be smaller still, say 2 percent. For (E) in 1964, the unconditional probability might therefore be about 1/50, rather than 1 /10. If we change all of the probabilities to unconditional probabilities in this manner, then their product will not—except by a fluke—equal 1/12,000,000. L: In this case, if you used unconditional probabilities as if there were independence, that would be a mistake, wouldn’t it? S: Yes; but if the separate probabilities given in the table are conditional probabilities, then the product rule for dependent probabilities would have been correctly used if we multiplied them together as we described.7 L: So the prosecutor may have been basically correct in getting an order of magnitude estimate for the frequency of couples with the characteristics of the robbers. S: Possibly. L: Still, even if the prosecutor intended to use the product rule for dependent events, he had not verified the individual probabilities he quoted. ambiguities and overlaps in the original list. Kingston, Probability and Legal Proceedings, 57 J. Crim. L.C. & P.S. 93 (1966). 7 A simple example may clarify things. Suppose we have two young men and three old men; suppose also that one of the old men is a plumber and none of the other men are. The probability that a man drawn at random is a plumber is 1/5; the probability that a man drawn at random is old is 3/5. The assumption of independence between being older and being a plumber would give us the probability of selecting an older man who is a plumber as 3/5 times 1/5, or 3/25. which is clearly wrong. A correct attack would be to note that the probability of drawing an older man from this group is 3/5. Given that a man is old, the probability of that person also being a plumber is 1/3. We multiply 3/5 times 1/3 to get the correct probability, 1/5, of drawing an old man who is a plumber. This last approach is an application of the product rule for dependent events. More generally, if P (A) is the probability of event A occurring, and P (B|A) is the probability that event B occurs given that A occurs, then the probability of both A and B occurring is given by the product: P (A and B) = P (A)[P (B|A)].

450

William B. Fairley and Frederick Mosteller

S: But he could have taken some small steps in this direction. Suppose he had a small sample survey; this technique is sometimes used, though not always accepted by courts.8 He could then have estimated the proportion of white girls with ponytails, the proportion of yellow cars, the proportion of black men with beards, and so on. L: Since these numbers would have been just estimates, would the court have accepted them? S: I don’t know; that’s your field, not mine.9 But it would have at least had a chance to consider whether the method of gathering the data was adequate for the purpose. Remember that this analysis was an order of magnitude argument, and the court might have been sympathetic to a carefully made one. The court certainly suggested this attitude by developing its own mathematical model. Indeed, part of the court’s complaint was that there was no effort by the prosecution to connect the numbers with the real world. L: You mean he might just go out and count numbers of people of different types passing corners near the scene of the robbery at about 11:30 in the morning? Or ask the Registrar of Motor Vehicles to tell him what fraction of cars were yellow? S: Yes, at a minimum. He might do better than that—but let’s not get very far into the possible designs of his sample survey. L: How could he handle the dependence problem? S: By looking at pairs and triples of characteristics in his sample. For example, he might have counted 1,000 cars in the relevant neighborhood at the relevant time. He might have seen no cars with a couple like the one described, and he might have seen a few having as many as three of the required characteristics. Then he might be able to assess the rarity of the separate combinations of characteristics. Although it might be hard to determine a single collective probability, he would be able to define ranges of possible answers. The prosecution would then have a solidly based estimate of the rarity of the couple’s charactertistics. There are special statistical tools for handling probabilities of several characteristics that are not independent.10 It would be an uncertain estimate, for example, fixing the probability between 1/10,000 and 1/ 100,000. L: Would he be estimating conditional probabilities or unconditional ones? S: He would estimate the probability that all of the relevant characteristics are found in a randomly drawn couple. That probability is the figure that all of these calculations are intended to estimate. 8

Cf. Hans Zeisel, Statistics as Legal Evidence, 15 International Encyclopedia of the Social Sciences 246 (1968); Zeisel & Diamond, “Convincing Empirical Evidence” on the Six-Member Jury, 41 U. Chi. L. Rev. 281, 293 n. 52 (1974) 9 See Zeisel, The Uniqueness of Survey Evidence, 45 Cornell L. Q. 322 (1960). See also Sprowls, The Admissibility of Sample Data into a Court of Law: A Case History, 4 U.C.L.A. L. Rev. 222 (1957). 10 Cf. Y. Bishop, S. Fienberg, & P. Holland, Discrete Multivariate Analysis (MIT Press, in press).

25 A Conversation About Collins

451

L: I’m intrigued by the court’s claim that “the prosecution’s figures actually imply a likelihood of over 40 percent that the Collinses could be ‘duplicated’ by at least one other couple who might equally have committed the San Pedro robbery.”11 What bothers me, I suppose, is the idea that a mathematical appendix could prove something about the likelihood of the existence of couples of a certain type. That’s an empirical question. I’d have thought that a sample survey or a census would be the only way to find out how many couples sharing a given set of characteristics were in the area of the robbery. S: It certainly is an empirical question, and we have discussed in principle how the data needed to answer the question might be gathered. The court, however, was trying to show what could be said without any data. They said, in effect, “Let’s assume with the prosecution that the probability that a random couple would share the robbers’ characteristics is 1 in 12,000,000.” L: And then the court imagines randomly selecting some number of couples from all of the couples who could have committed the crime, one of whom (the defendants) was observed by the police to have the listed characteristics. S: Yes. Let’s call this number N . If we independently select N couples, each with a probability of 1 in 12,000,000 of sharing the robbers’ characteristics, we can then calculate the probabilities of selecting none or one or more couples with these characteristics. If we put all those probabilities together, we get what’s called the probability distribution of the number of such couples. Among the N couples. L: How do you find these probabilities? S: That calculation is made within probability theory, that is, these probabilities can be computed from a formula using the assumptions the court made. But we can carry out a simulation of the selection process that would illustrate these probabilities. L: How? S: We would randomly select N couples from the general population and record the number of couples selected that have the unusual characteristics. We would perform this selection process a number of times—perhaps a thousand—and the relative frequency of times we found no couples, one couple, two couples, and so on, would approximate the respective probabilities of finding, among N couples, no couples, one couple, two couples, and so on, with the unusual characteristics. L: The court discusses conditional probabilities of different numbers of such couples. What are those? S: The court says that the defendants are one couple with the characteristics of the robbers. According to the court, we should therefore be interested in probabilities that are conditional on that fact. That is, instead of asking for the chances of none or one or more, ask: if one, what are the chances of 11

68 Cal. 2d at 331, 438 P.2d at 40, 66 Cal. Rptr. at 504 (emphasis in original).

452

William B. Fairley and Frederick Mosteller

more than one? What is the probability of at least one more couple having the robbers’ characteristics, given that there is one?12 L: I suppose that you could also simulate that probability. S: We could go back to our earlier simulation and pick out all those selections of N couples having at least one couple with the required characteristics. We would then group these selections into those in which exactly one couple with the characteristics was selected and those in which more than one were selected. Of course, we ordinarily use a formula to make the calculation—as the court did. L: So the court was saying that, in our case, if we collect all the selections of N in which at least one couple like the robbers was chosen, 41 percent of them will have more than one such couple. Is that right? S: Yes. You understand the court perfectly. But in fact the court was wrong. A crucial point at which the appendix is mistaken is that the conditional probability of there being more than one couple like the robbers, given that there was at least one, is 0.41 only if N is about 12,000,000.13 And the 12

The issue stressed here is whether there is more than one couple. One could note that, even when there is more than one couple, each couple has a chance of being the couple of interest. The total probability that the defendants are the couple in question would then be a sum. The sum is composed of the probability that there was exactly one couple like the robbers plus part of the probability that there are exactly two, and so on. We shall not go into this aspect of the problem here. For a discussion of the probability of duplication and the probability of identity, see Kingston, Applications of Probability Theory in Criminalistics, 60 J. Am. Stat. Ass’n, 70 (1965). See also Cullison, supra note 2. 13 The court formulated a model leading it to compute a binomial probability where the number of trials (couples) is N and the probability of a success (picking a robberlike couple) on any one trial is p. The court gives the well-known general formula for solving this problem, which we do not reproduce here. See 68 Cal. 2d at 334, 438 P.2d at 42, 66 Cal. Rptr. at 507. In the general formula p need not be 1/N , but the court used p as 1/12,000,000 to follow up the prosecutor’s argument. The expected number of successes in N trials is N p. Because N is large and p is small, Poisson probabilities closely approximate mate the binomial probabilities, where N p is the mean of the approximating Poisson distribution. If we denote N p by m, the conditional probability sought is

probability of more than one probability of one or more =1–

probability of exactly one probability of one or more

=1–

me−m 1−e−m

=1–

m em −1.

25 A Conversation About Collins

453

court itself recognized that we don’t know what number N really is.14 L: How did the court go wrong? S: I don’t know. To reach its conclusion the court had to believe that N was 12,000,000. And it says it didn’t believe that. L: Doesn’t the 1 in 12,000,000 figure that the court took from the prosecutor imply a selection of at least 12,000,000 couples? S: No. Suppose we flip a coin ten times. There are 1,024 possible outcomes; if heads and tails have equal chances of coming up on each toss, then each possible outcome has an equal chance, 1/1,024, of occurring. In this instance the probability comes from the product rule, which is applicable because each flip of the coin is an independent event. We might have only one set of 10 tosses yielding only one outcome, yet we have a probability for that outcome of 1 in 1,024. So there is no need to have anything like a thousand flips. L: So there’s no special case for N being 12,000,000 just because the probability of each selection is 1/12,000,000. Suppose we try a small N , perhaps because we believe the robbers must have come from a small area in San Pedro near the scene of the robbery. Try 250,000 instead of 12,000,000. S: O.K. We can get the answer to this question from the general formula given by the court that shows how the conditional probability varies with N .15 When N is 250,000, assuming that the probability of any couple having the robbers’ characteristics is still 1/12,000,000, the probability of finding more than one couple if there is at least one is about 0.01.16 L: That’s about one-fortieth of the court’s figure. S: Yes. L: No wonder we have that quote from Disraeli about liars, damn liars, and statisticians. S: Actually, the original saying referred to expert witnesses, not statisticians.17 The number e is the base of the natural logarithm and has an approximate value of 2.71828. Here e−m is the probability of zero couples sharing the characteristics of the robbers, 1−e−m the probability of one or more such couples, and me−m the probability of exactly one. When m is small the required conditional probability is approximated by m/2. For m = 1, this approximation gives 0.5 instead of the more nearly exact value of 0.41. When p is taken as 1/N , then m = 1 and we get the court’s results. But p might not be 1/N and thus m need not be 1; it could be much smaller or much larger. 14 “N , which represents the total number of all couples who might conceivably have been at the scene of the San Pedro robbery, is not determinable . . .” Id. at 335, 438 P.2d at 43, 66 Cal. Rptr. at 507. 15 Id. at 334, 438 P.2d at 42, 66 Cal. Rptr. at 507. 16 Using m = N p = (250,000) × (1/12, 000, 000) = 1/48 ∼ 0.02, we get m/2 ∼ 0.01. 17 “There is probably no department of human inquiry in which the art of cooking statistics is unknown, and there are sceptics who have substituted ‘statistics’ for ‘expert witnesses’ in the well-known saying about classes of false statements. Miss Nightingale’s scheme for Uniform Hospital Statistics seems to require for its real-

454

William B. Fairley and Frederick Mosteller

Your common sense is basically right about this aspect of the problem. A couple could have such rare properties that it wouldn’t be observed once in a month of Sundays. We all know this. Further, we know that the fact that there is one such couple, does not imply there are more. There may be, and the court’s model is one way to think about it. But we should not get carried away with the details of the model or the implication of the court’s example. The court’s example is somewhat misleading. It suggests that when probabilities are small and we observe one rare object, there are quite likely two or more of them, in fact, 41 percent likely. L: Earlier, I chose a number for N that made it likely that the defendants were the robbers. Couldn’t the actual numbers have favored them even more than the court’s example? S: Certainly. If there are 250,000 couples and the probability that a given couple has the required characteristics is, for example, 1/25,000 instead of 1/12,000,000, then the odds are overwhelming18 that there is at least one more couple with those characteristics. L: So, if I understand you, the court’s argument—that the prosecution’s probability of 1/12,000,000 implied a probability of 0.41 for a second couple having the robbers’ characteristics—is true only if N equals 12,000,000, an assumption that the court rejected because N was, in its words, “not determinable.” S: That sums it up accurately. L: I’d like to clear up the meaning of these “selections” of N couples used by the court. I can understand the meaning of the probabilities we have discussed if I think of Los Angeles as a large goldfish bowl—which I am told it is—and imagine selecting N couples at random from the bowl and noting their characteristics. But how do we connect it all to Collins? S: I think of it a little differently. I think of Los Angeles as generating couples who go out into the San Pedro area, different sets of couples each day, or even each hour. We could find out roughly how many couples there are in the area during a short time period around the robbery, and that figure could be regarded as the number of couples. Some days there would be no couples of a given unusual type, some days one, and so on. If we used the court’s formula, we would estimate the probability of a couple having the robbers’ characteristics, and then apply the N that is appropriate for the place and time of day.19 ization a more diffused passion for statistics and a greater delicacy of statistical conscience than a voluntary and competitive system of hospitals is likely to create.” Sir Edward Cook, 1 The Life of Florence Nightengale 433–434 (1913). 18 Under this assumption, the odds are about 2,200 to 1. The simple approximation technique used above cannot be used here, because m, or N p, is not small. See notes 13 & 16 supra. 19 The Collinses were identified as having been in San Pedro at the time of the robbery, 68 Cal. 2d at 322, 438 P.2d at 34, 66 Cal. Rptr. at 498, so the question remains: given one, what is the probability of more than one?

25 A Conversation About Collins

455

L: Wouldn’t there be a tendency for a given region to have the same couples every day? S: Yes, and for that reason we might want to use a somewhat more complicated model than that presented by the court, but let’s save that for another time. L: In addition to the unknown N , I don’t know what to do with the 1 in 12,000,000 frequency figure, or even a 1 in 12,000 figure. In either case we are likely to find a substantial possibility of one or more other couples having the robbers’ characteristics. In the first case there may be only one other couple, whereas in the second there may be a thousand. But the logical point of our discussion is that, without any other evidence, the chance that there was only one couple like the robbers is now something between 100 to 1 in favor and 2,200 to 1 against. S: Yes, and that shows how important it is to get a fair estimate of the proportion of robber-like couples and the number of eligible couples. Before you heard this logical point, the unusual observed characteristics seemed telling, didn’t they? L: Yes, and that is what is puzzling me now. Reading all the facts, including the evidence in addition to that included in the probability model, I thought I saw a strong case against the defendants, but now I have doubts, and I can’t justify that earlier view or intuition. S: If we had time, I would discuss with you a way to think about the odds that takes into account the unusual characteristics and the other evidence in the case. I’d also like to discuss the implications of possible errors of observation by the eyewitnesses to the robbery.20 There is also the fact that the particular characteristics in this case were selected, and omitted, in a post hoc way, and that might have made a difference.21 Finally, we have suggested the need for somewhat more complicated models than those we have discussed. L: Perhaps these models should be discussed in another conversation. S: No doubt it will take more than one conversation. The Collins court, in its opinion and appendix, took a big step towards moving the ideas of probability into legal discussion in such cases. A first step into new territory is rarely the last.

20

These questions are taken up in Fairley, Probabilistic Analysis of Identification Evidence, 2 J. Legal Studies (1973). See also Finkelstein & Fairley, A Bayesian Approach to Identification Evidence, 83 Harv. L. Rev. 489 (1970); Tribe, Trial by Mathematics: Precision and Ritual in the Legal Process, 84 Harv. L. Rev. 1329 (1971); Finkelstein & Fairley, A Comment on “Trial by Mathematics,” 84 Harv. L. Rev. 1801 (1971); Tribe, A Further Critique of Mathematical Proof, 84 Harv. L. Rev. 1810 (1971). 21 Cullison remarks that “the prosecutor’s descriptions seem to do a better job of describing the defendants than of describing what was fairly established about the robbers.” Cullison, supra note 2, at 503.

Reprinted from Science (1977), 198, pp. 684–689

26. Statistics and Ethics in Surgery and Anesthesia John P. Gilbert, Bucknam McPeek, and Frederick Mosteller1 Harvard University

Ethical issues raised by human experimentation, especially in medicine, have been of increasing concern in the last half of the 20th century. Except for issues of consent and capacity to consent, ethical concerns raised by controlled trials center about the fact that individuals are being subjected, randomly, to different treatments. Two arguments are raised, and in each the patients are seen to be the losers. The first argument is an expression of the fear that the trial, by withholding a favorable new therapy, imposes a sacrifice on the part of some of the patients (the control group). The second argument raises the opposite concern that, by getting an untested new therapy, some patients (those in the experimental group) are exposed to additional risk. To a large extent, both arguments imply that investigators know in advance which is the favorable treatment. Some empirical evidence on these issues can be obtained by examining how potential new therapies are evaluated and what the findings are. How often do new therapies turn out to be superior when they are tested, and how much better or worse is a new therapy likely to be than the standard treatment? We have investigated such questions for surgery and anesthesia. The Sample of Papers For an objective sample we turned to the National Library of Medicine’s MEDLARS (Medical Literature Analysis and Retrieval System). For almost 1

J. P. Gilbert is staff statistician at the Office of Information Technology, Harvard University, Cambridge, Massachusetts 02138 and assistant in biostatistics in the Department of Anesthesia, Massachusetts General Hospital, Boston 02114. B. McPeek is anesthetist to the Massachusetts General Hospital and assistant professor of anesthesia, Harvard University. F. Mosteller is professor of mathematical statistics in the Department of Statistics and chairman of the Department of Biostatistics, School of Public Health, Harvard University, 7th Floor, 677 Huntington Avenue, Boston, Massachusetts 02115.

458

John P. Gilbert, Bucknam McPeek, Frederick Mosteller

15 years, this computerized bibliographic service has provided exhaustive coverage of the world’s biomedical literature. Articles are classified under about 12,000 headings, and computer-assisted bibliographies are prepared by crosstabulating all references appearing under one or more index subjects. For example, all articles indexed under prostatic neoplasms, prostatectomy, and postoperative complications might be sought. We obtained our sample from the MEDLARS system by searching for prospective studies and a variety of surgical operations and anesthetic agents, such as cholecystectomy, hysterectomy, appendectomy, and halothane (1). The papers appeared from 1964 through 1973. We found 46 papers that satisfied our four criteria: The study must include (i) a randomized trial with human subjects, (ii) with at least ten people in each group, (iii) it must compare surgical or anesthetic treatments, and (iv) the paper had to be written in English because of our own language limitations. All the papers we found, by the MEDLARS search, that met these criteria are included in the sample. Although this sample is neither a strictly random sample nor a complete census of the literature of the period covered, the method does largely exclude personal biases in selection. These papers evaluated two types of therapy. One type is designed to cure the patient’s primary disease. An example is the trial of radiation therapy in addition to surgery for the treatment of cancer of the lung (2). The second type of therapy is used to prevent or decrease the rate of an undesirable side effect of the primary therapy. Examples are the various trials of anticoagulants to decrease the incidence of thromboembolism after operations on the hip. Because we felt that these two types of therapies might differ in the distributions of improvements we wished to study, as indeed they seemed to, we have recorded them separately using the terms primary and secondary therapies, respectively. While each of our sample papers has provided important information concerning the treatment of a specific disease or condition, prognosis, complications, the natural history of disease, and the like, we have concerned ourselves only with the comparison of effectiveness between competing therapies. We have classified each therapy either as an innovation or as the standard treatment, to which the innovation was being compared. Although this distinction is usually clear, in a few instances some readers might disagree with our decisions. We took the position of the investigators, who usually indicated which therapies they regarded as the standards for comparison. Some papers report trials where several innovations were tested against one standard, or one innovation was sometimes tested against several standards, or the comparison was made for several distinct types of patients. To prevent one paper from having an undue effect on the total picture, no more than two comparisons were taken from any one paper, the choice being based on the importance of the comparisons for the surgery. When two comparisons were used, each was weighted one-half. When several papers reported the same investigation, we used the most recent one.

26 Statistics and Ethics in Surgery and Anesthesia

459

Comparisons of Innovations and Standards To give a rough qualitative idea of how the innovations (I) compared with the standards (S), we have classified the outcomes by “highly preferred to” (>>), “preferred to” (>), and “about the same as” (=) in Table 1. In the first set designated =, the innovation was regarded as a success because it did as well as the standard and did not have other disadvantages, such as high cost, dangerous side effects, or the requirement of extra skill or training in its administration. Thus, it offers the surgeon an extra therapy when the standard may have drawbacks. In the second set designated =, the investigators seemed indifferent to the equality; in the third set, the innovations were regarded as a disappointment because of undesirable features. The preferences reported reflect closely the views of the original investigators. About 49 percent of the innovations were successful when compared to their matched standards, and 13 percent were highly preferred. Among pairs of primary therapies, the innovation was highly preferred in 5 percent, and among pairs of secondary therapies the innovation was highly preferred in 18 percent of the comparisons. Indeed, the totals of the two extreme categories were smaller in the primary comparisons than in the secondary—10.5 percent as compared to 27 percent. The overall impact of the data in Table 1 is to suggest that, when assessed by randomized clinical trials, innovations in surgery and anesthesia are successful about half the time. Since innovations brought to the stage of randomized trials are usually expected by the innovators to be sure winners, we see that in the surgery and anesthesia area the evidence is strong that the value of the innovation needs empirical checking. Table 1. Qualitative comparisons between innovations (I) and standards (S) stratified by primary and secondary therapies. Where a paper had two comparisons, each was weighted one-half. Preference I >> S I>S I = S (success) I = S (indifferent) I = S (disappointment) S>I S >> I Total ∗

Primary 1 4 2 12 1 12 6 3 1 19

Secondary 5 4 12 6 1 5 4 2 12 28

Total

Percent

6 8 12 8 12 2 12 11 7 3 12

13 18 18 5 23 15 7

47∗

(99)

One paper contributed to both the primary and the secondary column.

460

John P. Gilbert, Bucknam McPeek, Frederick Mosteller

Quantitative Comparisons In addition to the qualitative comparisons of Table 1, we want to compare the performance of the innovation more quantitatively with the standard. For those primary therapies where survival gives a suitable measure of performance, we examine the distribution of the difference in survival percentages (I minus S). For the secondary therapies, we compare the percentages of patients not getting a specific complication such as abdominal infection or thrombosis. (Where we have used two complications in one study, each has been weighted one-half, as in Table 1.) If we merely take the observed differences, they are subject to variation over and above the true differences because of sampling error due to the finite samples used in the experiments. To adjust for these sampling errors, we use an empirical Bayes procedure, as described in the appendix. Efron and Morris (3) describe the general idea through an instructive sports example: If we observed the batting averages for their first 50 times at bat for 200 major league batters, we might find them ranging from 0.080 to 0.450, yet we know that major league averages for a season ordinarily run from about 0.200 to 0.350 these days. The excess spread comes from the sampling error based on only 50 times at bat rather than the season’s total experience. To adjust this, we can shrink the results toward the center of the distribution (roughly 0.275). How this is done is explained by Efron and Morris (3, 4) and more simply by them in (5); the explanation is given in detail for the present situation in (6). After the shrinking is carried out, we can estimate the distribution of the true gains or losses associated with the innovation by methods discussed in the appendix and in (6). In Fig.1, we give the estimated cumulative distribution for the true gains of secondary innovations. The graph suggests that about 80 percent of the innovations offer gains between -10 percent and +30 percent. In about 24 percent of the studies, gains of at least 20 percent occur. In about 10 percent of the studies, gains of more than 30 percent occur. About 12 percent of the time, losses of more than 10 percent occur. The sharp dip just to the right of zero improvement in Fig. 1 could, in a replication, move a few percent to the left or right of its present position. We have to emphasize that the cumulative is based essentially on a sample of 24 papers (not all secondary papers in Table 1 could be used here); but each paper is worth rather less than one whole observation of the difference because of the sample sizes in the investigations. If the sample sizes were infinite, we would not have the shrinking problem, and each paper would provide a full observation. Gains or losses of modest size, such as 10 percent, while extremely valuable, are hard to detect on the basis of casual observation. We need careful experimentation and good records to identify such gains and losses. To get an idea of how hard it is to detect a difference of 10 percent, say that between 55 percent and 45 percent, it may help to know that two samples of size about 545 are required to be 95 percent sure of detecting the difference, by a one-

26 Statistics and Ethics in Surgery and Anesthesia

461

sided test of significance at the 5 percent level. To be 50 percent sure requires samples of 136. Such large trials were rare in our samples.

Fig. 1. Secondary therapies: estimated cumulative distribution of true gains (reduction in percentage with a particular complication)

Nonrandomized Controlled Trials In addition to the randomized clinical trials, 11 1ess well-controlled trials seemed appropriate for reporting. Results are shown in Table 2 in a manner similar to that used in the randomized trials. By and large, the distribution leans more favorably toward innovations than that seen in Table 1. A tendency for nonrandomized trials to favor innovations is frequently noted. Although speculation is easy, the reasons for this are unclear. While in general a randomized trial provides stronger evidence than a corresponding nonrandomized trial, there are occasions where a nonrandomizing trial may be convincing. A nonrandomized study of abdominal stab wounds seems especially instructive because it provides strong evidence favoring a new policy. The hospital’s standard policy had been to perform a laparotomy (surgical exploration of the abdominal cavity) on all patients with abdominal stab wounds. In 1967, the hospital instituted a change in policy, the results of which Nance and Cohn (7) report. The new policy demanded exploration only when the attending

462

John P. Gilbert, Bucknam McPeek, Frederick Mosteller

surgeon judged it necessary. (A patient might be observed for a period and then explored.) Table 2. Summary for controlled nonrandomized trials. Preference

Primary

Secondary

Total

I >> S I>S I = S (disappointment) S>I S >> I

2 1

3 1 2 1 1

5 2 2 1 1

Total

3

8

11

The investigators give a record of (i) the substantial number of complications (25 percent) emerging from routine laparotomy when, in retrospect, the patient had not required surgical repair for the stab wound; (ii) the recovery without complications in the approximately 8 percent of patients who declined or otherwise passed by the former administrative rule of always performing a laparotomy; and (iii) evidence that delay before exploration under the old policy was not associated with an increase in the complication rate. These observations suggest that omitting the laparotomy for selected patients might be good practice. Some might have said, on the basis of the data presented in (i), (ii), and (iii), that the proposed new policy of judgmental surgical decisions would be clearly preferable to routine laparotomy. Nevertheless, such inductive leaps have often failed in other attractive circumstances, sometimes because the new policy loses some advantages that the old one had, or falls prey to the fresh problems that may arise when any policy is totally changed. Changing from set policy to the regular use of judgmental surgical decisions plus keeping records provided an inexpensive type of quasi-experiment. The method has a grave weakness because the time period is not common to the differently treated groups; and, therefore, causes other than the change in treatment may produce at least part of the observed differences. For the stab wounds, the need for a randomized clinical trial is not now compelling for the hospital partly because, in addition to the logic and data of (i), (ii), and (iii) above, the final quasi-experiment produced a large improvement. Although the percent requiring repair of the stab wound was about the same under the old and new policies (30 percent as compared to 28 percent), the overall complication rate dropped substantially from 27 to 12 percent. One fear would be that the unexplored group would produce a proportion of very severe complications. The evidence goes the other way. Among those not explored, the number without complications remained at zero even though the number not explored rose from 38 to 72 patients, and the percent explored fell

26 Statistics and Ethics in Surgery and Anesthesia

463

Table 3. Degree of control versus degree of investigator enthusiasm for portacaval shunt operation in 53 studies with at least ten patients. The table is revised from Grace, Muench, and Chalmers (8), table 2, p. 685 (1966, Williams and Wilkins, Baltimore). Chalmers advised us of two additional studies to add to the well-controlled to moderate cell, raising the count from 1 to 3. Degree of enthusiasm

Degree of Control Well controlled Poorly controlled Uncontrolled Total

Marked

Moderate

None

Total

0 10 24

3 3 7

3 2 1

6 15 32

34

13

6

53

from 92 to 40 percent. The average length of hospitalization over all patients dropped from 7.9 to 5.4 days. Had the effect been small, one might still be concerned whether possible biases and other changes could have given misleading results. All told, the evidence favoring the new policy seems persuasive for this hospital. Comparisons of Degrees of Control Although randomized clinical trials are not the only strong form of evidence about therapies in humans, weakly controlled investigations may not give the same results as better controlled ones. Chalmers and his colleagues have compared (8, 9) views of many investigators who had make studies of a single therapy, with respect to the degree of control used in each investigation. We give the results of one example of such collections of investigations (8). Table 3 shows the association between degree of enthusiasm and degree of control for the operation of portacaval shunt [slightly revised, by adding two cases, from Grace, Muench, and Chalmers (8)]. The counts in Table 3 are not of patients but of investigations. Table 3 shows that, among the 53 investigations, only six were classified as “well-controlled.” Among the 34 associated with “marked enthusiasm,” none were rated by the investigators as “well-controlled.” The “poorly-controlled” and the “uncontrolled” investigations generated approximately the same distribution of enthusiasm: about 72 percent “marked,” 21 percent “moderate,” and 6 percent “none” The six “well-controlled” investigations split 50-50 between enthusiasm levels “moderate” and “none.” Muench, who participated in collecting these data, has a set of statistical laws (10), one of which says essentially that nothing improves the performance of an innovation as much as the lack of controls. Because tables for other therapies have given similar results, one must be cautious in accepting results of weakly controlled investigations.

464

John P. Gilbert, Bucknam McPeek, Frederick Mosteller

In Table 3, the rows for “poorly controlled” and “uncontrolled” studies suggest that repeated, weakly controlled trials are likely to agree and build up an illusion of strong evidence because of the large count of favorable studies. Not only may this mislead us into adopting and maintaining an unproven therapy, but it may make proper studies more difficult to mount, as physicians become less and less inclined, for ethical reasons, to subject the issue to a carefully controlled trial lest the “benefits” of a seemingly proven useful therapy be withheld from some patients in the study. Strengths of Belief A controlled trial of innovative therapy may sometimes impose a sacrifice on the part of some patients by withholding the more favorable of a pair of treatments. However, prior to the trial we do not know which is the favorable therapy. Only after the trial can the winner be identified. Some will say that the physician must have an initial guess, however ill-founded. It is unlikely that his view of the two competing treatments is exactly 50-50. The question then arises: If the physician fails to act on such a preference, is the patient getting responsible care? To help consider this question, let us review information obtained from experiments on incidental information. Alpert and Raiffa (11) have performed a number of experiments on guessing behavior. Individuals were asked to estimate quantities about which they might have been expected to have some incidental information, such as the fraction of students in their class having a particular characteristic. Subjects were graduate students in the Faculty of Arts and Sciences and in the Graduate School of Business Administration at Harvard University. In addition to the basic estimate, the graduate students were asked to provide numbers below which various subjective probabilities would lie. If we think of the upper and lower 1 percent intervals as ones where a responder would be seriously surprised to find the true answer (that is to say, the responder felt 98 percent sure that the answer would lie between the chosen 1 percent and the 99 percent levels), then these responders were seriously surprised in 42.6 percent of the guesses or about 21 times as often as they should have been if the subjective estimates matched the true frequencies. Alpert and Raiffa’s work (11) shows that experienced adults are likely to overrate the preciseness of their estimates. These people were too sure of their information. Although these people were not physicians in a patient relation, they were well educated and engaged in thoughtful work. Until we get contrary information from more relevant studies, such data suggest that strong initial preferences for therapies yet to be tested by controlled trials should be viewed with reserve. And, of course, the distribution shown in Fig. 1 and the results of Table 1 also show that, for therapies tested in trials, holding a view not far from 50-50 has some empirical foundation for surgery. Shapiro (12) gives examples of wide variation among different physicians’ estimates of probabilities in therapeutic situations, data pertinent to this discussion, but not the same as the Alpert-Raiffa point. Shapiro shows that

26 Statistics and Ethics in Surgery and Anesthesia

465

physicians differ a great deal in their estimates; Alpert and Raiffa show that people are very frequently much further off than they expect to be. Do We Owe the Past or Future? Let us consider the question of whether a present patient should give up something for future patients. We, or our insurance carriers, pay the monetary cost of our care. What we do not pay for is the contribution to the medical system by past patients. These patients, through their suffering and participation in studies, have contributed through their illness and treatments to the present state of evidence for all patients. Such contributions cannot be purchased by money but can be repaid in part by making, when appropriate, a contribution to the same system. One good way is through participation in well-designed clinical trials when the patient falls into the limbo of medical knowledge. Other nonmonetary ways are donating blood and organs. So one may feel an obligation to the system of medicine that has reached its present state without his or her assistance, and in addition each person has an interest in its general improvement as we next explain. [For a recent treatment of this point see Almy (13).] In some circumstances, participation in the trial may turn out to be of help to the patient. Aside from the luck of getting the best therapy of several that are offered, this occurs, for example, when the patient has a disease for which treatments can be readily changed after the trial. Nevertheless, there are circumstances when the treatment is not reversible and when the chances are that the specific trial will be of little individual benefit—that is, when it has but slight chance of being a benefit to the patient, his family, or friends. Under these circumstances, the patient may still be willing to participate in a trial. If the trial is recognized as part of a general system of trials in which patients participate only on such occasions as they qualify and when a trial seems necessary, then the patient may well benefit in the future not so much from the results of the particular trial he or she participates in but from the system that gives rise to it. Findings will come forward on many other diseases and the patient, or someone dear to him, will be likely to suffer from some of those diseases whose trials will have produced useful findings. It is not so much, then, the direct payoff of this present trial that we should have our eye on, but pooled benefits of the whole system. The longer the patient lives, the more likely it is that he or she will suffer from some other of the diseases being studied by careful trials. And insofar as they are not studied by careful trials, the appropriate conclusions may be slow in coming. By putting off the day when strong evidence is obtained, we reduce the patient’s chances of benefiting most fully from modern medicine. Thus the patient has an interest not only in the trial he or she has the opportunity to engage in, but also a stake in a whole system that produces improved results that may well offer benefits in the future, if the patient survives the present difficulty. Thus, the social system will likely offer benefits through the larger system even when a

466

John P. Gilbert, Bucknam McPeek, Frederick Mosteller

particular component of the system may fail to pay off directly for a patient, his family, friends, or some other social group he belongs to. A further statistical point that may not be much appreciated by potential participants in randomized trials is that the inferences apply primarily to the population sampled in the study. To the extent that individuals or groups decline to participate in studies, and to the extent that their responses may differ from those of the rest of the population (an interaction between participation and response to therapy), the treatments selected may not apply to them as well as to participants and people “like” them. For example, if those in the lower economic status were less likely to participate and if economic status related to the differential effectiveness of therapies, say, through additional lack of compliance, the study will not properly appreciate the value of the therapy for the nonparticipating group. The lone individual may seem to have little incentive to participate because one seems so few among many. But the stake is not in any one person appearing in this study; it is in having people from segments of the population that represent that individual being properly represented in this and other studies so that the results of the whole system may be more assuredly applied to this patient when disease strikes. The idea is similar to that of being told not to complain of the system when one does not vote. But the extra feature here is that one gets to vote on certain special occasions, and then only a few are admitted to the booth, and so each opportunity to vote weighs much more heavily than usual. If certain groups tend not to participate in the evaluative system, then they will not find medical evaluations of therapies as well pointed to their needs as if they did participate. Thus, each individual has a stake in wanting people like themselves represented. Since it is hard to say what “people like themselves” means, the good solution is to have the whole appropriate population volunteering in all the therapies tested. Participating presumably encourages others like me to participate too, and vice versa. The main point of this discussion is that if participation seems to the patient to be a sacrifice, it should be noted that others are making similar sacrifices in aid of the patient’s future illnesses. So even if the particular trial may not help the patient much, the whole system is being upgraded for his or her benefit. We have a special sort of statistical morality and exchange that needs appreciation. Responsibility for Research Much of current popular discussion of the ethnical issue takes the position that physicians should use their best judgment in prescribing for a patient. To what extent the physician is responsible for the quality of the judgment is not much discussed, except to say that he must keep abreast of the times. Some physicians will feel an obligation to find out that goes beyond the mere holding of an opinion. Such physicians will feel a responsibility to contribute to research. In similar fashion, some current patients may feel a responsibility

26 Statistics and Ethics in Surgery and Anesthesia

467

to contribute to the better care of future patients. The current model of the passive patient and the active ongoing physician is not the most effective one for a society that not only wants cures rather than sympathy, but insists on them—a society that has been willing to pay both in patient cooperation and material resources for the necessary research. Quality of Life In addition to a society willing to support medical research through responsible experimentation on human beings, in addition to physicians dedicated to acquiring knowledge on behalf of the sick, we must be certain that controlled trials are designed to seek answers to the appropriate questions. In our survey, we found most concern with near-term outcomes, both mortality and morbidity. We need additional data about the quality of life of patients. Among our initial sample of 107 papers drawn through the MEDLARS search, quality of life seemed often to be a major consideration, although rarely did papers address more than a few features of that quality (14). Because much of medicine and surgery is intended to improve quality rather than to save life, measuring the improvement is important. As we have indicated above, different therapies frequently produce about the same mortality and morbidity, and so the ultimate quality of life achieved would bear heavily on the choice. Thus, for proper evaluation of alternatives, we need to assess the patient’s residual symptoms, state of restored health, feeling of well-being, limitations, new or restored capabilities, and responses to these advantages or disadvantages. For surgery, we need long-term follow-up and both objective and subjective appraisals of the patient’s quality of life. Frequently, the long-term follow-up is carried out, but overall quality of life is rarely measured. For example, among 16 cancer papers in the initial sample of 107, follow-ups ranged from 2 months to 2 decades. With few exceptions, survival and recurrence data were the principal information given, and because different treatments usually had similar rates, it would be fruitful to report contrasts among the treatments in the quality of life or death experienced by patients with the same disease but having different treatments. This might be especially appropriate because the therapies involved such features as castration, hormones, irradiation, chemotherapy, and various amounts of surgery. Developing and collecting suitable measures for quality of life after surgery requires leadership from surgeons and the cooperation of social scientists. We hope these developments will soon take place. Summary Approximately half the surgical innovations tested by randomized clinical trials provide improvements. For those where reduction in percent of complications was a useful measure, we estimate that about 24 percent of the

468

John P. Gilbert, Bucknam McPeek, Frederick Mosteller

innovations gave at least a 20 percent reduction in complications. Unfortunately, about 12 percent of the innovations gave at least a 10 percent increase in complications. Therefore, keeping gains and discarding losses requires careful trials. Gains of these magnitudes are important but are hard to recognize on the basis of incidental observations. When well-controlled trials have not been used, sometimes data have piled up in a direction contrary to that later found by well-controlled trials. This not only impedes progress but may make carefully controlled trials harder to organize. Most of the trials we studied did not have large sample groups. To dependably identify gains of the magnitude we found in the discussion on surgery and anesthesia, trials must be designed carefully with sufficient statistical power (large enough sample sizes) and appropriate controls, such as may be provided by randomization and blindness. As Rutstein (15) suggests: It may be accepted as a maxim that a poorly or improperly designed study involving human subjects . . . is by definition unethical. Moreover, when a study is in itself scientifically invalid, all other ethical considerations become irrelevant. There is no point in obtaining “informed consent” to perform a useless study. When we think of the costs of randomized trials, we may mistakenly compare these costs with those of basic research. A more relevant comparison is with the losses that will be sustained by a process that is more likely to choose a less desirable therapy and continue to administer it for years. The cost of trials is part of the development cost of therapy. Sometimes costs of trials are inflated by large factors by including the costs of the therapies that would in any case have been delivered rather than the marginal cost of the management of the trial. This mistake is especially likely to be made when a trial is embedded in a large national program, and this is also the place where trials are highly valuable because their findings can be extended to a whole program. Surgical treatment frequently trades short-term risk and discomfort for an improved longer term quality of life. While long-term follow-up is frequently reported, a vigorous effort is needed to develop suitable measures of quality of life. Table 1 gives empirical evidence that, when surgical trials are carried out, the preferable treatments are not known in advance. Although a common situation in a trial would be that the innovation was expected to be a clear winner, the outcome is in grave doubt. Empirical evidence from nonmedical fields suggests that educated “guesses” even by experienced, intelligent adults are way off about half the time. For these reasons we discount the pretrial expectations or hunches of physicians and other investigators. Most innovations in surgery and anesthesia, when subjected to careful trial, show gains or losses close to zero when compared to standards, and the occasional marked gains are almost offset by clear losses. The experimental

26 Statistics and Ethics in Surgery and Anesthesia

469

group is neither much better nor much worse off than the control group in most trials, and we have little basis for selecting between them prior to the trial. The one sure loser in this system is a society whose patients and physicians fail to submit new therapies to careful, unbiased trial and thus fail to exploit the compounding effect over time of the systematic retention of gains and the avoidance of losses. Let us recall that our whole financial industry is based on a continuing return of a few percentage points. All in all, the record in surgery and anesthesia is encouraging. We regard a finding of 50 percent or more successes for innovations in surgical and anesthetic experiments as a substantial gain and a clear opportunity for additional future gains. Well-conducted randomized clinical trials are being done. All of us, as potential patients, can be grateful for a system in which new therapeutic ideas are subjected to careful systematic evaluation. Appendix Estimating the distribution of gains. The model of the process is that of two-stage sampling. We regard the innovation and its paired standard as drawn from a population of pairs of competing therapies. Let Z be the random variable corresponding to the improvement offered by the innovation (innovation minus standard), with mean M and variance A. For the ith innovation with true gain Zi , the experiment assesses the gain as Wi and Wi has mean Zi and variance Di . If we assume as an approximation that the distributions of Zi and Wi are normal, then the posterior distribution of Zi has mean Zi∗ = M ∗ + ei (Wi − M ∗ ) where

ei = A∗ /(A∗ + Di )

A∗ is an estimate of A, and M ∗ is an estimate of M . The posterior distribution of Zi is approximately normal with mean Zi∗ and variance (1 − Bi )Wi , where Bi = Di /(A∗ + Di ) In the current problem the D’s are estimated from binomial theory because the W ’s are the difference between two independent observed proportions. Details of obtaining A∗ and M ∗ are given in (6). To estimate the cumulative distribution of Z, we compute for each observation Wi z − Zi∗ ci = (1 − Bi )Di then using normal theory we compute Φ(ci ) = P (X < ci )

470

John P. Gilbert, Bucknam McPeek, Frederick Mosteller

where X is a standard normal random variable. Thus  ci √ 1 Φ(ci ) = [1/ 2π] exp(− x2 )dx 2 −x Finally k 

Φ(ci )/k

i=1

estimates P (Z < z) for each value of z. We thus release ourselves from the original normal approximation for Z and get a new distribution that is not normal but should be an improved approximation of the true distribution. When weights were used because one study gave two comparisons they modified both the estimation of A and W and the estimation of P (Z < z). References and Notes 1. For a discussion of MEDLARS, see M. Day, Fed. Proc. Fed. Am. Soc. Exp. Biol. 33, 1717 (1974). Indexing is done by specially trained abstractors at the National Library of Medicine. The MEDLARS contents vary over time as additions are made to correct omissions. Our initial search turned up 36 randomized clinical trials. These are listed in (6), appendix 9-1, pp. 145–154. A repeat search, approximately 18 months later, done according to the same search instructions, revealed 13 additional randomized clinical trials as follows: R.B. Noone, P. Randall, S.E. Stool, R. Hamilton, R.A. Winchester, Cleft Palate J. 10, 23 (1973); R. Smith, Trans. Ophthalmol. Soc. Aust. 27, 17 (1968); B. Brehmer and P.O. Madsen, J. Urol. 108, 719 (1972); J.E. Rothermel, J.B. Wessinger, F.E. Stinchfield, Arch. Surg. 106, 135 (1973); R.K. Laros, G.I. Zatuchni, G.J. Andros, Obstet. Gynecol. 41, 397 (1973); W.H. Harris, E.W. Salzman, R.W. DeSanctis, R.D. Coutts, J. Am. Med. Assoc. 220, 1319 (1972); J.W. Roddick, Jr., and R.H. Greenelaw, Am. J. Obstet. Gynecol. 109, 754 (1971); I.L. Rosenberg, N.G. Graham, F.T. DeDombal, J.C. Goligher, Br. J. Surg. 58, 266 (1971); R. Brisman, L.C. Parks, J.A. Haller, Jr., Ann. Surg. 174, 137 (1971); J.M. Lambie, D.C. Barber, D.P. Dhall, N.A. Matheson, Br. Med. J. 2, 144 (1970), D.B. Haverstadt and G.W. Leadbetter, Jr., J. Urol. 100, 297 (1968); J.A. Haller, Jr., et al., Ann Surg. 177, 595 (1973); D.J. Pinto, East Afr. Med. J. 49, 643 (1972). 2. A.B. Miller, W. Fox, R. Tall, Lancet 1969-II, 501 (1969). 3. B. Efon and C. Morris, J. Am. Stat. Assoc. 68, 117 (1973). 4. —, J. R. Stat. Soc. B 35, 379 (1973). 5. —, Sci. Am. 236, (No. 5) 119 (1977). 6. J.P. Gilbert, B. McPeek, F. Mosteller, in Costs, Risks, and Benefits of Surgery, J.P. Bunker, B.A. Barnes, F. Mosteller, Eds. (Oxford Univ. Press, New York, 1977), chap. 9, pp. 124-169. For formulas see pp. 156-161.5 7. F.C. Nance and I. Cohn, Jr., Ann. Surg. 170, 569 (1969). 8. N.D. Grace, H. Muench, T.C. Chalmers, Gastroenterology 50, 684 (1966). 9. T.C. Chalmers, J.B. Block, S. Lee, N. Engl. J. Med. 287, 75 (1972). 10. J.E. Bearman, R.H. Loewenson, W.H. Gullen, “Muench’s postulates, laws, and corollaries,” Biometrics Note 4 (National Eye Institute, Bethesda, Md. 1974).

26 Statistics and Ethics in Surgery and Anesthesia

471

11. M. Alpert and H. Raiffa, “A progress report on the training of probability assessors,” unpublished paper, Harvard University (28 August 1969). 12. A.R. Shapiro, N. Engl. J. Med. 296, 1509 (1977). 13. T.P. Almy, ibid. 297, 165 (1977). 14. B. McPeek, J.P. Gilbert, F. Mosteller, in Costs, Risks and Benefits of Surgery, J.P. Bunker, B.A. Barnes, F. Mosteller, Eds. (Oxford Univ. Press, New York, 1977), chap. 10, pp. 170–175. The results of the initial sample of 107 are reported in (6). Of these, 36 were randomized and 34 could be used, 11 were nonrandomized controlled trials, and 59 were series (study of one therapy). Our additional sample added 13 randomized trials for use in Table 1. Chapter 10 discusses quality of life. 15. D Rutstein, Daedalus 98, 523 (1969). 16. This work was facilitated by NIH grant Gm 15904 to Harvard University, by the Miller Institute for Basic Research in Science, University of California, Berkeley, and by NSF grant SOC75-15702 A01. We appreciate the advice and assistance of A. Bigelow, M. Ettling, M. Gasko-Green, D. Hoaglin, V Mik´e, A. Perunak, K. Soper, J. W. Tukey, and G. Wong.

Reprinted from Bulletin of the International Statistical Institute 47 (1997), pp. 559–572

27. Experimentation and Innovations Frederick Mosteller Harvard University, U.S.A.

1. Introduction Evaluating social and medical programs has become a large business in the United States of America. A special journal, Evaluation, is now devoted to this work, and a new society has been formed, the Evaluation Research Society. This surge arises because social programs themselves are huge enterprises, and the nation wonders whether the benefits of the programs are worth their costs. Some legislation provides a small percentage of the appropriation for evaluation. In some instances programs that are planned for the future are tried out on a small scale, with a view to appreciating the merits and difficulties before they become full scale ventures. In other directions, the medical area has developed a large program in clinical trials intended to evaluate new therapies so as to judge their efficacy and safety. In this paper we discuss the need for using controlled field trials in assessing social and medical programs. 2. Performance of social innovations One might suppose that innovations are generally valuable and successful, and therefore that they scarcely need testing, but the data arising from various field trials do not bear this out. Table 1 shows the distribution of the outcomes of a collection of social and socio-medical reforms that were subjected to randomized controlled field trials which compared the innovation with the standard treatment. The rating of the innovation compared with the standard was made by Gilbert, Light, and Mosteller (1975) on the basis of the reports by the original authors of the field trial. In making the evaluation, they did not include any measure of the cost of the innovation, instead they rated on the basis of the advantage of the innovation found in the trial. Therefore if subjected to a cost-benefit analysis, some innovations might get a reduced rating. In the rating system a ++ indicates a substantial improvement, a + a modest improvement, a 0 no improvement, with similar definitions for − and −−. Short titles of the projects whose innovations are being rated are included in the table to give some idea of the variety of subjects being investigated.

474

Frederick Mosteller

TABLE 1. RATINGS OF SOCIAL AND SOCIAL MEDICAL INNOVATIONS Ratings

−−



0

+

++

Welfare Workers Girls at Voc. High Pretrial Conf.

Cottage Life L.A Police

Neg. Income Tax Manhattan Bail Emergency School Assistance Program

0

3

2

3

Kansas Blue-Cross

Comp. Med. Care Drunk Probation Nursing Home Family Medical Care

Psychiatric After-care Mental Illness

Tonsillectomy

Social Innovations

Subtotal

0

Socio-medical Innovations

Subtotal

0

1

4

2

1

Total

0

1

7

4

4

Source: Gilbert, Light, Mosteller

The main finding is that half of the innovations offered no improvement over the standard treatments. A quarter offered substantial improvements corresponding to the ++ ratings. The experiments were applied to juvenile delinquency, legal proceedings, welfare, mental illness and rehabilitation, handling drunks, advising parents of children having surgical operations, handling health insurance, and training physicians. Although not included in the table, an additional experiment compared two curricula in secondary school physics and produced two outcomes. The innovative program did no better, as measured by a common test, in teaching physics to the students than the standard one, but its students did like the course better and thought non-mathematical students could profitably study it as well as those more mathematically inclined. 3. Gains from medical innovations In the medical area, Gilbert, McPeek, Mosteller (1977b) have provided a corresponding set of ratings for randomized controlled clinical trials in the area of surgery and anesthesia. Table 2 gives their results. Again, the ratings come from an appraisal of what the investigators of the original experiments report. The experiments were sorted into two groups, first, those dealing with the performance of the operation in its capacity as the primary therapy. The question for such operations is whether they are curing the patient’s disease, and whether the innovation causes increased survival or less morbidity. Second, physicians use the secondary therapies to reduce the frequency of bad side effects of the primary therapy, thus secondary therapies prevent infections or thrombosis or other difficulties following surgery and anesthesia.

27 Experimentation and Innovations

475

TABLE 2. QUALITATIVE COMPARISONS BETWEEN INNOVATIONS AND STANDARDS STRATIFIED BY PRIMARY AND SECONDARY THERAPIES. WHERE A PAPER HAD TWO COMPARISONS, EACH WAS WEIGHTED ONE-HALF. Rating of Innovation ++ + 0 (success) 0 (indifferent) 0 (disappointment) − −−

Primary Secondary

Total

%

1 4 2 12 1 12 6 3 1

5 4 12 6 1 5 4 2 12

6 8 12 8 12 2 12 11 7 3 12

13 18 18 5 23 15 7

19

28

47*

(99)

*One paper contributed to both the primary and the secondary column. Source: Gilbert, McPeek, Mosteller (1977b) “Success” with 0 rating means that although the innovation was not superior to the standard, it offers advantages in some circumstances and performs equally well. “Disappointment” with 0 rating means that the innovation, although performing at about the same level as the standard, has other drawbacks such as cost, side-effects, or requirements of special training that make it less desirable.

This collection of randomized trials has a somewhat stronger basis as a sample than those rated in Table 1. Whereas the social innovations were merely those the authors had been able to collect from broad reading, the surgical studies came from a computer search of the National Library of Medicine’s MEDLARS (Medical Literature Analysis and Retrieval System). This system lists most of the world’s output of medical articles. The innovations rated in Table 2 dealt with many areas of surgery such as: cancer (several forms), ulcer, transplantation, cirrhosis, bowel and colon surgery, liver injuries, stab wounds, open-heart surgery, and appendectomy. Examining Table 2, one sees that about half of the innovations have been rated as successful, with about 13% regarded as highly successful. Also about 22% got negative ratings. For a discussion of the amounts of improvement see Gilbert, McPeek, and Mosteller (1977a). Tables 1 and 2 together suggest that social innovations and surgical innovations are successful about half the time. Since innovations brought to the stage of being tested by randomized trials are ordinarily thought by their innovators to be sure winners, the evidence is strong that innovations do need testing.

476

Frederick Mosteller

4. Why randomize? Evidence from non-randomized trials Students and teachers of statistics often ask for evidence that randomization is needed in making comparisons among treatments. Usually they are thinking about the mathematical evidence. Probably the more influential evidence comes from the field studies themselves. We have two kinds. The first comes from findings parallel to Table 2 but with less well controlled trials. The second comes from results from much the same study repeated under varying conditions. Examples of the first of these are given in Table 3, and of the second in Table 4. For controlled but non-randomized trials, the investigator’s report leads us to give nearly half of the innovations a very high rating (++), a substantial improvement over the standard. Thus we see that the studies where randomization was not used led to giving the innovation a higher rating more often than did the randomized studies as shown in Table 2. The argument can be made that the Table 3 experiments were not comparable to those on which Table 2 is based. For example, one might argue that where it was clearer that the therapy was excellent, a less well controlled trial might be satisfactory and be more often used. TABLE 3. SUMMARY FOR CONTROLLED NON-RANDOMIZED TRIALS IN SURGERY AND ANESTHESIA Rating of Innovation

Primary Secondary

Total

%

++ + 0 (disappointment) − −−

2 1

3 1 2 1 1

5 2 2 1 1

45 18 18 9 9

Totals

3

8

11

(99)

“Disappointment” with zero rating: see footnote to Table 2 Source: Gilbert, McPeek, Mosteller (1977a)

Thomas Chalmers and his colleagues have looked into a sequence of situations where trials of the same therapy have been repeated by many different investigators under a variety of degrees of control. Since we can rarely repeat experiments in the political, social, or economic world, these medical results are most informative for those who believe that what happens in experimentation in one field of endeavor is likely to recur in another. Table 4 gives the result of 53 different clinical trials—that is 53 different studies all of the operation portacaval shunt. This operation is designed to reduce pressure from the blood stream in the esophagus and thus prevent or stop hemorrhaging in the patient. The operation has a substantial death rate.

27 Experimentation and Innovations

477

TABLE 4. DEGREE OF CONTROL VERSUS DEGREE OF INVESTIGATOR ENTHUSIASM FOR PORTACAVAL SHUNT OPERATION IN 53 STUDIES HAVING AT LEAST 10 PATIENTS Degree of Enthusiasm Degree of Control

Totals

Marked

Moderate

None

Well controlled Poorly controlled Uncontrolled

0 10 24

3 3 7

3 2 1

6 15 32

Totals

34

13

6

53

Revised from Grace, Muench, and Chalmers (1966), their Table 2, p.685. Copyright 1966, The Williams and Wilkins Co., Baltimore, Md. Dr. Chalmers advised me of two additional studies to add to the well-controlled and moderate cell raising the count from 1 to 3.

Table 4 relates the investigator’s enthusiasm for the operation after the study is completed to the degree of control exercised in the study. It shows clearly that the less well controlled the study, the more enthusiasm the investigators have for the operation. (It also shows that well controlled trials are rare.) The information in Table 4 then is similar to that in the comparison between Table 2 and Table 3. It again shows that poorer control seems to lead, on the average, to better ratings for the innovation. In addition to the study whose results are shown in Table 4, Chalmers and his associates have reviewed other therapies where many studies have been made and have found for each therapy that less control goes with more enthusiasm from the investigator after the study has been completed. These findings fit well with one of Muench’s postulates (Bearman, Loewenson and Gullen, 1974), which says in essence that nothing improves the performance of an innovation more than the lack of controls. Thus some support for using randomization in trials comes from the information about the differential outcomes when randomized assignments are not used. 5. When will experiments be most useful? Since experimentation is relatively new in the policy field, the question naturally arises as to when it is a useful technique. Alice Rivlin (1974) has listed four types of social situations where one might consider experimentation, and has provided examples, and indicated problems of execution and of inference for each. Because experimentation often has political difficulties, requires a great deal of care in execution, and may be expensive, one does not want to recommend it unless it seems especially useful. She considers four types of questions: Type 1. Small independent units. What is the response of individuals, households, or other micro units to a change in economic incentives?

478

Frederick Mosteller

Examples: income-maintenance experiment, housing demand experiment, health insurance experiment. Type 2. Market response. What is the market response to a change in economic incentives? Examples: housing supply experiment, effect of health insurance on supply of health services. Type 3. Output. What is the production function of a public service, such as health, education, or manpower training? Examples: performance of special curricula or special manpower training programs, family planning. Type 4. Changing incentives. How can incentives be altered in order to affect the production of a service and in turn to affect ultimate outcomes? Examples: education vouchers as an alternative way to obtain education instead of the public school system, health maintenance organizations as a device for encouraging the health providers to take more interest in keeping people well inexpensively, performance contracting in education—to encourage the teaching organization to improve the performance of the students by paying the teachers’ contractor more when the students perform better. All experimentation takes a great deal of care, and social experiments can be interfered with by daily events in drastic ways, as when an incomemaintenance experiment meets up with a mass lay-off or changes in welfare rules. Generally speaking the Type 1 experiments where individual units are treated separately are the easiest to do and have the clearest cut results. The inference may have difficulties, because we may want to know what will happen when the treatment is given to all people, or on a permanent basis, and these issues cannot be tested by a sample from the population carried out over a short time. Type 2 investigations of markets are difficult to carry out because of their scope and because of their high cost. Besides, the sample size is likely to be very small. For example, one has to realize that a study done on a whole market in one city, say Milwaukee, is just a sample of size one. The political problems can also block the study. Problems of scale can be overwhelming. Perhaps experimentation is not very workable for this kind of problem. Type 3 questions which ask about the effects of inputs on outputs, although regarded by Rivlin as more difficult than those of Type 2 do not seem so to me. Her examples include studies of the educational programs Follow Through and Head Start designed to improve the education of underprivileged children. In such studies one has to deal with substantial units rather than with individuals, and therefore one has to work at getting comparability of units, at the same time keeping the units separate or independent and to get comparability between the experimental and control units. The sample size is essentially the number of units and not the number of children. In the Planned Variation studies of Follow Through no attempt was made to get comparability of units. Developers of several different teaching programs, called models, arranged to install their model in a number of classrooms in different school districts. Then

27 Experimentation and Innovations

479

the instruction proceeded and measurements of gain in academic achievement were taken. Extensive statistical analysis did not have much success unraveling the relative merits of the several teaching programs. Since we know from a great many educational investigations (but not many experiments) that very large effects owing to modifications in the educational system have not been found, we should plan on measuring small effects. This in turn means that much care in the design is required, including comparability of initial conditions. (These remarks do not mean that an educational system has small effects, but that changes in a well established system are likely to produce small improvements.) Nevertheless, keeping independence between the units is hard. For example, in the USA in the experimental studies of the Emergency School Aid Act (ESAA), a program designed to improve the education in school systems undergoing desegregation, some schools within districts were given funds from the ESAA program, at random, and others not. A matched pair system was used to get comparability among schools in the experimental and control groups. School administrators had several sources of funds, and they were able to reassign the funds from other sources so that schools got comparable total funds even though they did not get them all from the same sources. Thus when a comparison was made to see what the additional funds from ESAA had accomplished, the effect of the differential money was essentially wiped out by this reallocation of funds. Later investigations were able to arrange their allocations so that this did not happen, but clearly it is an important issue. One advantage of assessing a large public program is that it is already in place and the program already is being paid for. Thus experimentation within the program can be relatively inexpensive since only the marginal costs of the investigation are being charged, and one does not have also to find the funds for the basic service being delivered. For example at one time there were 18,000 Head Start units. The opportunity to experiment when there are so many units is almost an obligation. When so much money is being spent there need to be developments to see whether the system can be improved by evolutionary investigations. The Type 4 questions are especially difficult to address by experiment, or perhaps in any way. For example, the state might give parents educational vouchers so that they can buy their children’s education where they wish, and thus stimulate schools to improve their teaching in order to compete for students. Such an innovation requires a substantial time for the system to respond. An entrepreneur can scarcely afford to develop a first class school in response to a field experiment that may terminate at any time. Other difficulties about the stability of the findings will occur to the reader. A second instructive example is performance contracting where contracts were let for the education of poorly achieving children. The more the children learned, the more the contractors were paid. The government gave little lead time for the contractors and the program lasted only one year. It is an example

480

Frederick Mosteller

where extensive political opposition caused difficulties. Furthermore the contractors, by and large, lost money. Raising the attainments of the children at the bottom of the class can only be a difficult assignment because so many of the reasons for low achievement are outside the control of the teacher. Rivlin says of these studies that the “Performance contracting did not produce dramatic results in one year.” Gramlich and Koshel’s data (1973) suggest that the contractors obtained a gain of perhaps 9% over the control groups in spite of their quick start and inexperience. The whole experience shows how difficult a Type 4 question can be. The complications of the political situation, and the time it takes for institutions to be built and stabilize themselves do make these approaches where the effect is complicated and only achieved through influencing organizations that in turn must influence groups and individuals very difficult to look into with experiments. And so, while appreciating Rivlin’s classification, I would quibble a bit and change the order of difficulty from 1, 2, 3, 4 to 1, 3, 2, 4, or even 1, 3, 4, 2. Until we get more experience, experiments on the Type 1 and 3 questions seem to me to have more chance of success. Demonstration projects seem more likely efforts for Types 2 and 4 at this stage of our understanding. Other ways of sorting problem areas may be useful, and I hope economists and others will give further thought to this issue. 6. Special considerations The idea of carrying out genuine experiments on people is one that makes many of us recoil, and the objections can be extensive. Let us review and comment briefly on some of these objections. 6.1 Putting People at Risk In some experiments, one group gets a given treatment and another group does not, and the treatment may be harmful. Several issues must be faced. The first is how can we apprise the new treatment if we do not do comparative experiments? One common method is to decree that a new plan applies to all. For example, a new mathematics program may be installed in a whole school system. Or a nation-wide change in laws may be instituted. Generally speaking the usual way of handling innovations is to introduce them without any way of getting information about their value, and then to replace them on the basis of equally weak information. We usually have no plan for improving the idea, if it is a good one, by gathering suitable data on the outcome of appropriate variations of the innovation. The undercurrent of this complaint is the secret belief that the information about whether a change is going to be for the better must be known. This view does not face the reality that frequently innovations are useless or worse. Thus the usual way of handling innovations, which is just to install them without regard to their actual performance needed is one that should be regarded

27 Experimentation and Innovations

481

with some suspicion because it does not recognize the empirical results we have provided in Tables 1 and 2 of this article. They show that we do not know in advance what innovations are helpful and what not. They only show that some will be and that it is necessary to test and to select if we are to benefit from the process of innovation. 6.2 Withholding a Good The opposite of the risk argument is that of withholding a good. Here the supposition is that the innovation is a good one and that people should not be deprived of it. Again, it often happens not only that the innovation is not a good one but has negative value, as our tables show. In the medical area a practice that is especially troublesome to many is the giving of placebos. Where there is a standard treatment, the standard treatment will ordinarily be the alternative to the innovation rather than a placebo. Placebos probably should be used rarely. Three situations need special comment. a) Placebos. Occasionally in medical work the placebo is the innovation, and indeed in one of the studies found in the MEDLARS search (Table 2) the placebo turned out to be superior to two standard treatments, better than one and much better than the other. Thus what Henry Beecher, the great anesthesiologist, called “the powerful placebo” should not be sold short. Current medical opinion seems to be changing, and the newly discovered biochemical systems in the brain are believed by some to be stimulated by placebos, so that the treatment may be legitimized in a new way. b) Terminal patients. In extreme situations where the patient is at death’s door and established treatments are known to be of limited value, one can readily appreciate a willingness to try some entirely unproven treatment. By the same token, a placebo may be regarded as inappropriate in these circumstances. Clearly the case of terminally ill patients requires very special attention. c) Special groups. Many object when special groups are selected for experimentation and not others. Naturally this can be an evil. This seems especially dangerous when the groups have little control over their own lives and little political power or are poorly able to defend themselves such as when children, prisoners, and the poor are used in experiments. At the same time, some of these groups suffer especially from certain diseases and they should not be denied the proper appraisal of therapies for these diseases just because they belong to special groups. Similarly in the social area experiments on incomemaintenance inevitably must be associated with people who need the money. We require rules and safeguards to protect special groups, but the rules should not overprotect. Thus informed consent by individuals in medical experimentation is a valuable requirement, but in addition we depend upon ethics groups to review the overall plan and guard against group-wide abuses. d) Participation. Many kinds of studies employ volunteers. Even though there may be a differential effect of treatment on volunteers and non-volunteers, the choice of treatment can only depend on the outcome for the volunteers.

482

Frederick Mosteller

Thus non-participants can expect that their lack of representation may sometimes lead to less than optimum social and medical programs for them. e) Scarce resources. When resources for a treatment are scarce, the justification for treating some and not others in an experimental mode seems especially justifiable. 6.3 Time Experiments take a long time to set up and do carefully. And if we reflect on laboratory work in physical and biological science, we would not expect the first experiment or even the first few experiments to be done well enough to provide the answers we need. And so experiments do indeed take time. At the same time, we need to realize that the political process in the USA (perhaps in some other countries as well) moves slowly, and so ideas for major social innovations are actually adopted after lengthy consideration. It is true that we have frequent flurries of reports of instant action on the part of legislatures and our Congress, but by and large they die to return another year. Usually a great deal of discussion occurs before a major new policy goes into effect. Consequently far from delaying anything, a careful study of an issue, especially a good controlled field experiment, might be of great help to the law makers in their deliberations. Often too, it is claimed that the demand for a study is merely a delaying action. Perhaps, but if a political body has decided to delay, it will not be due to the wait for a study, though it may commission one as part of its delay. In the USA as in many other countries the great domestic issues have been with us for a long time, and they have been the subject of dozens of studies, many taking years. Some examples are: what to do about the drug problem, national health insurance, the volunteer army, urban renewal, inflation, unemployment, manpower training, abortion, juvenile delinquency, and mass transit. Not all these problems are to be treated by way of experimentation, but some profitably can be. The main point is that the issue of urgency is not in many public problems as impressive once one has reflected on them. When my son was planning to apply to college he was asked to write an essay saying what he regarded as the most pressing problem in the world and what he would do about it. I asked what he told them. Mass transit! I was surprised and suggested that other problems in the world might be regarded as more pressing. He said that it was true that war and inflation were bigger problems, but they couldn’t be very pressing because they had been around since long before he was born and people didn’t seem to do much about them. It is a hard lesson first that problems that are urgent to individuals may not be urgent to nations and second that all problems may not have solutions. Finally, we should realize also that studies that are not done now will not be available to assist policy makers when they are needed later. If studies take a good deal of time, then it would be well to start them. Even when new policies are instituted they may be changed over a period of a few years.

27 Experimentation and Innovations

483

The short-term perspective of years that we tend to use should perhaps be replaced by decades in policy work. 6.4 Costs The costs of policy experimentation can of course be large. As mentioned above, one must distinguish between the marginal cost of managing an investigation and the cost of the treatment which would have been applied even in the absence of experimentation. In addition, one has to weigh, in counting the net cost of an experiment, not just the absolute cost of managing the experiment, but consider the long run cost of applying an ineffective program for years. Thus we have costs of not doing experimentation as well as costs of doing. 6.5 Other methods The principal other methods in use in policy analysis are, in alphabetical order, cost-benefit analysis, data banks, guessing, longitudinal studies, management information systems, observational studies, quasi-experiments, regression models, sample surveys, simulation, and theory. Discussing each of these in turn with respect to its possible accomplishments could be a very useful activity, but is not within the scope of this paper. Instead, I will mention that except for quasi-experiments that ordinarily use as their control recent history, none of these methods actually change a treatment in society or in an individual and record the response. What they do is allow individuals to be treated by the events of society and report on what happened to those that happened to be treated in various ways. At best this latter approach makes for a weak inference because there are so many other things that could have caused the performance that is regarded as the outcome measure. 7. Acknowledgment This work was facilitated by grant SOC75-15702 to Harvard University from the National Science Foundation. References Bearman, J. E., Loewenson, R. B. and Gullen, W. H. (1974). Muench’s postulates, laws and corollaries. Biometrics Note No. 4, Bethesda, National Eye Institute, Department of Health, Education, and Welfare. Evaluation, Program Evaluation Resource Center, Minneapolis Medical Research Foundation, Inc., 501 South Park Avenue, Minneapolis, Minnesota 55415. Evaluation Research Society, c/o Dr. Judith Garard, Dept. of Psychiatry, Univ. of Minnesota, Box 393, Mayo Memorial Bldg., Minneapolis, Minnesota 55455. Gilbert, J. P., Light, R. J. and Mosteller, F. (1975). Assessing social innovations: an empirical base for policy; in: Evaluation and Experiment: Some Critical Issues in Assessing Social Programs; edited by Bennett, C.A. & Lumsdaine, A.A. Academic Press, New York 39-193.

484

Frederick Mosteller

Gilbert, J. P., McPeek, B. and Mosteller, F. (1977a). Progress in surgery and anesthesia: benefits and risks of innovative therapy; in: Costs, Risks, and Benefits of Surgery; edited by Bunker, J. P., Barnes, B. A. & Mosteller, F. Oxford University Press, New York. Gilbert, J. P., McPeek, B. and Mosteller, F. (1977b). Statistics and ethics in surgery and anesthesia, Science, 198, 684-689. Grace, N. D., Muench, H. and Chalmers, T. C. (1966). The present status of shunts for portal hypertension in cirrhosis, Gastroenterology, 50, 684. Gramlich, E. and Koshel, P. (1973). Social Experiments in Education: the Case of Performance Contracting. Brookings Institution, Washington, D.C. Rivlin, A. (1974). Allocating resources for public research: how can experiments be more useful? American Economic Review, 64, 346-354. Key words Evaluation, social experimentation, controlled trials, randomized controlled trials, non-randomized trials, randomized clinical trials. Abstract Society cannot afford to institute poorly tested reforms because a substantial fraction of innovations, when carefully tested, prove not to be beneficial. Poorly controlled investigations are likely to make an innovation appear to perform better than it actually does. Where applicable, well controlled field trials of innovations can offer valuable information. The better opportunities occur when dealing with effects on small, independent, individual units such as individuals, households, classrooms, or small geographic units. The difficult problems of changing organizations or markets where the chain of causation leading to the output is long seem less likely to yield to this method, though demonstrations may still be useful. Although individuals must be safeguarded in setting up experiments, the usual arguments against social experimentation such as concern for putting people as risk, opposition to withholding a good, the time-consuming nature of experimentation and its costliness, do not stand up well when compared with the alternatives of haphazard installation of innovations. Such methods give us little chance to appraise the benefits of innovations. Alternative methods of investigation, although often valuable, have the weakness that they compare different situations as they stand but do not actually make changes in treatment in the field and observe their effect. Society needs to continue to extend and improve its methods of controlled experimentation in the appraisal of social, economic, and medical innovations. R´ esum´ e La soci´et´e n’a pas les moyens de promouvoir des r´eformes mal test´ees parcequ’une proportion appr´eciable d’innovations, quand soigneusement test´ees, se sont r´ev´el´ees peu profitables. Des ´etudes sans contrˆ oles ont des chances de faire apparaitre les performances d’une r´eforme meilleures qu’elles ne sont en r´ealit´e. Quand c’est possible, mettre ` a l’´epreuve les innovations sur le terrain avec de bons contrˆ oles offre des renseignements pr´ecieux. Les occasions les plus favorables se produisent quand on a affaire ` a de petites unites individuelles et ind´ependantes telles que des individus, familles, classes ou de petites r´egions g´eographiques. Il ne semble

27 Experimentation and Innovations

485

pas que les probl`emes difficiles de changer les organismes ou march´es puissent ˆetre r´esolus par cette m´ethode, bien que des d´emonstrations soient probablement utiles. Bien que les individus doivent ˆetre prot´eg´es quand on organise une exp´erience, les arguments habituels contre l’exp´erimentation sociale tels que le souci des risques encourus, le refus de retenir un progr´es, le temps et coˆ ut des exp´eriences, ne se d´efendent pas bien quand on les compare avec l’alternative de r´eformes d´esordonn´ees. De telles m´ethodes nous laissent peu de chance d’appr´ecier les b´en´efices de l’innovation. Certaines autres m´ethodes d’etudes, bien que souvent valables, ont le d´efaut de comparer des situations diff´erentes au lieu de faire des changements sur le terrain et observer leur r´esultats. La soci´et´e a besoin de continuer, g´en´eraliser et am´eliorer ses m´ethods d’exp´erimentation contrˆ oll´ee pour l’´evaluation des innovations dans le domaine social, ´economique et m´edical.

Reprinted from Journal of Contemporary Business (1979), 8, 79–92

28. New Statistical Methods in Public Policy. Part I: Experimentation Frederick Mosteller and Gale Mosteller Harvard University and University of Chicago

Using data from the population as it stands is a dangerous substitutefor testing. Editor’s Note—This article by the Mostellers has been divided into two parts. The first discusses experimentation and Part II discusses exploratory data analysis.

As quantitative methods have been applied more and more to public policy work, the whole panoply of statistical devices becomes appropriate for the policy student. Fairley and Mosteller (1977) have recently provided a book illustrating the applications of many statistical methods to problems in public policy. We cannot treat all of these here. We choose to discuss two areas in some detail. The first deals with data gathering—more specifically experimentation and its importance in policy work. We do not discuss how to do it, but we do try to give an empirical base for understanding the need as well as limitations. Second, we treat exploratory data analysis. We try to give enough of the flavor of the methods and of the ideas behind them that readers will have a running start when they turn to more comprehensive works, and so that they will be able to follow the reasoning associated with some exploratory analyses. We have not here carried out large-scale practical applications, because we have to exposit the methods themselves and their rationale. Key Words: Controlled Trials, Evaluation, Experimentation, Innovations. EXPERIMENTATION Until fairly recently, economists thought it infeasible to inform public policy by using experiments with direct interventions and control groups instead of observational studies or theoretical analyses. But over the past two decades some substantial attempts to use experiments and near-experiments to study

488

Frederick Mosteller and Gale Mosteller

areas such as aid to the working poor, educational programs, and national health insurance have brought these techniques into the public eye. Our discussion treats the need for and the limitations of experiments for public policy. What do experiments tell us? The motivating idea, well expressed by George Box, is that if we want to see what happens when we change something in a complicated system, we need to change it and observe what happens. In causal work, policy analysts make heavy use of regression methods for analyzing observational studies, and this leads to difficulties. We are all used to saying quickly that ”correlation is not causation” and then illustrating the weakness with the high correlation between quantities that seem ridiculous to relate. An example would be the correlation between the number of Presbyterian ministers in New York and the amount of Scotch whiskey produced in Great Britain. Although such amusing illustrations make the point about spurious correlation, they do not illuminate the causative problem with observational studies as brightly as does the next example. The Population as it Stands Let us consider a familiar example: height versus weight for male adults in the United States. We can certainly fit a regression line of height on weight and then use weight to predict height. Thus, in a sense, the regression line serves two purposes, as a predictor and as a summary of the relation between the two variables in the population as it stands. The regression does not tell us what will happen to an individual’s height when we change his weight. We know from many sad experiences that, in adults, changes in weight produce practically no change in height. Thus the regression for predicting height from weight was misleading for predicting what would happen when weight changes. We are invited to confusion by a regression equation like ˆ = 28 + 1 w h 4 ˆ where w is the weight in pounds and h the estimate of height in inches. The equation seems to say that if we change w by 4 pounds we get an increase of 1 inch in the estimate of height. If the equation is derived from an observational study, changing w refers to choosing different individuals, moving from those in the population with one weight, say 160 pounds, to others in the population weighing 164 pounds. The reference is to the population as it stands. Increasing w means measuring heights for other individuals who are heavier; it does not mean changing the weights of the individuals. The silent equation cannot remind us of this difference. More generally, when we relate a predictor variable to an outcome variable, based only on an observational study, we cannot count on the fitted relation to tell what will happen to outcome when the predictor variable is changed. In practical policy situations, we may introduce innovations by changing the values of some of the variables and observing changes in outcome variables.

28 Public Policy Experimentation

489

For example, we change the industrial taxes in a city and then observe the change in the amount of industry. Essentially, we are carrying out part of an experiment. This activity is not the same as looking at the relation between taxes and amounts of industry across cities. Researchers applying regression analysis to the population as it stands may be tempted to use their results to make invalid policy predictions. For example, a large income elasticity for “schooling” does not imply that low income families will substantially increase “schooling” for their families if given an income supplement. The elasticity, instead, shows that higher-income families spend much larger proportions of their income on “schooling” than lower-income families. To give another example, production-function regressions based on the population as it stands do not suggest the size of an output change that can be gotten from an input change unless special assumptions hold, notably that the production function is stationary across time or is the same for all countries or firms in the sample. Instead such production functions describe the relationship between inputs and outputs across time, countries, or other observational units. When regression methods are applied to data arising from experiments, where the changes are actually made and the outcomes observed, the causation argument is more likely to hold. This, of course, is a major attraction of experiments: that we may be able to assess the effect of a change in policy. Reforms and Experiments If innovations were always blessings, we would scarcely need to test them, and if they were always failures, the same would be true. What can we say about the track record of innovations? To illustrate what has happened when social reforms have been assessed by experiments, we review the results of Gilbert, Light, and Mosteller (1975). They assessed the value of each reform on a five-point scale: very good ++, good +, neutral 0, bad −, very bad −−. The score measures how well the innovation performs as compared with the standard treatment it replaced. The scoring assumed that the innovations were cost-free. If cost were included, the scores would drop somewhat. These studies were classified into social, socio-medical, and medical innovations. The first group includes eight, of which four are negative income tax, studies of bail, training of police, and attempts to reduce delinquency among girls. The second group includes eight, of which four are experiments on probation for drunks, effects of widening health insurance benefits, training mothers whose children have tonsillectomy, and training physicians in comprehensive medical care. The third group includes twelve, of which four are the Salk vaccine experiment for polio prevention, treatment of cancer of the bronchus, vagotomy, and gastric freezing for ulcer. Table 1 shows the scoring. Table 1 shows that among those rated, fewer than half showed improvements, though few seemed to have a negative effect. A neutral score (0) need not be bad if the innovation costs no more than the standard treatment. The

490

Frederick Mosteller and Gale Mosteller

social and socio-medical innovations usually did cost more. The medical ones were mixed. These authors were concerned that their sample might not be satisfactory because they chose social and socio-medical experiments as known to them, and they had no systematic source. Medical innovations came, with two exceptions, from a systematic computer search of the medical literature. To get a look at what might happen with objective sampling from a larger medical literature, Gilbert, McPeek, and Mosteller (1977) studied surgical innovations that had been tested by experiments (randomized clinical trials). They made one change in the scoring system. The 0’s were broken into two groups. They included those that were regarded as a success, because they provided a valuable alternative treatment for some patients, and those that were a disappointment, because they were more trouble, more expensive, or more dangerous than the standard. Table 1 Summary of ratings of social, socio-medial and medical innovations as compared with the standard treatment −−



9

+

++

Total

Social Innovations Socio-medical Innovations Medical Innovations

1

1 1

3 4 6

2 2 2

3 1 2

8 8 12

Totals

1

2

13

6

6

28

They also broke the innovations into groups. One group included innovations in the therapy for the disease, called primary treatments, and the other included innovations for the recovery from the therapy (reduced infections, reduced thrombosis, reduced complications following surgery), called secondary treatments. Table 2 shows the results. Again slightly fewer than half of the innovations improved on the standards. Before the tests, the innovations were highly regarded or they would not be likely to have been brought to expensive randomized trials. Thus innovations do not automatically turn out to be improvements even in the hands of the most careful people. Poorly Controlled Trials What happens when investigations with less careful controls have been carried out? In the policy area, we rarely have comparability of studies, but we can draw some analogies with the surgical area. Gilbert, McPeek, and Mosteller (1977) also report the outcomes for controlled non-randomized trials

28 Public Policy Experimentation

491

Table 2 Ratings for surgical innovations: randomized controlled clinical trials Disappointment

Primary Secondary Totals

Success

−−



0

0

+

++

Total

1 3

3 3

7 3

2 2

5 2

1 4

19 17

4

6

10

4

7

5

36

in surgery as shown in Table 3, just as in Table 2. Controlled trials that did not have randomization produced nearly half + + ’s and more than half showed the innovation to be an improvement. An even more telling example is given by Grace, Muench, and Chalmers (1966), who report on many separate studies of the same new operation, portacaval shunt. Among these, 6 were well controlled and 47 were poorly controlled or uncontrolled. After each experiment, the investigators reported their degree of enthusiasm for the new operation. Table 4 (slightly extended from the original article as a result of correspondence with Dr. Chalmers) shows the results of the 53 studies. Each had at least 10 patients. In both examples, less well-controlled studies produce more enthusiasm for innovations than do well-controlled ones. We also note that well-controlled investigations are rarer in this instance. These results all support Hugo Muench’s law, which essentially says that nothing improves the performance of an innovation more than lack of controls. We have here a fair amount of empirical evidence that when innovations are assessed without well-controlled experiments, they are likely to be overrated. Table 3 Ratings for controlled non-randomized trials Disappointment

Primary Secondary Totals

Success

−−



0

1

1

2

1

1

2

0

0

+

++

Totals

1 1

2 3

3 8

2

5

11

492

Frederick Mosteller and Gale Mosteller Table 4 Ratings of portacaval shunt operations, 53 investigations Degree of enthusiasm

Well-controlled Not well-controlled Totals

None

Moderate

Marked

Totals

3 3

3 10

0 34

6 47

6

13

34

53

ECONOMIC SITUATIONS THAT USE EXPERIMENTS Social experiments have merit but they are difficult to do, and may be expensive and, for a variety of reasons, inconclusive. To appreciate the degrees of difficulty and the favorable and unfavorable circumstances for experiments in the socio-economic or policy realm, we review and expand on Alice Rivlin’s (1977) instructive analysis of experiments. She divides policy experiments into four categories and discusses problems of execution and statistical inference. We give these categories names: micro-experiments, macro-experiments, production experiments, and reorganization experiments. Micro-experiments Micro-experiments can come close to a statistician’s ideal. In these experiments, small independent demanders (or suppliers) respond to taxes, subsidies, or income supplements. By independent, we mean that the decision of one demander is not linked to those of other demanders and that each demander, by joining the market, raises the price of the good to all other demanders by a small, almost negligible, amount. For example, households or individuals chosen randomly from a large geographic area would decide how much of a subsidized good, such as physicians’ services, to purchase independently of other households in the experiment. The total increase in the quantity demanded by all the households in the experiment would be small relative to the market so that we can claim that the market demand and supply curves are held nearly constant. That is, the price received by suppliers and paid by non-subsidized buyers is unchanged, and prices for substitutes and complements are unchanged. So the micro-experiment gives us an estimate of the elasticity of demand in the region of the original market price and the subsidy price, holding constant all other prices. Micro-experiments do not always involve subsidies. A tax experiment such as the negative income tax changes the “price” of leisure or income foregone. To greatly simplify, a negative income tax guarantees an individual a minimum income, and taxes each dollar earned by a certain percent. With this system,

28 Public Policy Experimentation

493

those who earn less than the minimum income receive a “negative tax,” while those who earn more pay a “positive tax.” By changing the marginal tax rate and the minimum income level for a sample of households, the investigator can measure the effect of different negative income taxes on hours worked and earnings while scarcely shifting the labor supply curve. Theoretically, micro-experiments could test the response of suppliers in an industry, but in many industries of interest the suppliers have large market shares and their actions influence industrial activity. For example, large firms can raise the prices of inputs by expanding production. Unlike microexperiments involving individual workers (the unorganized labor market) or families, those involving firms are likely to violate the independence assumption. Three distinctive elements of micro-experiments are: (a) treatment is a price change (tax or subsidy) or an income change (voucher or income supplement) or both; (b) the investigator controls the treatment, such as the magnitude of the price change; and (c) the participants can be randomly chosen and independent. Macro-experiments In macro-experiments, the whole market responds to taxes or subsidies. The taxes or subsidies may be placed on either the demand or supply side of the market. To study the magnitude of response in the whole market, the market must be saturated, and the experiment’s participants need not be small or independent. Since the impact of the policy will be large, the market demand or supply curve or both will shift, and the investigator will not have control over the size of the price change. The subsidy macro-experiment lowers the price paid by all demanders and thereby raises the quantity they demand. This market-wide expansion of demand (shift of the demand curve) raises the market price received by sellers and affects prices for substitutes, complements, and inputs. So the subsidy macro-experiment increases the quantity that is demanded and sold, raises the market price, and changes prices in other markets. The subsidy macroexperiment gives us an estimate of the elasticity of supply in the region of the original and the new market prices without holding constant other prices. Recall that the subsidy micro-experiment gave an estimate of the elasticity of demand holding constant other prices. To get an estimate of the elasticity of demand without holding constant other prices, the macro-experiment would have to tax or subsidize the supply side of the market. Macro-experiments give a more complete picture of the interaction between demand and supply curves that is lacking when we study the demand or supply curve alone. In addition, macro-experiments affect markets for substitutes, complements, and inputs, and reveal further consequences of policy.

494

Frederick Mosteller and Gale Mosteller

Production Experiments Production experiments measure the relation between inputs and outputs for a product or service. In many settings, production experiments can also come close to the statistician’s ideal. For example, an agricultural production experiment on pigs might study alternative diets that are low in cost and chemical additives. Because a great deal is known about appropriate foods to feed pigs, experiments can be developed without much difficulty. The treatments would be various measured combinations of food, and the output would be total weight gained during a specific time period. In other experiments, the inputs and outputs seem harder to standardize. An education production experiment might study alternative programs to increase the vocabulary of fourth graders. Mechanical ways of studying vocabulary, such as flash cards, can be specified carefully, but more creative vocabulary games are harder to standardize. The goal of production experiments is to encourage the discovery and development of new methods of production rather than relying upon techniques currently employed in the market. Reorganization Experiments Reorganization experiments strive to alter market and institutional structures and conditions to deliver different final outcomes. These experiments combine elements of macro-experiments and production experiments and depend on a chain of interrelations. By giving parents education vouchers and the opportunity to choose schools for their children, “schooling” acquires more of the characteristics of a traditional market. Parents who want particular kinds of training for their children can offer their vouchers to schools that respond to these demands. To meet demands for a large variety of specialties, each studied by a few, a traditional school system must be well-endowed. With the voucher system, schools compete for students, and parents may seek schools that teach the desired specialty. By changing the ground rules, reorganization experiments alter both the means of production and the final product. Since reorganization experiments rely on a chain of interrelations, the final product is likely to be much more difficult to achieve than the product of a direct input-output process. Reorganization experiments may require much time and substantial investment because creating a new kind of factory or a new kind of university may require a great deal of capital. For all these reasons such experiments are the most difficult to carry out. Problems of Execution and Inference Table 5 summarizes many problems that arise in the four kinds of experiments.

28 Public Policy Experimentation

495

Table 5 Summary of problems of execution and inference of four types of experiments Micro-experiments

Macro-experiments

Production experiments

Reorganization experiments

1) Specification and measurement of treatments and outcomes

Treatments can be carefully specified in the form of tax-subsidy or income supplement schedules. Outcomes in the labor and product market can often be carefully specified, while those in service markets may be difficult to specify. Indivisible products or products with a spectrum of characteristics (such as houses) may also be difficult to measure.

Inputs and outputs can be easy or hard to specify depending on the nature of the area studied and the extent of accumulated knowledge in the area.

These experiments involve all specification and measurement problems of the first three experiment types.

2) Sample selection

These experiments may use randomized selection of the participants in the control and treatment groups.

These experiments require saturation of a region.

These experiments may use randomized selection of the participants in the control and treatment groups.

These experiments require saturation of a region.

3) Cost

These experiments may be low in cost.

These experiments tend to be high cost due to the saturation of a region.

These experiments may be low in cost.

These experiments tend to be high cost due to the saturation of a region and to the development of new economic and social organization.

4) Interference from political institutions

These experiments may undergo little interference from political institutions.

These experiments are likely to undergo interference from political institutions due to the saturation of a region.

These experiments may undergo little interference from political institutions.

These experiments are likely to undergo interference from political institutions due to the saturation of a region and to the conflict between political institutions and social reorganization.

Micro-experiments

Macro-experiments

Production experiments

Reorganization experiments

5) Interference from participants

These experiments may undergo little interference from participants because of the direct connection between treatment and outcome.

These experiments are likely to undergo interference from participants because of the indirect connection between treatment and outcome.

6) from tors

These experiments may undergo little interference from administrators.

These experiments are likely to undergo interference from administrators when areas such as health and education are studied.

Interference administra-

7) Temporary and permanent effects

In all experiments that run for short periods of time, the results may reflect temporary effects rather than permanent ones. To develop assurance that policy effects are permanent, longer-term experiments must be run.

8) Hawthorne fect

This confounding effect may be present in all of the experiment types.

ef-

By political interference (item 4), we mean both opposition to performing the experiment and modification of the experimental conditions and treatments. Such interference prevents the policy analyst from discovering the effects of treatments that he wants to study by restricting the selection of subjects and by causing him/her to test treatments different from the ones intended.

496

Frederick Mosteller and Gale Mosteller

Participants in and administrators of experiments (items 5 and 6) also may interfere. Participants can undermine the goals of the experiment by altering their behavior during the experimental period. Because families may have goals that conflict with those of the experimenter, they can be a powerful interfering force (item 5). The interference may be in the best interest of the family, but not of the experiment. When one family member receives an “income supplement,” such as a school lunch, the family may transfer some of the supplement to other family members, perhaps by revising eating schedules and serving sizes. The money saved may be used to buy other things such as clothing and heat. Micro-experiments and macro-experiments seem less prone to the interference of participants because the treatment and the outcome are directly connected. Administrators can foil experiments by modifying the treatments (item 6). In trying to “help” subjects in the control group, administrators may supplement the control treatment with additional resources. Such interference may be more likely in production and reorganization experiments involving long-range investment in future well-being and opportunities. The Hawthorne effect (item 8) states that people may behave differently when they know that they are being observed. For example, they may work harder. Table 5 argues that micro-experiments and production experiments are more feasible to carry out and easier to interpret than macro-experiments and reorganization experiments. The former kinds tend to be lower in cost and involve less interference from political institutions than the latter kinds. Microexperiments and direct production experiments may also have less interference from participants. These are strong reasons for pursuing micro-experiments and production experiments. But we must not overlook the contribution that macroexperiments and reorganization experiments make to the full assessment of policy. Though the results of repeated sampling in a market may look promising, the outcome of a new policy in the market as a whole may differ considerably because of interactions within as well as outside the market. Macroexperiments and reorganization experiments provide a more complete picture of the policy even if we must place less confidence in the results because the controls are less effective and because the sample size may be one. Even in the presence of political interference, large-scale experiments may suggest whether policies could be successfully implemented without distortion. CONCLUSIONS Policy experimentation is needed because experimentation is a strong tool for assessing changes in policy. Using data from the population as it stands is a dangerous substitute for testing the behavioral response of individuals and groups to policy changes.

28 Public Policy Experimentation

497

Fewer than half of 28 social reforms assessed by experimentation showed improvements, though few seemed to have negative effects. A majority of the social reforms studied cost more than the standard treatment. Experiments that are not well-controlled provide unreliable assessments of innovations. Surgical innovations were much more likely to be rated as improvements over the standard surgery when non-randomized rather than randomized trials were used. Less well-controlled investigations also generated more enthusiasm for the portacaval shunt operation than well-controlled ones. Of the four types of experiments we discuss, micro-experiments and production experiments tend to be more feasible to carry out and to have more reliable interpretations than macro-experiments and reorganization experiments. Nevertheless, macro-experiments and reorganization experiments focus on the outcomes of interactions among individuals and institutions inside and outside the experiments, and so give a fuller view of policies. REFERENCES William B. Fairley and Frederick Mosteller, Eds., Statistics and Public Policy. (Reading, Massachusetts: Addison-Wesley Publishing Company, 1977). John P. Gilbert, Richard J. Light and Frederick Mosteller, “Assessing Innovations: An Empirical Base for Policy,” in C.A. Bennett and A.A. Lumsdaine, Eds., Evaluation and Experiment. (New York, New York: Academic Press, Inc., 1975), pp. 39–193. Abbreviated version in Fairley and Mosteller (1977), pp. 185–241. John P. Gilbert, Bucknam McPeek and Frederick Mosteller, “Progress in Surgery and Anesthesia: Benefits and Risks of Innovative Therapy,” in J.P. Bunker, B.A. Barnes and F. Mosteller, Eds., Costs, Risks, and Benefits of Surgery. (New York, New York: Oxford University Press, 1977), pp. 124–169. N.D. Grace, H. Muench and T.C. Chalmers, “The Present Status of Shunts for Portal Hypertension in Cirrhosis,” Gastroenterology (Vol. 50, No. 5, May 1966), p. 684–691. Alice M. Rivlin, “Allocating Resources for Policy Research: How Can Experiments Be More Useful?” American Economic Review (Vol. 64, No. 2, May 1974), pp. 346–354 and in Fairley and Mosteller (1977), pp. 243–254. ACKNOWLEDGEMENT We wish to thank Elizabeth Allred, John Emerson, Kathy Godfrey, David Hoaglin, Marjorie Olson, Anita Parunak, Edward Snyder, and Michael Stoto for their suggestions and contributions. This paper was facilitated in part by National Science Foundation Grant SOC 75-15702 to Harvard University.

Reprinted from The American Statistician (1980), 34, pp. 11–17

29. Classroom and Platform Performance 

Frederick Mosteller

Roger I. Lee Professor, Departments of Statistics and Biostatistics, Harvard University, Cambridge, MA 02138.

By discussing some features of elementary and intermediate courses in applied statistics, I plan to bring out a few ideas that I have found useful in teaching. Although no two teachers are alike, I have often found devices used by others adaptable to my own style, and perhaps some of you will resonate to one or two of my remarks. I shall speak as if to new teachers eager for my advice. Some of these remarks also apply to talks given away from home. When I think of teaching a class, I think of five main components, not all ordinarily used in one lecture. They are 1. Large-scale application 2. Physical demonstration 3. Small-scale application (specific) 4. Statistical or probabilistic principle 5. Proof or plausibility argument. 1. LARGE–SCALE APPLICATION Only rarely do mathematics and statistics courses give their students views of large-scale applications of the work they study. One may come away from an algebra or a statistics course with a feeling of having collected many tiny clever ideas that work themselves into a major bundle, but without any feeling that the work may have important applications. Nevertheless, students like to know that their studies do relate to important activities in the practical world. By the time students get to college, they tire of being put off with the advice that “You will see applications of this later.” Some realize that if they do not see the applications soon, they never will, at least not in formal courses. 

This paper was facilitated by Grant SOC75-15702 from the National Science Foundation. The author wishes to thank David C. Hoaglin and Anita Parunak for their suggestions, and thousands of students, through 40 years of teaching, for theirs.

500

Frederick Mosteller

Teachers are often reluctant to spend class time on these applications because, of course, the student is not likely ever to be directly involved in such work, and much time is needed to clarify the routine class material that the teacher feels obligated to deliver. We cannot fault this view very much but maybe a little. Until recently, few teachers had a collection of substantial uses of statistics at hand unless they had worked in many projects or had spent a great deal of time collecting such examples and writing them up. Today this is easier because we have the book Statistics: A Guide to the Unknown (Tanur et al. 1978), which has 46 essays (in the 1978 second edition) on important uses of statistical methods in many practical areas. This book was prepared for the Joint Committee on Statistics and Probability of the American Statistical Association and the National Council of Teachers of Mathematics. (The American Statistical Association receives the royalties and uses part of them to support its educational efforts.) These essays have been so popular that the publisher has broken the book into three parts and published paperbacks in the business and economics areas, in the political and social areas, and in the biomedical and health areas (Tanur et al. 1976, 1977a,b). In the first edition, these essays had no accompanying discussion questions, but these have been added owing to the need perceived by Erich Lehmann, the publisher’s editor, and so both the small specialized paperbacks and the second edition of the full work have problem material. My main point is that with these essays it is much easier for the teaching statistician to open a lecture with a few remarks about some real-world problem that uses the specific method to be treated in the day’s lecture. I do not begrudge a little time thus spent. In the long run, most of these students will not actually use statistical methods in detail, and what they may get out of the course is the idea of the important or intriguing applications. Furthermore, one never knows how a student is listening. All teachers are, I am sure, astonished again and again, as I have been, with what sticks in a student’s mind about a course. A second resource of practical problems is the four volumes of Statistics by Example (Mosteller et al. 1973). The fourth is especially oriented to statistical modeling. For content, then, I would say that in many elementary courses it may be possible and profitable to start many of the lectures with some remarks about a problem in which the statistics of the lesson plays a role, even though a minor one. 2. PHYSICAL DEMONSTRATION Gathering relevant data in class and then analyzing them with the help of the class gets the students involved because we are analyzing their data and they can contribute to the analysis.

29 Classroom and Platform Performance

501

One way to obtain a set of data that can be intriguing is to collect for males and for females, separately, the number of brothers and the number of sisters each student has. One has to specify the degree of relationship, or the matter can get very complicated. I stick to brothers and sisters who have the same biological father and mother as the respondent. Usually the men have more sisters than the women have, and the women have more brothers than the men. Sorting out why this is can be instructive. But the data can also be generated by the students themselves. We can look at guessing sequences. For example, one can ask each student to act like a random-number table, except that the student chooses one of the numbers 1, 2, or 3 with replacement after a signal and writes down the number thought of. Ask the students to close their eyes and as you time them (five seconds is time enough between signals to guess one number), write a sequence of numbers, one every five seconds, on a piece of paper. An occasional student will pick a number and always use it, say 3,3,3,3,3,3,3,3,3,3, I suppose, to show that he or she is not a random-number machine—forgetting that the problem was to what extent the machine could be imitated. In a large class, it is instructive to analyze the numbers of successive pairs of digits generated by the students, skipping past, say, the first three marks to get past the start-up phase. Or each student can have a die and be asked to cast it and record its outcome 10 times in sequence, and then the data can be collected and analyzed. You need to make rules about what tosses count or don’t count (how about not falling flat or falling on the floor, etc.). It is preferable if the students keep their own data, or at least a copy of them, to keep them involved. What should be analyzed? The sequence (tendency of numbers to follow one another and thus see if we have independence, or something close to it). Another grabber is the distribution of the lengths of runs. This can be recovered from the same data because longer runs are found if we redefine the die tosses as odd and even (rather than the numbers from 1 to 6) so that the die can simulate a 50–50 coin. Many teachers have collections of colored balls with paddles or other devices for drawing the balls out in an essentially random way so that we can observe real-life binomial variation in a somewhat controlled environment. I wish that every class I taught could have a physical demonstration of these sorts that each student can be involved with, and the more mathematical the class, the greater the need. Once in a while, a good physical demonstration goes astray. For example, I had a nice set-up once that depended on some cups having numbered tags in them, so that we could carry out a sampling demonstration. The helpers who were passing through the class got mixed up and did not always report the minus signs on numbers that were drawn, and one of the helpers also did not recover the tags after they were drawn and replace them and mix them up. When a demonstration of this kind begins to go wrong in such ways, I have a firm rule: Abandon it for that day. Any attempt to get everything fixed up and back in order will merely destroy the rest of the hour.

502

Frederick Mosteller

A demonstration has to be worked out in all details before one comes to class. The helpers, if there are to be any, have to be poised and already trained. The students need to know what the instructions are. If you think of possible misunderstandings, do not rely on the fact that no one asks a question after you ask for questions. If no one asks, then say, “All right, then I have a few,” and then go over the instructions by asking questions about them and calling on specific students to reply. You will likely find that some of the instructions have not been understood, especially if there is anything the least bit tricky. It is good to remember that at any time some people are not paying attention, and at other times, no one is. Needless to say, we are not trying to shame the students, we are just making sure that the usual sort of Murphy’s law (anything that can go wrong will go wrong) has a lower rate of application in your classroom. The teacher who uses physical demonstrations must be prepared for them to foul up and also to take such failures in good humor when they occur. Some students get fun out of a teacher’s discomfiture, so be prepared both to be cheerful anyway and not to blame anyone, including yourself. When I speak of foul-ups, I am talking about dropping 1,000 beads on the floor, or someone’s forgetting to write down the numbers in a manner useful for analysis. I am not talking about the possibility that the outcome of the experiment may differ somewhat from theoretical results. Part of the purpose of the investigation is to experience the sort of departure from theory that naturally occurs. We need to remember that Walter Shewhart, in setting up quality-control methods, regarded such distributions as the Poisson, the binomial, and the normal as ideals to be achieved when all the assignable causes had been removed after considerable engineering work and research. We cannot expect one-shot classroom experiments to have the kind of perfection that industry can achieve only at great expense and effort. Therefore, when the students try to ”cheat” by picking big numbers out of a hat or peeking ahead, and so on, please regard these as aberrations from the ideal that you should call attention to, without rancor, and then look forward to seeing whether their effects may show in the analyses. Today, of course, the interactive computer gives us a chance to do more and better experiments of a stochastic nature than we formerly could. But let us remember that that is all going on in a black box. We still have the question, “Do physical objects behave as they are supposed to according to our ideal theory?” Therefore, some experience with real coins, cards, or dice can be most instructive before the computer takes over. And, indeed, it can be informative to compare the results that are achieved by real objects or real people with the performance of computerized random-number simulations of them. After a small classroom experiment with coins, I have found some of the class interested in William H. Longcor’s millions of tosses of dice (Iversen et al. 1971).

29 Classroom and Platform Performance

503

3. SMALL–SCALE APPLICATION In preparing to present a specific technique, we need a concrete real-life problem that uses the technique. This is not quite the same as the largescale application in which the technique is mentioned under Section 1; instead, it is the specific use of the technique, often complete with numbers. Ideally, the material is available on a handout, which the student either has from the beginning of the semester in a syllabus or workbook, or it may be handed out on the day of the lesson. Fresh handouts revive the student, if their number is not overwhelming. This concrete example makes sure that the student understands what the specific technique is supposed to accomplish and that the technique has a practical use. These applications are best drawn from the areas of the students’ interests, because students like to learn their own subject matter at the same time that they are learning methods and techniques. THE LESSON: 4. STATISTICAL OR PROBABILISTIC PRINCIPLE 5. PROOF OR PLAUSIBILITY ARGUMENT You may be thinking, perhaps, “Between the big illustration, the small illustration, and the demonstration, we are never going to get to the lesson,” but here we are with it. I have very little to say about it today except that a friend of mine, Robert E.K. Rourke, once taught me an important idea for presenting new material. He calls it PGP. These letters stand for Particular General Particular. In our lesson strategy, the small illustration (Section 3) is the first particular. It motivates the lesson and clarifies the technique. The general is the general treatment of the technique that you present. Let us say it is a treatment of the unbiased estimate of the variance of a distribution based on a random sample. We will display the general formula and, depending on the class, prove unbiasedness, or discuss and motivate the N −1 in the denominator, where N is the sample size, paying special attention, say, to samples of size 1 and 2. At any rate, when this is completed, we will have handled the general part of the discussion. Now Rourke’s idea is to drive the whole thing home with a second particular, that is, with a new and further use of the method. The second particular could be, to a new example, similar to the old one, followed by a slightly different one, perhaps with some extra fillip. For example, if we have been looking at the variability of single measurements, we might apply the same technique to the variability of the sum of two independent measurements to see how the results compared with those for single observations. In a mathematically oriented class, I do not regard it as satisfactory merely to prove a theorem and apply it. I like to spend some time, whenever I can,

504

Frederick Mosteller

seeing if we, the teacher and the class, can make the whole thing seem reasonable as well, often qualitatively. In the problem of variability just mentioned, we ought to think about whether the variability of a sum is going to be larger or smaller than that of the single measurements, in intuitive terms as well as formal mathematical ones. If I have plenty of time during the lesson I like to work in a discussion of the law of sums and the law of averages simultaneously here. At any rate, such material would complete the PGP principle for the lesson. All told, then, when I prepare a lesson, I think of five main kinds of content in a statistics or probability lesson, the five previously mentioned. I rarely find it possible to get all five in, I am sorry to say, but I can usually do four. To do this requires a good deal of planning and sometimes timing. Let me turn to a few remarks about lesson preparation. Lesson Preparation. When trying for a good demonstration, either a physical one or a mathematical one, the teacher needs a clear notion of how long it is going to take and also needs to know what language to use. In some demonstrations, certain sentences, when developed just right, are very easy to understand and they carry the message. But the same ideas, when started out a little wrong, never seem to come to an end, except possibly the dreaded “You know what I mean even if I can’t say it.” If I wind up there, I can then be sure that students don’t know what I mean. Thus, memorizing starts of a few key sentences in a lecture can make smooth going. Definitions and specific explanations often need this special treatment. I find that when I am going to give a new demonstration, I want to sit down and tell it silently to myself because I am very comfortable and cozy that way, and I have my notes, and I don’t have to say it out loud. Unfortunately, this sort of comfort is almost sure to lead to trouble later. What the teacher needs to do is to lay down the notes, get chalk, write down the time, and begin the demonstration out loud, writing on the board as planned. This brings out the embarrassing tough spots. When the rehearser doesn’t know what comes next, long silences or much “ah-ahing” goes on while the clock ticks on. Furthermore, there is not enough board space in the world to put down all that has been planned. And when the rehearser comes to the end of a “five minute” demonstration and finds that half an hour has gone by, he or she is likely to believe that in the class it will all go much more quickly. It won’t. It will go more slowly because someone will ask a tough question and derail the instructor right at a key point. We need practice, and we need to figure out just what equations are to appear on the board and what example is to be used. We must time the example as well, especially if it has any calculation, algebraic or arithmetic. Again, however clever we are, it is well to have the example completely worked out and to know exactly how it goes. Even then, once in a while we may find that the results found in class are not identical with what was worked out at home.

29 Classroom and Platform Performance

505

I realize that some people can use what I call the stumblebum technique effectively. They start out on an example, and seem to make mistakes and get the students caught up in the demonstration helping them. I once admired an elementary school teacher who had 50 mathematicians vying to correct him as he kept apologizing for not knowing as much as they did. I suddenly realized that he was egging them on the same way he did the children. It was great. I’ve used his actual lecture to teach the idea of the distribution of the sum of two random variables, but not the stumblebum technique. This is a most effective device, but I believe that it can be used by few teachers and by these only when they have complete command of the material and total self-confidence. I would not recommend memorizing talks, although that is what some teachers do, but it is well to be so familiar with a talk that one can concentrate on how to say it rather than on what it is that will be said. I would like to summarize this point by saying that rehearsals are extremely helpful, and rehearsals with timing very instructive. Rehearsals are, I think, the single best way of improving one’s lecture work. Backup in the form of notes in big, black block letters can also be most helpful. Length. When one finds that the talk is too long, one’s first impulse is to plan to speak faster. This is fatal. The basic rule should be that when a lecture or a talk is too long, we have to abandon whole chunks: a demonstration, a proof, an example, and so on. The lesson plan would ideally have an arrangement something like a news story that is designed still to be informative when cut off at the close of any paragraph. The intent is to do thoroughly what is done. As George Miller says, “The job is not to cover material but to uncover it.” Thus, it is good to have a lesson in which the main body of the material is delivered by about two-thirds the way through the class period, and the rest of the lesson plan amplifies through examples and special cases or more esoteric ideas, as far as time is available. I think it well to recognize that students remember very little of what is said in the last couple of minutes before the bell rings, and I doubt if anything remains of what is said after the bell rings, unless possibly it is an example said to be on the final examination. Shortness. A related matter does occasionally arise when a teacher fears not having enough material to fill up the time. Then there is likely to be an opening with anecdotes and filler material before getting into the heart of the material. Inevitably, toward the end of the lecture, the teacher apologizes by saying that there is a lot of interesting material that there is not time to tell. Visiting speakers to a class are especially prone to this sort of thing. The better students become very restless. If I think there isn’t enough material for the lesson, and I have no way to get more prepared in time, I start right in with the best material as if there was not enough time, and deliver it at the regular rate, and then take advantage of the leftover time for discussion. Usually some difficulty arises that uses the time profitably.

506

Frederick Mosteller

Handwriting. In spite of the 20th-century visual aids, blackboard and chalk are still a mainstay for most of us. I see nothing wrong with this. Some people, like me, have rather poor handwriting, and to them I say just one thing, write large. This helps a great deal. Then there are those who write beautifully and small, and very lightly and swiftly, so that students can’t keep up. I assume that there is a special part of the Inferno waiting for them, but I cannot imagine a sufficiently severe punishment. For those who stand in front of what they have written and block it, no doubt the afterlife holds still worse. Projectors. The overhead projector has brought us a long way. I have recently been subjected, however, to a great many lectures given by people using transparencies. There are a few features of working with the projector that I would like to call to your attention. Figure 1 is a typical transparency. (The speaker displayed a transparency produced by typing from a manuscript, single spaced, at full 8 12 inch width after reduction, 62 lines long.) It has many advantages: it has everything on it, all the formulas, all the lecture, and nobody can read it, except possibly the lecturer, who then bores the audience to tears by doing so.

A Typical Transparency (Mosteller, F., and Tukey, J.W. (1977), Data Analysis and Regression. Reading, Mass.: Addison-Wesley.)

I think the rule most often broken in using the overhead projector is that material for it should be written large and with few lines to the page. Using several transparencies instead of one is all right. It may sometimes be desirable to have two projectors going instead of one.

29 Classroom and Platform Performance

507

These machines are easy to use, and the overlays can be easily and quickly prepared even by an amateur. Time and trouble are repaid for good-looking overlays, though. This projector is a great supplement for a handout. One can write with a grease pencil or a special pen on the overhead projector’s overlay instead of on a blackboard, but it takes practice. The speaker may not realize how much small movements are magnified by the projector. When pointing to something with a pencil, rest the tip on the plastic to stop the quivering, or lay it down. Don’t point with fingers, and don’t wave hands over the transparency. All of these sudden movements are hard on the viewers’ eyes and make some viewers seasick. Aside from crowding, sudden scrubbing and waving movements are the most frequently employed poor form in using transparencies. Finally, when finished with a transparency, talk about something else for a bit before removing it. Sometimes one gives a lecture in a special place or plans to use special equipment for the day—a special movie projector, an unusual slide projector, or other equipment. Unless you have actually tried this equipment out, are sure that it will be available when the day comes, and are sure that you can run it or that you will surely have a competent helper, you must be prepared for evasive action when this equipment does not work out. I once had a special short film that had to be run on an unusual machine. I got the machine and a young man from our audiovisual-aids department, and we spent a couple of hours learning how to run the machine so that it produced exactly the effect desired. I was very pleased. Next morning when the class was to start at 9:00 a.m., the helper did not appear, but rather late someone else appeared, explaining that the other chap had a class and couldn’t come. The new fellow couldn’t thread the machine in the first hour. One must be prepared for such a disaster and plan to abandon the show and replace it instantly with other relevant material. One can’t afford to waste the class’s time. Questions to the Audience. I have intimated and emphasized that it is good to have the students involved. For this reason, even in large lectures, I feel that it is good to have questions for the audience. I might mention that I have a friend who has a trick that I haven’t tried yet but that I would like to try. He distributes to his students cubes differently colored on several sides. Then he asks the students multiple-choice questions and tells them that if they prefer one answer to show the red side, another the yellow, and others the green, and so on. This device seems to keep large classes alert, and I pass the idea along for what it is worth. The direct participation of the class has the advantage of keeping all its members involved with the material. The rest of the paper is oriented more to lectures given on trips to unfamiliar places. The Handout. People like to take something away from a talk. If you have a handout, you have already made sure that they will. You have also taken out insurance against a major omission in your presentation. You have prepared

508

Frederick Mosteller

for amplification in case of questions. You have put the listeners in the position of being able to go at a slightly different pace than you do, and they can go back and check the definitions and conditions in a leisurely manner. You have also shown that you cared enough about the audience to take the trouble to prepare. Try to find out about how many handouts you will need and take plenty. Do not depend on the mail. Be sure your handout has a title a date your name and affiliation reference to foundation or other support, if any numbered pages. Some speakers include a reference to the place where the talk is being given. Beyond this, a lot depends on the topic, but key definitions theorems proofs formulas references tables figures results are examples of valuable things for the reader to carry away and for you to lean on in your presentation. On the one hand, you need not give out the whole paper; on the other hand, you may wish to distribute more than you can present. Variety. In keeping people involved, the keynote is variety. One source of variety is the blackboard, colored chalk, the examples, the lecture, moving around, the projector, the physical demonstration, the questions and answers, and the handout. The changes in method of presentation keep bringing the audience back to the talk. Try not to use just one method of presentation, even if you can. Another source of variety is content, but that is another story. Hazards. In giving speeches away from home, be prepared for the possibility of poor facilities. Lack of chalk is a common problem, no eraser is another. In some restaurants and hotels, the chalk won’t write on the board. If you carry a few pieces of soft chalk to such talks, you may be pleased. If you need colored chalk, carry your own, but be sure to try it out first. Only rather bright-colored chalk shows up. Those beautiful dark purples and blues are useless. A lecturer with a handout is ready when told “We had no idea this place had no blackboard; they hold conferences here all the time.”

29 Classroom and Platform Performance

509

Before a talk, go to a bit of extra trouble. Find out where it is to be held. Go there and try everything out. See if you can turn the lights on and off. If there is a projector, be sure you can plug it in and that it works; focus it. In the case of an overhead projector encourage the persons who provide it to provide an extra bulb in case the bulb fails. If it does, don’t burn your hand by touching it directly. If there is audio equipment, try it out. Find out how far you should be from the mike to make the right sort of noise. Have some water nearby in case your throat goes dry. A sip clears up lots of voice troubles for a speaker. You may feel embarrassed about all these preparations, and anyone testing a mike always feels silly. But you’ll look even sillier fumbling around during the talk. If you haven’t enough handouts, get the ushers to distribute one to every second or third person. If you run out, offer to send material to those who give their names and addresses. One can be surprised when a lecture for 30 turns into 300 because some teachers are attracted by the title and bring their classes at the last minute. Don’t distribute the handouts yourself. The speaker has other things to do. Strange things happen. Be ready for them. 1. The room where the talk is to be given is occupied. Relax. Let the local people work it out. Introduce yourself to people waiting around and chat with them. Don’t complain. 2. The room isn’t big enough! Already you are a great success. Don’t apologize for the hosts. You can say how happy you are to see such a large audience. Don’t apologize for late starting because of local arrangements. Only apologize for lateness if it is your fault. 3. Only two or three people show up. Frequently there will then be a dither about whether the speech should be given since there are so few. Indicate that you would like to give the speech and give it, perhaps a little more intimately. Don’t talk about the smallness of the audience. I have made some long-term acquaintances this way, for example, on one evening during a 14inch snowstorm. However carefully one prepares, time problems can arise not of one’s own making. You were promised 45 minutes, but the previous speaker ran over. Everyone thought lunch was at 1:00 p.m., but the caterer says 12:30 p.m. or not at all, and now you have 15 minutes including the question period. What do you do? First, don’t complain. Second, the Gettysburg Address didn’t take 15 minutes. If you have prepared to dump, as prescribed before, you know exactly what the most important 15 minutes of your talk is. It is 12 minutes long, plus 3 minutes for questions. Don’t rush. Carefully tell what is important and what the implications are, and then stop for questions. Whatever you do, don’t talk about not having much time. That just upsets the audience, and they start paying attention to how you are managing the talk instead of what you are saying. Finally, don’t feel insulted when things

510

Frederick Mosteller

like this happen. They happen to everyone everywhere. It is too bad, but just step in and perform. People will be pleased with a nice short speech. I believe that Paul Halmos, a very great lecturer, noted that in a lifetime of giving and attending mathematics lectures he had never heard complaints about a seminar ending early. The classical difficult situation occurs when you are the final discussant and the introducer says, “After the next speaker, we will turn to cocktails, which I am informed are already in the hall.” I think you would do well to restrain yourself to about 7 minutes in such circumstances. Conclusion. I’m sure that there is no one good way to teach, but that there are many, and, unfortunately, still many more ways to do it badly. What a speaker hopes to accomplish in a talk like this is to distribute a few ideas. I once complained about a cookbook to a friend who is a master chef. He said that if one gets one good recipe out of a cookbook, that is the most that one can expect. I hope you have found one thing in my discussion that you can adapt to your teaching. And I shall be most interested if, in the question period, you comment on ways you find useful to think about a lesson. REFERENCES Iversen, Gudmund R., Longcor, Willard H., Mosteller, Frederick, Gilbert, John P., and Youtz, Cleo (1971), “Bias and Runs in Dice Throwing and Recording: A Few Million Throws,” Psychometrika, 36. 1–19. Mosteller, Frederick, Kruskal, William H., Link, Richard F., Pieters, Richard S., and Rising, Gerald R. (eds.) (1973), Statistics by Example: Detecting Patterns, Menlo Park, Calif.: Addison-Wesley. — (1973), Statistics by Example: Exploring Data, Menlo Park, Calif.: AddisonWesley. — (1973), Statistics by Example: Finding Models, Menlo Park, Calif.: AddisonWesley. — (1973), Statistics by Example: Weighing Chances, Menlo Park, Calif.: Addison-Wesley. Tanur, Judith M., Mosteller, Frederick, Kruskal, William H., Link, Richard F., Pieters, Richard S., Rising, Gerald R., and Lehmann, Erich L. (eds.) (1976), Statistics: A Guide to Business and Economics, San Francisco: Holden-Day. — (1977a), Statistics: A Guide to Political and Social Issues, San Francisco: Holden-Day. — (1977b), Statistics: A Guide to the Biological and Health Sciences, San Francisco: Holden-Day. — (1978), Statistics: A Guide to the Unknown (2nd ed.), San Francisco: Holden-Day. Zelinka, Martha, with the assistance of Michael Sutherland (1973a), Teachers’ Commentary and Solutions Manual for Statistics by Example: Exploring Data, Menlo Park, Calif.: Addison-Wesley.

29 Classroom and Platform Performance

511

— (1973b), Teachers’ Commentary and Solutions Manual for Statistics by Example: Finding Models, Menlo Park, Calif.: Addison-Wesley. Zelinka, Martha, with the assistance of Sanford Weisberg (1973a), Teachers’ Commentary and Solutions Manual for Statistics by Example: Detecting Patterns, Menlo Park, Calif.: Addison-Wesley. — (1973b), Teachers’ Commentary and Solutions Manual for Statistics by Example: Weighing Chances, Menlo Park, Calif.: Addison-Wesley.

Reprinted from New England Journal of Medicine (1980), 302, pp. 630–631

30. The Clinician’s Responsibility for Helping to Improve the Treatment of Tomorrow’s Patients Bucknam McPeek1 , John P. Gilbert1 , and Frederick Mosteller2 1

Massachusetts General Hospital and 2 Harvard School of Public Health

In addition to using his or her knowledge and skills for treating today’s patient, the clinician has a responsibility to strengthen medical knowledge and upgrade the quality of care that will be available in the future. The ways in which clinicians can make such a contribution include devising clinically relevant descriptive scales, avoiding pitfalls, advising patients eligible to participate in trials, and supporting efforts to improve the testing of new therapies. The physician knows that applying a therapeutic procedure in practice is vastly different from testing it in the carefully regulated conditions of the laboratory. The controlled trial provides a middle ground between these extremes and serves to test whether the ideas developed in the laboratory work in patients. The active participation of clinicians in the design and execution of trials secures their relevance to everyday practice. The clinician can contribute to the analysis of therapeutic experience by developing quantitative scales for describing patients’ conditions. It is difficult to compare two treatments unless we agree on what constitutes a better outcome. Some examples of scales already in common use are the New York Heart Association Index, the Agpar score, the Iowa hip score, the Visick score for ulcers, and the American Society of Anesthesiologists’ physical status classification. The more such a scoring scheme depends on a few easily observed characteristics, the more reliable it will tend to be (i.e., different observers will tend to give the same score to a given patient). Clinicians can also use their experience to help investigations avoid pitfalls such as the regression effect. For example, when the patient’s condition naturally fluctuates—with better periods followed by poorer periods followed by better periods, and so on—it is easy to be misled. The patient is feeling worse and comes to see the physician. Drug X is prescribed and soon the patient is feeling better. (But, as Lewis Thomas says, “most things are better in the morning.”) In many cases, the patient will be less careful about continuing the medication at this point and shortly thereafter feels worse. (Thomas for-

514

Bucknam McPeek, John P. Gilbert, and Frederick Mosteller

got to say that many things are worse later.) Thus, we see periods of therapy followed by improvement and periods of poor compliance followed by relapse. The regression effect is only one of the factors that make it difficult to assess a treatment. Another problem familiar to clinicians is the strong placebo effect that a patient derives from the belief that his or her disease is being treated. Beecher has shown that about 35 per cent of patients improve on placebo over a wide range of ailments including angina, headache, postoperative wound pain, and seasickness. These examples illustrate how easily one can be misled by unstructured observational studies, and they point out the need for clinicians to encourage more carefully controlled evaluations. Patients contribute to research as well as benefit from it. Past patients, through their suffering and participation in studies, have contributed to the treatments now in use. Their contributions should be honored, when possible, by our participating in the same system of medical research. The inferences made from these trials apply primarily to the population studied. To the extent that individuals or groups decline to participate (or are not given an opportunity to participate), and to the extent that their responses may differ from those of the rest of the population, the treatment that seems best may not apply to them as well as to the participants and the people similar to the participants. If special groups, such as children and the well-to-do, tend not to be included in the system, then they will find medical evaluations of therapies not as well pointed to their needs. The clinician is often in the position of advising patients who have been approached or who are eligible to participate in clinical trials. The clinician can take the responsibility for explaining the medical issues involved in the trial and relate these to the circumstances of the patient so that a reasonable and responsible decision can be made. The patient who is upset by the uncertainty of particpating in a double-blind study should probably be advised not to particpate even if the differences in treatments are innocuous. On the other hand, many patients will derive a great deal of satisfaction from being able to make a contribution to medicine. Physicians are in a position to evaluate the importance of the problem under investigation and the reputation of the investigators, and to provide each patient with a concerned but dispassionate viewpoint. The clinician can serve both to encourage effective trials and to protect patients from overzealous investigations. An ineffective treatment, a drug that doesn’t work, not only is unsatisfactory for the patients involved, but also entails resources stolen from the system, an expense that we can ill-afford as health care comes under increasing financial pressures. References 1. Beecher HK. Measurement of subjective responses: quantitative effects of drugs. New York: Oxford University Press, 1959: 66–7. 2. Gilbert JP, McPeek B, and Mosteller F. Statistics and ethics in surgery and anesthesia. Science. 1977; 198:684–9.

Reprinted from Science (1981), 211, pp. 881–886

31. Innovation and Evaluation Frederick Mosteller Frederick Mosteller is Roger I. Lee Professor, Department of Biostatistics, School of Public Health, Harvard University, Boston, Massachusetts 02115. He has just retired as president of AAAS. This article is the text of his address at the annual meeting of the AAAS in Toronto, Ontario, Canada, 6 January 1981. Summary. Social, medical, and technological innovations are discussed, first with reference to historical examples and then with modern studies. I show the need for evaluating both the innovations themselves and the research processes leading to them. I suggest some kinds of research that need to be carried out if we are to continue to have a vigorous program of scientific and technological innovation. Finally, I explain the new initiative by the AAAS in science and engineering education.

My topic here is innovation and evaluation. I begin with an early experiment in nutrition. It was designed by Daniel of the Lions’ Den, but for humans rather than lions. Daniel was held hostage in Nebuchadnezzar’s court and, possibly for religious reasons, disapproved of the rich food, wine, and meat served there. The eunuch in charge feared for his own head if he were to give Daniel and his three friends merely the simple Judean vegetable fare called pulse (such as peas and beans). Daniel asked for a 10-day trial and promised to turn to the court’s diet if the Judean hostages weren’t then as healthy as the others. To turn to a translation of the original article, Daniel 1:12-15 (1): Prove thy servants, I beseech thee, ten days; and let them give us pulse to eat, and water to drink. Then let our countenances be looked upon before thee, and the countenance of the children that eat of the portion of the king’s meat: and as thou seest, deal with thy servants. So he [the eunuch] consented to them in this matter, and proved them ten days. And at the end of ten days, their countenances appeared fairer and fatter in flesh than all the children which did eat the portion of the king’s meat.

Had this study been submitted as a report to Science, the reviewer might make the following remarks. First, there is no sampling problem because Daniel needed only to prove that he and his three friends were better off with the diet. He did not have to make the generalization to, say, the entire population of Judea or the human race. This is unusual because ordinarily we are trying to make such generalizations. For Daniel it was fortunate

516

Frederick Mosteller

as well, because with such a small sample—Daniel, Shadrach, Meshach, and Abednego—the eunuch would have had to insist on using Student’s t-test, and this would not be invented for another 2500 years, almost exactly. Second, the length of the trial, 10 days, seems short for a nutrition experiment. Third, the end point “fairer and fatter in flesh” seems not well defined. Other translations speak of “sleeker” which also is vague. From the eunuch’s point of view, the diet of pulse was an innovation, while the court’s regular diet was the standard. And so Daniel designed a comparative experiment, an early evaluation of an innovation. I turn to a historical, but more policy-oriented example: Another nutrition experiment was carried out by James Lancaster starting in 1601 when the East India Company sent its first expedition to India. He was general of four ships and a victualler (2). They sailed from Torbay in England in April 1601. At that time scurvy was the greatest killer of the navy and of expeditions and explorations, worse than accidents or warfare or all other causes of death together. More than half a crew might die of scurvy on a long voyage. In 1497 Vasco da Gama sailed around the Cape of Good Hope with a crew of 160 men: 100 died of scurvy (3). Lancaster served three teaspoons of lemon juice every day to the sailors on the largest ship of his fleet and few became ill. By the time the fleet got to the Cape of Good Hope, so many sailors on the three smaller ships were sick from scurvy that Lancaster had to send sailors from the large ship to rig the smaller ones. When they reached the Cape of Good Hope 110 men had died, mostly from the 278 men who started on the three smaller ships. Clear evidence that lemon juice prevents scurvy? Maybe. At any rate, the evidence is so strong that the East India Company and the British Navy could surely be expected to follow up this investigation with further research. Not at all! Policy moves more majestically. About 150 years later, 1747 to be precise, the physician James Lind (4) carried out an experiment consisting of adding something special to the diets of scurvy patients on the ship Salisbury. He had six dietary additions: 1) Six spoonfuls of vinegar. 2) Half-pint of sea water. 3) Quart of cider. 4) Seventy-five drops of vitriol elixir. 5) Two oranges and one lemon. 6) Nutmeg. Lind assigned two sailors ill from scurvy to each treatment. Those who got the citrus fruit were cured in a few days and were able to help nurse the other patients. The supply of citrus fruit ran out in about 6 days. Lind knew about Lancaster’s work as well. With this dramatic and crucial experiment plus the backup of Lancaster’s earlier voyage surely the British Navy now will adopt citrus fruit for prevention of scurvy from long sea voyages? No! Forty-eight years later policy caught up. In 1795 the British Navy

31 Innovation and Evaluation

517

begin using citrus juice on a regular basis and wiped out scurvy in the service (5). The British Board of Trade followed suit after a delay of only 70 years (1865) and wiped it out in the mercantile marine (5). We often talk about how slow we are to make use of innovations, but this case study of citrus juice should give us a little encouragement. Today we are worrying about 20-year lags. Here is one of 264 years. Evaluation of Today’s Social Programs We need both to have a larger number of innovations and to be sure that they are beneficial. This requires both inventiveness and evaluation. Let me demonstrate first with social programs. To see whether social programs that had been evaluated carefully were successful, Gilbert, Light, and I (6) reviewed a substantial number of programs. Each had been evaluated by a randomized controlled experiment. I mention only a few, to give you a feeling for their variety. After we studied the evaluation of a program, we scored the program on a scale running from a double plus down to a double minus, with zero meaning there was essentially no gain from the program, a double plus meaning that the program was an excellent innovation, and a double minus that it was much worse than the treatment that it replaced. Our ratings did not include costs of the program; had they done so, they would probably have had to be reduced somewhat. The studies were classified into social, sociomedical, and medical innovations. Four of the eight social innovations we studied were: negative income tax, studies of bail, training of police, and attempts to reduce delinquency among girls. Let me describe one. The study of delinquent girls was intended to reduce the amount of juvenile delinquency by instituting a social program. It had two steps. First, the investigators needed to identify the potentially delinquent girls, and second to apply the program to them and so prevent their delinquency. What happened in the experiment? First, the innovators were very successful in identifying those girls who were likely to become delinquents. Second, they had no success at all in diverting the young girls from their course. Thus we assigned this innovation a zero. Although it is worth something to be able to identify potential delinquents, and this feature would be useful in future studies, the purpose of the innovation was to reduce delinquency. Since it did not do this, it was rated a zero. It did not increase the delinquency rate either, and so it did not get either a minus or a double minus. We also studied eight sociomedical innovations, of which four were: experiments on probation for conviction for public drunkenness, effects of broadening health insurance benefits, training mothers whose children have tonsillectomies, and training physicians in comprehensive medical care. Let me briefly describe the probation experiment. The judge assigned these habitual offenders to one of three treatments in a randomized manner. Such

518

Frederick Mosteller

offenders with two arrests in the previous 3 months or three in the previous year were fined $25, given a 30-day suspended sentence, and assigned to one of three groups: 1) No treatment. 2) An alcoholic clinic. 3)Alcoholics Anonymous. The payoff variables were number of rearrests and time before first rearrest. The results were that the “No treatment” group performed somewhat better than the other two groups, each of which performed equally well. The original authors, Ditman et al. (7), concluded that the study gave no support to the policy of short-term referrals. Thus we scored the innovation a zero. It might possibly have been scored a minus. In this same group was the Kansas-Blue Cross experiment. It had been suggested that one reason for excessive use of hospitalization and consequently of the rising costs of medical care was that the insurers would pay only for work done in the hospital, whereas some work could be taken care of more cheaply with outpatient care. The insurance company responded to this suggestion with a substantial randomized experiment. They put 5000 people into a group that had added benefits of ambulatory care, free of charge, in addition to the regular hospitalization, and compared their results over a year with those of 10,000 patients on the regular program. The results came out contrary to expectations. The amount of hospitalization for the group with extra ambulatory benefits went up by 16 percent, while that for the group with regular benefits increased by only 3 percent. Thus the overall effect went in the opposite direction to that hoped for. There was more information. For example, there were 15 percent fewer short-term stays in the extra-benefit group. But the decrease was more than offset by the added longer stays. This innovation received a minus because the results went in the direction opposite to that hoped for. It is, of course, possible that there may be a benefit overall in this approach because of finding things wrong early that need attention and thus preventing later health problems. That would take a substantial further investigation to establish. We studied 12 medical innovations, including the following four: the Salk vaccine experiment for polio prevention, treatment of cancer of the bronchus, vagotomy, and gastric freezing for ulcer. The Salk vaccine was a major success and has nearly stamped out paralytic polio. In one investigation, children were allocated to two groups, those injected with the vaccine and those injected with a saline solution. The vaccine was highly effective, and we rated it a double plus. The treatment for cancer of the bronchus presented a difficult rating problem because on one hand there was a substantial improvement in survival, but on the other we did not have clear evidence about the quality of the lengthened life. The comparison was between surgery and radiation therapy. Patients receiving the radiotherapy survived almost 50 percent longer—up

31 Innovation and Evaluation

519

from about 200 days to nearly 300, so we gave the treatment a double plus. But a plus might have been appropriate because further information about quality of life might have changed our views. These few examples illustrate some experiments and reforms and their evaluation. How did it all come out? Out of 28 innovations, 12 were positive, and of these 6 got double pluses, 3 were negative, and 13 were rated zero. Thus less than half of well-tested innovations we discovered in the literature were beneficial even if costs were neglected. This suggests strongly that social or medical innovations do need to be evaluated. You may be concerned that our sample of innovations was rather catchas-catch-can. Were were troubled about this too, and therefore did a further study. We evaluated surgical innovations (8). By using the MEDLARS search system, a computer-based bibliography of medical literature, we obtained a population of surgical studies that was selected in an objective way. We chose randomized clinical trials. In all, we found for the period under study—1964 to 1973—a total of 36 trials comparing a surgical innovation against a standard treatment. Among these, 44 percent were regarded as successful. Of these, 11 percent were not improvements over the standard, but they were the equal of it and offered new approaches that might be preferred in special circumstances. The actual improvements were about 33 percent, of which the excellent ones comprised 14 percent. Thus we find again that when innovations are put to trial, they are successes only about half the time, and that substantial improvements are relatively rare, about one in seven. Ethics and Self-Interest This gives us an extra piece of information. One reason people often give for not using randomized clinical trials is that they are unethical. That is, one should not give a patient an inferior treatment. The information obtained from the controlled studies of surgery and anesthesia that we reviewed showed that the physician does not know which way a trial will come out. This goes far toward resolving the ethical issue. In a period when the population is tightening up its attitude about participation in experiments, in sample surveys, and generally in informationproducing activities, we need also to think about self-interest. Let us focus on the medical situation. Sometimes participation in a trial may directly help the patient. The patient may be lucky and get the preferable therapy, or the treatment may be reversible so that after the trial, patients who had the less useful treatment can be given the better one. Nevertheless, sometimes the outcome of the trial may be of little benefit to the individual or his or her relatives or friends.

520

Frederick Mosteller

In spite of this, the patient may still wish to participate in the trial. If we recognize the trial as part of a general system of trials in which patients participate only when they qualify and when we require a trial to find the better treatment, we see that the patient may benefit not from the particular trial but from the system of trials. Findings from other trials will help the patient, or relatives or friends. We should have our eye on the pooled benefit of the whole system. The longer the patient lives, the more likely he or she is to suffer from some of the diseases that we learn about through this system of trials. The patient will then be the beneficiary of this information. If trials are not made, then the information will be slow in coming in, if it comes at all. Thus the patient has a stake in the whole system, not just in a particular trial. The inferences derived from trials apply to the populations who participate in them. To the extent that individuals decline to participate, and to the extent that their responses may differ from those of others, the treatments may not apply as well to them and people “like” them. If special groups either deliberately fail to participate or if they are barred from participation, then the trials cannot be expected to apply as well to them as to the groups who do participate. It is hard to say just what “people like me” means, and a good solution is to have volunteers from the whole appropriate population. If participation seems to be a sacrifice, others are making similar sacrifices in aid of “my” future illnesses, and the whole system is being upgraded for “my” benefit. Thus a special sort of statistical morality and exchange needs appreciation (9). Measuring benefits. In studying costs, risks, and benefits of surgery, we found that measuring benefits was our weakest point. Survival is the most-used measure of benefit. But much surgery, maybe most of it, is intended not as lifesaving but for improving quality of life. This means that we need to assess quality of life (convenience and comfort) to find out how much we are improving matters. Before we try this in social programs generally, we would do well to develop our methods in an area like surgery. My colleagues and I have been trying this out on an exploratory basis using a brief questionnaire, and we are much encouraged. Safe surgery dilemma. Information about economics, about outcome, and agreed-on ethics cannot entirely determine social policy. What I call the safe surgery dilemma illustrates this (10). Consider a safe surgical operation like that for appendicitis. If physicians operated whenever they saw even slight signs and symptoms, the total lives lost might be minimized. But if they operated only when the symptoms were severe, this would minimize the total number of days spent by patients recuperating. None of us want to be operated on needlessly, nor do we want to die because we have shown only mild symptoms. What should the policy be? To save the last life may require a million extra operations or hundreds of lifetimes of recuperation. This is a problem that society must settle, and while information and ethical considerations can help, the decision is a social one.

31 Innovation and Evaluation

521

Note that the conflict is not the usual one between society’s interests and those of the individual, but is a conflict within the individual. We have here the classical mathematical difficulty that we cannot expect to maximize two functions at the same time. That is why the happy principle of “the greatest good for the greatest number” is only a slogan and rarely a useful tool. Linkage and confidentiality. While I am on the subject of information, let me mention a further matter. A great deal of valuable information about the economy and about health is tied up in government computer systems. It is difficult to relate various kinds of information for statistical study purposes because we have become more and more concerned about privacy and confidentiality. This leaves us with a serious question. Are we going to purchase this information all over again within a time delay in order to solve new problems, or are we going to use what we have? To use it requires that we link up information about an individual from separate statistical series. We do not need to know the person’s name after the linkage has taken place, but some identification is required to make the link. Under suitable auspices, the linkage can be made and then the names erased. After that, statistical analyses can be made. As an example of the need, associated with evaluation, we have many statistical series in the United States of exposure to various chemicals. Let us call these the input information. We also have many series concerning deaths or disablements or morbidity from various diseases. Let us call these the output information. What we do not have is many series where the exposure input data are linked to the health outcome data. A recent study at the National Center for Health Statistics (11) found only four series that related environmental input to health outcome, and the linkage was primarily of a geographic aggregate nature rather than on a single individual basis. Thus if we want to clean up the environment, we need data linkage to tell us how to spend our money to reduce damage to health. We need to know where the most damage occurs and how effective expenditures would be in reducing health losses. If we knew that one social policy would save a person-year of life for each $10,000 spent, and another policy would cost $500,000 per year of life saved, this information might well influence us in deciding how to spend the money. Lest you suppose that such extremes do not arise, let me say that we can document wider extremes in some current lifesaving and safety policies (12). What sort of strategy might we have for getting this information? We need more linkage of data about individual exposure and life histories and their relation to health outcome. We are trying to control health events that may take 20 or 30 years to develop, and we do have data of potential value in choosing such controls. The self-interest of the society might well decide that instead of starting out now to gather such data from scratch, we would do better to have some linkage and consider the various amounts of damage that different forms of exposure create, and what it might cost society to reduce

522

Frederick Mosteller

untoward effects. Nevertheless, this is a political issue, and society may prefer its privacy and confidentiality to providing information that may save lives and disablements. Research on Research There have been several studies of basic research, or perhaps of research in general. We need many studies in this area, not to discover whether basic research yields dividends, but to find out something about the prospects for success of various kinds in research and development. The first such study was Project Hindsight, which was carried out by the U.S. Department of Defense (13). Its general conclusion, which did not cheer up basic researchers, was that basic research did not contribute much to the development of weapons systems. It concluded that targeted research contributed more. The second was the study of Comroe and Dripps (14, 15). This study traced some major biomedical developments to their basic research roots, and showed the essential role of basic research in inventing new therapies. The go-no-go approach to basic research seems to be a not very helpful concept. We need basic research for new developments. The problems must be what basic research is needed, and how much is worthwhile in a given area. Can evidence be adduced which would help with the funding and educational and occupational decisions that must be made? It is one thing to say that nobody knows, but another to face the fact that someone has to decide how much money to provide and how to spend it for the public good. Although such questions are political, definite quantitative information can help us with such decisions. Let me describe what I found common to the Comroe-Dripp and the Hindsight studies in spite of their opposing conclusions. The first finding was that major practical advances in both weaponry and biomedical therapies seemed to require not just one innovation or breakthrough, but a bundle of them, often as many as a dozen. The second was that there is a substantial period, often 20 years, between a basic science innovation and its use in weaponry or therapies. If a variety of new things have to be assembled to make a whole, it is not surprising that they might on average be somewhat aged before being used in a major innovation. Comroe and Dripps studied the origins of the ten most important clinical advances in cardiopulmonary medicine and surgery occurring between 1945 and 1975. Of 529 key research articles leading to these advances, 41 percent “reported work that, at the time it was done, had no relation whatever to the disease that it later helped to prevent, diagnose, treat, or alleviate” (14, p. 12). A report with the acronym TRACES (16), prepared by the Illinois Institute of Technology Research Institute for the National Science Foundation, dealt with five advances: magnetic ferrites, video tape recorder, the oral contraceptive pill, the electron microscope, and matrix isolation. By studying

31 Innovation and Evaluation

523

a longer time period than Project Hindsight, the investigators found that key events leading to these advances divided into three groups: 70 percent nonmission research, 20 percent mission-oriented research, and 10 percent development and application. The distribution of nonmission events had a mode between 20 and 30 years prior to the innovation, while mission-oriented events peaked during the decade prior to the innovation. For these case studies, time from conception to demonstration ran about 9 years. Ten years prior to the innovation, 90 percent of the nonmission-oriented research had been completed. The Battelle Columbus Laboratories (17) extended this research by adjoining the heart pacemaker, hybrid grains and the Green Revolution, electrophotography, input-output economic analysis, and organophosphorous insecticides to the magnetic ferrites, video tape recorder, and the pill studied by TRACES. The average time from conception to first realization of the innovations was 19 years. This set of innovations took longer to realize that those of TRACES, and of the significant events leading to the innovations, the distribution was 34 percent nonmission, 38 percent mission-oriented, 26 percent development, and 3 percent nontechnical. Thus the distribution of key events into the categories varies depending on the choices of innovations to study and perhaps on who classifies them. It seems clear, however, that both mission- and nonmission-oriented research are important, and that the nonmission work goes on generally well in advance of the mission-oriented research, which in turn tends to precede the developmental work. We need some additional kinds of studies that are retrospective and prospective. For example, we need to have an idea about the population of research being done, and what it emits, in addition to a method that starts with highly selected output and works back. Once this idea has been worked over carefully so that we understand what we need to find out, we then might engage in a truly prospective study. That is, we ultimately need to develop a study based on research as it starts. The major difference between a forward-looking retrospective study and a prospective study is that we have the opportunity to gather the data we want in the prospective study. In the forward-looking retrospective study we have to make do with the data history has provided. Recollections often differ, and the older I grow, the more I distrust oral history. Funding agencies generally, and the U.S. Congress in particular (18), especially desire more research of this kind. We do not know much about how to do it. Blume (19) says that we should not expect universal principles of scientific management, but that the comparative analysis of scientific communities might do much to help us understand the workings of science. In such studies, he says, we might find out how organizational factors, resources, and division of labor vary in their effects from one specialty to another. Although we can scarcely instruct anyone how to do this research on scientific productivity and scientific management, we should encourage a good deal more of it and not expect much payoff soon.

524

Frederick Mosteller

Successful technological innovations. We need considerably more work in the area of research on research, both in basic science and in innovations in technology. I illustrate this for the technology side using the British study called Project Sappho. Investigators at the Science Policy Research Unit at the University of Sussex studied the reasons for success and failure in industrial innovation (20). They combined a matched-pair and a case-study approach. They chose instances in which a technological innovation had been introduced at least twice, at least once successfully and at least once unsuccessfully. Then they carefully applied the case-study method to the details of both the successful and the unsuccessful innovation. In all, they studied 29 pairs of innovations, drawn from either the chemical industry or the scientific instruments industry. They wanted to find the characteristics that separate winners from losers. The main finding was that no one variable seems to distinguish successful from unsuccessful innovations. Beyond this, their detailed findings can be summarized as follows: Successful innovators better understand user needs; pay more attention to marketing; develop more efficiently, but not necessarily faster: make better use of outside technology and advice; have responsible individuals with greater seniority and authority (mostly the business innovator rather than the technical innovator). Many features did not seem relevant though often mentioned in business lore: size of firm, management techniques, use of qualified scientists and engineers, timing (being first or second to market the innovation), initial familiarity with markets and technology, structure of research, in-house versus out-of-house ideas, market pressures. Of course, the sample studied has special features. It does not discuss technologies where just one attempt at introduction succeeded or failed, and these might form the majority of cases. Thus it would be valuable to have some further studies. In examining single successes and failures, we cannot readily create the comparability that the matching in Project Sappho provided. Scientists and engineers. In recommending research on research and development, I wish that I could say that research on scientists and research on engineers were mutually supporting efforts and that what works for one works for the other. Research at Massachusetts Institute of Technology suggests that this is not true. Even for research scientists working in the same firm with engineers, the goals are not the same (21; 22, p. 310). Their priorities are almost reversed, with scientists more oriented to the world outside the company and engineers turned more inward toward the company. Allen (22) reports that, for engineers, ideas suggested by people outside the firm for solving company problems have a low success rate compared with ideas developed within the firm. Research scientists, in contrast, find that suggestions from outside the firm have a good success rate. Allen reports another difference. At first his group felt that engineers did not read the literature while the research scientists did. Further investigation

31 Innovation and Evaluation

525

showed that a few engineers acted as technological gatekeepers. They read the literature and interacted with the rest, keeping them informed. Of course, these gatekeepers were soon promoted to management, where they no longer interacted technically with the engineers and could no longer follow the literature. Among research scientists, the tendency was for each person to keep up with an appropriate literature—each scientist acting as his or her own technological gatekeeper. These remarks merely support my earlier point that we cannot expect our research efforts to have universal applicability for scientists and engineers. Technological innovation and the economy. In late 1979 the American Chemical Society held a symposium on innovation and research (23). In June 1980 the AAAS held its Fifth Annual R & D Colloquium (24). Both the symposium and the colloquium considered what could be done to stimulate innovation. Edwin Mansfield (25) pointed out that the economist Zvi Griliches (26) used data from about 900 manufacturing firms to indicate that a firm’s rate of productivity increase is directly related to the amount it has spent on R & D. Nestor Terleckyi (27) has shown corresponding results for whole industries. Mansfield (28) has found that there is a direct relation between amount spent on basic research and rate of productivity increase after adjusting for total R & D expenditure. The participants at these conferences had many suggestions for increasing the rate of innovation mainly through changes in government policy. Several speakers pointed to Germany and Japan where they felt that the cooperation between government and industry to promote industrial development was a pattern to emulate. Others encouraged more relationship on what Daniel Boorstin (29) calls “the fertile verge” between industry and universities. Improved patent policy would help, some say. Others suggested reductions in regulations. I shall not try to pull together or evaluate these suggestions made by others more qualified in this area than I. William D. Carey (30), Executive Officer of AAAS, pointed out that creating a substantial turn-around in public policy toward innovation would be a lot to expect because innovation has a rather small constituency. To enlarge it would require cooperation among large and small industries, labor organizations, economists, professional groups, media, and elected representatives. He doubts this will happen. He points out too that since innovations take perhaps 10 years to develop, it is hard to evaluate the effectiveness of any specific policy shift in the process of technological innovation. He sums up these complications in a quotation from a friend who says “On a clear day, you can see practically nothing.” AAAS Initiative in Science and Engineering Education With such a complicated outlook on the government and industrial side, what more can AAAS do? A fundamental ingredient in both scientific and

526

Frederick Mosteller

technological innovation is the strength of the scientists and engineers who are available. Another important component is a welll-educated public who can appreciate the value of research of all kinds and recognize the need to nourish them. In the United States we have seen an erosion in education in science and mathematics both in amount and quality. The citizen has become less well informed, as we know from the many studies by the National Assessment of Educational Progress. It may take 10 years to develop a technological innovation, but it takes 20 years to make a citizen or a scientist or engineer. We must find methods to do this better. As Antoine de Saint-Exup´ery said (31, p. 155), “As for the Future, your task is not to forsee it, but to enable it.” For example, I have been impressed with the educational work of the Ontario Science Centre. After Dr. Tuzo Wilson introduced me to it, I encouraged our AAAS Committee on the Public Understanding of Science to review the possible use of science and technology centers and museums as a resource for strengthening science education. That committee, chaired by John Truxal, has already taken steps in that direction. This is just one step. The committee has made several recommendations to the Board of Directors designed to strengthen the AAAS effort in education. The board, after being informed of recent studies of science education in various countries, has resolved to mount a program to help improve science and engineering education both for the citizen and for professionals. James Rutherford will advise the board on developing a program for science education. Our first steps will be to work with our affiliated science and engineering societies to advance science and engineering education in the 1980’s. We will urge the President of the United States and the Congress to address the need for excellence in science and engineering education. I hope, personally, that our Candadian members will join us in considering what cooperative steps may be taken. Finally, for the 1982 AAAS Meeting in Washington, D.C., we have directed that a major theme be “Toward a national commitment to educational excellence in science and engineering for all Americans.” Conclusion To get and retain the benefits of social, medical, and technological innovations, we need to evaluate their effectiveness in practice. Otherwise we will find ourselves paying for poor innovations both in dollars and in delay of introducing better ones. Individuals have some self-interest in participating in experiments and sample surveys and allowing the linkage of information for statistical purposes, because innovations may not be suited to the nonparticpants, and we may mount expensive programs that have little payoff. Still society must de-

31 Innovation and Evaluation

527

cide whether it approves. Solid information helps decisions about some innovations but cannot settle some issues where different payoff evaluations lead to differing policy actions. To help the process of innovation and to inform the funding agencies and the public, we need more research on the research process itself both in basic science and in technology. To maintain a strong economy we require constant innovation. This requires vigorous basic science and technological R & D programs. Providing these requires a citizenry well-educated in science and technology as well as strong training for specialists in science and engineering. This education needs a great deal of improvement. The AAAS plans to join with other societies and institutions in strengthening these educational programs. References and Notes 1. The Holy Bible, King James Version, Daniel 1: 12–15. 2. S. Purchas, Hakluytas Posthumus or Purchas His Pilgrimes: Contayning a History of the World in Sea Voyages and Lande Travells by Englishmen and others (1625) (reprinted by James MacLehose and Sons, Glasgow, 1905), vol.2, pp.392– 398. 3. S. Davidson, A.P. Meiklejohn, R. Passmore, Human Nutrition and Dietetics (William & Wilkins, Baltimore, 1959), p.419. 4. J. Lind, A Treatise of the Scurvy (1753) (University Press, Edinburgh, reprinted 1953). 5. Encyclopaedia Britannica (Benton, Chicago, 1964) vol. 20, pp.231–232. Jane Teas has kindly called my attention to C. Lloyd’s “The introduction of lemon juice as a cure for scurvy” [Bull. Hist. Med. 35, 133 (1961)]. Lloyd attributes the time lag of introducing lemon juice in the British Navy primarily to misleading reports from Cook’s voyages and reports from Cook himself and secondarily to the many competing remedies, failure to experiment properly, and to Lind’s own lack of prominence. Buckham McPeek has shown me “Vitamin C status of submariners” by S.C. Gilman, R.J. Biersner, and R.D. Thornton (Report 933, Naval Submarine Medical Research Laboratory, Submarine Base, Groton, Conn., June 1980). The authors find lowered plasma vitamin C in submariners after patrols, and encourage vitamin C dietary supplements, closing “After 400 years, the mistakes of the past should not be repeated.” 6. J.P. Gilbert, R.J. Light, F. Mosteller, in Evaluation and Experiment: Some Critical Issues in Assessing Social Programs, C.A. Bennett and A.A. Lumsdaine, Eds. (Academic Press, New York, 1975), chap 2, pp. 39–193. 7. K.S. Ditman, G.G. Crawford, E.W. Forgy, H. Moskowitz, C. MacAndrew, Am. J. Psychiatry 124, 160 (1967). 8. J.P. Gilbert, B. McPeek, F. Mosteller, in Costs, Risks, and Benefits of Surgery, J.P. Bunker, B.A. Barnes, F. Mosteller, Eds. (Oxford Univ. Press, New York, 1977), chap. 9, pp. 124–169. 9. See J.P. Gilbert, B. McPeek, and F. Mosteller [Science 198, 684 (1977)] for a more detailed treatment of this issue. 10. F. Mosteller, J. Surg. Res. 25, 185 (1978). 11. B.B. Cohen, J.A. Perlman, D.P. Rice, L.E. Duncan, S.C. Marcus, M.D. Kunst, P.E. Leaverton, W.F. Stewart, Environmental Health: A Plan for Collecting and

528

12. 13.

14. 15. 16.

17.

18. 19. 20.

21. 22.

23.

24.

25. 26.

27. 28. 29.

30. 31.

Frederick Mosteller Coordinating Statistical and Epidemiologic Data (National Center for Health Statistics, Government Printing Office, Washington, D.C., 1980). N.J. Bailey, Reducing Risks to Life: Measurement of the Benefits (American Enterprise Institute for Public Policy Research, Washington, D.C., 1980). C.W. Sherwin and R.S. Isenson, First Interim Report on Project Hindsight (Office of Director of Defense Research and Engineering, Washington, D.C., 30 June 1966, revised 13 October 1966). J.H. Comroe, Jr., Retrospectroscope: Insights into Medical Discovery (Von Gehr, Menlo Park, Calif., 1977). —, Jr. and R.D. Dripps, Science 192, 105 (1976). Technology in Retrospect and Critical Events in Science (TRACES vol. 1, report to the National Science Foundation, prepared by the Illinois Institute of Technology Research Institute, 1968). Interactions of Science and Technology in the Innovative Process: Some Case Studies (final report to the National Science Foundation, prepared by Batelle Columbus Laboratories, Columbus, Ohio, 1973). R.N. Giaimo, in A.H. Teich et al. (24), pp. 29–37. S.S. Blume, Science 207, 48 (1980). Science Policy Research Unit, University of Sussex, Success and Failure in Industrial Innovation (Centre for the Study of Industrial Innovation, London, 1972). R.R. Ritti, The Engineer in the Industrial Corporation (Columbia Univ. Press, New York, 1971). T.J. Allen, Managing the Flow of Technology: Technology Transfer and the Dissemination of Technological Information Within the R&D Organization (MIT Press, Cambridge, Mass., 1977). W.N. Smith and C.F. Larson, Eds., Innovation and U.S. Research: Problems and Recommendations (ACS Symposium Series 129, American Chemical Society, Washington, D.C., 1980). A.H. Teich, G.J. Breslow, G.F. Payne, Eds., R & D in an Inflationary Environment: Federal R & D, Industry and the Economy, Universities, Intergovernmental Science (American Association for the Advancement of Science, Washington, D.C., 1980). E. Mansfield, in A.H. Teich et al. (24), pp. 84–91. Z. Griliches, in New Developments in Productivity Measurement and Analysis, J. Kendrick and B. Vaccarra, Eds. (National Bureau of Economic Research, Chicago, 1980), pp. 419–454. N. Terleckyi, Effects of R & D on the Productivity Growth of Industries: An Exploratory Study (National Planning Association, Washington, D.C., 1974). E. Mansfield, Am. Econ. Rev. 70, 863 (1980). D.J. Boorstin, “The fertile verge: Creativity in the United States,” an address given at the Carnegie Symposium on Creativity, the inaugural meeting of the Council of Scholars of the Library of Congress, 19–20 November 1980 (Library of Congress, Washington, 1980). W.D. Carey, in W.N. Smith and C.F. Larson (23), chap 23, pp.231–234). A. de Saint-Exup´ery, The Wisdom of the Sands, translated by S.Gilbert from the French Citadelle (Univ. of Chicago Press, Chicago, 1979; reprint of the edition published by Harcourt, Brace, New York, 1950).

31 Innovation and Evaluation

529

32. I thank a number of persons for help in the preparation of this manuscript: T.J. Allen, J. Bailar, W.D. Carey, J. Charette, E.E. Clark, B.J. Culliton, R. Day, R. Dersimonian, D.C. Hoaglin, P. Lavori, A. Lisbon, B. McPeek, E. Mansfield, M. Olson, K. Patterson, L. Robinson, J. Teas, D. Wolfle, C. Youtz. Preparation of this paper was aided by National Science Foundation grant NSF SOC 75-15702 to Harvard University.

Reprinted from Utilitas Mathematica (1982), 21A, pp. 155–178

32. Combination of Results of Stated Precision: I. The Optimistic Case Frederick Mosteller and John W. Tukey Harvard University and Princeton University and Bell Laboratories

ABSTRACT. We propose a technique for combining estimates of the same quantity into a single number and its estimated variance. Several series (or experiments) provide estimates, yi , of the quantity of interest and of the variance of yi , estimated internally to the series, namely s2i . In the special case where all yi estimate the same number, Cochran has proposed a method called partial weighting, which groups together the half to two-thirds of the estimates that have the smallest estimated variances and gives them equal weights, and weights each of the other yi separately, inversely as their variance. We propose a parallel method where the grouping leans on the order statistics of an appropriate chi-squared distribution or distributions. Let νi +1 be the number of measurements in series i and s2 (i) = s2i (νi + 1), and let n be the number of series. Let c(i|n, ν) be the median of the ith order statistic in a sample of n from the χ2ν /ν distribution, where χ2ν follows the chi-squared distribution, with ν degrees of freedom. For brevity, let ci = c(i|n, νi ). Then the s2 (i)/ci estimate a common σ 2 for all those i for which that σ 2 applies. Our broad purpose is to group together the yi associated with the smallest of these variance estimates and weight them equally. Essentially, we group together all yj with s2 (j)/cj < 2 min s2 (i)/ci . Then we repeat i

the process with the remaining series, reducing n as we go. The required details appear in Section 4. We illustrate both Cochran’s method and ours on a collection of series of carefully measured accelerations of gravity at Washington, D.C.

532

Frederick Mosteller and John W. Tukey

1. Introduction. Frequently we have to combine the results of several series of measurements, where each series provides both an estimate, yi , of the quantity of interest and an estimate, s2i , of that estimate’s approximate variability. The proper analysis differs substantially, depending on whether we can assume that each series is estimating the same number, or whether we have to be more realistic by assuming that the different series are estimating different numbers. (In the latter case, further assumptions about series-to-series variation are clearly going to appear.) In Cochran’s 1954 account, the decision between these approaches is made by a “test for interactions”. It seems to us that it may often be better to make this decision on the basis of subjectmatter knowledge, making the “test for interaction” only if subject-matter knowledge suggests that the values being estimated are likely to be the same and planning to allow the test result to overrule such a subject-matter decision wherever the test indicates that the values being estimated appear to be different. We shall try to treat the first of these approaches (which is often unrealistic) in the present paper, leaving related considerations concerning the second approach to another place. The best-supported technique for the first approach, during the last 25 years, has been that of Cochran (1954), which developed from the work initiated in a paper by Yates and Cochran (1938). In this approach—called “partial weighting” (not the same thing as “semiweighting”)—one begins by obtaining preliminary weights for the results of the several series as numbers proportional to the reciprocals of the estimated variances. As a second step, Cochran groups together from one-half to two-thirds of the series with the smallest s2i , and gives each of these latter series the same weight, the mean of the preliminary weights for the series making up this combined subgroup. (It is natural to group at the bottom, rather than the top, because a few unusually small σi2 are quite unlikely, although unusually large ones are all too common.) We then have to provide an estimate of the variance of the combined results, and if we wish to go on to a confidence interval, an estimated number of degrees of freedom for that estimate. Cochran found it adequate to estimate the variance as if the weights had been estimated without regard to the data and to use the Smith-Welch-Fisher (Smith 1936, Welch 1947, 1949) approximation for the number of degrees of freedom, which would then be appropriate. If we are to do better than following Cochran’s suggestion directly, we will need to take more detailed account of the relative sizes of the s2i (= 1/wi ). In doing this, we want a procedure which performs well in at least three extreme situations. First, where all the σi2 are equal; second, where the series fall into a small number of groups with the same σi2 within each group, but substantially differing values from group to group; and third, where the values of the σi2 are merely scattered.

32 Combination of Results: Optimistic Case

533

We shall find it convenient to group together the series that we wish to combine and to do different kinds of combinations (so far as estimated variance and estimated degrees of freedom go) between and within groups. In doing this, we are using a computational device intended to provide good performance in diverse circumstances. We are not asserting that there actually are real groups—groups such that the variances within a group are more alike than they are among adjacent groups. And if there should be groups, we are not assuming that the groups that we have formed correspond to the groups within which the variances are constant. 2. A gravity example treated by Cochran’s method. We now illustrate Cochran’s procedure for data on the acceleration of gravity at Washington, D.C., measured by timing a counted number of pendulum swings, provided by Heyl and Cook (1936). In their work, Heyl and Cook made a number of series of careful determinations, of which 8 are reported in considerable detail in Tables 5 to 12 of their paper. The earliest of these series (later table numbers) used materials in the experimental apparatus which were later replaced by more satisfactory materials. We have not chosen this example as one for which the optimistic approach is reasonable. Rather it has been chosen as one with all the standard difficulties rather clearly present. We plan to treat it more extensively and more reasonably both in a later paper, and in a fuller account now in preparation. Here it serves only to illustrate the arithmetic. Exhibit 1 gives the details of the experimental results, together with means and s2 for the various series. We use s2 (i) for the s2 calculated for the individual observations of a series, and s2i = s2 (i)/ni for the (internally) estimated variance of the series mean (almost surely an underestimate, as we will see elsewhere). Since the series are of modestly variable length, 8 ≤ ni ≤ 13, in this case, it could make a difference whether we carry out the partial weighting process on the s2i or on the s2 (i). We shall begin by working with the s2 (i), returning to the alternative choice before long. If we partially weight the s2 (i), we will get a combined s2 (L) for the lower group. The corresponding weights to replace the naive 1/s2i = ni /s2 (i) for series in that group will then be ni /s2 (L). Exhibit 2 treats these 8 series by partial weighting as we think someone who had read Cochran carefully might have done, so far as the weighted mean goes, with 5 (out of 8) series  in the low group (1/2 < 5/8 < 2/3). The (naive) estimated variance (=1/ wi ), given for the answer, is clearly not exactly what Cochran would prescribe. It is probable that Cochran would also have pooled the s2 (i), taking account of their degrees of freedom, reaching an s2 (P ) of 27.5123 and weights of .4725, .2908, .4725, .3998 and .2908, rather than taking the mean of these s2i , 29.9959, as we have done (though he seems to recommend this). This alternate route would lead to a sum of weights of 2.0312 and a weighted sum of deviations of 6.972, providing a (partially)

534

Frederick Mosteller and John W. Tukey Exhibit 1 The Heyl and Cook measurements of g, expressed in units of 0.001 cm/sec2 (8 series of 8 to 13 measurements each)

Table # Base (H&C) value

5 6 7 8 9 10 11 12

980086 980083 980085 980078 980086 980082 980092 980068

ni Deviations, d, from 980080

13 13 11 8 8 9 11 8

s2 (i)

mean d¯

s2i

10,12,11,8,3,2,3,6,9,7,4,4,4 11.2564 6.38 .8659 5,2,4,2,0,2,2,1,2,5,−1, −4, −13 22.4359 .54 1.7258 4,4,4,12,13,7,−1, −7,1,8,9 34.0909 4.91 3.0992 −1, −1, 1, 2, −5, −9, −2, 1 13.3571 −1.75 1.6696 20,15,1,1,12,4,2,−4 68.8393 6.38 8.6049 30,8,1,0,−24, 1, 18, 0, −13 248.2500 2.33 27.5833 9,17,20,22,31,31,22,3,−3, −10, −11 233.8909 11.91 21.2628 −2, 4, 5, −24, −43, −32, 9, −10 370.5536 −11.62 46.3192

Notes: Table # is from their paper in J. Res. N.B.S. 17: 805-839, 1936. Base value is their reported value for the series, after making deletions judged appropriate to the series.  ¯ 2 /(n − 1). s2 (i) (per observation) is (d − d) 2 2 si for mean is s (i)/ni . The series from Tables 9 and 10 were measured about 6 months earlier, those from Tables 11 and 12 were measured about 1 year earlier; the series from Tables 9, 10 and 11 were measured with stellite knife edges, that from Table 12 was measured with agate knife edges, both later stated to be unsatisfactory.

weighted mean of deviations of 3.43 ± .70, only a hair different from the result of Exhibit 2. We will return below (Section 7) to Cochran’s calculation of an estimated variance for this partially weighted mean and of an estimated degrees of freedom for this estimated variance. We are now going to supplement the analysis of all the results reported by Heyl and Cook by an analysis of a portion of their results. We will select 3 observations from each series, avoiding the first one as likely to be unusual and the later ones as subject to a decreasing trend. Such an additional example serves three purposes: (1) it will show, when we come to our grouping, something of how our grouping process is affected by the degrees of freedom assigned to the s2 (i), with greater fluctuations anticipated for fewer degrees of freedom, (2) it will provide an example with equal degrees of freedom, and (3) by producing a shortened example for which the naive standard deviation of the overall result is still even less than for the analysis of the full series, it will show how far from ideal conditions the data at hand must be. We offer next, for later comparison, the results of similar calculations for only 3 (the 2nd, 3rd and 4th) of the determinations in each series. Exhibit

32 Combination of Results: Optimistic Case

535

Exhibit 2 Partial weighting applied to the series in Exhibit 1 (units 0.001 cm/sec2 , deviations still from 980080.) Table # ni (H&C)

13 8 13 11 8

11 9 8

5 8 6 7 9

11 10 12

s2 (i) (for indiv.)

wi = ni /s2 (i)

6.38 −1.75 0.54 4.91 6.38

11.2564 13.3571 22.4359 34.0909 68.8393

.4334* .2667 .4334 .3667 .2667

2.765 −.467 .234 1.800 1.702

(sum) mean

(149.9796) 29.9959

(1.7669)

(6.034)

11.91 2.33 −11.62

233.8909 248.2500 370.5536

.0470 .0362 .0216

.560 .084 −.251

sum

1.8717

6.427

yi = mean of deviations

wi yi

partially weighted mean: 6.427/1.8717 = 3.43  naive estimated variance: 1/ wi =.534 = (.731)2 naive result: 980083.43 ± .73 Note: (∗ )In the first 5 entries of this column, wi = ni /s2 (L) = ni /29.9959.

3 shows the numbers. Notice that, because we have avoided much of the downward trend evident in several of the series: a) the (naive) estimated variance of the final result is smaller for 3 determinations per series than it is for all determinations in each series, and b) the (partially weighted) final value is about 3.72 units (.0037 cm/sec2 ) larger. (An additional reason for this last effect is clear; this analysis has allowed the Table-11 series to receive substantial weight. Although the deviation for the Table-5 series is quite positive, its change only accounts for one unit of increase.) We shall return to this example later, also. If, in place of Exhibit 2, we had combined the s2i (for means), rather than the s2 (i) (for individuals), we would have, in this particular example, put the series in the same order, as Exhibit 1 shows. Presumably, then, we would have put the same 5 series in the low-s2 group, when they would all get weight .3132 (the other weights being unchanged). The totals would be 1.6708 for weights and 5.548 for weighted deviations, giving a partially weighted answer

536

Frederick Mosteller and John W. Tukey Exhibit 3 Combination of short-series results (only the 2nd, 3rd and 4th determination from each series) (unit still 0.001 cm/sec2 ; deviation still from 980080) (overdotted figures repeat indefinitely) s2 (i) (for indiv.)

Table # (H&C)

yi = mean of deviations

wi = 3/s2 (i)

6

2.66˙

1.33˙

.4500∗

1.200

8

.66˙

2.33˙

.4500

.300

5

10.33˙

4.33˙

.4500

4.650

11 10

19.66˙ 3.00

6.33˙ 19.00

.4500 .4500

8.850 1.350

(sum)

˙ (33.33)

(2.2500)

(16.350)

mean

6.66˙

7

6.66˙

21.33˙

.1406

.937

9 12

5.66˙ −5.00

65.33˙ 271.00

.0459 .0111

.260 −.056

totals

2.4476

17.491

wi yi

partially weighted mean: 17.491/2.4476 = 7.15  naive estimated variance: 1/ wi = .4086 = (.639)2 naive result: 980087.15 ± .64 ˙ Note: (∗ ) The first 5 entries in this column are wi = 3/s2 (L) = 3/6.66.

of 980083.32. (This is about 1/6 a naive standard deviation smaller than those for the full example.) 3. Allowing for the natural variability among s2 in choosing groups. We want to compare the observed variability of our variance estimates s2 (i)— or s2i —with what would be reasonable in a highly homogeneous case. We plan to do this by looking at the idealized case and then throwing in a reasonable adjustment factor.

32 Combination of Results: Optimistic Case

537

The natural idealized case is that where all s2 (i) estimate the same σ 2 , each with the same number of degrees of freedom, all the underlying populations are Gaussian, and all samples are drawn independently. (Of course, none of these assumptions will be guaranteed in practice.) We must contemplate both this unique situation, and a great variety of situations where the s2 (i) estimate different σi2 (with diverse patterns for the values of the σi2 ). The only convenient starting point is the unique situation with all σi2 equal; so it is natural to start there. Accordingly, it is natural to order the s2 (i) by magnitude and to compare their pattern with what would be reasonable for an ordered sample from some multiple of the chi-squared distribution with the indicated number of degrees of freedom. It is a computationally verifiable result that the medians of the distributions of the order statistics in a sample from any population are given, precisely enough for practical purposes, as the values where the cumulative distribution of a single observation attains the value (i−(1/3))/(n+(1/3)). Let c(i|n) be such working values for the medians of order statistics in a sample of n from the unit Gaussian distribution   i − 13 3i − 1 c(i|n) = Gau−1 , = Gau−1 3n + 1 n + 13 where Gau−1 is the inverse of the unit Gaussian cumulative. (These c(i|n) can also be taken, closely enough for most purposes, as .99 times the corresponding order statistic means [often tabulated as rankits, as normal scores, as scores for ordinal data (Fisher and Yates 1963), or as mean positions of ranked normal deviates (Pearson and Hartley 1958)]). Now the Wilson-Hilferty (1931) approximation for “chi-square/its degrees of freedom” leads to the formula ! 3 2 2 + c(i|n) 1− 9ν 9ν

 c(i|n, ν) =

for working values of medians of χ2 /ν order statistics. While various approximations used in this expression can be slightly improved, this will not ordinarily be either necessary or useful, so long as ν ≥ 3. For ν=1 and 2 we can apply [i − 13 ]/[n + 13 ] directly, to respectively, the half-Gaussian and the exponential, finding   i+n c(i|n, 1) = Gau−1 2n + 23 and



n−i+ c(i|n, 2) = − ln n + 13

2 3

 .

If we were providing a table of c(i|n, ν) we might prefer to use improved calculations rather than the Wilson-Hilferty approximation, in its calculation, but the formulas now given will meet our needs in the present paper.

538

Frederick Mosteller and John W. Tukey

Given an ordered set of s2 (i), all estimating the same σ 2 with about the same ν, we would expect them to look something like the ordered values of {σ 2 c(i|n, ν)} for i=1,2,. . . , n although we would expect them to be, accidentally, somewhat more irregular. Thus it is natural to calculate the quantities s2 (i)/ci = s2 (i)/c(i|n, ν), each of which can reasonably be thought of as estimating the common σ 2 . 4. Using the s2 (i) to guide our grouping. By analogy with Cochran’s procedure, we will want to begin by selecting a group of values of i corresponding to small s2 (i)’s which might be reasonably homogeneous—that is, might reasonably have come from a common σ 2 . A natural approach is to take the smallest value of s2 (i)/ci and include in the preliminary group all the j’s for which s2 (j)/cj < K (min s2 (i)/ci ) and then to add to this preliminary group any leapfrogged k’s—any k for which the corresponding s2 (k) < any s2 (j) where j is already included. The second step ensures that we are consistent about which s2 (i) we consider to be “small (and possibly homogeneous)”. Having constructed a first trial for the low group, we ought now ask whether it should be modified. Modification may be needed because the values of ck = c(i|n, νk ) for k’s just outside the initial low group boundary will be smaller, perhaps considerably so, than the values of c∗k = c(i|n , νk ), where n counts the number of series in the initial low group, and one (or more) of those next to it. The easy computation is to take n equal to the number of series in the low group plus one, and to look at the corresponding values of s2 (j)/c∗j , applying exactly the same algorithm. If the one added series now seems to belong to the low group, we add it. In such a case, we would repeat the calculation, after adding one more to the enlarged low group. We continue if necessary. When we stop, we have determined the final low group. We wish to continue, proceeding as if those s2 (i) in the low group came from lower σi2 than any of the others. Having removed this final low group from our total collection, we can now identify a second group by going through the whole procedure again for the remaining series, with a new value of remaining n, and hence new c(i|n, ν)’s, and so on, until all s2 (i) have been assigned to groups. Each of the groups thus assigned can be considered plausibly homogeneous, so that it seems reasonable to combine the results, yi , with equal weights within each of the groups. We have yet to choose K. Here we want to compromise our desires to find only one group when all the s2i estimate the same σ 2 , on the one hand, and on the other, to make a suitable number of groups when different s2i estimate different σi2 . At present, the best available basis for choosing K seems to be analogy. The two most nearly relevant analogies seem to be the use of a ratio in Paull’s

32 Combination of Results: Optimistic Case

539

procedure for “sometimes pooling” (Paull, 1950) and the easily tolerable misweighting ratio in assessing the possible inefficiency of incorrectly weighted means (Tukey, 1948). Since the number 2 appears prominently in both these analogies, we shall choose K = 2 here. (Applying such a ratio to the raw s2 (i)’s would not be reasonable.) Again, we might choose to do all this for the s2i rather than the s2 (i). 5. Grouping in the gravity example. We turn now to the full gravity example, as prepared in Exhibit 2, where, initially, n=8 and the values (i − 1/3)/(n + 1/3) are .08, .20, .32, .44, .56, .68, .80 and .92, so that the values of c(i|8) are −1.4051, −.8416, −.4677, −.1510, +.1510, +.4677, +.8416 and +1.4051, so that the approximate values of c(i|8, ν) are as in the upper panel of Exhibit 4, which carries out first the initial grouping, and then the trial of adding a fifth series to these 4. The lowest group now consists of 4 series. When, at the second step, we examine the remaining 4 series, using (i − 1/3)/(n + 1/3)=.1538, .3846, .6154 and .8462, so that c(i|4) = −1.0203, −.2934, .2934 and 1.0203, we get the c(i|4, ν)’s in the lower panel of Exhibit 4, which completes the grouping. While the s2 (i)/ci for H&C’s Table 11 is more than twice 141.5, this Table-11 series is still included in the second group being formed, because it is leapfrogged (twice). We turn next to the short gravity example, prepared in Exhibit 3, with the results shown in Exhibit 5. We now expect more dispersion, since the νi are only 2, and we are not surprised to find a larger low group (6 this time). Notice that we started with five (on trying all 8), found it indicated to add a sixth (on trying 6), but not a seventh (on trying 7). (The appearance of the Table-10 series in parentheses, on the third try, is no reason for removal—it is leapfrogged.) The partial weightings for the groups from Exhibits 4 and 5 are carried out in Exhibits 6 and 7, respectively. On the whole, as one might expect, the changes due to a different grouping (this Section vs. Section 2) are not very large. The result for the full example shifted by a little less than a naive standard deviation, whereas that for the short example shifted by less than one quarter of its naive standard deviation. If, as an alternative to Exhibits 4 and 6, we again consider combining s2i instead of s2 (i), as we did at the close of Section 2, we find the same grouping (4 low, 4 higher), and mean s2i of 1.8401 and 25.9426, for the low and higher groups, respectively. The weights are now .5434 for the low group and .0385 for the high group, and the partially weighted mean of the deviations is 2.50, corresponding to an answer of 980082.50 ± .66 (naive) differing from that of Exhibit 6 by about 2/3 of a naive standard deviation. The general impression that the writers have of the situation is that shifting one series (H&C Table 9) out of the low group for the full example—and shifting one into the low group (H&C Table 7) for the short example—were both changes for the better, if we look at the columns of s2 (i)/ci or s2 (i)/c∗i in the light of anticipated (minimum) sampling deviations. Thus we prefer the

540

Frederick Mosteller and John W. Tukey Exhibit 4 Grouping the full gravity example A. Initial Step

Table # (H&C)

νi

ci = c(i | 8, νi )

s2 (i)

s2 (i)/ci (∗ )

c∗i = c(i | 5, νi )

s2 (i)/c∗i (∗∗ )

5 8 6 7 9 11 10 12

12 7 12 10 7 10 8 7

.4935 .5480 .7732 .8717 .9855 1.1494 1.3769 1.8096

11.2564 13.3571 22.4359 34.0909 68.8393 233.8909 248.2500 370.5536

22.81 24.37 29.02 39.11 (69.85) (203.49) (180.30) (204.77)

.5614 .6842 .9455 1.1598 1.6149 – – –

20.05 19.52 23.73 29.39 (42.63) – – –

Notes: (∗ ) Values of s2 (i)/ci greater than twice 22.81 are in parentheses. (∗∗ ) Values of s2 (i)/c∗i greater than twice 19.52 are in parentheses.

Table # (H&C)

νi

9 11 10 12

7 10 8 7

B. Second Step ci = c(i | 4, νi ) s2 (i)

.4864 .8149 1.0647 1.5211

68.8393 233.8909 248.2500 370.5536

s2 (i)/ci (∗ )

141.5 (287.0) 233.2 243.6

Note: (∗ ) Values of s2 (i)/ci greater than twice 141.5 are in parentheses.

calculation based on Exhibit 4—on principle, and for its long run properties, whatever effect it may have had on the partially weighted means. The fact that the naive standard deviations were smaller in the grouped cases is interesting, but we doubt its value as evidence of better performance. 6. An important caution. In extreme circumstances, the procedures just outlined may produce a very small lowest group. Subject-matter knowledge should be called upon as thoroughly as possible to help in deciding whether such a group contains too few series. In the absence of adequate guidance from subject matter, we should probably impose a formal requirement that the smallest group contain at least a prescribed fraction of all the series. We should then plan to combine groups starting with the lowest until this requirement is met.

32 Combination of Results: Optimistic Case

541

Exhibit 5 Grouping the short gravity example (all νi =2; unit still 0.001 cm/sec2 , etc.) A. Initial Step Table # (H&C)

ci = c(i|8, 2)

s2 (i)

s2 (i)/ci (∗ )

c∗i = c(i|6, 2)

s2 (i)/c∗i (∗∗ )

c∗∗ i = c(i|7, 2)

s2 (i)/c∗∗ i (∗∗∗ )

6

.0834

1.33˙

15.99

.1112

11.99

.0953

13.99

8

.2231

2.33˙

10.46

.3054

7.64

.2578

9.05

5

.3857

4.33˙

11.24

.5465

7.93

.4520

9.59

11 10

.5798 .8210

6.33˙ 19.00

10.92 (23.14)

.8650 1.3350

7.32 14.23

.6931 1.0116

9.14 (18.78)

7

1.1394

21.33˙

18.72

2.2513

9.48

1.4816

14.40

9 12

1.6094 2.5257

65.33˙ 271.00

(40.59) (107.30)

– –

– –

2.3979

(27.25)

Notes: (∗ ) Values here more than twice 10.46 in parentheses. (∗∗ ) Values here more than twice 7.32 would be in parentheses. (∗∗∗ ) Values here more than twice 9.05 in parentheses. B. Second Step Table # (H&C)

ci = c(i|2, 2)

s (i)

s2 (i)/ci (∗ )

9 12

.3365 1.2528

65.33˙ 271.00

194.2 216.3

2

Note: (∗ ) Values here more than twice 194.2 would be in parentheses.

Although Cochran’s technique used a threshold fraction of 1/2 or 2/3, with our grouping techniques it seems plausible that fractions as small as 1/4 or 1/5 might well be reasonable. Clearly our examples, with 1/2 in one low group and 3/4 in the other, do not need any such treatment.

542

Frederick Mosteller and John W. Tukey Exhibit 6 The partially weighted estimate for the full gravity example (as grouped in Exhibit 4). A. Bases for weights. Low group mean: s2 (L) = 20.28. High group mean: s2 (H) = 230.38. B. Partial weighting Table # (H&C)

vi

yi = mean of deviations

wi = ni /above

wi yi

5 8 6 7

12 7 12 10

6.38 −1.75 0.54 4.91

.6410 .3945 .6410 .5424

4.090 −.690 .346 2.663

9 11 10 12

7 10 8 7

6.38 11.91 2.33 −11.62

.0347 .0477 .0391 .0347

.221 .568 .091 −.403

sums

2.3751

6.886

partially weighted mean of deviations: 6.886/2.3751 = 2.899  naive estimated variance: 1/ wi = .4210 = (.649)2 naive result: 980082.90 ±.65

7. The two steps of combination. We have proceeded as if: • •

the grouping determines the weights the weights determine the partially weighted value

without stopping to ask whether this was just what we wanted to do. Except to the extent to which the values determine the s2 (i), and thus, somewhat indirectly, the grouping, which then determines the weights, it is not unreasonable to act, as Cochran did for his lower group, as if the weights within groups were externally provided. Their ratios are fixed by the ni (each = νi + 1), which then fix the (partially) weighted mean for that group. (If we work with the s2i , the weights would be just 1/s2i .) It is true that if the series of H&C’s Table 7 had produced an s2 (i)/ci of 45.71, instead of 39.11, this series would have been shifted out of the low group, thus very substantially changing its weight. But the biasing effect, on either: a) the partially weighted means for the groups, or

32 Combination of Results: Optimistic Case

543

Exhibit 7 The partially weighted estimate for the short gravity example (as grouped in Exhibit 5). A. Bases for weights. Low group mean: s2 (L) = 9.111 ˙ High group mean: s2 (H) = 168.166. B. Partial weighting Table # (H&C)

yi = mean of deviations

wi = 3/above

wi yi

6

2.66˙

.3293

.878

8

.66˙

.3293

.220

5

10.33˙

.3293

3.403

11 10

19.66˙ 3.00

.3293 .3293

6.476 .988

7

6.66˙

.3293

2.195

9 12

5.66˙ −5.00

.0178 .0178

.101 −.089

sum

2.0114

14.172

partially weighted mean of deviations: 14.172/2.0114 = 7.046  naive estimated variance: 1/ wi = .4972 = (.705)2 naive result: 980087.05 ± .70

b) the stability that ought to be assigned to the groups does not seem likely to be large. While some experimental sampling investigations would surely be in order, we feel that losses here are less important than gains from more effective grouping. Thus we are content with what we have done within groups, namely: a) used the weights we found to make combination within groups, and b) plan to use the sum of these weights for the combination across groups. The latter, of course, is just what Cochran did. Thus we are content with:

544

Frederick Mosteller and John W. Tukey

a) the final partially weighted means, and b) the total weights assigned to each group (and often, the corresponding degrees of freedom). What remains to be considered is the calculation of an estimated variance (and its degrees of freedom) for combination across groups. 8. The final stage. For our grouped analyses, then, we have the following results, first for the full gravity example in Exhibit 8 and then for the short version in Exhibit 9. (We now use WG —here in the form of WL or WH —for the total weight of a group and W+ for the total weight of all groups.) Exhibit 8 Group summary data for the full gravity example as in Exhibit 4 and partially weighted in Exhibit 6 G (= group) L H total

WG 2.219 .156

νG

WG /W+

uG

∗ νG

41∗ 32∗

.934 .066

.247∗∗ .247∗∗

41∗∗∗ 32∗∗∗

2.375=W+



These are the sums of νi over the group. (Quite appropriate if we had pooled carefully and presumably adequate for our present purposes.) ∗∗ uG = 4(WG /W+ )(1 − (WG /W+ )); always common to only 2 groups. ∗∗∗ Equal to νG here because there are only two groups (see below).

Exhibit 9 Group summary data for the short group example as grouped in Exhibit 5 and partially weighted in Exhibit 7 G(= group) L H total ∗

WG 1.976 .036

νG

WG /W+

uG

∗ νG

12∗ 4∗

.982 .018

.0703∗∗ .0703∗∗

12∗∗∗ 4∗∗∗

2.012=W+

These are again the sums of νi over the group. uG = 4(WG /W+ )(1 − (WG /W+ )); still common when there are only 2 groups. ∗∗∗ still equal to νG because there are only 2 groups. ∗∗

32 Combination of Results: Optimistic Case

545

The formulas that we recommend, first for calculating some of the quantities above, and then going on to the final steps are: WG WG uG = 4 1− , W+ W+ # of groups − 2 νG ∗ νG ) , = νG − (lesser of 4 and 2.25 # of groups − 1 and s2final =  1 vfinal

1 WG

 1+

 uG $

∗ νG  1 W G 2 = . ∗ νG W+

,

∗ The latter two differ from those given by Cochran by the appearance of νG  2 rather than νG in the formula for sfinal , and by its appearance in place of νG  in the formula for 1/νfinal . Here νG (called fi by Cochran) is given by # of groups − 2  νG = νG − 4 # of groups − 1

provided the ν’s are large enough, and with the aid of Cochran’s Table 12 ∗ otherwise. We believe that νG , as defined above, adequately captures the results expressed in Cochran’s (1954) Table 12 and his discussion in Sections 8 and 9. In these days of computers, the advantage of formulas (and algorithms) over tables has been greatly increased. Exhibit 10 uses these formulas to reach estimated variances and estimated degrees of freedom for our two examples, as grouped, and for the full example, treated as in Exhibit 2. Our final results, then, are: full, grouped: 980082.90 ± .65 (46.7df) short, grouped: 980087.05 ± .70 (12.4df) full, Exhibit 2: 980083.43 ± .73 (50.5df) where we regard the grouping (of Exhibits 4 and 6) as more satisfactory than using Cochran’s original prescription. In this example, then, the changes from the naive estimated standard errors were quite trivial. (We cannot always expect this to be the case.) In dealing with this example in the way we have, we must be careful to treat it only as an academic exercise. The optimistic approach, of acting as if the different series were estimating the same quantity, has no justification at all for this example—as is so frequently the case. The partially weighted means found here, as we hope to explain in more detail in a later paper, are, in fact, quite close to those that will be found using more powerful techniques. (Contrary to what might be supposed on the basis of some customary analyses.) The

546

Frederick Mosteller and John W. Tukey Exhibit 10 The final steps of the two gravity examples.

A. The full example, as grouped.    .247 1 .247 s2final = 1+ + = .4268 = (.653)2 2.375 41 32 1 1 1 = (.934)2 + (.066)2 = .0214 = 1/46.7 νfinal 41 32 B. The short example, as grouped.   1 .0703 .0703 s2final = 1+ + = .5087 = (.713)2 2.012 12 4 1 1 1 = (.982)2 + (.018)2 = .0804 = 1/12.4 νfinal 12 4 C. The full example, as per Exhibit 2 (Cochran). G

WG

νG

WG /W+

uG

 νG

L

1.7669

48

.9440

.211

45.33˙

11

.0470

10

.0251

.098

7.33˙

10

.0362

8

.0193

.076

5.33˙

12

.0216

7

.0115

.045

4.33˙

sum

1.8717=W+

  1 .211 .098 .076 .045 1+ + + + = .5571 = (.746)2 1.8717 45.33 7.33 5.33 4.33 1 1 1 1 = (.9440)2 + (.0251)2 + (.0193)2 + (.0115)2 45.33 7.33 5.33 4.33 = .0198 = 1/50.5

s2final = 1 νfinal

estimated standard errors are, of course, far too small (so that their estimated degrees of freedom are quite irrelevant). 9. Conclusions. We have presented and illustrated a novel scheme for partial weighting that a) makes use of the median behavior of order statistics from any multiple of chi-square (we choose to work with the order statistics of χ2 /ν), b) is likely to make more than one group of more than one series each,

32 Combination of Results: Optimistic Case

547

c) represents, we believe, the best available technique for those rare situations, where it is reasonable to act as if the different series were all estimating the same value. We hope to return to more realistic cases soon. 10. Acknowledgments. This work was supported in part by the National Science Foundation through Grant SES8023644 to Harvard University and prepared in part in connection with research at Princeton University sponsored by the U.S. Army Research Office (Durham). We owe the necessary careful checking of the numerical results and formulas to Cleo Youtz. REFERENCES Cochran, W.G. (1954) The combination of estimates from different experiments. Biometrics 10, 101-129. Fisher, R.A. and Yates, F. (1963). Statistical Tables for Biological, Agricultural and Medical Research, sixth edition. New York, Hafner, especially table XX, page 94. Heyl, P.R. and Cook, G.S. (1936). The value of gravity at Washington. J. Res. Nat. Bur. Stand. 17, 805-839. Paull, A.E. (1950). On a preliminary test for pooling mean squares in the analysis of variance. Annals of Math. Statist. 21, 539-556. Pearson, E.S. and Hartley, H.O. (1958(1962)). Biometrika Tables for Statisticians. Cambridge University Press. Especially table 28. Smith, H.F. (1936). The problem of comparing the results of two experiments with unequal errors. Journal of the Council of Scientific and Industrial Research (Australia) 9, 211-212. Tukey, J.W. (1948). Approximate weights. Annals of Math. Statist. 19, 91-92. Welch, B.L. (1947). The generalization of “Student’s” problem when several different population variances are involved. Biometrika 34, 28-35. Welch, B.L. (1949). Further note on Mrs. Aspin’s tables and on certain approximations to the tabled function. Biometrika 36, 293-296. Wilson, E.B. and Hilferty, M.M. (1931). The distribution of chi-square. Proc. Nat. Acad. Sci. 17, 684-688.

548

Frederick Mosteller and John W. Tukey

Yates, F. and Cochran, W.G. (1938). The analysis of groups of experiments. Jour. Agr. Sci. 28, 556-580.

Reprinted from W.G. Cochran’s Contributions to Statistics (1984), P. Rao and J. Sedransk, eds. New York: Wiley, pp. 223–252

33. Combination of Results of Stated Precision: II. A More Realistic Case Frederick Mosteller and John W. Tukey Harvard University and Princeton University and Bell Laboratories

The first paper in this sequence (Mosteller and Tukey, 1982) dealt with the case where the estimates from the various series to be combined were thought to be estimating the same thing—the optimistic case.1 It included a grouping procedure for first combining results of groups of (estimates from) series with plausibly similar population variances of the estimates, σi2 . This grouping was driven by comparison with order-statistic medians of appropriate χ2ν /ν distributions and involved a factor of 2 criterion, as we review here. Grouping can be conducted either in terms of s2i (the estimate, for the ith series, of the variance, σi2 , of that series’s result) or in terms of s2 (i) (the estimate, for the ith series, of the variance σ 2 (i) per individual in that series). In this paper we concentrate on the s2i approach referring the reader for the changes needed for the s2 (i) approach to Paper I (Mosteller and Tukey, 1982). Here we treat the more realistic case, where we admit that the results of the various series estimate different things, but where the weights for combining the results are not prescribed. This means that weights assigned according to precision can and usually do change what the combination estimates, but this is something we are prepared to accept. We begin, in Sections 2.1–2.3, with grouping by s2i . Because it leads to equal weights within each group, we can easily form group means, assess their variances and degrees of freedom, and then combine group means. Notice that if only one group is formed, this procedure assigns equal weights to all series. 1

This work was supported in part by the National Science Foundation through Grant SES8023644 to Harvard University and was prepared in part in connection with research at Princeton University sponsored by the U.S. Army Research Office (Durham). We owe the necessary careful checking of the numerical results to Cleo Youtz and Katherine Halvorsen.

550

Frederick Mosteller and John W. Tukey

After a speculation about the primary example in Section 2.4, we next discuss, in Sections 3.1 to 3.5, what to do when a group is very small, so that its own estimate of variability among series means is either nonexistent or so variable as to make its replacement by a value or values from another group or groups desirable. Finally, in Section 4.1 we discuss a modification to the grouping procedure to make more complete allowance for differing numbers of degrees of freedom. In assessing estimated variance and assigned degrees of freedom for a weighted combination with individually estimated weights needed when we combine groups, we use modifications of Cochran’s techniques from Paper I (Mosteller and Tukey, 1982), and here described in Section 2.3. 1. INTRODUCTION The combination of the results of series of experiments attracted Cochran for over 40 years (Cochran, 1982a, p. 26). Papers such as Yates and Cochran (1938), Cochran (1938), and Cochran (1954) attacked the problem broadly, while others (e.g., Rao, Kaplan, and Cochran, 1981) focus more on the impact of theoretical innovations. The student of this subject will find reading these papers essential and will find their references a good route to enter the related literature. Rao (1983) provides a summary of much of Cochran’s work on this problem. The present paper, like Paper I (Mosteller and Tukey, 1982), has its roots in the idea of partial weighting (Yates and Cochran, 1938; Cochran 1954, p. 119) in which a substantial fraction of the experiments (those that appear less variable) receive equal weight, while others receive lesser weights. “The method prevents an experiment which happens to have a small estimated error from dominating the result, while allowing the less precise experiments to receive lower weights” (Cochran, 1954, p. 119). A good working rule was said (Cochran, 1954, referring to Cochran, 1937) to be to include one-half to two-thirds of the experiments in the equally weighted group. Presumably Cochran intended an informal procedure—certainly very effective in his hands—to choose the exact cutoff. We propose here a specific formal procedure for doing this. Our method takes account of the detailed values of s2i [or of s2 (i)] and can lead in extreme cases to (a) all experiments getting equal weight or (b) only a few experiments of low apparent variability getting the same (high) weight. (We offer special techniques usable at the latter extreme.) The paper of Rao, Kaplan, and Cochran (1981) emphasized the comparison of various estimates related to C. R. Rao’s MINQUE estimates with more classical estimates of an analysis of variance character. It did not consider partially weighted estimates. (To include partial weighting it would have been necessary to specify the choice of a low-apparent-variability group in adequate detail.)

33 Combination of Results: More Realistic Case

551

Realistic combination of parallel results, such as location estimates, must allow for the near certainty that the results, though parallel, are estimating quantities that differ from one another. In more technical language we must allow for additional “among” variability, beyond the “within” variability assessed by the observed precisions. The classic treatment of this subject has assumed that the among variability was the same for all results, whatever their within variabilities seemed to be. A good case can be made, in many situations, that the among variability is proportional to the within variability. It seems reasonable to us that the among variability often increases as the internally assessed variability of a result (based on within variability) increases. In four examples we have studied, only two offer appreciable evidence either way. Both of these—one involving careful measurements of the acceleration of gravity (Heyl and Cook, 1936; Mosteller and Tukey, 1982), the other agricultural fertilizer trials of the early 1930s (see below)—support the hypothesis of increase. In Paper I (Mosteller and Tukey, 1982) we modified Cochran’s technique of grouping to combine results that each estimate the same quantity, but that have potentially different variability. We now extend this method to combine results when they may estimate different quantities. We call the treated situations series: most simply collections of observations, conceived of as random samples from a population, more generally data sets from which both a summary value for the series and an internal estimate of variance of that value, and an assigned d.f. (degrees of freedom) for that estimate of variance can be obtained. In this paper we stress the difference between the internal variability of a series summary, which can be estimated from differences within the series, and the total variability of that summary, which usually includes variability, which we call excess or among variability, which cannot be so estimated. (We call it “among” since it reflects differences among what the different series are estimating.) In particular, we might have yij = µ + αi + ij where i labels a series and j an observation within a series. Here µ + αi is a constant for the ith series (what the ith series estimates) and the ij are independent with mean zero and with variance which may depend upon i, as may the range of j. (The structure of the observations within a series need not be so simple. Thus, factorial structure is likely, and the results to be combined may be differences or regression coefficients.) The classical add-on model assumes the same excess variability for every series. An alternative scale-up model does not assume a common excess variance, but rather that the excess variability of a result is proportional to its internally estimated variability. Although still other models may be desirable, these two already offer a variety of methods of combining results based on an algorithmic approach that we illustrate.

552

Frederick Mosteller and John W. Tukey

The approach used collects series into groups based on their internal variability. For a group, it treats all results (one per series) as if they had the same variability. After forming groups, we combine the results from the series within the group and estimate the variability of the combination generated by the group. Then we combine the combinations from the groups to get a grand combination and an estimate of its uncertainty. Notice that if only one group is formed, this procedure assigns equal weight to all series. 2. GROUPING THE s2i 2.1. The Grouping Process, Illustrated on the Lewis and Trevains Example In their 1938 paper, Yates and Cochran (p. 570 ff) rediscussed the analysis of 13 parallel 3 × 3 Latin squares, originally described by Lewis and Trevains (1934). We shall now discuss this yet again. In one part of this agricultural experiment, 13 (= n) “centers” participated, with center i producing a series of measurements with result yi (a gain in yield from use of a new fertilizer) and an estimated variance of that result, s2i . In this analysis each s2i has 2 (= ν) degrees of freedom. (In many problems, the series have differing numbers of degrees of freedom.) The centers are ordered in Table 1 according to the size of s2i . Our grouping proceeds by ordering the s2i [in such an instance, where degrees of freedom are all the same, this is equivalent to working with the s2 (i) themselves], calculating values ci = c(i|n, ν),

here = c(i|13, 2)

where c(i|n, ν) is chosen to be approximately the median of the ith largest of n values sampled from χ2ν /ν, and working with the standardized ratios s2i /ci . We use ci for brevity in tables and formulas, even though its meaning changes with n and ν both from group to group and from problem to problem. In the special case when all the σi2 equal σ 2 , these ratios s2i /ci can be thought of as estimates of σ 2 taking ordinal position into account. As a minimum, we group together, as a low group, any series (a) for which this ratio is less than twice its lowest value or (b) where (a) applies to some larger value of [s2i ]. (Other factors than 2 could be considered.) By requiring (b), we insure that the [s2i ] for the low group include, with s2i , any s2j < s2i . With a tentative low group of size k formed, we can then try to add one more series to it by repeating the calculation for a proposed group of only k + 1 series (n will be replaced by k + 1) and repeating the trial (for k + 2, etc.) if successful. Then we take the series not included in the final low group and repeat the whole calculation. Table 1 shows these stages for the Lewis and Trevains data.

33 Combination of Results: More Realistic Case

553

Table 1. Grouping the Series in the Lewis and Trevains Example Stage 1

i

Center

s2i

1 2 3 4 5 6 7 8 9 10 11 12 13

10 11 9 8 12 5 13 2 7 3 1 4 6

0.0046 0.0073 0.0184 0.1107 0.1920 0.2638 0.2706 0.3543 0.4803 0.5329 0.8599 1.1528 1.7249

Stage 2

n = 13 n = 13 ci s2i /ci

n=4 ci

n=4 s2i /ci

0.0513 0.1335 0.2231 0.3216 0.4308 0.5534 0.6931 0.8557 1.0498 1.2910 1.6094 2.0794 2.9957

0.1671 0.4855 0.9555 1.8718

0.0275 0.0150 0.0193 (0.0591)

0.090 0.055 0.082 (0.344) (0.446) (0.477) (0.390) (0.414) (0.458) (0.413) (0.534) (0.554) (0.576)

Stage 3 new i

n = 10 n = 10 ci s2i /ci

1 2 3 4 5 6 7 8 9 10

0.0667 (1.660) 0.1759 (1.092) 0.2985 0.884 0.4383 0.617 0.6008 0.590 0.7949 0.604 1.0361 0.514 1.3545 0.635 1.8245 0.632 2.7408 0.629

1. In all s2i /ci columns, ν = 2 and values greater than twice the smallest are in parentheses. 2. The temporary low group consists of 3 series, so we next try n = 4. Stage 2 then shows that the final low group also consists of 3 series. 3. The remaining 10 series, analyzed in a similar way in Stage 3, are now accepted as a second (high) group. 4. Throughout ci = c(i|n, 2) = − ln[(3(n + 1 − i) − 1)/(3n + 1)]. 5. Source: Yates and Cochran (1938), p. 570.

While the ratio of extreme s2i was 375 to 1, that of extreme c(i|13, 2) was 58 to 1, so that initially the s2i /ci varied only over about 10 to 1 (0.576 to 0.055). The low group clearly consists of centers 10, 11, and 9 (i = 1, 2, 3) and when we turn to the remaining 10 series, our algorithm gives no evidence for heterogeneity—the s2i varying by 16 to 1, while the ci vary by 41 to 1. Note that the two parenthesized ratios in the final column are absorbed under rule (b), which groups a series with the rest if a series further down the list is included because it has s2i /ci less than twice the lowest such ratio. This leads us to two groups, one of 3 series (3 centers), the other of 10. Computations show that for the ith order statistic, yi|n of a sample of n drawn from any distribution F (y): 48.66% ≤ Prob{yi|n ≤ F −1 ((3i − 1)/(3n + 1))} ≤ 51.34% and that the approximation to 50% is usually much better than these bounds indicate. Thus the value of y where F (y) = (3i − 1)/(3n + 1) is an entirely adequate approximation to the median of yi|n (cf. Mosteller and Tukey, 1982). Our concern is with order statistics of χ2ν /ν. For small ν we have a folded

554

Frederick Mosteller and John W. Tukey

normal or exponential distribution. For larger ν, the Wilson-Hilferty (1931) approximation offers sufficient accuracy. Thus, for small values of ν we use the following formulas: for ν = 1   −1 3i + 3n c(i|n, 1) = Gau 6n + 2 

for ν = 2 c(i|n, 2) = − ln

3(n + 1 − i) − 1 3n + 1



and, for ν ≥ 3, the Wilson-Hilferty approximation to the percentage points of χ2  ! 3 2 2 c(i|n, ν) = 1 − + c(i|n) 9ν 9ν where −1

c(i|n) = Gau



3i − 1 3n + 1



and Gau−1 is the inverse of the unit Gaussian cumulative. If we had grouped the s2 (i) instead of the s2i , we would have taken account of their assigned d.f. in pooling the s2 (i) for i in group G. And we would have weighted each series proportionally to the number of individual observations used in that series’s result. Thus the summary quantities for each group would have been somewhat different. (The examples presented in Paper I illustrate the necessary changes.) 2.2 Analyzing the Groups In our example, we have a low group L, and a high group H. Let the general index for a group be G, consisting of #G series. In our example the numbers of series in each group are #L = 3 and #H = 10. In general, for each group, G, we have now a combined s2G , the mean of the 2 si for the series in group G included, and an MSG , the mean square among the results of the series of G.  1 MSG = (yi − yG )2 #G − 1 i where the summation is over the #G values of i in G, and where yG is the arithmetic mean of the yi in G. Because we believe that each series has its own true mean µi , we think that even for the series in a group the expected variation among results exceeds the average variation within series. Therefore we expect that in the long run ave{s2G } ≤ ave{MSG } and we note that s2G will ordinarily involve considerably more degrees of freedom, νG , than does MSG with its #G − 1 degrees of freedom. In general, νG is

33 Combination of Results: More Realistic Case

555

the sum of the degrees of freedom associated with each of the series in group G. When all ν’s are equal, as in our example, νG equals ν times #G . In the present example, the ratio νG /(#G − 1) is unusually small, close to 2. Now we consider two different ways of combining s2G and MSG to get an estimated variance and assigned degrees of freedom for yG , one ultraconservative, the other one, called Paull-2, that, according to Cochran (l954, p 106), is a relatively safe sometimes-pool procedure that “may slightly overestimate the standard error.” Because we often have few series per group, using MSG itself would hardly be a satisfactory procedure. In ultraconservative (UC) combination we use whichever of s2G and MSG is larger, keeping the corresponding degrees of freedom. In “Paull-2” combination, we look first at MSG /s2G . If MSG /s2G < 2, we pool taking   1 (#G − 1)MSG + νG s2G estimated variance of yG = #G (#G − 1) + νG assigned d.f. = (#G − 1) + νG . If MSG /s2G ≥ 2, we use MSG with its assigned d.f., thus in this case, we do not pool, but use instead estimated variance of yG = MSG /#G assigned d.f. = #G − 1 This rule was constructed (Paull, 1950) in the course of optimizing the performance of a sometimes-pool rule. It saves effective degrees of freedom as best it can by never taking the smaller of the two estimates. (The most important reason for allowing for degrees of freedom is to compensate for the occurrence of estimates of variability that are inappropriately low.) Tables 2 and 3 carry the Lewis and Trevains example onward, obtaining the estimated variances (with their assigned d.f.) for each group and both sorts of combination. In the tables, for the low group we set G = L and for the high group we set G = H. 2.3. Conclusion of the Lewis and Trevains Example We are now ready to combine the two group means, using the modified form of Cochran’s calculations presented, with some new notation, in Mosteller and Tukey, 1982. The new notation uses WG for the weight to be attached to the result from group G, and W+ for the sum of the weights. The uG is a convenient form of correction term in computing the variance for the result from combining two groups. The average of the degrees of freedom νG for the ∗ groups is ν¯. Finally νG is an adjusted d.f. for group G which happens to be identical with νG when only two groups are used. Table 4 has the numbers. The formulas are (see Mosteller and Tukey, 1982)

556

Frederick Mosteller and John W. Tukey

W+ =



WG

WG (always ≤ 1) W+  νG  number of groups − 2 ∗ = νG − lesser of 4 and νG 2.25 number of groups − 1  WG yfinal = yG W+   uG $  1  W G $2 1 1 2 1+ sfinal = = ∗ ∗ W+ νG νfinal νG W+ uG = 4

and s2final∗ =

WG W+



1−

s2final (1 + 0.12/(ν − 1))

(used for 2 groups only)

This last adjustment, from s2final to s2final∗ , was not given in Paper I. It is the result of examining the first column of Table 4 of Cochran and Carroll (1953), which contains the ratios of Cochran and Carroll’s expression (8) to (n + 2)/(n + 1). These ratios vary from 1.125 to 1, but, after division by 1 + [0.12/(ν − 1)] they vary only from 1.004 to 0.996. Cochran and Carroll’s table deals only with the case νL = νH = ν, and the more general use of ν¯ is probably conservative. We find this adjustment barely worthwhile, but include it because the methods we propose so often involve a final combination—with estimated weights—of exactly two groups. Table 2. Further Calculations for the Low Group (G = L) of Lewis and Trevains Example i

Center

yi

s2i

1 2 3

10 11 9 sum

0.14 0.08 0.08 0.30

0.0046 0.0073 0.0184 0.0303

yL = MSL = = s2L = νL =

0.30/3 = 0.10 1 [(0.04)2 + (−0.02)2 + (−0.02)2 ] 2 0.0012 0.0303/3 = 0.0101 2×3=6

UC combination: s2L > MSL , and so we use s2L and its associated d.f., 2 × 3 = 6 estimated variance of yL = 13 (0.0101) = 0.00337 assigned d.f. = 6 Paull-2 combination: MSL /s2L = 0.0012/0.0101 < 2,  therefore pool.  estimated variance of yL =

1 3

= assigned d.f. = 8

2(0.0012)+6(0.0101) 2+6 1 (0.0079) = 0.00263 3

33 Combination of Results: More Realistic Case

557

Table 3. Further Computations for the High Group (G = H) of the Lewis and Trevains Example Center

yi

s2i

4

8

−0.30

0.1107

5

12

−0.29

0.1920

6

5

−1.10

0.2638

7

13

−0.19

0.2706

8

2

0.96

0.3543

9

7

0.25

0.4803

s2H =

10 11 12

3 1 4

0.83 1.94 −0.92

0.5329 0.8599 1.1528

νH = 2(10) = 20

13

6

−0.01

1.7249

sum

1.17

5.9422

sum of squares

7.7033

i

1.17 = 0.117 10   (1.17)2 1 = 7.7033 − = 0.841 9 10

yH = MSH

5.9422 = 0.594 10

MSH 0.841 = = 1.42 < 2 s2H 0.594 #H − 1 = 9

UC combination: MSH > s2H , use MSH with 9 d.f. 1 estimated variance of yH = (0.841) = 0.0841 10 assigned d.f. = 9 Paull-2 combination: MSH /s2H < 2, therefore pool.   1 9(0.841) + 20(0.594) estimated variance of yH = = 0.0671 10 9 + 20 assigned d.f. = 29

Notice that the difference between the results of the analysis is trivial so far as yfinal goes, but appreciable as far as s2final∗ goes. When we change our choice of combination, we generally expect less important changes in yfinal than in s2final∗ . We next need corresponding two-sided 5% values of t. Interpolating in 1/ν, for ν = 6.5 for UC combination and ν = 8.6 for Paull-2 combination we find 2.403 and 2.278, respectively, so that the 95% confidence intervals are 0.101 ± 2.403(0.0575) = from − 0.037 to 0.239

(UC combination)

and 0.101 ± 2.278(0.0507) = from − 0.014 to 0.216

(Paull-2 combination)

both covering zero. The Paull-2 result is probably the more accurate one—that is, it comes nearer to the advertised 95% level. Its lower end is near enough to zero to support a strong suspicion that the actual difference is positive.

558

Frederick Mosteller and John W. Tukey

Table 4. The Final Stages of Combination for the Lewis and Trevains Data Group

WG (= 1/var)

L H total

296.7 11.9 308.6 = W+

ν for Group

6 9

WG /W+

UC Combination 0.9614 0.0386

uG

0.148 0.148

∗ νG

6a 9

yG

0.100 0.117

yfinal = (0.9614)(0.100) + (0.0386)(0.117) = 0.101   1 0.148 0.148 2 sfinal = 1+ + = 0.00337 = (0.058)2 308.6 6 9 1 1 1 1 = (0.9614)2 + (0.0386)2 = 0.1542 = νfinal 6 9 6.5 1 s2final∗ = 0.00337/(1 + 0.018) = 0.00331 = (0.0575)2 = 302

L H total

380.2 14.9 395.1 = W+

Paull-2 Combination 8 0.9623 29 0.0377

0.145 0.145

8a 29

0.100 0.117

yfinal = (0.9623)(0.100) + (0.0377)(0.117) = 0.101   1 0.145 0.145 s2final = 1+ + = 0.00259 = (0.0509)2 395.1 8 29 1 1 1 1 = (0.9623)2 + (0.0377)2 = 0.1158 = νfinal 8 29 8.6 1 s2final∗ = 0.00259/(1 + 0.0069) = 0.00257 = (0.0507)2 = 389 a

∗ Since there are only two groups, νG = νG .

2.4. What Actually Happened in the Lewis and Trevains Experiment? If we were half a century closer in time to the actual data collection, it would be easier to try to assess what really happened. Two hypotheses seem, at first glance, reasonable competitors. 1. Some oddity, perhaps a consequence of the granularity of recording, made the three smallest s2i smaller than they should have been—truly the σi2 were all the same. (If this were so, our confidence intervals above would be too short.) 2. The three centers—9, 10, and 11—did have smaller σi2 , the remaining ten σi2 could well all be the same.

33 Combination of Results: More Realistic Case

559

For completeness, we return to the data and look at the apparent relation between the s2i , the mean response to the two fertilizers, and the yield with no fertilizer. Here, as Table 5 shows, we may see a touch of relationship, but it is far from being strong enough for us to take seriously. To explore (1) somewhat further, we return to Table 1, and find the median value of s2i /ci , for the 10 highest of s2i among all 13, to be 0.452. To obtain 0.452 for the three smallest s2i would have required s2i of 0.023, 0.060, and 0.101 (rather than the 0.0046, 0.0073, and 0.0184 that were observed). Could this have been the result of some oddity? When we go to the original Lewis and Trevains paper we find: 1. Centers 8, 9, 10, and 11 (the four with smallest s2i ) were the four centers in Yorkshire. 2. The centers with most acid soil were, in order starting with the most acid, 8, 7, 10, 9, 1, 12, 11, and 4. All this is somewhat compatible with a distinctive group with small σi2 , but is not very persuasive. A truly conservative analyst would have, we think, to borrow the variability estimate for these three centers from the group of 10. The statistical evidence for different σi2 for groups L and H is strong enough to rule out pooling. Probably, then, we should either take the low-s2i group at face value, or else we should borrow from the group of 10.

Table 5. Relation of s2i to Mean Response and Base Yield Center 10 11 9 8 12 5 13 2 7 3 1 4 6 Medians of first 5 last 5 a b

s2i 0.0046 0.0073 0.0184 0.1107 0.1920 0.2638 0.2706 0.3543 0.4803 0.5329 0.8599 1.1528 1.7249

Mean Response 3.44 1.72 3.04 2.33 3.24 2.63 4.16 4.61 3.94 3.44 7.28 3.69 2.42

Base Yield 7.42 3.70 9.46 8.99 9.62 11.88 9.47 7.19 6.20 10.07 7.44 7.74 11.94

Suma 10.86 5.42 12.50 11.32 12.86 14.51 13.63 11.80 10.14 13.51 14.72 11.43 14.36

0.0184 0.8599

3.04 3.69

8.99 7.74

11.32 13.51

Sum of two preceding columns. Ranks of each of three preceding columns.

Ranksb 7.5 4 3 1 1 1 5 8 7 2 7 4 6 10 8 4 12 12 11 9 10 12 3 6 10 2 2 7.5 11 9 13 5 13 9 6 5 3 13 11 5 9

7 6

4 9

560

Frederick Mosteller and John W. Tukey

3. VERY SMALL GROUPS 3.1 Borrowing: Which? When we have very small groups—this means all groups consisting of only one series, most groups consisting of exactly two series, many groups of three series—using the techniques just explained does not help much. With 0, 1, or even 2 degrees of freedom in MSG , we hardly know enough about the combination of among variance and within variance to be comfortable. The natural thing to do is to borrow information—either about among variability or about the sum of among and internal variability. Two points need to be stressed: 1. The choice of what we borrow may be important. 2. How we borrow ought to depend somewhat upon whether we are borrowing for a low group, an intermediate group, or a high group. Let us introduce the notation Σ 2 for the among variance common to each series and σi2 for the within series variability of a result. Then we could write in words the add-on variability of a result for a series as (common among variance) + (individual per-series within variance) or in symbols Σ 2 + σi2 . If each series had its own within variance, σ 2 (i) per individual observation, we could write, in the simplest case, (common among variance) +

individual per-observation variance series length

or in symbols Σ2 +

σ 2 (i) 1 + νi

where 1+νi is the number of observations in series i. This latter seems natural where the number of observations in a series varies in some way unrelated to both difficulty (variance, etc.) and level of measurement, and the observations in a series are a reasonable facsimile of a random sample from something. (This might happen, for instance, when the number of replications depends on an individual investigator’s energy, but all investigators used equal care.) The former often seems natural when the within variance (for a series result) is not clearly associated with series length and thus comes, in large part, from some nonsampling source [as seems to be the case, for instance, with Heyl and Cook’s (1936) precision measurement of the acceleration of gravity in Washington] (See Paper I.) Equally cogent, we believe, is the point of view of scale-up variability in which each series is thought to have variance

33 Combination of Results: More Realistic Case

561

(individual per-series within variance) × (constant factor) or in symbols σi2 (1 + c) or (individual per-observation within-variance) × (constant factor) series length or in symbols σ 2 (i)(1 + c) 1 + νi This version is particularly apt when measurements are easier in some series and harder in others. Or, often, when measurement involves better techniques in some series, worse in others. It also applies in the presence of any of a variety of other anomalies. Sometimes we can be reasonably clear about whether add-on or scale-up is appropriate. When we are so fortunate as to recognize that we are in such a situation, we should behave accordingly. Otherwise, we plan to compromise between these possibilities. 3.2. Two (or More) Substantial Groups There will sometimes be three or more groups. And in some of these instances two (rarely more) of these groups will be substantial (i.e., not very small) If A and Z are the substantial groups, and Q is a very small group for which we wish to borrow, then we can ask groups A and Z whether add-on or scale-up is preferred. If there were pure add-on variability, both (#A ) × (estimated per-group variance for A) − (s2A ) and (#Z ) × (estimated per-group variance for Z) − (s2Z ) would be estimating the add-on (ao) among variance. Let 1 larger of these (ao) estimates . zao = ln 2 the smaller (ao) estimate If, by contrast, there were pure scale-up variability, both (#A ) × (estimated per-group variance for A)/s2A and (#Z ) × (estimated per-group variance for Z)/s2Z would be estimating the scale-up (su) factor for among variance. Let

562

Frederick Mosteller and John W. Tukey

zsu

1 = ln 2



larger of these (su) estimates the smaller (su) estimate

.

(Although an argument can be made for subtracting 1 from both numerator and denominator of this ratio, we deem it undesirable, and so, particularly for the rough-and-ready analysis proposed here, we do not recommend it.) We need to decide whether zao is much nearer 0 than zsu , or vice versa, or neither. Thus we wish to separate the zao − zsu plane into three parts. Some numerical research might ultimately help with this decision. We have, rather arbitrarily, chosen zao < 12 zsu : borrow on add-on basis zsu < 12 zao : borrow on scale-up basis otherwise:

borrow the average of add-on and scale-up.

Let us abbreviate estimated per group variance as epgv. Thus when zao < 12 zsu we borrow on an add-on basis using epgv for Q =

1 2 #Q [sQ

+ 12 #A (epgv for A)

+ 12 #Z (epgv for Z) − 12 s2A − 12 s2Z ]. If zsu < 12 zao we borrow on a scale-up basis taking epgv for Q =

s2Q [#A (epgv for A)/2s2A + #Z (epgv for Z)/2s2Z ]. #Q

If the preference is unclear, it seems reasonable to average the values of the two expressions just given. We have also to deal with the cases where MSA < s2A (when group A shows no tendency to have any additional among-group variance) or where MSZ < s2Z (when this is true for Z) or where both inequalities hold. If A < Z, and MSA < s2A , but MSZ > s2Z which is > s2A , it is harder to believe in “addon” than in “scale-up,” so we use “scale-up.” If MSZ < s2Z , but MSA > s2A which is < s2Z , it is harder to believe in “scale-up” than in “add-on,” so we use “add-on.” If both MSA < s2A and MSZ < s2Z , so that there is no evidence for additional among variability, we return to the analysis of Paper I. To summarize these additional cases Group A lower than Group Z (i.e., s2A < s2Z ) MSA < s2A ,

MSZ > s2Z > s2A

⇒ scale up

MSZ < s2Z ,

MSA > s2A (s2A < s2Z )

⇒ add on

MSA


s2Z

⇒ Paper I analysis.

33 Combination of Results: More Realistic Case

563

3.3. More Than One Very Small Group If two or more very small groups are adjacent, we treat each as if it were there alone. A very small group will thus either be an instance of the work of Section 3.2 (where there are two substantial groups), or will be treated as either a high or a low group (where there is only one substantial group). 3.4. Borrowing for a Low Group When our very small group consists of the series with the smallest s2i , we must be particularly careful. Very small groups of large σi2 are fairly common, but very small groups of truly small σi2 are rare. Thus we need extra caution when our very small group involves the smallest s2i . Almost the most conservative reaction, one not infrequently justified, is to say, in effect, “I don’t believe these little s2i , so I will borrow the estimated per-series variance from the next larger group.” In the Lewis and Trevains example this would mean (see Table 3) taking a UC variance per series of 0.841 or a Paull-2 variance per series of 0.671 as appropriate for all 13 series, with the results in Table 6. The resulting intervals are much longer than when we gave full credence to the apparent quality of the series of the low group. Table 6. Equal Weighting in the Lewis and Trevains Example As a Result of Total Borrowinga Group

#G

L

3

H

10

pool

yG UC Combination estimated variance per series = 0.841 0.100 yfinal = 0.113 0.117

s2final =

0.113

νfinal = 9

1 (0.841) 13

= 0.0647 = (0.254)2

interval = 0.113 ± 2.262(0.254) = from −0.462 to 0.688 Paull-2 Combination L

3

0.100

estimated variance per series = 0.671

H

10

0.117

yfinal = 0.113

0.113

s2final =

pool

1 (0.671) 13

νfinal = 29 interval = 0.113 ± 2.045(0.227) = from −0.351 to 0.577 a

See Table 3.

= 0.0516 = (0.227)2

564

Frederick Mosteller and John W. Tukey

The most conservative action of course would be to say, in effect, “I don’t believe the low group s2i and my doubt extends to the corresponding yi , so I shall set aside all the series in the low group!” There are, as we now see, examples where this seems to be the best alternative. In his 1954 paper Cochran (p. 120) gives summary data for seven observers in an impaled-fly experiment. This study evaluated observers’ ability to estimate the number of flies settling on a grill. Each of seven observers looked briefly at grills that had known numbers of flies impaled on them. Each observer made an estimate for each grill five times, presumably independently. Our analysis deals with one grill which had 161 flies impaled. Table 7 shows the consequences of grouping the s2i . We get two very small groups, one low and one high. 1 The s2i for the low-s2 answer is so low—only about 20 of the pooled s2 in the middle group—that rational belief that observer 2 made five independent judgments is almost impossible. So we shall be wise to set aside this observer entirely. (We return to the final analysis of these data below, after we have dealt with the very small high group.) When neither of these extreme judgments seems appropriate, we should proceed in a more balanced way. The simple borrowings give—with L the very small group and Z the single substantial one—for the estimated pergroup variance of group L 1 2 [s + #|Z(estimated per group variance for Z) − s2Z ] #L L estimated per-group variance for Z s2 (scale-up) L #L s2Z /#Z (add-on)

We recommend using the 50-50 compromise—the average of these two expressions. We could calculate degrees of freedom to assign to these estimated variances. However, as the assigned d.f. for L, we recommend using the smaller of νL and the assigned d.f. for Z. Tables 8 and 9 show the corresponding calculations when the low group of three in the Lewis and Trevains data are treated as “very small.” The resulting (Paull-2) interval is intermediate in length between the original treatment (Sections 2.1–2.3) and the result of total borrowing (Table 6). 3.5 Borrowing for a Top Group Here again, when detailed insight is lacking, we recommend the half-andhalf compromise between add-on and scale-up. Tables 10 and 11 treat, in this way, the very small top group for the impaled-fly example.

33 Combination of Results: More Realistic Case

565

Table 7. Grouping the Impaled-Fly Example (All νi = 4) Stage 1

Observer

i

s2i

ci

s2i /ci

2 6 1 7 3 4 5

1 2 3 4 5 6 7

8.1 51.2 117.0 134.0 235.9 295.0 1064.6

0.2497 0.4532 0.6410 0.8424 1.0821 1.4077 1.9964

Stage 2

ci

s2i /ci

32.4 0.5336 15.2 (113.0) 1.2521 (40.9) (182.5) (159.1) (218.0) (209.6) (533.3)

Note: Stages chosen as in Table 1 Result: Low group = observer 2 Middle group = observers 6, 1, 7, 3, and 4 High group = observer 5 Observer i 6 2

Stage 3

yi 158.0

yi − yM −3.1

1

3

183.2

22.1

7 3 4

4 5 6 mean

143.0 154.0 167.2 161.1

−18.1 −7.1 6.1

new i

1 2 3 4 5 6

ci

s2i /ci

0.2738 187.0 0.5026 232.8 0.7229 185.4 0.9744 242.1 1.3086 225.4 1.9046 (559.0)

(s2i = 8.1) (pooled s2i = 166.6) (s2i = 1064.6)

913.25 = 228.3 4 (yi for observer 5 = yH = 187.2) MSM =

sum of squares 913.25   1 4(228.3) + 20(166.6) 5 24 1 = (176.9) = 35.38 = (5.95)2 5

Paull-2 estimated per-group variance =

Source: Cochran (1954), p. 120.

4. SUPPLEMENTARY POINTS 4.1 More Careful Grouping when the νi Differ In our treatment so far, we paid little attention to the differences among the νi . We ordered the raw s2i , and calculated divisors—as c(i|n, νi )—in terms of this raw order. If the νi are different enough, this procedure comes in question. So in this section we offer an alternative procedure which (1) treats the νi much more carefully and (2) may give somewhat better results. Such a

566

Frederick Mosteller and John W. Tukey

Table 8. Add-on and Scale-up Assessed per Group Variances for the Low Group in the Lewis and Trevains Examplea Basics (Z = high group of 10) s2L = 0.0101 s2Z

= 0.594

#L = 3 #Z = 10

Add-on 1 [0.0101 3

+ 10(0.0671) − 0.594] = 13 (0.0871) = 0.0290 = (0.170)2

Scale-up 0.0101 3



0.0671 0.594/10

 =

1 (0.0114) = 0.00380 = (0.062)2 3

Half-and-half + 12 (0.0038) = 0.0164 = (0.128)2 νassigned = lesser of 6 and 29 = 6

1 (0.0290) 2

a

Calculations given for Paull-2 combination (cf. Tables 2 and 3).

Table 9. Final Calculations for the Lewis and Trevains Example with the Low Group Treated as “Very Small”a G (= group) L H total

WG = 1/var 61.0 14.9 75.9 = W+

ν for group

WG /W+

uG

∗ νG

yG

6 29

0.8037 0.1963

0.631 0.631

6 29

0.100 0.117

yfinal = (0.8037)(0.100) + (0.1963)(0.117) = 0.103   1 0.631 0.631 s2final = 1+ + = 0.0148 = (0.122)2 75.9 6 29 1 1 1 1 = (0.8037)2 + (0.1963)2 = 0.1090 = νfinal 6 29 9.2 1 s2final∗ = 0.0148/(1 + 0.007) = 0.0147 = (0.121)2 = 68 interval = 0.103 ± (2.255)(0.121) = (−0.170 to +0.376) a

Paull-2 combination, see Tables 3 and 4. Note: As per text, the estimated variance underlying WL is a 50-50 compromise between “add-on” and “scale-up” calculations.

modification might begin to pay when νmax /νmin = 2 and might be essential when νmax /νmin = 10. Reordering from Chi-square to Gaussian If the s2i have very different νi , but come from the same σ 2 , the (raw) ordering of the s2i may well not reflect their relative extremeness. Thus, for example, we would not be too surprised, for νi = 2, to see s2i /σ 2 = 2.9687 (corresponding approximately to—with approximately the same upper tail area

33 Combination of Results: More Realistic Case

567

Table 10. The High Group (Observer 5) in the Impaled-Fly Example (M is for the Middle Group of Five Observers) Basics #H = 1 s2H = 1064.6 s2M = 166.6 #M = 5 Estimated per-group variance for M = 35.38 = 15 (176.9) Add-on (1064.6 + 176.9 − 166.6) = 1074.9 = (32.8)2 Scale-up     176.9 35.38 (1064.6) = 1064.6 = 1130.4 = (33.62)2 166.6/5 166.6 Half-and-half 1 (1075) + 12 (1130) = 1102 = (33.20)2 2 νassigned = lesser of 4 and 20 = 4 Table 11. Final Calculations for the Impaled-Fly Examplea

G (group) L M H total

WG (1/var)

ν for group

WG /W+

uG

∗ νG

yG

24 4

0.9692 0.0308

0.119 0.119

24 4

161.1 187.2

set aside 0.0283 0.0009 0.0292 = W+

yfinal = (0.9692)(161.1) + (0.0308)(187.2) = 161.9   1 + 0.119 1 + 0.119 = 35.4 = (5.95)2 s2final = 0.0292 24 4 1 νfinal

=

1 (0.9692)2 24

+ 14 (0.0308)2 = 0.0394 =

s2final∗ = 35.4/(1 + 0.009) = 35.1 = (5.92)2 =

1 25.4

1 0.0286

interval = 161.9 ± 2.058(5.92) = from 149.7 to 174.1 a

Compare Tables 7 and 10. Note: Actual number of flies was 161.

on the null hypothesis as—a standard Gaussian Xi = 1.645 by the WilsonHilferty approximation) because 2s2i /σ 2 = 5.9374 is at about the 5% level for chi-square with 2 d.f. (1.645 is the corresponding one-sided Gaussian upper 5% point). We would be very surprised, for νj = 50, to see s2j /σ 2 = 1.7348 (corresponding approximately to the Gaussian deviate Xj = 3.0902). Ordering our s2i by their Xi , if we knew them, would be a much more sensible order than ordering them by their raw values. We do not know σ 2 , but we might be willing to use a trial value T 2 in its place. We could then look at a variety of T 2 and see to what orders their use would correspond. Table 12 illustrates this point through a simple hypothetical example.

568

Frederick Mosteller and John W. Tukey

Whether we look at the largest |Xi |, or at the sum of squares of the Xi , we are led to use something like T 2 = 10. Table 13 explores a narrower range of values of T 2 . Here we get the same order anywhere between T 2 = 7 and T 2 = 12. Making the largest |Xi | small directs us to a T 2 somewhat larger than 9. Making the sum of the Xi2 small directs us to a T 2 somewhat larger than 10. Calculations after Reordering A convenient way to choose a single working T 2 is to pool the s2i and take T = s2pool . In our example this gives T 2 = 10, so that the actual calculation becomes what we see in Table 14. It is clear, from the panel headed “reordering,” that 0.5 and 1.5, both at ν = 50, differ in a more important way than do 0.2 and 1.8 at ν = 2. Thus it makes more sense—in this extreme example considerably more—to use the revised order in calculating c(i|n, ν) values for each s2i preparatory to grouping. Gnanadesikan and Wilk (1970) offer another way to approach the general problem. 2

When Should We Reorder? When we make our initial ordering, there is typically some confusion. Especially because the s2i are likely to belong to two or more quite distinct groups, the reordering calculations are likely to involve rather unrealistically large values of Xi . So we do not propose to carry out our reordering at such a stage. Once we have formed a tentative group, and are trying the combination of that tenative group with one more s2i , things have pretty well settled down. Here reordering would seem to be natural, especially if we use, for T 2 , the pooled s2 based on all the s2i in the tentative group, leaving out of our pooling the one more series group we are trying to include. We would, of course, continue using reordering if the first step of addition is successful, so that we continue to try further additions, one s2i at a time. When we stop enlarging the first group, and start again, we would still avoid reordering in the first stage. 4.2. What We Have Not Discussed Both in this paper and in Paper I, we have refrained from extending these techniques to the use of robust/resistant methods to summarize either the observations within a series or the series within a group. (There seems little reason not to do such things, but the present paper is surely long enough.) We have also not discussed other versions of Paull-2-like combinations, including that proposed by Paull himself. 5. CONCLUSIONS 5.1. Recommendations We regard using the procedures set out in this paper, with Paull-2 combination, as, currently, middle-of-the-road practice in almost any situation

33 Combination of Results: More Realistic Case

569

Table 12. Some Sample Reorderingsa (Values of νi and s2i Given) Values of Xi for Different Assumptions About T 2

a

νi

s2i

T2 = 1

T2 = 2

2 50 50 2

2 5 15 18

1.1131 0.3333 −0.4562 −0.9123 −1.2742 −1.6407 10.7163 5.4248 0.0667 −3.0278 −5.4839 −7.9710 22.0598 14.4282 6.7004 2.2374 −1.3049 −4.8918 5.1956 3.5736 1.9312 0.9827 0.2298 −0.5325

ordering

1 3 4 2

1 3 4 2

largest |Xi |  2 Xi

22.1 629.7

T 2 = 5 T 2 = 10 T 2 = 20 T 2 = 50 T 2 = 100 −1.8523 −9.4073 −6.9634 −0.9728

1 2 4 3

2 1 4 3

3 1 2 4

3 1 2 4

3 1 2 4

14.4

6.7

3.0

5.5

8.0

9.4

250.5

48.8

16.0

33.5

90.4

141.4

We use the inverse of the Wilson-Hilferty transform:   3 2 2 2 + Xi (χν /ν deviate) = 1 − 9ν 9ν

to find the Xi , where Xi is a standard Gaussian deviate. Table 13. A Narrower Range of Trial T 2 Values of Xi When νi

s2i

2 50 50 2

2 −0.6908 −0.7768 −0.8495 −0.9123 −0.9671 −1.0157 −1.1340 5 −1.5248 −2.1085 −2.6023 −3.0278 −3.4001 −3.7298 −4.5329 15 4.4051 3.5633 2.8511 2.2374 1.7004 1.2249 0.0667 18 1.4434 1.2644 1.1131 0.9827 0.8685 0.7675 0.5213

order

largest i|  |X Xi2

T 2 = 7 T 2 = 8 T 2 = 9 T 2 = 10 T 2 = 11 T 2 = 12 T 2 = 15

2 1 4 3

2 1 4 3

4.41 24.3

3.56 19.3

2 1 4 3 2.85 16.9

2 1 4 3 3.03 16.0

2 1 4 3 3.40 16.1

2 1 4 3 3.73 17.0

Note agreement of new order for 7 ≤ T 2 ≤ 12. Also note that pooled s2 =

2(2) + 50(5) + 50(15) + 2(18) 1040 = = 10 2 + 50 + 50 + 2 104

2 1 3 4 4.53 22.1

570

Frederick Mosteller and John W. Tukey

Table 14. Old and New Ways of Grouping a Hypothetical Example In Raw Order c(i|4, ν) 0.1671 0.9297 1.0460 1.8718

s2i 2 5 15 18

νi 2 50 50 2

Ratio (11.97) 5.38 (14.34) 9.62

sums 104 1040 s2pool = 1040/104 = 10 Reorderinga s2i /s2pool 0.2 0.5 1.5 1.8

νi 2 50 50 2

Xi −0.912 −3.022 2.237 0.983

Order 2 1 4 3

νi 50 2 2 50

(same in new order) s2i /s2pool Xi 0.5 −3.022 0.2 −0.912 1.8 0.983 1.5 2.237

After Reordering νi 50 2 2 50 a

s2i 5 2 18 15

c(i|4, ν) 0.7980 0.4855 0.9555 1.2031

Here Xi =

Ratio 6.27 4.12 (18.84) (12.47)

Ratio for Raw Order 5.38 (11.97) 9.62 (14.34)

(s2i /s2pool )1/3 − 1 + 2/9νi (2/9νi )1/2

so that the Wilson-Hilferty approximation WH applied to Xi would return s2i /s2pool as χ2ν /ν.

where we are allowed to choose weights to maximize precision. [Using the techniques of Paper I (Mosteller and Tukey, 1982), which assume that all series are estimating the same quantity, as we said in that paper, seems to us to be always risky and to be likely to lead to serious error.] The choice between grouping the s2i (the estimated per-series within variances), which was discussed in detail above, and grouping s2 (i) (the estimated per-observation within variance), which was alluded to, seems to us a difficult question. Blind faith in the relevance of a per-observation variance is rarely warranted, but there surely are situations where grouping s2 (i) seems reasonable. Equally, there are cases where some series are known, perhaps in advance, to involve more difficult measurements than others, and where this knowledge has been reflected by making more observations in the more difficult situa-

33 Combination of Results: More Realistic Case

571

tions. In such cases grouping s2 (i) would usually be foolish, and grouping s2i would ordinarily be preferred. Fortunately, the choice between grouping s2i and grouping s2 (i) seems unlikely to have major effects on the answer. (Cases with large ratios between the number of measurements per series are an exception.) The one thing to avoid is doing both, and reporting the one whose final combination is most palatable. So it is good to have any systematic, even though somewhat arbitrary, rule to be used when the preceding paragraphs do not apply. We suggest looking at the numbers of observations in each series, and at the ratio of the largest of these numbers to the smallest. If this is less than 2, group s2i . If it is greater than 5, group the s2 (i). In between, use the best insight and judgment you have to choose one or the other. [Changes in calculations, when grouping s2 (i), are pointed out at the close of Section 2.1, and treated in more detail in Paper I.] The use of UC (ultraconservative) combination in practice seems to us to be usually unnecessarily biased and unwise. (We included it here mainly for pedagogical reasons—as a presentation that can be more easily understood because it avoids the complexities of “sometimes pool” procedures.) The use of robust/resistant methods to summarize observations within a series, so long as we obtain an estimated variance and a crudely relevant d.f. for that estimated variance, calls for no change in the methods of this paper. When these methods are appropriate, such combinations which are robust and/or resistant within a series seem also likely to be appropriate. However the individual series were summarized and however we have grouped the series, one way to give the several series within a group “equal weight” is to combine them robust/resistantly. This might not be as safe as making robust/resistant combinations within a series, but we would expect it ordinarily to provide the major advantages that robust/resistant techniques usually offer. When it comes time to combine groups, these are often so few in number (i.e., 2) as to prevent simple robust/resistant techniques offering any real help. Thus our (unsupported) view is that if any of the techniques of this paper are applicable, robust/resistant techniques offer their usual advantages (a) within series and (b) within groups, but probably should not be used (c) among groups unless we have at least five groups, preferably each with a reasonable number of series. The final warning is most general and most pervasive. If series estimate different things, differently weighted combinations will also estimate different things. If we know what we must estimate, we must use weights that ensure that we estimate the appropriate quantity, no matter what happens to the variability of our final estimate. If, at the other extreme, it does not matter exactly what we estimate, we may choose our weights to minimize the variability of our final result—something we would do by using the methods of this paper.

572

Frederick Mosteller and John W. Tukey

5.2. Summary In grouping series by the maxims of Cochran (1954), with a view to giving all the series in each group the same weight, it has been the practice to take the detailed values of the s2i being grouped into account wholly informally. Thus when Cochran recommends putting one-half to two-thirds of the series in a low group, who can doubt that his choice, in that range, would depend on just how the values of s2i clustered. The techniques set down in Paper I (Mosteller and Tukey, 1982) and in the present paper take much more detailed and formalized account of these values. It is to be expected, but has not yet been verified, that doing this will strengthen our processes of combination. Our procedures allow for a variety of choices, specifically: 1. full freedom in summarization within series, so long as estimated variances and assigned d.f. are provided; 2. grouping based on either per-series or per-observation variabilities (discussed in Section 5.1); 3. combination of series within group by either classical or robust/resistant techniques (which give each series either equal weights or equal opportunity to receive substantial weights); and 4. combination of group results using either Paull-2 (preferred) or UC (ultraconservative) assessments of combined variability for group summaries. In the body of the paper, we illustrate the detailed techniques which seem at all likely to occur when we group the s2i . [Changes when s2 (i) are grouped were illustrated in Paper I.] Section 5.1 presents our recommendations about choice. REFERENCES Bennett, C.A., and Franklin, N.L. (1954). Statistical Analysis in Chemistry and the Chemical Industry. Wiley, New York. Cochran, W.G. (1937). Problems arising in the analysis of a series of similar experiments. J. Roy. Stat. Soc., Suppl. 4, 102–118. Also Paper 8 in Cochran (1982b). Cochran, W.G. (1938). Some difficulties in the statistical analysis of replicated experiments. Empire J. Expt. Agric. 6, 157–175. Also Paper 17 in Cochran (1982b). Cochran, W.G. (1954). The combination of estimates from different experiments. Biometrics, 10, 101–129. Also Paper 58 in Cochran (1982b).

33 Combination of Results: More Realistic Case

573

Cochran, W.G. (1982a). Summarizing the results of a series of experiments. Paper 112 in Cochran (1982b). Also in Proceedings of the 25th Conference on the Design of Experiments in Army Research, Development, and Testing. U.S. Army Research Office, Durham, N.C., ARO Report 80-2 (1980). Cochran, W.G. (1982b), Contributions to Statistics. Wiley, New York. Cochran, W.G., and Carroll, S.P. (1953). A sampling investigation of the efficiency of weighting inversely as the estimated variance. Biometrics, 9, 447–459. Also Paper 53 in Cochran (1982b). Gnanadesikan, R., and Wilk, M.B. (1970). Use of maximum likelihood for estimating error variance from a collection of analysis of variance mean squares. Ann. Math. Stat., 41, 292–304. Heyl, P.R., and Cook, G.S. (1936). The value of gravity at Washington. J. Res. Natl. Bur. Stand., 17, 805–839. Lewis, A.H. and Trevains, D. (1934). Investigations of the manurial effectiveness of ammonium phosphate. Part II. Experiments in the British Isles to compare concentrated fertilizers containing ammonium phosphate with ordinary mixtures containing superphosphate. Empire J. Expt. Agric., 2, 239–250. Mosteller, F., and Tukey, J.W. (1982). Combination of results of stated precision: I. The optimistic case. Utilitas Mathematica, 21A, 155–178. [Paper 32 in the present volume—Ed.] Paull, A.E. (1950). On a preliminary test for pooling mean squares in the analysis of variance. Ann. Math. Stat., 21, 539–556. Rao, P.S.R.S. (1983). Variance components models for combining estimates, present volume. [W.G. Cochran’s Contributions to Statistics—Ed.] Rao, P.S.R.S., Kaplan, J., and Cochran, W.G. (1981). Estimators for the one-way random effects model with unequal error variances. J. Am. Stat. Assoc., 76, 89–97. Also Paper 114 in Cochran (1982b). Wilson, E.B., and Hilferty, M.M. (1931). The distribution of chi-square. Proc. Natl. Acad. Sci., 17, 684–688. Yates, F., and Cochran, W.G. (1938). The analysis of groups of experiments. J. Agric. Sci., 28, 556–580. Also Paper 13 in Cochran (1982b).

Reprinted from A Festschrift for Erich L. Lehmann in Honor of His Sixty-Fifth Birthday (1983), P. Bickel, et al., eds. Belmont, CA: Wadsworth, pp. 239–248

34. Allocating Loss of Precision in the Sample Mean to Wrong Weights and Redundancy in Sampling with Replacement from a Finite Population J.L. Hodges, Jr., Frederick Mosteller, and Cleo Youtz University of California, Berkeley and Harvard University

ABSTRACT For samples of the same size, sampling with replacement instead of without replacement from a finite population enlarges the variance of the sample mean. The loss of precision has two sources: (1) from inappropriate weights because the same information is being used repeatedly and (2) from redundancy by not having the maximum possible number of distinct items in the sample. Because, in many statistical problems, variation in weights does not matter much (Tukey, 1948), intuition suggests that the lion’s share of the loss comes from attrition in distinct items in the sample. This intuition is incorrect, and for a large class of practical situations the loss divides nearly evenly between the two causes. We provide both rough approximations and exact mathematical results, as well as some tabulated values.1 1. THE PROBLEM AND ITS APPROXIMATE SOLUTION Consider a population of N items, each of which is associated with some numerical value, and suppose that these N values have mean µ and variance σ 2 . To estimate µ, we draw a random sample of n items without replacement, and calculate the mean X of the n sample values. It is well known that X is an unbiased estimate for µ and that Var(X) = 1

σ2 N − n · . n N −1

(1)

Acknowledgement: This research was facilitated by National Science Foundation Grant SES 8023644 to Harvard University. This collaboration came about from a kaffeeklatsch regularly organized by Erich Lehmann while Mosteller was Miller Research Professor at the University of California at Berkeley when Hodges reported on an example from his elementary statistics course.

576

J.L. Hodges, Jr., Frederick Mosteller, and Cleo Youtz

Now suppose that we were instead to draw a random sample of m items with replacement and calculate the mean Y of these m observations. Then Y would also be an unbiased estimate for µ, but with Var(Y ) =

σ2 . m

(2)

On comparing the variances (1) and (2), it appears that the with-replacement sample needs to be larger than the without-replacement sample if Y is to be as good an estimate of µ as X. Equating the variances gives m=n·

N −1 , N −n

(3)

which exceeds n except in the trivial case where n = 1. In an important class of special cases, requirement (3) can be put into the form of an easily remembered rule of thumb. Suppose that n is “large” while the sampling fraction n/N = α is “small”—as is often true in practical sampling. Then requirement (3) may be written approximately as m = n(1 + α).

(4)

In words, to compensate for replacement, one needs to augment the sample size by a proportion about equal to the sampling fraction. With a 10% sample, for instance, suppose that n = 100, which is “large,” while N = 1000 so that α = 0.1, which is “small.” The 10% sampling fraction calls by requirement (4) for a 10% augmentation of the without-replacement sample size n = 100, or 10 extra draws, if one wishes to sample with replacement. Formula (3) shows that the needed augmentation is exactly 11. Can one explain intuitively why with-replacement sampling needs extra draws? The natural first thought is the redundancy of with-replacement sampling. This method allows an item to be drawn more than once; the repeat draws are wasted, since they provide no fresh information. This explanation would be supported were it to turn out that in our special case the proportion of repeats is about α. The chance that a population item appears at least once among the m is 1 − (1 − N1 )m , so the number K of distinct items in the with-replacement sample has the expectation  m  1 E(K) = N 1 − 1 − . (5) N On expanding the binomial, we find for E(K) the series m−m(m−1)/2N +· · · , so that in our special case E(K) ≈ m(1 − α/2); the wasted repeat draws amounting on average to the proportion α/2 of the sample, approximately. This calculation suggests that redundancy accounts for only about half of the inefficiency of Y . The exact formula (5), applied to the numerical example,

34 Allocating Loss of Precision

577

supports this conclusion. With N = 1000, formula (5) shows that E(K) = 99.7 when m = 105 while E(K) = 100.6 when m = 106. That is, the damage from redundancy could be offset by only a 5% augmentation, not 10%. In casting about for an explanation for the other half of Y ’s inefficiency, one may ask whether this estimate is making the best use of the available information. In computing Y , we use each sample value as drawn, so that an item drawn twice is given double weight. But should it be? It is merely a matter of chance which items reappear. Would it not be better to note down the identity of the items drawn before they are replaced, so that one could give equal weight and compute the mean Z of the values of the K distinct items in the sample? Under the condition that K = k, the k distinct items constitute in effect a random sample of k items drawn without replacement, so E(Z|K = k) = µ and by the without-replacement variance (1) E[(Z − µ)2 |K = k] = (σ 2 /k) · (N − k)/(N − 1). It follows that, unconditionally, Z is unbiased for µ, with σ2 N − K Var(Z) = E[ · ]. (6) K N −1 When m is large and α small, K will have a small coefficient of variation and the expectation (6) should be about equal to σ 2 /E(K) · [N − E(K)]/(N − 1). On substituting our approximation E(K) = m(1 − α/2), we see that, if the properly weighted estimate Z is used, it suffices to put m = n(1 + α/2) to obtain a variance that matches the without-replacement variance (1). For our special case, therefore, we have a rather neat conclusion: the inefficiency of Y rests about equally on the inefficiency of the sampling design and on the improper treatment of the data it produces. But our conclusion is based on some shaky approximations; let us see what a precise mathematical argument can give. (At this point, to a fanfare of trumpets, the Queen of the Sciences rides into battle). 2. EXACT TREATMENT In the opening presentation, we have asked how big a sample m with replacement is required to give equal variance to the sample mean for a sample of n drawn without replacement. We found it to be n(1 + α), where α = n/N . Had we instead asked how the variance of the sample mean was reduced for equal sample sizes n, we would change from σ 2 N − 1 with replacement · n N − 1 and equal weights

to

σ2 N − n σ2 without · ≈ (1 − α) . replacement n N −1 n

578

J.L. Hodges, Jr., Frederick Mosteller, and Cleo Youtz

Thus, we could think of the loss in precision as measured by α. We could also write the variance of a sample mean as σ2 N − x · n N −1 where x ranged from 1 to n. Thus, apparently from our introduction x/n could be regarded as a measure of the precision retained by the method for large n and small α: 1. x/n ≈ 0 for sampling with replacement and equal weights for all observations, 2. x/n ≈ 1/2 for sampling with replacement and equal weights for the observations on the distinct items, 3. x/n ≈ 1 for sampling without replacement and equal weights. We now check this more carefully. To find the required expectation E(

σ2 N − K · ), K N −1

−K ). we see we need E( N K When one notes that (N − K)/K is essentially the reciprocal of a random variable, one can scarcely hope for a closed result for the expectation. What a happy surprise to discover after tabulating a few values that the expectation is N −1 n−1 k N −K ) = k=1n−1 . R(N, n) = E( (7) K N This formula is readily proved for N = 2 and any n. Then for fixed n, an induction on N completes the proof. See the Appendix. Using this formula and x/n = [N − nR(N, n)]/n, we can compute x/n exactly for various values of N and n. Table 1 shows some results for n ≤ N . We note that for n > 1, the results are generally close to 1/2, and that α(= n/N ) need not be especially small for this to be true. For large N and n the calculation becomes clumsy, and Euler’s summation formula (Bromwich, 1942) for a function f is helpful for the numerator. That result gives

N −1 i=1

f (i) =

 N −1 1

f (x)dx + 12 f (N − 1)

1 + 2! B1 f  (N − 1) −

1  4! B2 f (N

where the Bs are the Bernoulli numbers B1 =

1 1 , B2 = , .... 6 30

(8) − 1) + · · ·

34 Allocating Loss of Precision Table 1. Values of

N 10 20 50 100 500 1000 5000 10,000

n 1

2

1 1 1 1 1 1 1 1

.5 .5 .5 .5 .5 .5 .5 .5

3

4

5

10

579

x n

50

100

500

1000

5000

.48333 .47500 .46670 .42570 .49167 .48750 .48334 .46259 .49667 .49500 .49333 .48501 .41906 .49833 .49750 .49667 .49250 .45922 .41829 .49967 .49950 .49933 .49850 .49183 .48350 .41768 .49983 .49975 .49967 .49925 .49592 .49175 .45848 .41760 .49997 .49995 .49993 .49984 .49917 .49834 .49168 .48335 .41753 .49998 .49998 .49997 .49995 .49959 .49918 .49585 .49168 .45841

For our problem the first three terms are  N −1 (N − 1)n − 1 1 1 xn−1 dx = , f (N − 1) = (N − 1)n−1 , n 2 2 1 n − 1 1 B1 f  (N − 1) = (N − 1)n−2 . 2! 12 A three-term approximation could be used as it stands, or we could study the limiting ratio as before, say with n = αN for small α and large n. For this situation, we recognize that the first neglected term—the B2 term of expansion (8)—has the same order of N as the others, but it is of order α3 and has a small coefficient, 1/720. If we neglect the higher terms, we get as the first two terms in the approximation 1 α x = − . n 2 12

(9)

This means that in this situation if we use the correct weights, we suffer a little more than half of the loss from sampling with replacement. Thus a little more than half the loss is due to redundancy, the lack of distinct items. We note that for values in this table the formula works fairly well even for large α. For N = 10, n = 10, α = 1, the approximation suggests 0.41667 instead of the tabled .42570. For N = 100, n = 10, α = 0.1, and the formula gives .49167, very close to the correct 0.49250. Our original intuition about the weights is incorrect because it rests on a first order argument, essentially about the n of σ 2 /n, not about the second order matter of the effect on the correction term that we discuss. APPENDIX Induction Proof The distribution of K, the number of distinct items has probabilities of the form

580

J.L. Hodges, Jr., Frederick Mosteller, and Cleo Youtz

P (k; N, n) = ck,n N (N − 1) · · · (N − k + 1)/N n .

(10)

The ck,n depend on a recursion rule ck,n+1 = kck,n + ck−1,n . Table 2 shows values of ck,n in a Pascal-like triangle. This rule arises because we get k distinct items at the end of n + 1 draws by two routes. First route: given k distinct items after n draws, we have k ways to get no new item; second route: we have k − 1 distinct items after n draws, and we have N − k + 1 ways to get a new distinct item. The factor N (N − 1) · · · (N − k + 1) is still needed in the probability for k distinct items with n + 1 trials. Thus, in the first route the multiplier k associates itself with ck,n . In the second route, the N − k + 1 is required to fill out N (N − 1) · · · (N − k + 2), and so we get only the number 1 multiplying ck−1,n . The quantity whose expected value is desired is N −K R(N, n) = E K (11) n  = N −n ck,n N (N − 1) · · · (N − k + 1)[(N − k)/k]. k=1

To prove formula (7) we need to prove that (N + 1)n−1 R(N + 1, n) − N n−1 R(N, n) = N n−1

(12)

because that is necessary for the induction argument from N to N + 1 for given n. Table 2. Table of Coefficients ck,n Illustrating ck,n+1 = kck,n + ck−1,n

k

10

9

8

7

6

5

4

3

2

1

1 6 25 90 301 966 3025 9330

1 3 7 15 31 63 127 255 511

1 1 1 1 1 1 1 1 1 1

n 1 2 3 4 5 6 7 8 9 10

1

1 45

1 36 750

1 28 462 5880

1 1 15 21 140 266 1050 2646 6951 22827 42525

1 10 65 350 1701 7770 34105

34 Allocating Loss of Precision

581

Substituting from formula (11) into formula (12), we have n 

ck,n [N (N − 1) · · · (N − k + 1) − (N − 1)(N − 2) · · · (N − k)]/k

k=1

=

n 

ck,n (N − 1)(N − 2) · · · (N − k + 1).

k=1

Multiplying numerator and denominator by N n and slipping an N from the numerator behind the summation sign gives us N n−1

n 

ck,n N (N − 1) · · · (N − k + 1)/N n

k=1

= N n−1

n 

P (k; N, n)

k=1

= N n−1 as required for equation (12). It remains only to provide the starting point of the induction for N = 2 and arbitrary n. We need to show that R(2, n) =

1 2n−1

.

The number, k, of distinct categories is 2 with probability 1 − (1/2)n−1 and 1 with probability (1/2)n−1 . The required mean is 1 1 2−2 2−1 + ( )n−1 ( ) R(2, n) = [1 − ( )n−1 ] 2 2 2 1 1 = n−1 . 2 This completes the induction proof. REFERENCES BROMWICH, T.J.I’a. (1942), An Introduction to the Theory of Infinite Series (2nd ed.), London: MacMillan and Co., p. 304. DAVID, F.N., and BARTON, D.E. (1962), Combinatorial Chance, New York: Hafner, pp. 13–15. FELLER, W. (1968), An Introduction to Probability Theory and Its Applications, Vol 1, (3rd ed.), New York: Wiley, pp. 101–106. Feller treats especially the case where n > N , whereas we primarily study n < N . KOLCHIN, V.F., SEVAST’YANOV, B.A., CHISTYAKOV, V.P., translated by A.V. Balakrishnan (1978), Random Allocations, New York: Wiley. See Chapter 1, The Classical Shot Problem. TUKEY, J.W. (1948), “Approximate Weights,” Annals of Mathematical Statistics, 19, 91–92.

Reprinted from Surgery (1984), 95, pp. 572–579

35. Reporting Clinical Trials in General Surgical Journals ∗∗

John D. Emerson , Bucknam McPeek, and Frederick Mosteller

Readers need information about the design and analysis of a clinical trial to evaluate and interpret its findings. We reviewed 84 therapeutic trials appearing in six general surgical journals from July 1981 through June 1982 and assessed the reporting of 11 important aspects of design and analysis. Overall, 59% of the 11 items were clearly reported, 5% were ambiguously discussed, and 36% were not reported. The frequency of reporting in general surgical journals is thus similar to the 56% found by others for four general medical journals. Reporting was best for random allocation (89%), loss to follow-up (86%), and statistical analyses (85%). Reporting was most deficient for the method used to generate the treatment assignment (27%) and for the power of the investigation to detect treatment differences (5%). We recommend that clinical journals provide a list of important items to be included in reports on clinical trials. From the Departments of Biostatistics and Health Policy and Management, Harvard School of Public Health, and the Department of Anesthesia, Massachussetts General Hospital, Boston, Mass. Reports of clinical trials often omit information about study design, implementation, and analysis. Such information is essential for accurate evaluation of these research reports.1,2,4,9 DerSimonian and co-workers4 surveyed 67 clinical trials published in four general medical journals and found that for 11 important aspects of design and analysis, 56% of all items were clearly reported, 10% were ambiguously mentioned, and 34% were not reported. These 

Supported in part by grant RF79026 from the Rockefeller Foundation. Accepted for publication June 23, 1983. Reprint requests: Bucknam McPeek, M.D., Department of Anesthesia, Massachusetts General Hospital, Boston, MA 02114.  Current address: Department of Mathematics, Middlebury College.

584

Emerson, McPeek, and Mosteller

investigators considered whether published reports included such basic information as whether and how patients were randomized to treatment, the presence and extent of blindness, whether loss to follow-up was reported, and which statistical methods, if any, were used in the analysis. In this study we examined the reporting of therapeutic clinical trials published in six leading general surgical journals: the American Journal of Surgery, the Annals of Surgery, the Archives of Surgery, the British Journal of Surgery, Surgery, and Surgery, Gynecology and Obstetrics. Our study focuses on whether an aspect of design or analysis was reported rather than on whether it was used correctly in the research. For example, if authors reported that patients were not “blind” to the type of treatment, this item is coded as positive because the article provided information needed by readers to assess this aspect of study design. Thus our investigation examined the adequacy of reporting rather than the appropriateness or quality of the design and analysis of the trials we considered. METHODS We selected, somewhat arbitrarily, six major general surgical journals written in English. We examined each issue of the selected journals from July 1981 through June 1982 to identify reports of comparative clinical trials on humans that were prospective therapeutic trials having two or more arms. We excluded any studies (1) in which physicians assigned patients to treatment groups by nonsystematic methods, (2) that were previously reported in detail in other published research articles, or (3) that were designed primarily to assess diagnostic procedures or prognostic factors. We identified the eligible articles according to the inclusion criteria described above, through a three-stage process. First, for the six journals one author (B.M.) sequentially reviewed all the articles to determine which ones should be candidates for the study and made notes about the relevant characteristics of each article. Next, a second author (J.D.E.) reviewed the 175 articles identified in the first stage and applied the exclusion criteria; this produced a list of 93 articles, which were then randomly assigned to pairs of readers. Finally, our readers independently applied the exclusion criteria to these articles and on close scrutiny found nine more articles that should not be included. All clinical trials not excluded for the reasons mentioned were included in our study; we obtained 10 clinical trials from the American Journal of Surgery, 17 from the Annals of Surgery, 7 from the Archives of Surgery, 27 from the British Journal of Surgery, 8 from Surgery, and 15 from Surgery, Gynecology and Obstetrics. Pairs of readers independently assessed how frequently certain aspects of design and analysis were reported. The random assignment of readers used the RAND10 table of random integers and balanced as nearly as possible such factors as the journal in which the articles appeared, the order of assignment to each member of a pair of readers, and the numbers of articles the same two readers surveyed. We considered 11 specific items on the basis of their

35 Reporting Trials in General Surgical Journals

585

importance to readers in determining the confidence that should be placed in the authors’ conclusions, their applicability across a variety of subspecialty areas, and their ability to be assessed by the scientifically literate general medical reader. We used the following items, which had been identified by DerSimonian et al.4 : 1. Eligibility criteria—Information explaining the criteria for admission of patients to the trial 2. Admission before allocation—Information used to determine whether eligibility criteria were applied before knowledge of the specific treatment assignment was available 3. Random allocation—Information about the assignment of patients to treatment groups 4. Method of randomization—Information about the mechanism used to generate the random assignment 5. Patients’ blindness to treatment—Information about whether the patients knew which treatment they were receiving 6. Blind assessment of outcome—Information about whether the person assessing the outcome knew which treatment had been given 7. Treatment complications—Information describing the presence or absence of side effects or complications after treatment 8. Loss to follow-up—Information about the numbers of patients lost to follow-up and the reasons why they were lost 9. Statistical analyses—Indication of analyses going beyond the computation of means, percentages, or standard deviations 10. Statistical methods—The names of the specific tests, techniques, or computer programs used for statistical analyses 11. Power—Information describing the determination of sample size or the size of detectable differences Five readers participated in completing survey forms for the papers; two of these readers independently reviewed each paper assigned to them to ascertain whether each of the items was clearly reported, not reported, or unclear. They were informed as to the authors and to the journal in which the article appeared. All items judged clearly not applicable for a particular study were regarded as reported. Clear implication was counted as reporting; for example, a reader cannot expect a patient to be blind to amputation. The readers indicated on the survey forms the location in the article of the data for each item found to be reported, whether they believed an item to be not applicable, and other comments that could aid subsequent adjudication of any disagreements. We provided the readers with the same survey forms and the same set of substantial written instructions with examples that were developed for the investigation of general medical journals. The readers also participated in training sessions to review clinical trials together, debate their findings with

586

Emerson, McPeek, and Mosteller

each other, and question us about the meaning and interpretation of the written instructions and examples. These training sessions totaled approximately 5 hours and helped to settle the inevitable ambiguities before we assigned the papers for review. The readers were all university doctoral students or postdoctoral fellows with scientific backgrounds from the Harvard Departments of Biostatistics and Epidemiology. All readers had considerable prior experience in assessing research designs for medical investigations; two readers were physicians. (Readers’ names appear in the acknowledgment.) One author (J.D.E.) served as an arbitrator for disagreements. To do this, he first tabulated the results of the survey forms and identified items that produced disagreement between the first two readers. He assigned each item one of the following codes: R for “reported” if clear information about the item was present in the article or if it was clearly not applicable; O for “omitted” if no report of the item was found; and ? if there was ambiguous or partial information about the item. The arbitrator based the final code on the written reports of the first two readers when they agreed, and on further study of the paper, by him and by two readers who served also as discussants to resolve disagreements. After considering both the written views of the readers and the views of the discussants, the arbitrator always made the final decision in resolving disagreements among the readers. To aid in assessing the reliability of our procedure for evaluating the reporting in these articles on the 11 selected items, we compiled data about disagreements between the initial two readers. We distinguished disagreements involving reported/omitted (R/O) pairs from those that involved a ? code. Upon completing the data gathering for this study, we selected a stratified random sample of articles from the 67 clinical trials evaluated by DerSimonian et al.4 to calibrate our ratings with theirs. Our sample of 15 of those articles was stratified on the four journals used in that study. We subjected these articles to exactly the same review process and arbitration of disagreements that we used for the 84 articles in our study of six general surgical journals. We then compared our results for this sample with those obtained by DerSimonian and co-workers. These data provided some information on the degree of replicability of studies of clinical trials that make use of the methods and instruments adopted. RESULTS Table I lists the percentages of articles reporting each of the 11 items for the six journals and for all journals combined. It also lists the overall percentages for the journals when the 11 items are combined. Information about random allocation of patients to treatment, loss to follow-up, and the use of statistical analysis was reported in at least 85% of the 84 articles. Only 27% discussed the method used for randomization, and a mere 5% of the articles discussed power. For all items combined, we found a 59% overall level of reporting; this

articles 10 17 7 27 8 15 84

Journal Am J Surg Ann Surg Arch Surg Br J Surg Surgery SGO Total

Loss to follow-up R ? O 80 20 0 94 6 0 86 14 0 85 7 7 88 0 13 80 7 13 86 8 6

Treatment complications R ? O 80 0 20 82 6 12 71 14 14 74 0 26 88 0 13 80 0 20 79 2 19

Statistical analysis R ? O 70 0 30 94 0 6 100 0 0 81 0 19 100 0 0 73 0 27 85 0 15

Random allocation R ? O 100 0 0 94 0 6 86 0 14 81 4 15 100 0 0 87 0 13 89 1 10 Statistical methods R ? O 40 0 60 82 0 18 57 0 43 70 0 30 63 0 38 60 0 40 65 0 35

Method of randomization R ? O 40 0 60 47 0 53 14 14 71 19 7 74 13 0 88 27 0 73 27 4 69

R 0 12 0 7 0 0 5

Power ? 0 0 0 0 0 0 0

O 100 88 100 93 100 100 95

Patients’ blindness to treatment R ? O 80 0 20 71 0 29 71 0 29 63 0 37 75 0 25 60 7 33 68 1 31

R 62 68 56 53 66 53 59

Mean all items ? 4 4 8 6 1 5 5

O 35 27 36 41 33 42 36

Blind assessment of outcome R ? O 60 0 40 41 6 53 57 14 29 19 15 67 75 0 25 27 20 53 38 11 51

Legend: Am J Surg, American Journal of Surgery; Ann Surg, Annals of Surgery; Arch Surg, Archives of Surgery; Br J Surg, British Journal of Surgery; SGO, Surgery, Gynecology and Obstetrics; R, item reported; ?, item unclear; O, item omitted.

No. articles 10 17 7 27 8 15 84

Journal Am J Surg Ann Surg Arch Surg Br J Surg Surgery SGO Total

Admission before allocation R ? O 70 10 20 76 12 12 57 14 29 48 19 33 75 13 13 53 13 33 61 14 25

Eligibility criteria R ? O 60 10 30 59 18 24 14 14 71 33 15 52 50 0 50 40 7 53 43 12 45

Table I. Percentage of articles reporting each item in each journal

35 Reporting Trials in General Surgical Journals 587

588

Emerson, McPeek, and Mosteller

figure compares with a reporting rate of 56% obtained by DerSimonian and co-workers for four prestigious general medical journals (Table II). In general, differences in reporting levels among the six journals are not striking. Because we examined 10 or fewer clinical trials in three of the six journals and because the percentages reported were very close for some journals, we note that a new sample of articles might give a somewhat different comparison. The important result is that 59% of all items in all 84 articles were clearly reported, a figure that is slightly higher than the reporting percentage (56%) for four general medical journals. Table II. Comparison of results from two similar studies∗ Percentages of articles reporting R ? O 84 Surgical trials (this study) 59 5 36 Subsample of 15 clinical trials† This study 67 4 29 DerSimonian et al.4 63 10 27 67 Clinical trials DerSimonian et al.4 56 10 34 ∗ Based on an independent examination of a random subsample of 15 articles, by each of two study groups. †For both of the research teams, the 15 articles refer to the same stratified random subsample of the 67 articles from the New England Journal of Medicine, the Journal of the American Medical Association, the British Journal of Medicine, and the Lancet; DerSimonian et al.4 examined these articles for their reporting of 11 items.

We examined the numbers and percentages of all articles that produced disagreements over the reporting between the two initial readers. Disagreement occurred for 23% of the questions, and nearly two thirds of these involved R/O codes. Loss to follow-up produced the highest number of disagreements (45%), and more than half of these (27% of all articles) involved the R/O code pairs. Disagreements of this kind usually meant that one of the readers had missed something; ultimately we can tell that most were settled in favor of R, because loss to follow-up achieved an overall reporting level of 86%. Table III shows the distribution of reported (R), ambiguous (?), and omitted (O) by papers; for example, no paper received 11 Rs and only two papers (2%) received 10 Rs. Six percent of the 84 papers reported clearly on only 2 of the 11 items, and 2% reported on only 3 items. The remaining papers reported between 4 and 9 items, and the overall average was nearly 6 Rs. These articles reported on about 5% of the items in an ambiguous manner and failed to report on 36% of the items.

35 Reporting Trials in General Surgical Journals

589

Table III. Marginal distributions of numbers of items coded in 84 papers Percentages of papers† No. items∗ R ? O 0 0 60 0 1 0 30 7 2 6 8 12 3 2 2 26 4 8 0 23 5 10 0 10 6 15 0 10 7 30 0 8 8 15 0 4 9 11 0 1 10 2 0 0 11 0 0 0 ∗ Number of items found reported among the 11 items. †Percentages of papers giving each of the three codes for the indicated number of items. The numbers in a column of the table may not add to 100% because of rounding. The numbers in a row are not closely related in that, for example, we cannot determine for the 6% of papers reporting two items how many of the other items were omitted and how many were unclear.

Table II gives results for overall reporting that aid in comparing the findings for six surgical journals examined here to the findings for four general medical journals. The first row of the table gives the overall reporting levels we found for the six surgical journals, and the last row gives the results obtained for the four general medical journals. Row 2 gives the results we obtained, using our own readers, random assignment scheme, and adjudication process, for a random subsample of 15 of the 67 articles from the study of general medical journals. Row 3 gives the parallel results obtained for these same 15 articles when examined by DerSimonian et al. and their readers. DISCUSSION We have selected 11 important aspects of design and analysis and investigated how often they are reported in samples of articles from six leading general surgical journals, five from the United States and one from the United Kingdom. We sought basic methodologic aspects that readers need to appraise the strengths of reported clinical trials; we briefly discuss these 11 aspects here, in the order used for Table I. Readers usually want to assess whether the findings of a clinical trial can be applied to another patient group, or perhaps to an individual patient. This difficult task requires specific information about who the subjects of a trial were and how they were selected. Over half of the 84 papers failed to report unambiguously on this item. While most papers were clear in stating the exclusion criteria, often they did not identify the group of patients to whom

590

Emerson, McPeek, and Mosteller

they applied these criteria. The lack of this information makes generalizing a trial’s findings to groups other than the subjects themselves very difficult. We sought information about whether eligibility and exclusion criteria were applied without knowledge of the treatment to be assigned. Determination of admission to a study after treatment assignment is known may bias subject selection. Approximately two thirds of the papers we reviewed supplied the needed information. Random assignment of treatments reduces bias in a clinical trial. We searched for an indication of whether treatment assignments were random or whether random assignments were possible. Although nearly 90% of the articles reported clearly on the use of randomization, only 27% reported on the method of randomization. The method used should be reported if the reader is to be assured that the randomization was reliable. Although randomization based on coin flipping, card drawing, or use of birthdays or patient identification numbers can seem reliable, it may also have serious weaknesses. Such methods are more easily interfered with and cannot be checked if questions about them arise later. Published tables of random numbers10 that can be used for patient assignment are readily available. An indication that randomization was done “using the method of sealed envelopes,” while somewhat informative, does not tell the mechanism that generated the assignments contained in the envelopes. Sealed envelopes may invite tampering, a danger minimized when treatment assignments are made from a central source after patients have entered the trial. Blindness is a second important bias-reducing technique employed in clinical trials. In searching for reports on this item, we distinguished between patients’ blindness to treatment and blind outcome assessment. In two thirds of the articles, reporting on patients’ blindness was unambiguous. Frequently we coded the item R because it was evident from the context (for example, amputation versus alternate operation) that patient blindness was impractical. Whether outcome assessment was blind was much less easily determined because many reports stop after using the phrase “double blind.” While this phrase may imply that the patient is blind to treatment, it leaves unclear whether the second blind party is the patient’s personal physician, the operating surgeon, or a separate evaluator who assesses patient outcome. Often several physicians and other medical personnel are in a position to bias study results if each is not blind to the assigned treatment. Although blindness for all persons may be impossible, the blinding of some of these individuals helps to reduce the number of sources of potential bias. Surgeons know that treatments often produce side effects with implications for patients’ well-being and for their subsequent care. Unless a report describes a search for, and enumeration of, side effects and their frequencies, readers may fear that a treatment is subject to undisclosed disadvantages. This concern may deter their use of a new and effective treatment. Investigators are indeed fortunate when they can collect and analyze outcome data from all subjects treated. Often this is not possible because patients

35 Reporting Trials in General Surgical Journals

591

refuse further cooperation with investigators, move away, die, or become lost to follow-up for other reasons. When dropouts occur, the missing data can alter a trial’s analysis and change its conclusions. When authors fail to report clearly on loss to follow-up, some readers may be misled into believing that dropouts did not occur. This misunderstanding may handicap the reader’s efforts to interpret the results and to generalize them to other patient populations. Variation among individual patients and other sources of variability outside the investigators’ control usually require statistical inference to evaluate the outcome of a clinical trial. A reader must know not only that statistical methods were used, but which methods and how they were applied. Readers cannot assess the appropriateness and meaning of a P value or the outcome of a significance test without this information. In many reports of clinical trials, investigators can provide enough information about a statistical analysis to enable interested readers to perform a similar analysis on another data set. Among the 11 aspects of reporting we evaluated, the final aspect— statistical power—was least often reported. When investigators do provide a brief discussion of power, it gives readers information needed to assess both the adequacy of a plan of investigation and the strength of its conclusions. In particular, when a study reports one or more findings of nonsignificance, the reader needs to determine whether these findings indicate that there are no important differences or whether nonsignificance results simply from a study too small to detect important differences. The reporting frequencies found in the present study and in that of DerSimonian et al.4 suggest that most investigators do not appreciate the importance of information about power. The reader may wonder about the power of our own investigation. Because the key issue is whether the reports are 100% or something less, and because the results are far from 100%, our study has power near 1 for detecting such a large difference. Thus our study is large enough and reliable enough to be nearly sure to detect such large differences. Several objectives influenced the design of our study. First, we wanted to be able to compare our results for articles in six general surgical journals with those obtained by DerSimonian and colleagues4 for four English language general medical journals. For this reason, we used the same survey forms and written instructions that were developed for, and adopted in, the previous study. The forms contained 17 items of reporting on aspects of design and analysis. Six of these items were not tabulated in the previous study because raters did not assess them reliably. Thus, whereas for comparability we used the same 17-item checklist, we tabulated and reported upon only the 11 items used in the previous study. We also conducted a small secondary review of a random sample of papers from that study in order to establish a basis for comparability between the two studies. Our study design was also influenced by our desire to benefit from the experience gained from the study of DerSimonian et al. A letter to the editor3 responding to the previous study reminded us that, although our study of

592

Emerson, McPeek, and Mosteller

reporting is not itself a clinical trial, we could incorporate a few additional features of good clinical trial design into our own study. Feinstein and Horwitz8 have suggested explicit ways to do this in investigations which, like ours, are observational studies. Thus we could improve on the methods used for the previous study in several ways. For example, we assigned pairs of readers at random to evaluate the 84 surgical trials. We used a stratified randomization to make certain that all pairs of readers assessed nearly the same percentage of articles from each of the six journals. We also took steps to ensure that both members of a pair of evaluators did not receive a given article as their first (or their last) article to evaluate; thus we balanced on the order in which a pair of readers considered their articles. Unlike the previous study of four general medical journals, our study did not include all comparative clinical trials published in the six journals during the study period. We excluded any trials that had been reported in a previous publication because we believed that authors of such articles might well not repeat some details of the study design and the analysis when they could refer readers to details in a previous report. We also excluded from our study all trials aimed at comparing diagnostic techniques; we restricted attention to therapeutic trials. Our findings about the levels of reporting on the 11 items selected are close to those of DerSimonian and co-workers4 for four general medical journals. For all journals combined, our average level of complete reporting was 3% higher, our average level of ambiguous reporting was 5% lower, and our average level of omission was 2% higher than for the general medical journals. For 3 of the 11 items our reporting percentages differed from those found in the previous study by more than 10%; for the surgical clinical trials we found a 13% higher rate of reporting on patients’ blindness to treatment, a 15% higher rate of reporting on treatment complications, and a 20% lower rate of reporting on the statistical methods used. In general, our findings are consistent with those of DerSimonian et al. Our calibration study of 15 papers from the set studied by DerSimonian et al.4 suggests that the main difference in the performances of the two study groups was an approximate 50% reduction in the number of assignments of ambiguous ratings by the second group. This reduction apparently led to roughly proportional increases in reporting and omission ratings. Table II shows that the distributions of ratings in the calibration set and in the two main studies agree very closely, though the calibration set itself seems to have a slightly higher rate of reporting than the two studies as a whole. Although our main finding is that published reports of clinical trials often omit needed reporting on important items, we do find very modest differences among the six general surgical journals in the frequency with which they reported the 11 items. Table I gives the percentages for each item and the mean percentages over all items for each journal. The number of articles studied for each journal was small; we included 10 or fewer articles for three of the six journals.

35 Reporting Trials in General Surgical Journals

593

To assess whether the differences among the six means of all items reported were statistically significant, we performed a one-way analysis of variance. We used the number of Rs reported for each article as that article’s score, and we grouped these scores into six groups corresponding to the journals. We obtained F = 2.5 (with 5 and 78 degrees of freedom), which gives a P value of 0.04. Thus there were statistically significant differences among the six journals in their levels of reporting. However, the relative variability among these six general surgical journals tended to be less than that for the four general medical journals; our estimate of the standard deviation of the six true mean reporting percentages for the surgical journals was 5.2, while the corresponding estimate for the four general medical journals was 9.0. Fig. 1 displays the reporting scores for the articles in the six journals using parallel boxplots.6 The rectangles indicate the middle 50% of the scores in each group, and the lines stretching away from the rectangles represent the lower 25% and the upper 25% for each group of data. The vertical lines within each of the rectangles designate the median reporting score for each of the journals. The display suggests that the variability within groups, as measured by the

Fig. 1. Parallel boxplot display of scores (numbers of items out of 11) reported for each of 84 clinical trials published in six general surgical journals. The boxes represent the middle 50% of the scores for each of the six journals, and the vertical lines through the boxes represent the median scores. The horizontal lines extending from the ends of the boxes represent the lower 25% and upper 25% of the data in each group of scores.

594

Emerson, McPeek, and Mosteller

interquartile range, is greater than the variability in median reporting scores among the six journals. Editors and referees establish and maintain standards for what is published. Editors could enhance substantially the level of reporting of clinical trials by providing authors, referees, and associate editors with a list of items that they believe should be strictly reported. Such lists have already been constructed not only for the reporting of clinical trials11,12 but also for reporting in other medical specialties.7 The specific list of items is certain to vary among journals as well as among specialties, but we recommend that suitable lists be made available, perhaps in journals’ “Information for Authors.” We know that 100% reporting could be achieved with the aid of such lists, because authors generally have all the information needed to report on all the items we have considered; few words are needed to make such reports. In making this proposal, we appreciate that the items we have chosen for design and analysis may not be the most crucial in that area, and, more important, that reporting in other areas of the research process may need improvement. DerSimonian et al.5 have suggested that a conference of biomedical editors on the reporting problem could lead to further studies of reporting, under the editors’ guidance; these studies could lead in turn to a helpful set of standards. We believe that higher standards for reporting on design and analysis will eventually bring about the adoption of improved research designs and more suitable methods of analysis in medical investigation. When one plans on reporting an aspect of an investigation, one gives more consideration to determining how that aspect is to be carried out. The readers for this project were Roselie Bright, Carol Master, Francois Meyer, Charles Poole, and Jane Teas. We are also indebted to members of the Harvard Study Group on Statistics in the Biomedical Sciences who provided valuable input for many aspects of this study; they include John Bailar, Graham Colditz, Rebecca DerSimonian, Katherine Godfrey, Hossein Hosseini, Philip Lavori, Robert Lew, Thomas Louis, Katherine Taylor-Halvorsen, and John Williamson. Roselie Bright and Charles Poole helped in reassessing articles and the checklists as part of the adjudication of disagreements; Hossein Hosseini checked and rechecked the data; and Rebecca DerSimonian provided invaluable assistance and advice. We also thank Mary Schaefer and Susan Klawansky for preparing the manuscript. REFERENCES 1. Altman DG: Statistics in medical journals. Statistics in Medicine. 1:59–71, 1982 2. Chalmers TC, Smith H, Blackburn B, Silverman B, Schroeder B, Reitman D, Ambroz A: A method of assessing the quality of a randomized control trial. Control Clin Trials 2:31–49, 1981 3. Colton T: Letter to the editor. N Engl J Med 307:1219–20, 1982 4. DerSimonian R, Charette LJ, McPeek B, Mosteller F: Reporting on methods in clinical trials, N Engl J Med 306:1332-7, 1982 5. DerSimonian R, Charette LJ, McPeek B, Mosteller F: Letter to the editor. N Engl J Med 308:596–7, 1982

35 Reporting Trials in General Surgical Journals

595

6. Emerson JD, Strenio J: Boxplots and batch comparison. In Hoaglin DC, Mosteller F, Tukey JW, editors: Understanding robust and exploratory data analysis. New York, 1983, John Wiley & Sons 7. Epidemiology Work Group, Interagency Regulatory Liaison Group: Guidelines for documentation of epidemiologic studies. Am J Epidemiol 114:609–13, 1982 8. Feinstein AR, Horwitz RI: Double standards, scientific methods, and epidemiological research. N Eng J Med 307:1611–7, 1982 9. Mosteller F, Gilbert JP, McPeek B: Reporting standards and research strategies for controlled trials. Control Clin Trials 1:37–58, 1980 10. RAND Corp: A million random digits with 100,000 normal deviates. New York, 1955, The Free Press 11. Department of Health and Human Services, Public Health Service, National Institute of Mental Health: Trials assessment procedure scale. Rockville, MD, 1982, National Institutes of Health 12. Zelen M: Guidelines for publishing papers on cancer clinical trials: responsibilities of editors and authors. J Clin Oncology 1:164–9, 1983

Reprinted from Bulletin of the ISI (1987) 52(4), pp. 571–577

36. Compensating for Radiation-Related Cancers by Probability of Causation or Assigned Shares Frederick Mosteller Harvard University, Cambridge, Massachusetts, U.S.A.

1. Introduction In the United States of America, when an injured person sues for damages, the decision to compensate rests upon a principle of tort law called the preponderance of evidence. When uncertainty about who caused the damage exists, if it is more likely than not that the defendant caused the damage, the plaintiff is paid in full for the damages. Otherwise the plaintiff gets nothing. This all-or-none principle, together with some other drawbacks of the tort system for compensation, has led the Congress of the United States to consider other ways to provide compensation. When some people among many accidentally exposed to radiation develop cancer, it is impossible to prove that the cancer was caused by the radiation. If for each of many people the chance that the cancer was caused by radiation is less than .5, say .2, then it seems unreasonable that no one gets anything because the evidence is less than half in their favor. Consequently, Congress asked that the Department of Health and Human Services (DHSS) create tables of probability of causation (or what some people prefer to call assigned shares) with the intent that such tables might be used to determine compensation in circumstances like this through a workman’s compensation approach rather than through court cases. The DHHS established the National Institutes of Health Ad Hoc Working Group to Develop Radioepidemiological Tables (NIHWG). NIHWG consisted of scientists from the National Institutes of Health (NIH) and from other scientific and academic institutions. In its turn, NIH asked the National Research Council (NRC) to form a committee to review the recommendations and procedures of the NIHWG. The NIHWG completed its report in 1985, and thus advanced our knowledge of assigned shares and uncertainty. This paper reports on that work (National Institutes of Health, 1985; Oversight Committee on Radioepidemiological Tables, 1984). 2. Probability of Causation To focus on the idea of probability of causation, or assigned share, we ask for the probability that the plaintiff’s cancer was caused by the radiation dose

598

Frederick Mosteller

received. Many other causes might have promoted the cancer, such as medical treatments, personal life style, occupation, or background radiation. Imagine two large groups of men of age 25 who have identical characteristics, except that one group is exposed at that age to a specific dose of radiation and the other group is not, and for simplicity suppose that these two groups are of equal size. Consider those who survive to some specific age (say, 55) without developing a specific type of cancer. Suppose that 100 men in the exposed group and 80 men in the unexposed group are first diagnosed with the specific cancer at age 55. Then, ignoring statistical fluctuations, the assigned share for radiation in the cases in the exposed group is the ratio of the excess number of patients with the specific cancer in the exposed group to the number of cancer cases in the exposed group: Number of excess cancer cases in exposed group Number of cancer cases in exposed group 100 − 80 = .20. = 100 If we interpret the 20 excess cases as having been caused by the exposure and the remaining 80 caused by other factors, then the assigned share or probability of causation represents the probability that the cancer was caused by the radiation dose. A randomly chosen person from the exposed group would have a 20 percent chance that his cancer was caused by his previous radiation exposure. We need not suppose that that interpretation describes the world; rather it gives a way of thinking of the notion of probability. It could be that causation is more complicated and that all the people in the exposed group had some portion of cause of their cancer created by the radiation dose. The result in equation (1) could still be interpretable as a share. One difficulty with the probability interpretation is that radiation can be cancer-preventing as well as cancer-causing. The formula (1) used above could in principle have had a negative numerator and that would have led to a negative probability. That is one reason some prefer an expression such as assigned share to probability of causation because a negative number has a satisfactory interpretation of share of cancers prevented if such an event should occur. To get a more precise definition of assigned share, we suppose that we have age-specific death rates for a particular type of cancer for people exposed at a given age to a specific dose of radiation. Then let r(t|d) and r(t|0) be the aget-specific cancer incidence rates for the groups exposed to dose d and those not exposed to the dose, respectively. Then the assigned share index, AS, for those whose cancer is first diagnosed at age t is AS = [r(t|d) − r(t|0)]/r(t|d).

(1)

To connect definition (2) with definition (1) suppose that N persons are in each group at risk for cancer at age t. Then we expect N r(t|d) and N r(t|0)

36 Compensating for Radiation-Related Cancers

599

cancer cases in the exposed and unexposed groups, respectively. Substituting these results into equation (1) produces equation (2). In the law requesting the tables, Congress asked that account be taken of the type of cancer, size of radiation dose, ages at exposure and at diagnosis, and gender. In addition, the Secretary of Health and Human Services was asked to take account of such other factors as she together with advisors thought appropriate. So far, only smoking has been added. Although other variables no doubt make a difference—perhaps region of the country, life style, and occupation—suitable data to evaluate the joint effects together with the other variables would be hard to get. To illustrate what happens when strata are formed, let us extend our previous example by stratifying into groups of men of high and low economic status. We have 100 exposed men with cancer, 60 from high economic status and 40 from low, and 80 unexposed men, 45 from high economic status and 35 from low. This leads to the results shown in Table 1. For all men the assigned share is .2 as before, but the high-socioeconomic-status men have assigned shares of .25, the low of .13. TABLE 1. ASSIGNED SHARES FOR VARIOUS REFERENCE SETS Reference set All men

Assigned share (100-80)/100 = .20

High socioeconomic status men

(60-45)/60 = .25

Low socioeconomic status men

(40-35)/40 = .13

The important point of the table is that the assigned share does not produce a unique number for every individual, but instead is a quantity determined for a reference set of values of variables and assigned to every person in the set. The set of variables and their values chosen for making the calculation determines the value of the assigned share for a person. People have different assigned shares depending upon the variables and their values or categories used to make the calculation. Consequently, the choice of variables that could or should be used requires consideration. 3. Construction of the Tables The details of how the tables were constructed is lengthy and the NIHWG has provided a book (1985) describing their procedures, and the National Research Council Committee has added further discussion (1984). I will run over the key ideas briefly. The main source for age-specific incidence rates is the Japanese A-bomb survivor data (Kato and Schull, 1982). Some additional sources come from data on radiation for post-partum mastitis (Land, Boice, Shore, Norman, and

600

Frederick Mosteller

Tokunaga, 1980), ankylosing spondylitis (Smith and Doll, 1982), and cancer form uranium miners (Whittemore and McMillan, 1983), dial painters, and other workers exposed to radiation. Because of limited data, anyone constructing radioepidemiological tables is forced to make simplifying assumptions about how radiation exposure affects the age-specific incidence rate of specific cancers. Although there seems to be a latency period after exposure before an increase in rate occurs, the NIHWG did not use a model that imposed a jump in rate after a given time interval. That might be unsatisfactory from the point of view of a plaintiff because a day here or there might make a difference of a half million dollars in compensation. Consequently, the NIHWG imposed a cubic polynomial to smooth out the curves. They do not argue that it follows any biological theory or process; it merely smooths the curve and seems good enough for the purpose. This illustrates the sort of practical compromise that builders of such tables have to make. 4. Models Two forms of models that might be used are constant-relative risk and constant-excess risk. More formally, the constant-relative risk approach uses: r(t|d) = r(t|0)[1 + f (d)] where f (d) depends on the dose, but not on t, and f (0) = 0. This is a multiplicative approach. For the additive or constant-excess risk approach r(t|d) = r(t|0) + f (d) for some dose-response function f (d), with f (0) = 0 as before. For the multiplicative model, excess incidence rate is 1 + f (d) at all ages of diagnosis, and for the additive model it is f (d). Such simplifications are convenient, though they may not be correct in the real world. Though one would like to allow the data to choose the models, often the data are too sparse. NIHWG used a constant relative risk model for all cancers but bone marrow and leukemia. For these they assumed that the rate first increases and then decreases with time. The dose response functions considered were linear or linear-quadratic. Linear means f (d) = µd whereas linear quadratic means f (d) = µ1 d + µ2 d2 (where the µ’s are constants). The NIHWG chose linear-quadratic for all except breast and thyroid cancers where they chose the linear model. In adding smoking to the variables under consideration for lung cancer, the constant-relative-risk model was revised by replacing r(t|d) by r(t|d, S) where S is 0 for nonsmokers and 1 for smokers. Then for the constant-relative-risk model a factor (1 + βS) was applied, and in the constant-excess-risk model the term βS was added (where β is a constant). This has the feature that the AS does not change with smoking when the constant-relative-risk model is used, but does for the constant-excess model. For lung cancer, NIHWG used a special model

36 Compensating for Radiation-Related Cancers

601

r(t|d), S) = r(t|0, 0)[1 + f (d) + βS] so that AS does not vary with age at diagnosis but does with smoking status. Although the Japanese data on A-bomb survival gives results for a Japanese population, the problem still remains to project the results to a U.S. population. The NIHWG, in the main, assumed constant-relative-risk models in U.S. and Japan, even though relative risks in the two countries differed. The relation was somewhat complicated. They adjusted so that the relative risk in the U.S. gave a cumulative excess incidence for each cancer site for a period from 5 to 33 years after exposure that equaled the cumulative excess among A-bomb survivors between 1950 and 1978 (the period available for follow-up). For a linear dose-response model this would produce an AS for radiation for the U.S. of AS = µd/(p + µd) where p is the ratio of baseline U.S. rates to Japanese rates. The choice of projection methods can make a substantial difference in results. 5. Uncertainties As implied above, many uncertainties arise from the data and from the assumptions. The data are few, and consequently choice of model cannot be strongly guided by data. The handling of latency depends on curve-fitting. The choices between linear and the linear-quadratic models are not strongly based in theory or data, and other possible models have not been much considered. Similarly, the issue of constant-relative versus constant-excess risk offers delicate choices. Needless to say, on the claimant’s side, many uncertainties are also present. What was the dose, and if there were repeated doses, how shall these be cumulated? As the NRC group illustrates, changes as AS by factors of 3 or 4 are possible, owing to these uncertainties. In addition, as this work was going forward, the Japanese dosimetry was being re-evaluated, and I do not know the results of this evaluation and its consequences. Although these uncertainties plague the method, they seem less uncertain than the approach used in the current tort system where various experts give their testimony based on no systematic attempt to carry out the evaluation. The suggestion is sometimes made that tables such as these should not be used to settle the compensation claim, but that they could be used by the courts as a starting point for estimating the probabilities or assigned shares, so that others could bring additional information about the case in order to “better” compute assigned shares. I make two comments about this. First, predicting the outcome of adjoining an additional variable or set of variables to a situation where several variables have already been entered requires a degree of intuition few of us have. It is like a situation where many variables have been entered into a regression equation, and an investigator is asked what would happen if an additional variable were entered, when he has had no previous experience with this variable or its relation to the ones

602

Frederick Mosteller

already included. The investigator cannot hope to be good at guessing what will happen, even whether the sign of the new variable will be positive or negative, let alone its value. An exception occurs if none of the variables used so far are predictive, but that is not the circumstance here. Thus the possibility of guessing correctly about the effect of adding new variables does not seem attractive to me. More likely, the user would be misled by the firstorder effects of the new variables, rather than their effects after the variables already introduced have been adjusted for. Second, it seems to me to be a social and political question what variables are to be adjusted for. Let us suppose, for example, that smoking makes a difference in the compensation to be given to a person who smokes and that it is unfavorable to that claimant. The claimant might well argue that the smoking is something he or she had a right to do, and that claimants should not be penalized for having exercised their rights. Such matters could be argued in the courts. Consider insurance policy annuities for men and women. For years men got large annuities because women lived longer. But finally a U.S. Supreme Court decision changed this and now women and men get the same payoffs even though the difference of seven years in longevity continues. Thus society needs to settle what it wants adjusted for, and we have seen in the earlier part of this paper, that what is adjusted for matters. 6. Uses of Tables Just how these tables might be used is a matter for conjecture, but several possibilities have been suggested. The obvious one is to have a standard payment based upon some sort of actuarial assessment of the individual’s worth or future earnings, and then use the assigned share of that amount to compensate the victim or the family. Another payoff function has been suggested that gives no compensation for AS less than some minimum value such as .1, the intent being to bar many tiny claims whose processing is expensive, then to pay proportional to AS until .5, and for AS >.5 to pay the total actuarial value so as to make this approach seem at least as attractive as the present tort system. Again, this sort of choice would depend upon legislation tempered by the court system. 7. Acknowledgments This paper was supported in part by Grant SES 8401422 from the National Science Foundation and is based partly upon Lagakos and Mosteller (1986). BIBLIOGRAPHY 1. Kato, H. and Schull, W. J. (1982). Studies of the mortality of A-bomb survivors. 7. Mortality, 1950-1978: Part 1. Cancer mortality, Radiation Research, 90, 359–432. 2. Lagakos, S. W. and Mosteller, F. (1986). Assigned shares in compensation for radiation-related cancers, Risk Analysis, 6, 345–357 plus Discussion, 359–380.

36 Compensating for Radiation-Related Cancers

603

3. Land, C. E., Boice, J. D., Shore, R.E., Norman, J. E., and Tokunaga, M. (1980). Breast cancer risk from low-dose exposures to ionizing radiation: Results of parallel analysis of three exposed populations of women, Journal of the National Cancer Institute, 65, 353–376. 4. National Institutes of Health (1985). Report of the National Institutes of Health Ad Hoc Working Group to Develop Radioepidemiological Tables, Publication No. 85-2748. 5. National Research Council-National Academy of Sciences (1984). Assigned Share for Radiation as a Cause of Cancer: Review of Radioepidemiologic Tables Assigning Probabilities of Causation (Report of the Oversight Committee on Radioepidemiologic Tables), Washington D.C.: National Academy Press. 6. Smith, P. G. and Doll, R. (1982). Mortality among patients with ankylosing spondylitis after a single treatment course with X ray, British Medical Journal, 284, 449–460. 7. Whittmore, A. S. and McMillan, A. (1983). Lung cancer mortality among U.S. uranium miners, Journal of the National Cancer Institute, 71, 489– 499. SUMMARY To help compensate people who get cancers after being accidentally exposed to radiation, the Congress of the U.S.A. requested tables of probability of causation (assigned shares) to be computed. The tables took account of type of cancer, gender, size of dose, age at exposure and at diagnosis, and smoking. For most cancers, the assigned share was based on the age-t-specific cancer incidence rate for dose d, r(t|d), using the constant relative risk model for assigned share AS = [r(t|d) − r(t|0)]/r(t|d). The intent is to replace the tort law for damages by a workman’s compensation system. ´ ´ RESUM E A fin de faciliter le de’dommagement de ceux qui sont atteints des cancers ´ apr`es ˆetre expos´es accidentallement `a la radiation, le Congr`es des Etats Unis a demand´e le calcul des tableaux de probabilit´e de cause (parts assign´es). Les tableaux ont tenu compte du type de cancer, du sexe de l’individu, de la dose, de l’ˆ age a l’exposition, de l’ˆ age au diagnostic et des habitudes a l’´egard de fumer. Pour la plupart des cancers le part assign´e a ´et´e sur le taux d’incidence du cancer par aˆge t, pour la dose d, en employant le mod`ele de risque relatif constant pour le part assign´e (AS) AS = [r(t|d) − r(t|0)]/r(t|d). Le but est de remplacer les lois sur les dommages par un syst`eme d’indemnit´e.

Reprinted from Journal of the American Statistical Association (1989), 84, pp. 853–861

37. Methods for Studying Coincidences ∗∗

Persi Diaconis and Frederick Mosteller

This article illustrates basic statistical techniques for studying coincidences. These include data-gathering methods (informal anecdotes, case studies, observational studies, and experiments) and methods of analysis (exploratory and confirmatory data analysis, special analytic techniques, and probabilistic modeling, both general and special purpose). We develop a version of the birthday problem general enough to include dependence, inhomogeneity, and almost multiple matches. We review Fisher’s techniques for giving partial credit for close matches. We develop a model for studying coincidences involving newly learned words. Once we set aside coincidences having apparent causes, four principles account for large numbers of remaining coincidences: hidden cause; psychology, including memory and perception; multiplicity of endpoints, including the counting of “close” or nearly alike events as if they were identical; and the law of truly large numbers which says that when enormous numbers of events and people and their interactions cumulate over time, almost any outrageous event is bound to occur. These sources account for much of the force of synchronicity. KEYWORDS: Birthday problems; Extrasensory perception; Jung; Kammerer; Multiple endpoints; Rare events; Synchronicity. . . . for the ‘one chance in a million’ will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to us. R. A. Fisher (1937, p. 16) 

This article was presented as the R.A. Fisher Memorial Lecture at the Joint Statistical Meetings in San Francisco, August 19, 1987. The authors express their appreciation to David Aldous, David Hoaglin, Ray Hyman, Augustine Kong, Bruce Levin, Marjorie Olson, Laurel Smith, and Cleo Youtz for their contributions to the manuscript. This work was facilitated in part by National Science Foundation Grants SES-8401422 to Harvard University and DMS86-00235 to Stanford University.  Persi Diaconis is Professor of Mathematics, Department of Mathematics, and Frederick Mosteller is Roger I. Lee Professor of Mathematical Statistics Emeritus, Departments of Statistics, Biostatistics, and Health Policy and Management, Harvard University, Cambridge, MA 02138.

606

Persi Diaconis and Frederick Mosteller

1. INTRODUCTION Coincidences abound in everyday life. They delight, confound, and amaze us. They are disturbing and annoying. Coincidences can point to new discoveries. They can alter the course of our lives; where we work and at what, whom we live with, and other basic features of daily existence often seem to rest on coincidence. Let us begin with a working definition. A coincidence is a surprising concurrence of events, perceived as meaningfully related, with no apparent causal connection. For example, if a few cases of a rare disease occur close together in time and location, a disaster may be brewing. The definition aims at capturing the common language meaning of coincidence. the observer’s psychology enters at surprising, perceived, meaningful, and apparent. A more liberal definition is possible: a coincidence is a rare event; but this includes too much to permit careful study. Early in this century, biologist Paul Kammerer (see Sec. 2) and psychiatrist C. G. Jung separately studied coincidences and tried to develop a theory for them. Our definition is close to that given by Jung, who was deeply concerned with coincidences. Jung wrote a book called Synchronicity: An Acausal Connecting Principle (Jung 1973). Jung argued that meaningful coincidences occur far more frequently than chance allows. To account for this, he postulated synchronicity as “a hypothetical factor equal in rank to causality as a principle of explanation” (p. 19). Jung’s images have captured the popular imagination, and synchornicity has become a standard synonym for coincidence. We have organized this article around methods of studying coincidences, although a comprehensive treatment would require at least a large monograph. 2. OBSERVATIONAL STUDIES One way to study coincidences is to look at the world around us, make lists, and try to find patterns and structure. Kammerer summarized years of work on such studies in his book Das Gesetz der Serie: Eine Lehre von den Wiederholungen im Lebens–und im Welgeschehen (The Law of Series: A Doctrine of Repetitions in Events in Life and Society)(Kammerer 1919). Kammerer’s traumatic career is brilliantly summarized in Arthur Koestler’s (1971) book The Case of the Midwife Todd. This book includes a discussion and some translations of Kammerer’s laws of series. Kammerer is struck with the seeming repetition of similar events. He reports entries from his journal over a 15-year period. The following examples seem typical. My brother-in-law E. V. W. attended a concert in Bosendorfer Hall in Vienna on 11 Nov. 1910; he had seat #9 and also coatcheck #9. (Kammerer 1919, p. 24) On the walls of the Artist’s Cafe across from the University of Vienna hand pictures of famous actors, singers, and musicians. On the 5th of May 1917,

37 Diaconis and Mosteller: Methods for Studying Coincidences

607

I noticed for the first time a portrait of Dr. Tyvolt. The waiter brought me the New Free Press, in which there was an article on the crisis in German Popular Theater, with Dr. Tyvolt as the author. (Kammerer 1919, p. 27)

We would classify these examples as anecdotes. They seem a bit modest to be called case studies, but by making a collection Kammerer has moved them into the category of observational studies. An observational study consists of data collected to find out what things happen or how often events occur in the wold as its stands, including comparisons of outcomes for different naturally occurring groups (Hoaglin, Light, McPeek, Mosteller, and Stoto 1982, pp. 6, 8, and 55–75). Kammerer reported some 35 pages of examples. He then gave a rough classification of series into types. His classifications were all illustrated by examples. To give but one, under “order within order” Kammerer wrote, “I arrange some objects in some way or other, play around with them, move them back and forth, and the largest come to rest together. Or, I sort some materials by their contents but the keywords arrange themselves in alphabetical order, without my doing anything” (Kammerer 1919, p. 51). We have not succeeded in finding a clearly stated law of series in Kammerer’s writing. The closest we have come is Koestler’s translation “We thus arrive at the image of a world-mosaic or cosmic kaleidoscope, which, in spite of constant shufflings and rearrangements, also takes care of bringing like and like together” [translated from Kammerer (1919, p. 165) in Koesler (1971, p. 140)]. Modern Observational Studies in Psychology Related to Studying Coincidences. The brain processes and recalls information in ways that we barely understand. Clearly memory failure, selective attention, and the heuristic shortcuts we take in dealing with perceptions can sometimes deceive us into being surprised or lull us into ignoring rare events. The literature offers some work on the psychology of coincidence. Some of this work relies on observational studies, and some of it verges into experiments when the investigator compares opinions on the same topic following different stimuli. In the interest of continuity we report such work in this section. Over the past 10 years Ruma Falk and her collaborators have made a focused effort to study peoples’ reactions to coincidences. In Falk (1982), she showed that the way a story is told can change its degree of surprise. Adding specific, superfluous details made the same coincidence seem more surprising. Surprise ratings increased if the same stories were told as potential future events as opposed to things that had just happened. The results suggest that upon hearing somebody’s coincidence story, one is aware of a wide range of possibilities and considers the coincidence as one of many events that could have happened. Falk (in press) and Falk and MacGregor (1983) showed that people found stories that happened to themselves far more surprising than the same stories

608

Persi Diaconis and Frederick Mosteller

occurring to others. Both of these findings agree with common sense; however, we all believe some things that are wrong. This substantial body of careful quantitative work is a fresh breeze in a sea of talk. On a different theme, Hintzman, Asher, and Stern (1978) studied selective memory as a cause of coincidence. They claimed that we subconsciously keep track of many things. When a current event matches anything in memory, that even is retrieved from memory, and the coincidence remembered. We may thus overestimate the relative rate of coincidences. Hintzman et al. (1978) set up a laboratory investigation involving free recall of incidentally learned words, which provides some evidence for the speculations. Falk (in press) reported that Kallai (1985) replicated their findings using evens instead of words. The probability problems discussed in Section 7 make the point that in many problems our intuitive grasp of the odds is far off. We are often surprised by things that turn out to be fairly likely occurrences. The body of work of Kahneman, Tversky, and their colleagues explores the way that people assign odds in the face of uncertainty. That work is presented and reviewed in Kahneman, Slovic, and Tversky (1982) and by Nisbett and Ross (1980). The studies just described touch on a handful of important themes. They do not yet provide a broad, solid base for understanding the psychology of coincidences. Nevertheless, psychological factors are a main cause of perceived coincidences, and any systematic study, even if informal, is valuable. Two related methods of studying coincidences are anecdotes and case studies. Unfortunately, space considerations force us to skip them. 3. EXPERIMENTS Often investigators conduct experiments to test the powers of coincidence. We regard experiments as investigations where the investigator has control of the stimuli and other important determinants of the outcomes. We do not use experiment as in Harold Gulliksen’s joke “They didn’t have any controls because it was only an experiment.” Many experiment have been performed as tests of extrasensory perception (ESP). We illustrate with a single widely publicized experiment of Alister Hardy and Robert Harvie, reported in their book The Challenge of Chance [Hardy, Harvie, and Koestler 1973 (although Koestler is a coauthor of the book, his contributions do not deal with the experiment.)] Their experiment attempts to amplify any signal from an agent in an ESP study by using many agents simultaneously. A typical phase of their experiment uses 180 sending subjects and 20 receiving subjects, all gathered in a large hall. A picture was shown or drawn for the sending subjects who concentrated on it, trying to send it with their minds to the receivers. The receivers made a drawing or wrote a statement that told what they received. At the end of a minute, monitors collected the drawings from the receivers.

37 Diaconis and Mosteller: Methods for Studying Coincidences

609

The investigators isolated the receivers in cubicles with masked sides and fronts that face the drawing. We do not know how secure the cubicles were from outside sounds such as whispering. The experiment used feedback. After every trial the receivers left their cubicles and viewed the target figure. They then returned to the cubicle for the next transmission. Scorers counted close matches (coincidences) between receiver’s drawings and the pictures sent as successes. In informal reports of this and similar experiments, we often see examples of close matches, and Hardy and Harvie presented many examples. Our reaction to such successes is “Maybe there is something to this. How can we test it?” Hardy and Harvie conducted a permutation test as follows: The total number of receiver responses over a seven-week period was 2,112. Out of these, 35 (or 1.66%) were judged direct hits. To obtain control data, the investigators compared targets with responses randomly shown in other trials. If the hits were attributable to chance, then about the same proportion of hits should occur. To ensure that the feedback would not corrupt the comparison, the investigators compared targets with responses chosen at random from trials held before the target trial. The experiment had 35 hits and the control 59; however, the number of trials were as 2 to 5. One standard binomial significance test conditions on the total number of hits as 94 = 35 + 59. It asks for the probability of 35 or more hits in 94 trials when the probability of success in 2/7 = .286. Binomial tables give .0444. A two-sided level is about .09. The experiment offers no strong evidence for ESP or a hidden synchronous force. Looking at some of the closely matched drawings, it is tempting to speculate, as did Hardy and Harvie, that perhaps just a small fraction of hits was really due to telepathy. Another speculative thought: It would certainly seem that at any rate the majority of the results in the original experiments, which at first sight might have suggested telepathy, can now be just as well explained by the coincidental coming together in time of similar ideas as has been so well demonstrated in the control experiments. (Hardy et al. 1973, p. 110)

4. EXPLORATORY DATA ANALYSIS When data emerge in an unplanned way, we may still profitably analyze them, even using significance tests, though we regard them as having been subjected to exploratory rather than confirmatory data analysis. Usually the results are hypothesis-generating rather than firm conclusions, partly because of heavy affliction with problems of selection and/or multiplicity often unappreciated by even the most perceptive investigator. We provide one instructive illustration. When the Hardy-Harvie experiment led to no convincing evidence of telepathy, the investigators explored other features of the investigation, as any investigator would and should. Sometimes in a single trial two or more receivers would generate much the same drawing or idea, even though their drawings did not seem to have a relation to the target image. In the large

610

Persi Diaconis and Frederick Mosteller

groups with 20 receivers (with sets of receivers in the true experiment compared with equivalent numbers of artificially created control groups who did not have the same images), they found the following numbers of coincidences for equal numbers of opportunities: true experiment—107 pairs, 27 triples, 7 quadruples, 1 quintuple; control—131 pairs, 17 triples, 2 quadruples. In small groups with 12 receivers they found experiment—20 pairs; control—23 pairs. The authors regarded these multiples, which means several people getting the same answer, not as hits on the target, but as a sort of finding. They thought that these coincidences should perhaps be regarded as a form of telepathy. In their book, they put the issue as a question, especially because the controls are producing coincidences as about the same rate as the experimental subjects. The authors noted that triple matches and quadruple matches are more frequent in the experiment. Recall that one feature in the experiment might tend to produce such multiple agreements. After each drawing, all receivers immediately looked at the drawing or slide that they had just tried to match. Possibly this exposure could encourage a next train of thought that might be similar across the several receivers. That concurrence would, however, be torn apart in the control samples. That is why the investigators in retrospect thought that showing the feedback to the receivers immediately after the trial may have been a mistake in design, because we have no way to match that effect in the control part of the study. Though conjectural, this experiment illustrates both the hypothesis generation and the hazards of conclusions from exploratory data analysis. Before they began the study, the investigators did not know that they would want to make this comparison. 5. CONFIRMATORY DATA ANALYSIS Did Shakespeare use alliteration when he wrote poetry? This question shocks our friends, many of whom respond with scornful looks and lines like “full fathom five thy father lies.” The psychologist B. F. Skinner (1939, 1941) analyzed this question by looking at Shakespeare’s sonnets and counting how many times the same sound occurs in a line. He compared such coincidences with a binomial null hypothesis. For example, consider the sound s as it appears in lines of the sonnets. Table 1 shows the frequency distribution of the number of s sounds in lines of Shakespeare’s sonnets. (The parentheses give counts Skinner obtained when the same word in a line is not counted more than once.) After comparing the observed and the expected counts, Skinner concluded that, although Shakespeare may alliterate on purpose, the evidence is that binomial chance would handle these observations. He says, “So far as this aspect of poetry is concerned, Shakespeare might as well have drawn his words out of a hat” (Skinner 1939, p. 191). In this problem the noise clearly overwhelms any signal. We find it refreshing to be reminded that things that “we all know must be true” can be very hard to prove.

37 Diaconis and Mosteller: Methods for Studying Coincidences

611

Table 1. Skinner’s Analysis of Alliteration in Shakespeare’s Sonnets Number of s sounds in a line

Observed

0 702

1 501

2 161

Expected

685

523

162

3 29 (24) 26

4 7 (4) 2

Total count 1,400 1,392 1,398

Source: Skinner (1939, p.189).

6. ANALYTICAL TECHNIQUES— FISHER’S CONTRIBUTIONS Special analytical statistical techniques may be developed for any field. For a R. A. Fisher lecture on coincidences it seems appropriate to display a special method developed by Fisher himself. Fisher (1924, 1928, 1929) wrote three short papers on scoring coincidences. He suggested a way to give partial credit for partial matches and worked out appropriate versions for two different null distributions. Fisher used matching of ordinary playing cards, as in ESP testing, for his main example. If someone guesses at a card, he or she can get it exactly right; get the value, suit, or color right; or merely match as a spot or picture card. The categories can be combined in various ways such as right in color and picture categories. As an example of the need for Fisher’s technique, J. Edgar Coover (1917), who carried out the first systematic studies of ESP, kept track of numbers of exact matches of playing cards and of matches in suits, or of colors. But he had no way of handling these various degrees of agreement simultaneously. Fisher decomposed the possible matches into disjoint categories and suggested − log p as a reasonable score, where p is the chance of the observed match or a better one. The scores are shown in Table 2. The row labels relate to suit. Then 0 stands for no match, C for color, and S for suit. The column labels relate to values. Here 0 stands for no match, R for rank (picture or spot), and N for number (or value). Any pair of cards matches in some cell. For example, the jack of hearts and queen of diamonds match in color and rank, giving rise to the match CR. The table entries show -log10 p, where p is the chance of the observed match or better. Note that 0 means no match or better, so the score for cell 00 is log10 1 = 0. This computation assumes that we draw both cards at random. Fisher’s second paper (1928) begins by observing that in usual ESP tests, one card is chosen by a guessing subject, and the second is chosen from a randomly shuffled deck of cards. Subjects guess in a notoriously nonrandom manner. Fisher once discussed with one of us the troubles associated with this nonrandomly

612

Persi Diaconis and Frederick Mosteller

when a radio station investigated ESP by getting readers to send in guesses at the name of a card to be drawn. He said , “What if the first card drawn is the king of spades?” Fisher suggests a new scoring system. His idea uses a conditional score based on the probability of the observed match, conditional on the subject’s guess. Conditioning eliminates the need to explicitly model the distribution of the subjects’ guess only through its rank. Thus Fisher must provide two tables instead of one. The first table lists scores appropriate if the guesser names picture cards. The second table lists scores appropriate if the guesser names spot cards. Both tables are normalized to have the same mean and variance. Table 2. Fisher’s Logarithmic Scoring for Various Degrees of Card Matching Value Suit

−0

−R

−N

0− C− S−

0 .301 .602

.190 .491 .793

1.114 1.415 1.716

NOTE: Suit–0−, no match; C−, color; S− suit. Value– −0, no match; −R, rank (picture or spot); −N , number. Source: Fisher (1924, p. 183).

In his exposition, Fisher offers sage advice about practical aspects of evaluating coincidences. For example, he discusses whether people should be more moved when an individual has correctly guessed five cards in succession or on five separate occasions. He prefers the latter because it is hard to know the conditions under which such miracles occur. He wants to protect against fraud and mentions rabbits and conjurers. Remark 1. Fisher’s basic idea of using the value of − log p as a score is now standard statistical practice. He used it most effectively in his work on combining p values across independent experiments. Fisher derived the logarithmic scoring rule from the principle that the difference in scores given to two correctly guessed events should depend only on the ratio of their probabilities. Remark 2. In card guessing, the score would be applied to successive runs through a 52-card deck. Then, the permutation distribution of the sum of scores is a basic object of interest. We observe that this depends on the guesser’s choices. If the guesser always names the queen of hearts, the total score has no randomness. The probability theory for Fisher’s conditional scoring system in this and other applications can be derived from Hoeffding’s combinatorial central limit theorem. Label the values of an n-card deck as 1,

37 Diaconis and Mosteller: Methods for Studying Coincidences

613

2, . . . , n. Define an n × n array of numbers a(i, j) as follows: The ith row is determined by the guess on the ith trial by setting a(i, j) to the value of the score n if j turns up. For a given permutation π of the deck, the total score is i=1 a(i, π(i)). Hoeffding (1951) showed that this quantity has an approximate normal distribution. Hoeffding’s result puts some restributions on the a(i, j) that rule out cases where all rows are constant. Bolthausen (1984) gave nonlimiting approximations with error bounds. Remark 3. We do not know why Fisher chose numbered cards versus face cards for the conditioning; a bridge player might have preferred honor cards versus nonhonors. But for ESP purposes, the sets numbers and face cards seem like more homogeneous groups to us. Fisher says this illustration may guide applications to more complex situations. Remark 4. It is tempting to apply Fisher’s idea to other problems. The idea of computing conditional scores is similar to the ideas of skill scoring used to evaluate weather forecasters. Diaconis (1978), Diaconis and Graham (1981), and Chung, Diaconis, Graham, and Mallows (1981) developed these ideas and gave references. Remark 5. We think Fisher’s scoring system is potentially broadly useful and suggest an application to the problem of birthdays and deathdays. Is there a hidden synchronous force causing a person to be born and die at nearly the same time of year? To investigate this, one looks at people’s birth and death dates and scores them for closeness. One problem is the choice of cutoff: What counts as a match within a day or a month? Fisher’s idea gets around choosing a cutoff by assigning a score of − log p, with p the chance that two points chosen at random are at least as close as the observed pair. Here, distance is measured around a circle (mod 365). Because of homogeneity, the score is just − log10 {(1 + 2 × observed distance)/365}. Hoaglin, Mosteller, and Tukey (1985, chap. 1) reviewed several empirical studies of the correlation between birthday and deathday. 7. MODELING We find it convenient to devide our discussion of probability or statistical modeling into a part based on general-purpose models and a part based on special-purpose models. As Erich Lehmann remarked in the 1988 Fisher lecture, the separation between these two kinds of models is very gray. Indeed, we think it is likely largely a matter of effort. Most of us would start out thinking of the birthday problem as a special model, but with petting and patting, it gradually becomes a general-purpose model, as we illustrate in Section 7.1. 7.1 General–Purpose Models: Birthday Problems We find the utility of the birthday problem impressive as a tool for thinking about coincidences. This section deveops four versions.

614

Persi Diaconis and Frederick Mosteller

Problem 1: The Standard Birthday Problem. Suppose N balls are dropped at random into c categories, N ≤ c. The chance of no match (no more than one ball) in any of the categories is N −1  i=1

1−

i c

.

(1)

This formula is easy to calculate, but hard to think about. If c is large and N is small compared to c2/3 , the following approximation is useful. The chance of no match is approximately exp(−N 2 /2c).

(2)

. This follows easily from Expression (1), using approximation loge (1 − i/c) = −i/c. To have probability about p of at least one match, equate (2) to 1 − p and solve for N . This gives the approximations √ . N = 1.2 c for a 50% chance of a match (3) and

√ . N = 2.5 c for a 95% chance of a match.

(4)

[The 2.5 in (4) is 2.448, but rounding up seems best.] Thus, if c = 365, N = 22.9 or 23 for a 50% chance and about 48 for a 95% chance. As far as we know, the earliest mention of the birthday problem was made by Von Mises (1939). Problem 2: Many Types of Categories. The first variant offers a quantitative handle on a persistent cause of co-incidences. Suppose that a group of people meet and get to know each other. Various types of coincidences can occur. These include same birthday; same job; attended same school (in same years); born or grew up in sae country, state, or city; same first (last) name; spouses’ (or parents’) first names the same; and same hobby. What is the chance of a coincidence of some sort? To start on this, consider the case where the different sets of categories are independnet and the categories within a set are equally likely. If the numbers of categories in the sets are c1 , c2 , . . . , ck , we can compute the chance of no match in any of the categories and subtract from 1 as before. The following fairly accurate approximation is useful. If k different sets of categories are being considered, the number of people needed to have an even chance of a match in some set is about / 1 1 1 . N = 1.2 1 . + + ··· + c1 c2 ck The expression under the square root is the harmonic mean of the ci divided by k. If all ci equal c, the nubmer of people needed becomes 1.2(c/k)1/2 so that

37 Diaconis and Mosteller: Methods for Studying Coincidences

615

multiple categories allow coincidences with fewer people as would be expected. For a 95% chance of at least one match, the multiplier 1.2 is increased to 2.5 as in Expression (4). The preceding approximation for N is derived by using Expression (2) and the independence of the categories. As an illustration, consider three categories: c1 = 365 birthdays; c2 = 1,000 lottery tickets; c3 = 500 same theater tickets on different nights. It takes 16 people to have an even chance of a match here. Problem 3: Multiple Events. With may people in a group it becomes likely to have triple matches or more. What it becomes likely to have triple matches or more. What is the least number of people required to ensure that the probability exceeds 12 that k or more of them have the same birthday? McKinney (1996) found, for k = 3, that 88 people are required. For k = 4, we required 187. Levin (1981) treated this problem using multinomial theory. If N balls are dropped into 365 boxes, what is the chance that the maximum number of balls in a box equals or exceeds k? Using a clever twist of the Bayes theorem, Levin gave an algorithm that allows exact computation. He kindly carried this computation out for us, obtaining the results in Table 3. Thus in an audience of 1,000 people the probaiblity exeeds 12 that at least 9 people have the same birthday. We fit a curve to these numbers and find for modest k (say, smaller . than 20) than N = 47(k − 1.5)3/2 gives a good fit. Table 3. Number N Required to Have Probability Greater Than 1/2 of k or More Matches With 365 Categories (Bruce Levin) k N

2 23

3 88

4 187

5 313

6 460

7 623

8 798

9 10 11 12 13 985 1,181 1,385 1,596 1,813

In unpublished work we have shown that the following approximation is valid for fixed k and large c, with c the number of categories. The number of people required to have probability p of k or more in the same category is approximately given by solving for N in 1/k  1 N eN/ck /(1 − N/c(k + 1))1/k = ck−1 k! loge . 1−p

(5)

We obtained this result by using an approximation suggested to us by Augustine Kong. Here is an example of its use. A friend reports that she, her husband, and their daughter were all born on the 16th. Take c = 30 (as days in a month), . k = 3, and p = 12 . Formula (7.5) gives N = 18. Thus, among birthdays of 18 people, a triple match in day of the month has about a 50-50 chance. Problem 4: Almost Birthdays. We often find “near” coincidences surprising. Let us begin with the basic birthday problem. With 23 people it is even odds

616

Persi Diaconis and Frederick Mosteller

that at least 2 have the same birthday. How many people are needed to make it even odds that two have a birthday within a day? Abramson and Moser (1970) showed that as few as 14 people suffice. With seven people, it is about 50-50 that two have a birthday within a week of each other. Changing the conditions for a coincidences for a coincidence slightly can change the numbers a lot. In day-to-day coincidences even without a perfect match, enough aspects often match to surprise us. A neat approximation for the minimum number of people required to get a 50-50 chance that two have a match within k, when c categories are considered, follows from work of Sevast’yanov (1972). The answer is approximately " N = 1.2 c/(2k + 1). (6) When c = 365 and k = 1, this approximation gives about 13 people (actually 13.2) compared with the answer 14 from the Abramson-Moser calculation. All variants discussed in this section can be modified to allow unequal probabilities and mild dependencies. Although we have assumed equal probabilities of falling in the various categories, Gail, Weiss, Mantel, and O’Brien gave several approaches to exact computation and an application to a problem of detecting cell culture contamination. 7.2 Special–Purpose Models In this section, we illustrate special modeling for a few situations where we are commonly struck by coincidences. The New Word. Here is a coincidence every person will have enjoyed: On hearing or reading a word for the first time, we hear or see it again in a few days, often more than once. Similarly, if a friend buys a fancy brand of automobile, we notice more and more of them on the road. We explain these coincidences, using new words as an example. Suppose that you are 28 years old and that you graduated from secondary school at age 18. Let us consider only the time since graduation from secondary school, 10 years. You have just recognized a new word. Thus our first estimate of the rate at which you recognize this word is once in 10 years. After you have been tuned in to it, the word recurs within 10 days. Thus the rate has gone up by a factor of about 365. What are some of the potential sources of change? An actual change in the rate of the appearance of this word could occur because of some change in your behavior—you read a book on a fresh topic and some technical expression arises, such as bit (a basic amount of information). Thus your behavior has changed the rate. This source of appearances of the word should not be regarded as coincidences because it has a fairly obvious cause. In a less than blatant situation, you first come across the new word in your regular reading, say in a newspaper or magazine. Again the word appears in a few days in your reading or work. One possible explanation is that the world has changed. A word or expression formerly not in common use has become

37 Diaconis and Mosteller: Methods for Studying Coincidences

617

newsworthy. Thus an expression like Watergate can take on special meaning and be seen frequently, whereas formerly it had no real meaning except as a name for a building complex. We do not regard this as synchronicity. A third causal explanation is heightened perception. Here, your regular reading turns up a word such as formication in an article on an otherwise routine topic, and you discover what it means. You see it again soon. Very likely this word has been going by your eyes and ears at a steady low rate, but you have not noticed it. You are now tuned in to it, and where blindness used to be sits an eagle eye. Thus heightened perception offers a third source of coincidence. An important statistical phenomenon could also make some contribution to the higher-than-expected frequency. Because of our different reading habits, we readers are exposed to the same words at different observed rates, even when the long-run rates are the same. For high-frequency words like the, the difference is not only small, but irrelevant for this discussion. We are dealing with rare words. Imagine a class of rare words, all of which have about the same long-run frequency. Then, in the simplest situation, imagine being exposed to them at their regular rate. Some words will appear relatively early in your experience, some relatively late. More than half will appear before their expected time of appearance, probably more than 60% of them if we use the exponential model, so the appearance of new words is like a Poisson process. On the other hand, some words will take more than twice the average time to appear, about 17 of them (1/e2 ) in the exponential model. They will look rarer than they actually are. Furthermore, their average time to reappearance is less than half that of their observed first appearance, and about 10% of those that took at least twice as long as they should have to occur will appear in less than 1/20 of the time they originally took to appear. The model we are using supposes an exponential waiting time to first occurrence of events. The phenomenon that accounts for part of this variable behavior of the words is of course the regression effect. We now extend the model. Suppose that we are somewhat more complicated creatures, that we require k exposures to notice a word for the first time, and that k is itself a Poisson random variable with mean λ + 1. In fact, for definiteness suppose k − 1 is distributed according to a Poisson distribution with mean λ = 4. What we are saying is that we must have multiple exposures before we detect a new word, and that the multiple varies from one word to another for chancy reasons. Then, the mean time until the word is noticed is (λ + 1)T , where T is the average time between actual occurrences of the word. The variance of the time is (2λ + 1)T 2 . Suppose T = 1 year and λ = 4. Then, as an approximation, 5% of the words will take at least time [λ + 1 + 1.65(2λ + 1)1/2 ]T or about 10 years to be detected the first time. Assume further that, now that you are sensitized, you will detect the word the next time it appears. On the average it will be a year, but about 3% of these words that were so slow to be detected

618

Persi Diaconis and Frederick Mosteller

the first time will appear within a month by natural variation alone. So what took 10 years to happen once happens again within a month. No wonder we are astonished. One of our graduate students learned the word formication on a Friday and read part of this manuscript the next Sunday, two days later, illustrating the effect and providing an anecdote. Here, sensitizing the individual, the regression effect, and the recall of notable events and the nonrecall of humdrum events produce a situation where coincidences are noted with much higher than their expected frequency. This model can explain vast numbers of seeming coincidences. The Law of Truly Large Numbers. Succinctly put, the law of truly large numbers states: With a large enough sample, any outrageous thing is likely to happen. The point is that truly rare events, say events that occur only once in a million [as the mathematician Littlewood (1953) required for an event to be surprising] are bound to be plentiful in a population of 250 million people. If a coincidence occurs to one person in a million each day, then we expect 250 occurrences a day and close to 100,000 such occurrences a year. Going from a year to a lifetime and from the population of the United States to that of the world (5 billion at this writing), we can be absolutely sure that we will see incredibly remarkable events. When such events occur, they are often noted and recorded. If they happen to us or someone we know, it is hard to escape that spooky feeling. A Double Lottery Winner. To illustrate the point, we review a front-page story in the New York Times on a “1 in 17 trillion” number is the correct answer to a not-very-relevant question. If you buy one ticket for exactly two New Jersey state lotteries, this is the chance both would be winners. (The woman actually purchased multiple tickets repeatedly.) We have already explored one facet of this problem in discussing the birthday problem. The important question is What is the chance that some person, out of all of the millions and millions of people who buy lottery tickets in the United States, hits a lottery twice in a lifetime? We must remember that many people buy multiple tickets on each of many lotteries. Stephen Samuels and George McCabe of the Department of Statistics at Purdue University arrived at some relevant calculations. They called the event “practically a sure thing,” calculating that it is better than even odds to have a double winner in seven years someplace in the United States. It is better than 1 in 30 that there is a double winner in seven years someplace in a four-month period–the time between winnings of the New Jersey woman. 8. TOWARD A RATIONAL THEORY OF COINCIDENCES The preceding review provides some examples of how to think about and investigate coincidences. This final section lists our main findings. Although

37 Diaconis and Mosteller: Methods for Studying Coincidences

619

we do not yet have a satisfactory general theory of coincidences, a few principles cover a very large measure of what needs to be explained. Here we sketch four principles. Hidden Cause. Much of scientific discovery depends on finding the cause of a perplexing coincidences. Changes in the world can create coincidences; likewise, changes in our own behavior such as a new pattern of reading or eating can create a pattern. Frequency of forecasting the same dire event improves the chances of simultaneity of forecast and outcome. Forgetting many failed predictions makes success seem more surprising. At the same time, vast numbers of coincidences arise from hidden causes that are never discovered. At the moment, we have no measure of the size of this body of occurrences. Similarly, we have no general way to allow for misrepresentation, mistaken or deliberate, that may lead to many reports of coincidences that never occurred. Psychology. What we perceive as coincidences and what we neglect as not notable depends on what we are sensitive to. Some research suggests that previous experience gives us hooks for identifying coincidences. Multiple events emphasize themselves, and without them we have no coincidence to recognize. The classical studies of remembering remind us that frequency, recency, intensity, familiarity, and relevance of experience strengthen recall and recognition. Multiple Endpoints and the Cost of “Close.” In a world where close to identity counts, as it is often allowed to do in anecdotes, and “close” is allowed to get fairly far away, as when Caesar spoke of a military victory as avenging the death in battle, 50 years earlier, of the grandfather of his father-in-law, as if it were a personal revenge (Caesar 1982, p.33), then the frequency of coincidences rises apace. Some formulas presented here emphasize the substantial effect that multiple endpoints can have. The Law of Truly Large Numbers. Events rare per person occur with high frequency in the presence of large numbers of people; therefore, even larger numbers of interactions occur between groups of people and between people and objects. We believe that this principle has not yet been adequately exploited, so we look forward to its further contribution. Concluding Remarks In brief, we argue (perhaps along with Jung) that coincidences occur in the minds of observers. To some extent we are handicapped by lack of empirical work. We do not have a notion of how many coincidences occur per unit of time or of how this rate might change with training or heightened awareness. We have little information about how frequency of coincidences varies among individuals or groups of people, either from a subjective or from an objective point of view. Although Jung and we are heavily invested in coincidences as a

620

Persi Diaconis and Frederick Mosteller

subjective matter, we can imagine some objective definitions of coincidences and the possibility of empirical research to find out how frequently they occur. Such information might help us. To get a better grip on coincidences that matter to people, it might be useful to employ a critical incidence study. The results might help us distinguish between those coincidences that genuinely move people and those that they regard as good fun though not affecting their lives. Such distinctions, if they are valid, would help focus further coincidence studies on matters people think are important. In a culture like ours based heavily on determinism and causation, we tend to look for causes, and we ask What is the synchronous force creating all of these coincidences? We could equally well be looking for the stimuli that are driving so many people to look for the synchronous force. The coincidences are what drive us. And the world’s activity and our labeling of events generates the coincidences. The more we work in this area, the more we feel that Kammerer and Jung are right. We are swimming in an ocean of coincidences. Our explanation is that nature and we ourselves are creating these, sometimes causally, and also partly through perception and partly through objective accidental relationships. Often, of course, we cannot compute the probabilities, but when we can, such computations are informative. Where we have solid control and knowledge, the rates of occurrences seem about as expected, as Fisher said, but our inexperience with and lack of empirical information about the kinds of problems coincidences present do make for many surprises. [Received December 1988. Revised April 1989.]

REFERENCES Abramson, M., and Moser, W. O. J. (1970), “More Birthday Surprises,” American Mathematical Monthly, 77, 856–858. Bolthausen, E. (1984), “An Estimate of the Remainder in a Combinatorial Central Limit Theorem,” Zeitschrift f¨ ur Wahrscheinlichkeits-theorie und Verwandte Gebiete, 66, 379–386. Caesar [Julius] (1982), The Conquest of Gaul (trans. S. A. Handford), Hammondsworth, U. K.: Penguin (original circa 50 B.C.). Chung, F. R. K., Diaconis, P., Graham, R. L., and Mallows, C. L. (1981), “On the Permanents of Complements of the Direct Sum of Identity Matrices,” Advances in Applied Mathematics, 2, 121–137. Coover, J. E. (1917), Experiments in Psychical Research at Leland Stanford Junior University (Psychical Research Monograph 1), Stanford University, Leland Stanford Junior University. Diaconis, P. (1978), “Statistical Problems in ESP Research,” Science, 201, 131–136. (Reprinted in 1985 in A Skeptic’s Handbook of Parapsychology, ed. P. Kurtz, Buffalo, NY: Prometheus, pp. 569–584.) Diaconis, P., and Graham, R. (1981), “The Analysis of Sequential Experiments With Feedback to Subjects,” The Annals of Statistics, 9, 3–23. Falk, R. (1982), “On Coincidences,” The Skeptical Inquirer, 6, 18–31.

37 Diaconis and Mosteller: Methods for Studying Coincidences

621

Falk, R. (in press), “The judgment of Coincidences: Mine Versus Yours,” American Journal of Psychology. Falk, R. and MacGregor, D. (1983), “The Surprisingness of Coincidences,” in Analyzing and Aiding Decision Processes, eds. P. Humphreys, O. Svenson, and A. V´ ari, New York: North-Holland, pp. 489–502. Fisher, R. A. (1924), “A Method of Scoring Coincidences in Tests With Playing Cards,” Proceedings of the Society for Psychical Research, 34, 181–185. [Reprinted in 1971 in Collected Papers of R. A. Fisher (Vol. 1), ed. J. H. Bennett, Adelaide, Australia: University of Adelaide, pp. 557–561.] Fisher, R. A. (1928), “The Effect of Psychological Card Preferences,” Proceedings of the Society for Psychical Research, 38, 269–271. [Reprinted in 1972 in Collected Papers of R. A. Fisher (Vol. 2), ed. J. H. Bennett, Adelaide, Australia: University of Adelaide, pp. 241–242.] Fisher, R. A. (1929), “II The Statistical Method of Psychical Research,” Proceeding of the Society for Psychical Research, 39, 189–192. [Reprinted in 1972 in Collected Papers of R. A. Fisher (Vol. 2), ed. J. H. Bennett, Adelaide, Australia: University of Adelaide, pp. 388–391.] Fisher, R. A. (1937), The Design of Experiments (2nd ed.), London: Oliver & Boyd. Gail, M. H., Weiss, G. H., Mantel, N., and O’Brien, S. J. (1979), “A Solution to the Generalized Birthday Problem With Application to Allozyme Screening for Cell Culture Contamination,” Journal of Applied Probability, 16, 242–251. Hardy, A. Harvie, R., and Koestler, A. (1973), The Challenge of Chance: Experiments and Speculations, London: Hutchinson. Hintzman, D. L., Asher, S. J., and Stern, L. D. (1978), “Incidental Retrieval and Memory for Coincidences,” in Practical Aspects of Memory, eds. M. M. Gruneberg, P. E. Morris, and R. N. Sykes, New York: Academic Press, pp. 61–68. Hoaglin, D. C., Light R. J., McPeek, B., Mosteller, F., and Stoto, M. A. (1982), Data for Decisions: Information Strategies for Policymakers, Cambridge, MA: Abt Books. Hoaglin, D. C., Mosteller, F., and Tukey, J. W. (eds) (1985), Exploring Data Tables, Trends, and Shapes, New York: John Wiley. Hoeffding, W. (1951), “A Combinational Central Limit Theorem,” The Annals of Mathematical Statistics, 22, 558–566. Jung, C. G. (1973), Synchronicity: An Acausal Connecting Principle (trans. R. F. C. Hull; Bollingen Ser.), Princeton, NJ: Princeton University Press. Kahneman, D., Slovic, P., and Tversky, A. (1982), Judgment Under Uncertainty: Heuristics and Biases, Cambridge, U. K.: Cambridge University Press. Kallai, E. (1985), “Psychological Factors That Influence the Belief in Horoscopes,” unpublished master’s thesis, Hebrew University, Jerusalem, Dept. of Psychology. Kammerer, P. (1919), Das Gesetz der Serie: Eine Lehre von den Wiederholungen im Lebens–und im Weltgeschechen, Stuttgart: Deutsche Verlags-Anstalt. Koestler, A. (1971), The Case of the Midwife Toad, New York: Random House. Levin, B. (1981), “A Representation for Multinomial Cumulative Distributions Functions,” The Annals of Statistics, 9, 1123–1126. Littlewood, J. E. (1953), A Mathematician’s Miscellany, London: Methuen. (Reprinted with minor corrections in 1957.) McKinney, E. H. (1966), “Generalized Birthday Problem,” American Mathematical Monthly, 73, 385–387.

622

Persi Diaconis and Frederick Mosteller

Nisbett, R., and Ross, L. (1980), Human Inference: Strategies and Shortcomings of Social Judgment, Englewood Cliffs, NJ: Prentice-Hall. Sevast’yanov, B. A. (1972), “Poisson Limit Law for a Scheme of Sums of Dependent Random Variables” (trans. S. M. Rudolfer), Theory of Probability and Its Applications, 17, 695–699. Skinner, B. F. (1939), “The Alliteration in Shakerspeare’s Sonnets: A Study in Literary Behavior,” The Psychological Record, 3, No. 15, 186–192. Skinner, B. F. (1941), “A Quantitative Estimate of Certain Types of SoundPatterning in Poetry,” The American Journal of Psychology, 54, 64–79. ¨ Von Mises, R. (1939), “Uber Aufteilungs–und Besetzungs-Wahrscheinlichkeiten,” Revue de la Facult´e des Sciences de l’Universit´e d’Istanbul, N. S., 4, 145–163. [Reprinted in 1964 in Selected Papers of Richard von Mises (Vol. 2), eds. P. Frank, S. Goldstein, M. Kac, W. Prager, G. Szeg¨ o, and G. Birkhoff, Providence, RI: American Mathematical Society, pp. 313–334.]

Reprinted from Journal of the Italian Statistical Society (1993), 2, pp. 269–290

38. A Modified Random-Effect Procedure for Combining Risk Difference in Sets of 2 × 2 Tables from Clinical Trials John D. Emerson, David C. Hoaglin, and Frederick Mosteller Middlebury College, U.S.A.; Harvard University, U.S.A.; and Harvard University, U.S.A.

Summary Meta-analyses 1 of sets of clinical trials often combine risk differences from several 2 × 2 tables according to a random-effects model. The DerSimonian-Laird random-effects procedure, widely used for estimating the population mean risk difference, weights the risk difference from each primary study inversely proportional to an estimate of its variance (the sum of the between-study variance and the conditional within-study variance). Because those weights are not independent of the risk differences, however, the procedure sometimes exhibits bias and unnatural behavior. The present paper proposes a modified weighting scheme that uses the unconditional within-study variance to avoid this source of bias. The modified procedure has variance closer to that available from weighting by ideal weights when such weights are known. We studied the modified procedure in extensive simulation experiments using situations whose parameters resemble those of actual studies in medical research. For comparison we also included two unbiased procedures, the unweighted mean and a sample-size-weighted mean; their relative variability depends on the extent of heterogeneity among the primary studies. An example illustrates the application of the procedures to actual data and the differences among the results. Keywords: Clinical trials, DerSimonian-Laird method, Meta-analysis, Random effects, Risk difference, Semiweighted mean, 2 × 2 tables, Weighted mean. 1

This research was supported by Grant HS 05936 from the Agency for Health Care Policy and Research to Harvard University.

624

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

1. Introduction

As evidence accumulates from clinical trials that compare a treatment to a control, it has become common to combine the results by using systematic statistical methods to obtain numerical estimates of the gain from treatment. Meta-analysis offers a well-developed body of techniques for such synthesis. When the common outcome measure yields a 2 × 2 table as the result of each trial (e.g., by counting cures or counting deaths in the two groups), one natural measure of gain from treatment over control is the risk difference (the difference in event rate between treatment and control). This scale is the easiest to interpret; and for some purposes it is preferable to the usual alternatives, the risk ratio and the odds ratio (Berlin et al., 1989). In the medical literature a substantial number of meta-analyses focus on the risk difference. Another key consideration involves the relation between the available studies or trials and some population of studies. Fixed-effects analyses regard the available studies as homogeneous, in the sense that they have a common expected risk difference. By contrast, random-effects analyses incorporate heterogeneity among the expected risk differences, usually summarized in a between-study variance. Thus the random-effects approach provides a way to allow for additional variability and to generalize from the findings in a collection of primary studies to future patient populations that differ among themselves in much the same way as those in the primary studies. For estimating the mean risk difference in the population under a randomeffects model, the customary procedure, described by DerSimonian and Laird (1986), weights the risk difference from each primary study inversely proportional to an estimate of its variance. In simulation studies we noticed that, for certain types of data that could reasonably arise in practice, the DerSimonianLaird procedure exhibits unusual behavior, characterized in part by substantial bias. The difficulty arises from a lack of independence between the weights and the risk differences being combined. Thus, in the present paper, we propose a modified scheme for estimating the variances (and hence the weights). Extensive simulation experiments show that the modified procedure avoids the bias observed in the original DerSimonian-Laird procedure and also has variance closer to that occurring when ideal weights are known. In addition it performs roughly as well as the better of two potential alternative procedures: the equally weighted mean (unweighted mean) and a sample-size-weighted mean.

38 Random–Effect Procedure

625

2. Random-effects model and estimation for the population of studies

2.1 Risk difference under a Random-Effects Model Suppose that each of K randomized trials (the primary studies) has a treated group, T , and a control group, C, with sample sizes nT k and nCk , respectively. Within each of these groups the response is a binomial proportion, p; we observe p and denote the true rate (of cure, say, or death) by π. Specifically, for k = 1, . . . , K, nT k pT k ∼ binomial(nT k , πT k ) and nCk pCk ∼ binomial(nCk , πCk ); and from these we obtain the observed risk difference Dk = pT k − pCk . Under the random-effect model the primary studies constitute a (random) sample from a population of studies, in which πT k has mean µT and variance 2 , and πT k and πCk have correlation . σT2 , πCk has mean µC and variance σC Thus Dk has conditional expectation E(Dk |πT k , πCk ) = πT k − πCk and unconditional expectation E(Dk ) = µT − µC . The meta-analysis seeks to estimate µT − µC from the Dk . Further, Dk has conditional variance var(Dk |πT k , πCk ) =

πT k (1 − πT k ) πCk (1 − πCk ) + nT k nCk

(1)

and unconditional variance 2 var(Dk ) = σT2 + σC − 2σT σC +

2 µT (1 − µT ) − σT2 µC (1 − µC ) − σC + . (2) nT k nCk

These expressions guide the choice of weights in certain types of weighted mean.

626

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

2.2 The DerSimonian-Laird Procedure

When current practice adopts a random-effects model, the customary estimator of the mean risk difference µT − µC in the population of studies is a mean of the Dk with weights wk∗ derived by DerSimonian and Laird (1986): K 

DL =

wk∗ Dk

k=1 K  k=1

where s2k =

with wk∗ =

wk∗

1 , 2 s2k + σ ˆw

pT k (1 − pT k ) pCk (1 − pCk ) + nT k nCk

(3)

(4)

is the within-study estimate of the variance of Dk (that is, it estimates the 2 conditional variance given in equation (1)), and σ ˆw estimates the between2 2 2 study variance, σT + σC − 2σT σC . To get σ ˆw , DerSimonian and Laird start with wk = 1/s2k and K  wk Dk k=1 Dw = K . (5)  wk k=1

They then calculate a sum of squares for constancy of treatment effect, Qw =

K 

wk (Dk − Dw )2 ,

(6)

k=1

set Qw equal to its expected value under the random-effects model, require 2 ≥ 0, and solve to get that σ ˆw 2 σ ˆw =

max{0, Qw − (K − 1)} Σwk − (Σwk2 /Σwk )

(7)

This approach builds on the work of Cochran (1954), who refers to estimators that take the form of equation (3) as semiweighted means. The weights wk∗ in equation (3) give a result that can range from equal weighting to weighting according to the reciprocal of the estimated within-study variance (as in Dw ). Part of Cochran’s development deals with a set of estimates xi (i = 1, . . . , K) that have the same mean µ and different variances σi2 . If σi2 is unknown and one has an estimate s2i , one might estimate µ by forming a weighted mean of the xi with weights proportional to 1/s2i . The leading case involves Gaussian data, so that the estimate xi and s2i are independent. In the present binomial setting, however, both Dk and s2k are functions of pT k

38 Random–Effect Procedure

627

and pCk . In practice one hopes that this departure from independence has no serious adverse effect on the performance of DL. Still, Cochran (1954, p. 117) warns that lack of independence will lead to bias, and so the question is whether the bias is large enough to be troublesome in any particular instance. Unfortunately, under some circumstances the relation between Dk and wk (and hence between Dk and wk∗ ) appears to introduce substantial bias into both Dw and DL. As part of a study of approaches to meta-analysis of risk differences, we used a random-effects model to generate simulated sets of data 2 from populations with several combinations of µT , µC , and σ 2 (= σT2 = σC ). For one combination (µT = .08, µC = .06, σ = .02), representing a class of medical situations in which the main concern might be mortality, Dw and DL behaved strangely. The cause of the difficulty turned out to be the relation between wk (and wk∗ ) and Dk . We discuss the nature of the difficulty later in this paper.

2.3 A Modified DerSimonian-Laird Procedure

Because the weights in Dw and DL are inversely proportional to estimated variances, the unconditional variance of Dk given in equation (2) suggests an alternative set of weights derived from the random-effects model. We let vk = 1/var(Dk ) (whose calculation we describe below) and define a modified DerSimonian-Laird estimator as K 

M DL =

vk Dk

k=1 K 

.

(8)

vk

k=1

Because study k contributes explicitly to equation (2) only through nT k and nCk , we do not expect M DL to show the sort of behavior that we associate with correlation between Dk and its weight. In particular, we do not expect M DL to have any substantial bias. It is unbiased in some particular cases. To estimate the unconditional variance var(Dk ) in equation (2), we write var(Dk ) in terms of a between-study component and a within-study component: 2 var(Dk ) = σB + σk2 with 2 2 σB = σT2 + σC − 2σT σC

and σk2 =

2 µT (1 − µT ) − σT2 µC (1 − µC ) − σC + . nT k nCk

628

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

Then (as derived by calculating the expected value of the sample variance of the Dk and the expected values of the average within-sample variances of the treated-group and control-group proportions) we use the estimates 2 2 σ ˆB = s2D − AT BT ST2 − AC BC SC ,

with

(9)

1  Dk K 1  (Dk − D)2 s2D = K −1 1  pT k (1 − pT k ) ST2 = K nT k  pCk (1 − pCk ) 1 2 SC = K nCk $−1   1 nT k − 1 AT = , BT = 2 nT k nT k $−1   1 nCk − 1 AC = , BC = , 2 nCk nCk D=



and σ ˆk2

=K

2 AT ST2 AC SC + nT k nCk

(10) (11)

.

(12)

2 If σ ˆB ≤ 0, we use σ ˆk2 (an estimate of the unconditional within-study variance) as the estimate of var(Dk ) in the definition of vk .

2.4 The Unweighted Mean and the n-Weighted Mean

With substantially less computation than in DL or M DL one could use the unweighted mean D (defined in equation (10)), or the n-weighted mean DN =

K uk Dk Σk=1 with uk = K u Σk=1 k

1 2



1 1 nT k

+

1 nCk

=

2nT k nCk nT k + nCk

(13)

(the harmonic mean of nT k and nCk ). We include these two estimators because it will be informative to compare their performance against DL and M DL. For an appreciation of how the performance of DL and M DL would compare in actual meta-analyses, we study their behavior in situations with relatively few studies and usually modest sample sizes. We use simulation because

38 Random–Effect Procedure

629

we lack theoretical expressions for the mean and variance of these two estimators under the random-effects model.

2.5 Estimated Variances of the Estimators

To indicate the precision of DL, M DL, D, or DN in an actual meta-analysis, we estimate the corresponding standard error from the published rates for the primary studies (not from simulations). As the standard error of DL, DerSimonian and Laird (1986) give the large-sample estimate (Σwk∗ )−1/2 . Because in many meta-analyses neither K nor the nT k and nCk are large, we estimate the standard error of M DL by using the var(Dk ) discussed in Section 2.3 and the actual weights vk from equation (8) in the formula var(M DL) =

Σvk2 var(Dk ) . (Σvk )2

(14)

2 2 2 Thus we estimate σB by max{ˆ σB , 0} in calculating vk (as DL does with σ ˆw 2 2 in equation (7)); but we use var(Dk ) = max{ˆ σB + σ ˆk , 0} in equation (14), so as to introduce as little bias as possible into this estimate for var(Dk ). For any weighted mean with fixed weights, such as D or DN , we estimate the standard error by substituting those weights for the vk in equation (14). As a special case var(D) has a natural nonnegative estimate. We use s2D /K from equation (11), because (1/K)E(s2D ) = var(D).

3. Simulation design

3.1 Hypothetical Populations of Studies

The data-dependent weights in DL and M DL prevent us from directly calculating the mean and variance of these estimators (as we can do for D and DN ). Thus we studied their properties by choosing several hypothetical populations of primary studies and using simulation. By including D and DN in the simulations we are able to compare their observed means and variances against the theoretical values. We can also compare the performance of DL and M DL to that of these more-familiar estimators. To generate synthetic data according to the random-effects model described in Section 2.1, we proceeded in two phases. First we drew values of πT k and πCk for each of the K primary studies, and then we drew the corresponding observed proportions, pT k and pCk . The first stage of this sam2 pling process requires a choice of µT , µC , σT2 , σC , and . Because, in practice,

630

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

treated-group responses often show variation similar to that of control-group 2 responses, we simplified by setting σT2 = σC = σ 2 . Also for simplicity we set  = 0 (in simulations not presented here we have used other values of  and gotten similar results). We then chose five combinations of µT , µC , and σ, shown as Designs A through E in Table 1. Design A approximates homogeneity among the primary studies. The other four combinations illustrate a range of values that we have encountered in actual studies from medical research. Design E, with µT = .08 and µC = .06, represents a class of situations in which the main concern might be mortality. For convenience in tabulating and displaying the results, we took µT > µC , though for mortality, of course, we prefer the opposite. To ensure that πT k and πCk lay between 0 and 1, we modeled their population distributions as beta distributions with means and variances (µT , σ 2 ) and (µC , σ 2 ), respectively. Table 1 Five choices of the parameters µT , µC , and σ for the first stage of sampling in the random-effects model (µT is the mean rate for treated groups in the population of studies, µC is the mean rate for control groups, and σ 2 is the population variance for both types of groups)

Design

µT

µC

σ

A B C D E

.40 .40 .35 .31 .08

.30 .30 .30 .30 .06

.005 .10 .10 .02 .02

3.2 Primary Studies and Sample Sizes Often the greatest challenges for meta-analysis arise when only a fairly small number of primary studies are available and those studies report modest sample sizes. We designed our simulation to reflect this difficulty, as well as our own experience. With each of the five designs we used K = 5, 10, and 20 as the number of primary studies. Because randomized trials ordinarily aim to have nT k = nCk , we imposed this simplification on the sizes of the treatment and control groups. We denote the common size by nk . Among primary studies, however, nk often shows substantial variation. We modeled this behavior for small to moderate studies by taking n1 = 10, n2 = 20, n3 = 30, n4 = 40, n5 = 50 with K = 5 and repeating

38 Random–Effect Procedure

631

these values in the sequence n1 = n2 = 10, . . . , n9 = n10 = 50 with K = 10. We also included the simpler choices nk = 10 for all k and nk = 50 for all k (but we did not take all nk = 10 with K = 5). To represent somewhat larger meta-analyses and somewhat larger primary studies, we included K = 20 with n1 = n2 = 25, n3 = n4 = 50, . . . , n19 = n20 = 250. Because some metaanalyses encounter a single large primary study, we modified the preceding choice by setting n20 = 2000. Thus altogether we have seven combinations of K and the nk , which we refer to as patterns, with the abbreviations shown in Table 2. These patterns resemble a range of patterns that we have encountered in published meta-analyses of RCTs. We used each pattern of sample sizes with each of the five designs listed in Table 1. Table 2 Patterns of numbers of primary studies and sample sizes used in simulations

Patterna

K

nk (k = 1, . . . , K)

20V10 20L10

20 20

1050 10V5 1010 0550 05V5

10 10 10 5 5

25(25)250, each used twice 25(25)225, each used twice; n19 = 250; n20 = 2000 50 for all 10 studies 10(10)50, each used twice 10 for all 10 studies 50 for all 5 studies 10(10)50

a

In this abbreviated notation the first two characters give the number of primary studies (K) , and the remaining characters summarize the sample sizes (nk ). For example,  1050  has 10 primary studies with 50 observations in both groups of each study, and  10V5  has 10 primary studies whose size varies over 5 (equally spaced) values (in both groups).

3.3 Replications Given a pattern of sample sizes and the randomly drawn values of the πT k and πCk , we generated the pT k and the pCk (k = 1, . . . , K) according to the binomial distributions with nk trials and success probabilities πT k and πCk , respectively. Each run of the simulation (corresponding to one of the 35 combinations of a design with a pattern of sample sizes) used 2000 replications. In turn, each replication yielded K risk differences, Dk , to which we applied the estimators DL, M DL, D, and DN described above. (That is, in each run all the estimators received the same data). For each replication we recorded the

632

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

resulting estimates of µT − µC , along with other information such as the value 2 of σ ˆw in the DerSimonian-Laird procedure. Our analysis of the results focuses mainly on the mean and standard deviation of each estimator over the 2000 replications.

3.4 Keeping s2k Nonzero A practical problem arises in calculating s2k according to equation (4) when neither pT k nor pCk is strictly between 0 and 1. In our designs, especially Design E, the most likely case of this difficulty is pT k = pCk = 0. Thus, in order to evaluate DL on such a replication, we had to produce a nonzero value of s2k . For this purpose we adapted an adjustment from exploratory data analysis (Tukey, 1977, Chapter 15; Mosteller and Tukey, 1977, Chapter 5) that replaces a proportion p = x/n by p∗ = (x + 1/6)/(n + 1/3). Instead of modifying each pT k or pCk in this way, however, we applied the adjustment only to values of pT k or pCk that were equal to 0 or 1, and we used the adjusted values only in computing s2k (and then only when s2k = 0). That is, the adjustment did not affect the value of Dk . In the simulation run for pattern 1010 in Design E, 24% of the values of s2k involved the adjustment.

3.5 Theoretically Weighted Mean Because the values of µT , µC , and σ 2 are available during a run of the simulation, one can calculate the true var(Dk ) from equation (2). Using weights inversely proportional to these true variances, we included the theoretically weighted mean of the Dk (which we denote by T HW ). As the best linear unbiased estimator of µT − µC , this procedure provides a theoretical yardstick for the performance of the other procedures.

3.6 Execution

We carried out our simulations in SAS-PC (SAS Institute, Inc. 1985), running under MS-DOS on a microcomputer with a 33MHz Intel 80486 microprocessor. We used the SAS function RANGAM in generating the πT k and the πCk from synthetic gamma variates (via a standard relation between the beta and gamma distributions) and the SAS function RANDIN in generating the binomial proportions pT k and pCk . Each simulation run used a distinct starting value for the basic sequences of random numbers.

38 Random–Effect Procedure

633

4. Simulation results

4.1 Unusual Behavior in Design E

As we mentioned in Section 2, the behavior of DL when µT = .08, µC = .06, and σ = 0.02 led us to propose the modified estimator, M DL. Thus we begin with the simulation results for Design E. In Table 3, panel (a) shows the difference between the average value of each estimator and the known target value µT − µC . The columns correspond to the seven patterns of sample sizes. Panel (b) gives the corresponding standard deviations. For all patterns, the entry in panel (a) for DL is farther from 0 and thus indicates larger observed bias than the entries for the other estimators, and this behavior is more pronounced for pattern 1010 than elsewhere. It is surprising that the standard deviation of DL for pattern 1010 is about 2/3 as large as the standard deviation of the other four estimators. Because the nk are equal for this pattern, the Dk all have the same variance. Thus, weighting them equally (as M DL, D, DN , and T HW do in this instance) theoretically yields the smallest variance (among unbiased estimators). Columns 1050, 1010, and 0550 of panel (a) and panel (b) contain identical entries for M DL, D, DN , and T HW because for these patterns the four estimators reduce to the same estimator. Thus, applying them to the same Dk at each replication yields the same result. We attribute the bias and artificially low standard deviation of DL to the relation between the weight wk∗ and Dk , to which it is applied (see equation (3)): both Dk and s2k are functions of pT k and pCk . The effect is more pronounced when pT k = 0 or pCk = 0 and especially when both pT k = 0 and 2 2 ˆw is substantially larger than s2k (so that wk∗ ≈ 1/ˆ σw ), pCk = 0. Unless σ 2 ∗ 2 2 a zero term in sk increases wk = 1/(sk + σ ˆw ). Primary studies that have pT k = pCk = 0 make a particularly strong contribution, for any reasonable choice of adjustment that makes s2k > 0. They have Dk = 0; and, when pT k = pCk = 0 is a likely outcome, its greater weight helps to shift DL toward 0. At the same time such studies contribute smaller deviations to the sampling variance of DL. This mechanism accounts for the artificially low standard deviation of DL for several patterns. (We say artificially because DL has smaller standard deviation than T HW , which uses the ideal weighting scheme based on the known true variances of the Dk , whereas DL must estimate those variances. We explored the effect of the connection between wk∗ and Dk in a few special simulations for pattern 1010 that departed from Design E by increasing σ 2 . Although, by equation (2), each Dk then has greater variance, the larger values of σ 2 produced smaller standard deviations for DL. We regard such behavior as unnatural). For some values of nk in Design E, studies with pCk = pT k = 0 are a common occurrence. As an indication, when nk = 10 and πCk = .06 (the

634

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

Table 3 Results for Design E (µT = .08, µC = .06, σ = .02): Mean (minus µT − µC ) and standard deviation over 2000 replications within pattern Patterna

20V10

20L10

1050

10V5

1010

0550

05V5

−20 14 14 14 14

−75 5 5 5 5

−12 −2 −2 −2 −2

−23 1 5 1 1

206 239 264 237 237

244 366 366 366 366

249 264 264 264 264

287 323 362 320 320

−4.3 2.6 2.3 2.6 2.6

−13.8 0.6 0.6 0.6 0.6

−2.1 −0.3 −0.3 −0.3 −0.3

−3.6 0.1 0.6 0.1 0.1

a. Mean (minus .02), in units of .0001 DL M DL D DN T HW

−9 −4 −3 −4 −4

−7 0 1 1 0

−12 4 4 4 4

b. Standard deviation, in units of .0001 DL M DL D DN T HW

93 98 108 99 97

94 106 108 144 97

172 188 188 188 188

c. Deviation of mean in standard error units DL M DL D DN T HW a

−4.5 −1.6 +1.0 −1.7 −1.7

−3.2 0 0.6 0.2 −0.1

−3.1 0.9 0.9 0.9 0.9

. See Table 2.

mean value in the first-stage distribution), the binomial probability of pCk = 0 is .539; and when πT k = .08, the binomial probability of pT k = 0 is .434. The corresponding probability of pCk = 0 and pT k = 0 is (.539)(.434) = .234. In the simulations the values of πCk and πT k vary about their mean values, but the above calculation illustrates the likelihood of the difficulty of undesirable double zeros. When nk = 50 and πCk = .06, the binomial probability of pCk = 0 is .045; and when πT k = .08, the binomial probability of pT k = 0 is .015. From these illustrative values the probability of the joint event pCk = 0 and pT k = 0 is .0007. In such patterns as 1050 and 0550, then, a study with pCk = pT k = 0

38 Random–Effect Procedure

635

would not be a common occurrence. DL has some apparent (i.e., seeming) bias in these two patterns, but only about one sixth as much as in 1010. The apparent bias in patterns 10V5 and 05V5 is about one quarter as much as in 1010. Because of the importance, in pattern 1010, of primary studies with pT k = pCk = 0, we investigated an alternative way of adjusting s2k in such cases. Instead of replacing p = 0 by p = 16 /(n + 13 ), we used p = 12 /(n + 1). The simulation results for pattern 1010 and other patterns indicated that this change reduced the bias but by no means eliminated it. The simulation results for the modified DerSimonian-Laird procedure, M DL, in Design E (Table 3) indicate that it has no apparent bias. In patterns 1050, 1010, and 0505 it necessarily has the same standard deviation as the theoretically weighted mean, T HW ; and in the other four patterns its standard deviation is at most slightly greater than that of T HW . Thus, in this design the modification avoids the unusual behavior.

4.2 MDL versus DL in the Other Designs

In view of the favorable performance of M DL in Design E, we proceed to compare DL and M DL in the other four designs. Tables 4 through 7 give the simulation results for Designs A through D, respectively, in the same format as Table 3. Because Design A (µT = .4, µC = .3, σ = .005; Table 4) invloves negligible between-study variance, we expect (on theoretical grounds) the performance of DN to match that of T HW , as the simulation results show. In patterns 20V10, 20L10, 10V5, and 05V5 (where the two estimators are not the same), M DL comes very close to T HW , generally closer than D. For five of the seven patterns in Design A, M DL comes closer to the true mean than DL. In patterns 20V10, 20L10, 1050, and 0550 the two estimators’ standard deviations are essentially equal, but M DL has smaller standard deviation in patterns 10V5, 1010, and 05V5. On both the mean and the standard deviation, the difference between M DL and DL is largest at pattern 1010 (as also happened for design E in Table 3). The difference in mean is a plausible consequence of the relation between wk∗ and Dk in DL. In Design A primary studies with pT k = pCk = 0 are much less likely than in Design E. For example, with nk = 10, πT k = .4, and πCk = .3, the binomial probability of pT k = 0 and pCk = 0 is .00017, 2 which is small enough to counteract the greater weight, wk∗ , when σ ˆw is small (as one expects in Design A, which resembles a fixed-effects situation). In Design B (µT = .4, µC = .3, σ = .1; Table 5) M DL always comes at least as close to the true mean as DL does, and often the difference is substantial. The standard deviation gives a similar picture: M DL is less variable than DL

636

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

Table 4 Results for Design A (µT = .4, µC = .3, σ = .005): Mean (minus µT − µC ) and standard deviation over 2000 replications within pattern. Patterna

20V10

20L10

1050

10V5

1010

0550

05V5

35 13 17 13 13

68 9 9 9 9

20 11 11 11 11

27 9 9 7 7

403 392 455 385 385

688 648 648 648 648

438 434 434 434 434

589 575 652 562 562

3.9 1.5 1.7 1.5 1.5

4.4 0.6 0.6 0.6 0.6

2.1 1.2 1.2 1.2 1.2

2.0 0.7 0.6 0.6 0.6

a. Mean (minus .10), in units of .0001 DL M DL D DN T HW

9 4 8 4 4

6 1 −1 1 1

2 −9 −9 −9 −9

b. Standard deviation, in units of .0001 DL M DL D DN T HW

129 129 167 127 127

110 111 164 103 103

305 301 301 301 301

c. Deviation of mean in standard error units DL M DL D DN T HW a

3.3 1.4 2.1 1.3 1.3

2.3 0.4 −0.3 0.4 0.4

0.3 −1.4 −1.4 −1.4 −1.4

. See Table 2.

at each pattern. Thus M DL performs well when the between-study variance is relatively large. In Design C (µT = .35, µC = .30, σ = .1; Table 6) M DL and DL do about equally well on average, and M DL has smaller standard deviation. Again, the between-study variance is relatively large. Further, in Design D (µT = .31, µC = .30, σ = .02; Table 7) the averages of M DL and DL are about the same, pattern by pattern. As in Designs A, B, and C, M DL generally has smaller standard deviation. Occasionally in panel (a) of Tables 3 through 7, all the entries in a column seem a little farther from 0 than one might expect. Because D and DN are both

38 Random–Effect Procedure

637

Table 5 Results for Design B (µT = .4, µC = .3, σ = .1): Mean (minus µT − µC ) and standard deviation over 2000 replications within pattern Patterna

20V10

20L10

1050

10V5

1010

0550

05V5

39 13 13 12 12

4 −6 −6 −6 −6

10 −2 −2 −2 −2

28 5 6 −5 1

624 612 626 627 605

847 808 808 808 808

757 750 750 750 750

860 844 867 872 842

2.8 0.9 0.9 0.9 0.9

2.5 −0.3 −0.3 −0.3 −0.3

0.6 −0.1 −0.1 −0.1 −0.1

1.4 0.3 0.3 −0.3 0.1

a. Mean (minus .1), in units of .0001 DL M DL D DN T HW

6 −2 −4 1 −3

23 14 13 30 14

7 −7 −7 −7 −7

b. Standard deviation, in units of .0001 DL M DL D DN T HW

358 356 362 381 356

352 350 352 669 349

540 534 534 534 534

c. Deviation of mean in standard error units DL M DL D DN T HW a

0.7 −0.3 −0.4 0.1 −0.3

2.9 1.8 1.7 2.0 1.8

0.6 −0.6 −0.6 −0.6 −0.6

. See Table 2.

unbiased, we regard these as chance outcomes of the simulation process. None of them are particularly extreme. By including D and DN , whose theoretical standard deviations we can also calculate, in the simulation, we provided a way of checking on such possible quirks. In summary, M DL avoids the bias that troubles DL in Design E, and in the other four designs M DL gives an apparently unbiased estimate for µT −µC that generally has smaller standard deviation than DL. Together the designs and patterns cover a variety of situations that arise in practice. We would use M DL instead of DL to combine risk differences from a set of primary studies.

638

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

Table 6 Results for Design C (µT = .35, µC = .30, σ = .1): Mean (minus µT − µC ) and standard deviation over 2000 replications within pattern Patterna

20V10

20L10

1050

10V5

1010

0550

05V5

13 1 2 8 4

14 −12 −12 −12 −12

20 13 13 13 13

23 13 10 5 7

603 591 613 606 588

814 777 777 777 777

760 753 753 753 753

881 867 896 894 865

0.9 0.0 0.1 0.6 0.3

0.7 −0.7 −0.7 −0.7 −0.7

1.1 0.8 0.8 0.8 0.8

1.2 0.7 0.5 0.3 0.4

a. Mean (minus .05), in units of .0001 DL M DL D DN T HW

−2 −6 −5 −6 −6

2 −3 −1 −25 −2

10 3 3 3 3

b. Standard deviation, in units of .0001 DL M DL D DN T HW

353 351 353 381 351

366 364 368 686 364

545 540 540 540 540

c. Deviation of mean in standard error units DL M DL D DN T HW a

−0.2 −0.7 −0.6 −0.8 −0.7

0.2 −0.3 −0.1 −1.6 −0.3

0.8 0.2 0.2 0.2 0.2

. See Table 2.

4.3 Performance of D and DN Both D and DN give an unbiased estimate of µT − µC without using weights based on estimates of variability. Besides being able to calculate var(D) and var(DN ) theoretically (and thus check on the validity of the simulation), we included these estimators in order to learn how they behave in random-effects situations. Whenever a pattern has all the nk equal (as in 1050, 1010, and 0550), DN and D are the same estimator. This identity, evident in Tables 3 through 7, also extends to T HW and M DL and thus simplifies the comparisons among

38 Random–Effect Procedure

639

Table 7 Results for Design D (µT = .31, µC = .30, σ = .02): Mean (minus µT − µC ) and standard deviation over 2000 replications within pattern Patterna

20V10

20L10

1050

10V5

1010

0550

05V5

−12 −15 −15 −15 −15

−6 −8 −8 −8 −8

7 6 6 6 6

17 15 17 15 15

410 399 454 395 395

682 645 645 645 645

429 426 426 426 426

578 567 641 551 551

−1.3 −1.7 −1.4 −1.7 −1.7

−0.4 −0.6 −0.6 −0.6 −0.6

0.8 0.7 0.7 0.7 0.7

1.3 1.2 1.2 1.2 1.2

a. Mean (minus .01), in units of .0001 DL M DL D DN T HW

3 3 6 1 2

−2 −2 −5 1 −2

−4 −5 −5 −5 −5

b. Standard deviation, in units of .0001 DL M DL D DN T HW

143 143 169 143 142

140 147 171 163 135

311 306 306 306 306

c. Deviation of mean in standard error units DL M DL D DN T HW a

0.9 0.8 1.7 0.3 0.6

−0.7 −0.5 −1.4 0.4 −0.6

−0.5 −0.8 −0.8 −0.8 −0.8

. See Table 2.

the estimators. Distinctions can show up only in patterns 20V10, 20L10, 10V5, and 05V5, where nk (= nT k = nCk ) varies among the primary studies. In summarizing the comparative performance of D and DN , we separate the results in Designs A through E according to the extent of the heterogeneity among primary studies. When the heterogeneity is negligible (σ = .005 in Design A) or small (σ = .02 in Designs D and E), DN has smaller (usually substantially smaller) standard deviation than D. The single exception is pattern 20L10 in Design E, where DN has considerably larger standard deviation. The explanation lies in the small values of µT and µC , which reduce the last two terms in var(Dk ) (equation (2)) to the point that, when

640

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

2 nk = 2000, σT2 + σC dominates the expression and makes weighting inversely proportional to sample size undesirable. When, on the other hand, the heterogeneity is larger (σ = .1 in Designs B and C), DN usually has somewhat larger standard deviation than D. For pattern 20L10 the standard deviation of DN is much the larger of the two, again because the presence of the large study and the heterogeneity make equal weighting more appropriate (D is only slightly more variable than T HW ). The patterns with unequal nk (20V10, 20L10, 10V5, and 05V5) also provide comparisons between M DL and the other two alternative procedures, D and DN . Across the five designs M DL has about the same standard deviation as the better of D and DN , ranging from slightly smaller to somewhat greater. In these same comparisons the standard deviation of M DL is typically 2% larger (and at most 9% larger) than that of T HW .

5. Example

In this section we use data from a published meta-analysis to illustrate the differences among the weights that DL, M DL, D and DN can attach to the risk difference from a primary study. Among patients who undergo major surgery, malnutrition tends to increase the risk of complications, such as poor wound healing and infections. One approach to reducing this risk provides nutritional support by delivering predigested nutrients directly to the patient’s bloodstream. Detsky et al. (1987) applied a specific meta-analysis protocol to assess the results of 18 controlled trials of such perioperative total parenteral nutrition. Table 8 gives the data on the rate of complications from major surgery in 11 trials that were randomized or quasi-randomized. The right most three columns show the weight that each Dk receives in DL, M DL, and DN , respectively (for convenience in making comparisons, each set of weights totals 1 here). In view of the range of study sizes, it is not surprising that the weights attached by DN vary by a factor of more than 9 (from .025 for Study 2 to .231 for Study 9). The corresponding factors for DL and M DL are 6 and 10, respectively. Furthermore, the three estimators distribute the weight differently among the 11 studies. DN and M DL give their greatest single weight to Study 9, whereas for DL the greatest weight goes to Study 11. DN and M DL give least weight to Study 2, whereas DL gives least weight to Study 4. Study 1 shows the largest variation in weights, mainly because pC1 = 0 and pT 1 = .1 lead to a relatively small value of s21 in DL. The estimates of the mean risk difference in the population range from D = −.0702 to DL = −.0462. In this example (unlike the simulations) we cannot say which estimate is closest to the correct answer, but D, DN , and

38 Random–Effect Procedure

641

Table 8 Treated rate of complications from major surgery, control rate of complications, risk difference, and weights assigned by DL, M DL, and DN in a meta-analysis of randomized or quasi-randomized trials of perioperative total parenteral nutrition (Detsky et al., 1987) Weight in

k

nT k

pT k

nCk

pCk

Dk

DL

M DL

DN

1 2 3 4 5 6 7 8 9 10 11

20 10 30 10 24 10 12 30 66 10 58

.100 .000 .133 .200 .333 .100 .167 .500 .167 .100 .086

24 5 26 9 23 10 9 32 59 10 55

.000 .200 .192 .333 .174 .400 .111 .562 .322 .200 .164

.100 −.200 −.059 −.133 .159 −.300 .056 −.062 −.155 −.100 −.077

.178 .041 .108 .033 .076 .040 .055 .074 .153 .051 .193

.083 .023 .103 .035 .087 .037 .037 .117 .230 .037 .210

.081 .025 .103 .035 .087 .037 .038 .115 .231 .037 .210

1.000

1.000

1.000

DL = −.0462, s.e.= .0381 M DL = −.0653, s.e.= .0264 D = −.0702, s.e. = .0404 DN = −.0658, s.e. = .0264

M DL range from −.0702 to −.0653, whereas DL stands somewhat above them at −.0462. Still, the four values do not differ greatly. All four point to a moderate decrease in the risk of complications among patients who received perioperative total parenteral nutrition. When we estimate the variances as described in Section 2.5, we get these estimated standard errors: .0381 for DL, .0264 for M DL, .0404 for D, and .0264 for DN . Their values reflect both properties of the procedures and features of the particular data set. For the estimate of the between-study variance, we obtain −.0032, which we treat as 0. This estimate of no heterogeneity among the Dk reflects an apparently substantial correlation between the πT k and the πCk ; the correlation between the pT k and the pCk is 0.6. Much of this observed correlation, however, arises from the presence of Study 8, whose rate of complications is generally much higher than those in the other studies. It

642

John D. Emerson, David C. Hoaglin, and Frederick Mosteller

would be appropriate to ask whether the meta-analysis should include that study. When we use the standard error to assess the significance of the difference between an estimate of µT − µC and 0, DL and M DL give different results. Using a Gaussian reference distribution, we get z = −.0462/.0381 = −1.21 for DL and z = −.0653/.0264 = −2.47 for M DL, corresponding to two-tailed p-values of .23 and .013, respectively. Thus, M DL would lead us to conclude that total parenteral nutrition reduced the risk of complications, whereas DL would not. REFERENCES

Berlin, J.A., Laird, N.M., Sacks, H.S. and Calmers, T.C. (1989), A Comparison of Statistical Methods for Combining Event Rates from Clinical Trials, Statistics in Medicine, 8, 141–151. Cochran, W.G. (1954), The Combination of Estimates from Different Experiments, Biometrics, 10, 101–129 [reprinted as Paper 58 in Cochran (1982)]. Cochran, W.G. (1982), Contributions to Statistics, New York: Wiley. DerSimonian, R. and Laird, N. (1986), Meta-analysis in Clinical Trials, Controlled Clinical Trials, 7, 177–188. Detsky, A.S., Baker, J.P., O’Rourke, K. and Goel, V. (1987), Perioperative Parenteral Nutrition: A Meta-analysis, Annals of Internal Medicine, 107, 195–203. Mosteller, F. and Tukey, J.W. (1977), Data Analysis and Regression, Reading, MA: Addison-Wesley. SAS Institute, Inc. (1985), SAS Introductory Guide for Personal Computers, Version 6 edition, Cary, NC: SAS Institute, Inc. Tukey, J.W. (1977), Exploratory Data Analysis, Reading, MA: AddisonWesley.

Reprinted from Harvard Magazine (1999), 101, pp. 34–35

39. The Case for Smaller Classes and for Evaluating What Works in the Schoolroom 

Frederick Mosteller Harvard Univesrity

The United States is now engaged in a large and extensive program to improve our nation’s public-school systems. Last year Congress adopted President Clinton’s initiative to begin the federal funding necessary to add 100,000 elementary-school teachers and a substantial number of new classroom buildings throughout the country. Some states are undertaking similar initiatives. An example is the California program, begun in September 1996 under Republican governor Pete Wilson, that aimed to reduce class size in kindergarten through third grade to 20 students. Although California’s new reduced-class-size programs have been criticized for inadequate preparation, the state’s failure to line up enough additional classrooms and teachers may have encouraged President Clinton’s proposals. Meanwhile, three other states–Florida, Georgia, and Utah–have been considering smaller classes in the early grades. Like many ideas about how to improve or reform education, the effort to reduce class size is controversial. Some critics of public education see the program as a boondoggle, at worst a payoff to entrenched teachers’ unions. Others say strengthening teacher training or generally improving the quality of teachers would be more efficacious. This disagreement over methods of educational reform points to a troubling discovery–we have little, if any, objective, useful data on what really works in education, a quarter-trillion-dollar-plus enterprise that vitally affects our children. In the case of class-size reduction, however, there are such data, and they offer strong evidence that smaller classes in the early grades improve children’s learning. We need, therefore, to pay attention to this example of an educational experiment, both for its immediate bearing on the current issue of class-size reduction and for its larger message about how we ought to go about evaluating what works to improve education. 

Frederick Mosteller, LL.D. ’91, is professor of mathematical statistics emeritus. This article is in part adapted from, and updates, his analysis of the STAR program in the Harvard Education Letter of July/August 1997.

644

Frederick Mosteller

Project STAR (Student/Teacher Achievement Ratio), the state of Tennessee’s four-year study of the educational effects of class size and teachers’ aides in the early grades, is one of the great experiments in education in U.S. history. Its importance derives in part from its being a statewide study and in part from its size and duration. But even more important is the care taken in the study’s design and execution. Not only are the findings valuable, but Project STAR is also extremely important as an example of the kind of experiment needed to appraise other school programs, and as proof that such a project can be implemented successfully on a statewide basis. In the late 1980s, then-Tennessee governor Lamar Alexander (currently a candidate for the Republican presidential nomination) had made education a top priority for his second term. The state legislature and the educational community had been intrigued by a modest-sized Indiana study called Project Prime Time, which found benefits in having small classes in the early grades. The legislature was also aware of an investigation by Gene V. Glass and his colleagues at the University of Colorado and Murdoch University in Australia that used meta-analysis (a way of pooling information from several separate studies to strengthen evidence) to review the literature on the effects of class size. The results of this investigation suggested that a class size of 15 or fewer would be needed to make a noticeable improvement in classroom performance. Meta-analysis, however, was not viewed favorably by all professionals at that time, and the effect of class size continued to be seriously debated. Noting the expense associated with additional classrooms and teachers, the Tennessee legislature decided that it would be wise to have a solid research base before adopting such a major program. In addition to studying class size, the legislature wanted to evaluate the effectiveness of adding a teacher’s aide to a regular-size class. It therefore authorized and funded Project STAR. The idea that drove the Tennessee study is that teachers in smaller classes have more time to give to individual children. In addition, teachers and administrators who advocate small classes for students who are beginning school often say they are dealing with a “start-up phenomenon.” When children first come to school, they face a great deal of confusion. They need to learn to cooperate with others, to learn how to learn, and to get organized to become students. They arrive from a variety of homes and backgrounds, and many need training in paying attention, carrying out tasks, and engaging in appropriate behavior toward others in a working situation. The study was carried out in three kinds of groups: small class size (13 to 17 students); regular class size (22 to 25) with a teacher’s aide; and regular class size without a teacher’s aide. The study began in kindergarten and continued through the third grade. The children moved into regular-size classes in the fourth grade. By comparing average pupil performance in the different kinds of classes, researchers were able to assess the relative benefits of small classes and the presence of a teacher’s aide. The experiment involved 79 schools from inner-city, urban, suburban, and rural areas, so that the progress of children

39 The Case for Smaller Classes

645

from different backgrounds could be evaluated. In all, the experiment involved about 6,400 students during its four years. As Project STAR approached its final year, the staff requested and received funding for an additional program. The Lasting Benefits Study was designed to follow all three groups of students as they moved into regular-size classes after third grade. Two kinds of tests were used to assess student performance: standardized tests and curriculum-based tests. Standardized tests have the advantage of being used nationwide, but the disadvantage of not being geared directly to the course of study taught locally. Curriculum-based tests reverse those benefits and disadvantages: they measure more directly the increased knowledge of what was actually taught, but usually cannot tell how the results compare with the national picture. After four years, it was clear that smaller classes did bring substantial improvement in early learning in cognitive subjects such as reading and arithmetic (for details on methods and findings, refer to the works by Mosteller et al. cited in the references below). Following the groups further, the Lasting Benefits Study demonstrated that the positive effects persisted into grades 4, 5, 6, and 7, so that students who had originally been enrolled in smaller classes continued to perform better than their grademates who had started in larger classes. In the first two years of Project STAR, the gains of the minority students (primarily African Americans) were twice as great as those of the majority students; in subsequent years, however, they settled back to about the same gain as the rest. The presence of teachers’ aides during Project STAR, though beneficial, did not produce improvements comparable to the effect of the reduction in class size, nor did their presence seem to have as much lasting benefit after third grade. Is reducing class size the best, most cost-effective, reform? The Tennessee study does not prove that. Some experts, such as Robert E. Slavin, codirector of Johns Hopkins University’s Center for Research on the Education of Students Placed at Risk, focus on style of teaching and teacher quality as more important. But the valid data needed to assess and compare many alternative strategies simply don’t exist. For example, we do not have strong evidence about the effectiveness (if any) of the widely used, but much debated, procedure of tracking (breaking classes into groups of comparable attainments). What we do know from the Tennessee study is that this kind of investment does have a beneficial result. After reviewing the Project STAR findings, Tennessee policymakers asked themselves where it would be most effective to introduce this intervention. They decided to implement the small-class program in the 17 school districts where the children seemed most at risk of falling behind–those districts with the lowest per-capita incomes. This change meant decreasing class size in only 4 percent of the classrooms in the state. The results of the first three years of this program, called Project Challenge, have been encouraging. Thanks to the smaller classes, the children from these districts are performing better on both standardized and curriculum-oriented

646

Frederick Mosteller

tests than pupils from the same districts in earlier years. Indeed, their end-ofyear performance has raised their district ranking in arithmetic and reading from far below the state average for all districts to above average. What we also know from the Tennessee study is that we need more experiments of comparable quality to guide intelligent, effective policymaking for such a huge and vitally important enterprise as education. It seems strange that, after almost a century of educational research, we should be arguing about the outcome of one substantial controlled experiment concerning one classroom feature. I envision collections of districts or states joining together to design studies of mutual interest, just as medical institutions now routinely join together to carry out cooperative randomized clinical trials. The medical and health-care communities have come to expect this. The education community should expect no less. Extra Credit: Additional Readings •

Frederick Mosteller, “The Tennessee Study of Class Size in the Early School Grades,” The Future of Children, vol. 5, no. 2 (Summer/Fall 1995): 113–127. • Frederick Mosteller, “Smaller Classes Do Make a Difference in the Early Grades,” Harvard Education Letter, vol. 13, no. 4 (July/August 1997): 5–7. • Frederick Mosteller, Richard J. Light, and Jason A. Sachs, “Sustained Inquiry in Education: Lessons from Skills Grouping and Class Size,” Harvard Educational Review, vol, 66, no. 4 (Winter 1996): 797–842. • Robert E. Slavin, Cooperative Learning, second edition (Allyn & Bacon, 1995).

Reprinted from Statistical Science (1988), 3, pp. 136–144

40. Frederick Mosteller and John W. Tukey: A Conversation Moderated by Francis J. Anscombe

This article is adapted from an archival videotaping carried out on May 11, 1987, by the Department of Statistics at the University of Connecticut under the sponsorship of Pfizer Central Research of Groton, Connecticut, in cooperation with the Committee for Filming Distinguished Statisticians of the American Statistical Association. The Project Director was Harry O. Posten and the Associate Project Directors were Alan E. Gelfand, Timothy J. Killeen and Nitis Mukhopadhyay. The article was prepared in the editorial office of Statistical Science. Anscombe: Good afternoon. My name is Frank Anscombe. I am Professor of Statistics at Yale University. I was formerly a member of the Committee for Filming Distinguished Statisticians of the American Statistical Association. This afternoon we have a discussion or conversation between two very famous figures of the statistical world—John Tukey and Frederick Mosteller. John Tukey is Senior Research Statistician and Donner Professor of Science Emeritus and Professor of Statistics Emeritus at Princeton University. He was also, until his recent retirement, Associate Executive Director of Research at the AT&T Bell Telephone Laboratories. Frederick Mosteller is Chairman of the Harvard Department of Health Policy and Management. He is Roger I. Lee Professor of Mathematical Statistics in the Harvard School of Public Health, and Professor of Mathematical Statistics in the Harvard Department of Statistics and in the Harvard Department of Psychology. He is a member of the faculty of the John F. Kennedy School of Government and of the Medical School of Harvard University. John Tukey and Frederick Mosteller have known each other for a long time, since, I believe, 1939. They have collaborated in a number of major research projects and also over the years have had numerous informal contacts and discussions. We are now going to hear from them reminiscences and reflections on developments in statistics during their long association. We will hear of their early days at Princeton, of their collaborative studies and of their

648

Moderated by Francis J. Anscombe

attitudes on statistics and the current state of statistical science. Let’s begin with the early days at Princeton. John, you were there a little before Fred. Would you care to begin? Tukey: Thank you. Well, I came to Princeton as a chemist in 1937. I took prelims in math at the end of that year and a Ph.D. in topology a year later. I didn’t know as much about the state of statistics in the mathematics department then as I learned later. It was a difficult time. If it hadn’t been for Luther Pfahler Eisenhart, I don’t think there would have been any statistics in Princeton. And I’m sure that Sam Wilks would have done his important work somewhere else. In May of ’41 I went over to war work and for most of the war, except when they were away, I spent a lot of time with Charlie and Agnes Winsor, with Charlie both day and night and at meals. I learned a lot from real data but I think I learned even more from Winsor. So I came out of the war a statistician, not a topologist. A change I’ve never regretted. Mosteller: I was sent to Princeton as a graduate student by my own mentor, Edwin G. Olds. And I was astonished when I got to Princeton and met Wilks because he looked so much younger than I did. Of course you see me now, but then I looked very young and he looked younger yet. I could hardly believe that he was going to be my teacher. Also, he was a very different man from Edwin Olds. When Edwin Olds wanted you to do something you didn’t have any misunderstanding about that job. He just told you. When Wilks wanted you to do something, it was hard to catch it. You had to be alert for it. He would hint around, “It would be nice if somebody did something of a certain kind.” You had to grasp that he meant you. Coming from Olds it was a little hard for me to make it out sometimes. Wilks was very busy. He was in Educational Testing Service work, sample survey work, working for the Navy, working for the National Research Council and a member of the Social Science Research Council—a very busy man. And although tea was a sacrosanct institution at Fine Hall when I was there, Wilks almost never appeared except to make an appointment with somebody or to settle something, some kind of business. He almost never picked up a coffee cup and almost never ate a cookie. There were few statistics students there at that time. Alexander Mood, George Brown, they were both ahead of me. Wilfrid Dixon was in my class. Phil McCarthy, who is now at Cornell, was in my class, and a young man named Ernest Villovaso. The next year Ted Anderson and David Votaw came. The total graduate statistics curriculum consisted of one course which Sam taught throughout the year. We met lots of statisticians because after Wilks built his home famous statisticians were always visiting him at Princeton. All the graduate students met Neyman, Hotelling, A. T. Craig, Cecil Craig, Wald, Koopmans, Deming, Shewhart, Dodge, Romig and so on. Tukey: I don’t have too much to add to this. Before I got involved in other things and started to become statistical, I’ll have to say that not only was Wilks not stopping at tea but I very rarely saw him in any other way. He was hard at work, but not visible on the surface.

40 Frederick Mosteller and John W. Tukey: A Conversation

649

Mosteller: The war brought lots of statisticians to Princeton. Cochran came. And as you’ve already mentioned, Winsor, Paul Dwyer, and R. L. Anderson. Tukey, you were working at Fire Control Research and my wife, Virginia, worked there as secretary to Merrill Flood. With the large number of statisticians there in Princeton at that time, a seminar began, an extremely active one. My recollection, John, is that we had many fine speakers but whenever we ran out of speakers you always had something to say. Tukey: Well, I can’t confirm or deny that. And maybe I should clarify the term Fire Control Research; many people in Princeton thought we controlled fires. But really what we were working on in the the beginning was stereoscopic height and range finders with an experimental set-up down at Fortress Monroe. Then we moved on into a variety of less obviously related topics including things like armored vehicle fire control and ballistic behavior of rocket powder. And then eventually we finished off the war as the coordinating group for a thing called AC92, which was trying to find out how to make the B29 bomber more useful. But I think the important thing for the present was that there were a number of statisticians there. Particularly toward the end of the war, after Cochran and Mood had set up their two families in a large house (that is now occupied by my lawyer). On Sunday afternoons the statisticians would gather over there. Some people would do The New York Times crossword, some people would do this and that. I first met George Snedecor in the garden of that house, for example, and other people of interest. Mosteller: There was a fair bit of teaching of quality control at this time. Holbrook Working and E. G. Olds and Paul Clifford came and helped develop a quality control program that was going throughout the nation. And Sam, of course, cooperated to set up such an enterprise at Princeton. I taught in the program at Princeton and later in Newark and in Philadelphia.1 I think that experience was one of the reasons I was later involved in NBC’s Continental Classroom. Sam edited The Annals of Mathematical Statistics and I was his assistant. It was amazing how he got the Annals out. On Sunday evening when we would finish pouring some visiting fireman onto the train from Princeton, usually to New York or to Washington, he would turn and say, “I wonder if we shouldn’t spend just a little time getting out the Annals tonight.” And that was always the beginning of a long, hard session because Sam enjoyed the allnight effort. And about five or six in the morning we’d pull out of there and go down to the post office and mail off another issue of the Annals. He also got me a job working with Hadley Cantril. Hadley was a social psychologist who pioneered in using survey research for social science. And as a result of that effort, I ultimately met Sam Stouffer and worked as a sampling consultant to the War Department in Washington. Finally, that led to my coming to 1

These Princeton Quality Control Courses (short courses we would call them now) were so well liked that participants asked to convert them into an annual event that still continues under the auspices of the Metropolitan Section of the American Society for Quality Control.

650

Moderated by Francis J. Anscombe

Harvard University in the Department of Social Relations after I completed my degree at Princeton. But the war work that I did was partly done in New York with John Williams and Cecil Hastings and Jimmie Savage. We had a very small group; we called it the Princeton Statistical Research Group, Junior because the main group was in Princeton. There was also a Columbia group in the same building. Indeed my wife, Virginia, went to the Columbia group and served as a secretary for Allen Wallis, who headed that group. They had an enormous collection of statisticians including Harold Hotelling, Abe Girshick, Churchill Eisenhart, Albert Bowker, Jimmie Savage, Wald, Wolfowitz and so on. This was a very important move for me, partly because I got to work with Jimmie Savage and partly because I learned to work with problems—real problems— quickly. Problems came in and they always had a deadline on them, and I learned that you had to do what you could, and then write it up and send it off because it wasn’t going to be of any value later. That was a marvelous experience for me. Until then most of my statistical research wasn’t quite finished or quite good enough to display. After that I learned to bundle it up and send it off. Tukey: What Fred didn’t mention, because it never struck him as much as it struck other people, is that just after that experience if you heard somebody speaking in the next room at a party, you couldn’t tell whether it was Jimmie Savage or Fred Mosteller. They had worked together so much that their accents and intonation and everything were really very close. He also didn’t mention Milton Friedman, and it would be a mistake for me to let that omission get by. If I were going to give a paper on statistics anywhere in the ten years after 1945, the one person whose presence in the audience would make me most careful would have been Milton Friedman. Not anybody who claimed to be a statistician, but somebody who knew an awful lot of statistics and was very sharp to boot. I think in Princeton we didn’t have the same sort of short fuse experience, but we certainly had the real problems experience and the real data experience and different kinds of experimental design than they teach you in courses. For example, Eastman Kodak was building an experimental height finder and when it was time for us to close down our heightfinder work, I sat down and dictated 101 dictaphone cylinders. That was the first draft of the report on the M1E9 heightfinder. So Fred got one thing which he needed, I must have got another good thing. Mosteller: Well, Milton Friedman gave both Jimmie Savage and me a very, very important lesson. He took a one hundred page manuscript that we had labored over long and carefully and wrote all over it on both sides of the pages. This was at a time when it was hard to get extra copies of anything because they had to be made from scratch. So we were very indignant that he would mess up a manuscript this badly. So we took his 1,000 corrections and explained to him the 22 that we thought were mistaken. He then fought us into the ground. He wouldn’t give on even one. Finally, at the end, he said of

40 Frederick Mosteller and John W. Tukey: A Conversation

651

two of them, “Well, those may be matters of taste.” But what was important was that when he got finished he said, “Well, you fellows really have a lot of good things to say, and you ought to learn how to say them. There are some ways to learn how to write. Here are a couple of books that would do you a lot of good.” He sent us off with these books, and Jimmie and I for many months worked very hard on this problem. It made a big difference, I think, to both of us. It was the first time I think anyone ever took the editing of any writing that I had done or that Savage had done seriously enough to give us some motivation for learning how to improve our writing. It was a very important occasion. I learned something else at this time, John, and that was that everything didn’t have to be done by direct mathematics. It was possible to do things by example, by approximation and by simulation. I had done some simulation with Wilks, and with Olds before I had left Carnegie Tech, but simulation seemed to be very often needed in the wartime problems that we had. And indeed, many of them probably couldn’t be done even today by any mechanism other than simulation although it was fairly slow. We didn’t have a lot of heavy calculators available. One of the wartime problems that I had that was rather interesting was this: We had special bombs that could be aimed. Two were called Azon (azimuth only) and Razon (range and azimuth only), and there were also heathoming bombs. These bombs could be aimed while they were in flight so that they would be likely to hit a road or even a point target. The trouble with them was that they seemed to have a very large standard deviation around their aiming point. And a question before the house was: How was it possible that special bombs could be useful when they had a larger aiming error than standard bombs? So I was called to go to Washington to explain how that could be. The answer was essentially this: The radios that controlled the guidance systems in these bombs would break in a substantial proportion of them. And when they broke, they tended to lock their fins in extreme positions with the result that the bombs would just sail off, far, far from the aiming point. On the other hand, when the radio didn’t break then it was possible to aim them very precisely. With Azon you could get a substantial proportion of hits on a long rather narrow target, say thirty or forty feet wide from an altitude of 15,000 feet, even though the overall standard error would be of the order of 2,000 feet. So because John Williams was away I was required to go to the National Academy of Sciences in Washington to explain this to a very distinguished panel of scientists. It was the first important experience I had testifying about a statistical problem. John and I didn’t get to know each other until he came into statistics, though we saw each other at meals at the graduate school and at Fine Hall. But we did work together some in the war on a number of little problems, one of which had to do with the low moments of small samples. I had written a paper that gave the means and the variances of the order statistics of small samples but didn’t give the covariances. When I showed it to Wilks, he felt

652

Moderated by Francis J. Anscombe

that if we didn’t give the covariances the paper wasn’t satisfactory. It was very difficult to get the covariances. Getting accurate covariances for order statistics turned out to be a problem that defeated calculating machines for quite a few years even after they became fairly good. So getting those covariances wasn’t something that was going to come easily. Cecil Hastings, who worked for this Princeton group in New York with me, was very good at this kind of calculation so he worked hard on getting the covariances. Meanwhile, John Tukey and Charlie Winsor were with me one of those afternoons, perhaps a Sunday afternoon, I’m not quite sure, and I explained to them that it would be nice if we had features of these order statistics, not just for the normal distribution, but for some other distributions. I had them, of course, for the uniform distribution but wanted them for other distributions. And John and Charlie agreed to get the corresponding things for some other distributions. John, how did you do that? Tukey: We went for a distribution where the chances of doing it were good. Mosteller: Cheating? Tukey: No, no, not cheating. The main point is that we don’t have to go as far as simulations sometimes to be useful. If you look for a case in which the representing function, the inverse of the cumulative, has a nice algebraic behavior, you can end up doing things of this sort in reasonably closed form. And so we did some of this. So when the paper (Hastings, Mosteller, Tukey and Winsor, 1947) finally came out, it had some uniform information, some Gaussian information, some asymptotic Gaussian information and some information for the special distribution. All this did a fair amount to illuminate how these things varied as you went to distributions with a longer tail. The special distribution was only a little bit longer-tailed than the logistic. We weren’t very brave in those days. Fred mentioned a couple of things that he got into as a result of wartime problems. Maybe I should mention a couple. I got into a robustness problem because when we were working on the B29 we were supposed to be looking at the precision of machine gun fire from the aircraft. Well, everybody knew then as everybody knows now, regrettably, that if you have samples from a Gaussian distribution you should calculate the standard deviation in order to judge how broad it is. It turned out we weren’t getting perfect Gaussian distributions. It turns out that if you add 0.1% of a Gaussian three times as spread out as the basic one, then already the mean deviation has passed the standard deviation as a way of measuring the scale. How you detect whether someone’s been by with a hypodermic needle and put in one part in a thousand of a slightly broader distribution is a very, very difficult point. Indeed, by the time you put in 10% you can be pretty thoroughly ruined. And I think that this was essentially what got me started thinking about robustness—a topic to which I have come back more than once. I worked for Fire Control Research through roughly the end of 1944, then that was sort of winding up. I ended up

40 Frederick Mosteller and John W. Tukey: A Conversation

653

going to Bell Laboratories and I worked there full time for a while. Then in September 1945 things were turning down a little so I came back to Princeton half-time and worked on Sam Wilks’ project. Not at Fire Control Research where I’d been, but I saw much more of Fred. While I was at Murray Hill, we had an engineer named Budenbom who had been building a new, especially good tracking radar for tracking aerial targets. He wanted to go to California to give a paper and he wanted a picture to show what his tracking errors were like. So it was a question of calculating the spectrum which was done by the best conventional methods (by ladies using desk calculators). Dick Hamming looked at this and said, “Well, if you only smooth it ( 14 , 12 , 14 ) it will look much better.” So the first thing that we did was to smooth it ( 14 , 12 , 14 ) and sent Bud off with the picture. And the other thing was that Dick and I spent some months trying to understand why it was a good idea to smooth it ( 14 , 12 , 14 ). And that’s where my spectrum analysis education really got started. Mosteller: When I came back from New York with Virginia, she again worked for Merrill Flood who then had his own research company. I buckled down then, at last, to finish writing my thesis. It hadn’t been going too well when I left for New York but when I came back it was going fairly well. The only difficulty was I could rarely see Sam Wilks for the reasons that John has explained earlier. And therefore, John became essentially the advisor on the thesis. I went around to John from time to time and asked him for some suggestions. He always did two things: Took a pass at the problem I asked him about and then he’d always suggest something else, something entirely different to work on. And I gradually got an important idea out of that experience which was that it’s important to get out of ruts and into some new activity that may turn out to be more beneficial than the ruts you are already in and can’t handle. So I owe John a great deal for that. The thesis got finished finally (Mosteller, 1946) and though the thesis was one that Wilks had originally suggested, John did a great deal of the final advising on the topic. At the end of that time, I was going to go to Harvard but we were both invited to Lake Junaluska by Gertrude Cox for a conference following a summer teaching program that they had given down there. And John offered to drive me there in his famous stationwagon. So we drove together. Anscombe: Would you explain why this stationwagon was famous? Mosteller: I suppose John should explain why it was famous. Tukey: It wasn’t famous to me, so it has to be Fred. Mosteller: It was the oldest living stationwagon: perhaps the most disreputable looking stationwagon at Princeton. Tukey: It was a 1936 wooden stationwagon and this was only 1946. Mosteller: At any rate, it was a stationwagon, and he was going to drive us to Lake Junaluska in North Carolina for the meeting. We hadn’t been on the road half an hour before he pointed out that he had ideas for a paper that we would write on the way to North Carolina and back. And so we did work away on it. He had an idea for developing something called binomial

654

Moderated by Francis J. Anscombe

probability paper which was very good for plotting certain kinds of binomial information. And we wrote the paper (Mosteller and Tukey, 1949), not finished exactly during the trip, but we did a lot of work on it on the trip and shortly thereafter. Junaluska had a lot of exciting people. Wolfowitz was there, Fred Stephan, Phil Rulon, David Duncan. I think Charlie Winsor was there, Bill Cochran, R. A. Fisher, Gertrude Cox, of course, and many others. Fisher was in very good humor. The first day everyone was asked to say what they would like to hear somebody talk about. And then after that list was put on the board, people were asked to volunteer to give talks on the topics. The thing that interested me very much was that one of the topics requested was Bayesianism and R. A. Fisher volunteered to give that talk. Indeed, I might say it was one of the best talks I ever heard him give in my life. He was always with us in the evening, drinking beer, especially with the younger people. He had a good time at that meeting, and we all had a good time too, both socially and intellectually. I gave a talk about pooling data which was sort of a Bayesian talk (Mosteller, 1948). The idea is that you have means from two different sources and the question is: Can you estimate one of those means better by using information from both these sources instead of only one? Fisher talked to me quite a bit about this idea. He never exactly told me he didn’t like it but he didn’t ever tell me he did like it either. He cross-questioned me very carefully in private for about three-quarters of an hour about it, and at the end we parted, and as far as I could tell, we were still very good friends. So apparently on some occasions, at least, Fisher was very interested in Bayesianism and comfortable with it. Do you remember anything about that, John? Tukey: Well, I would suspect that you weren’t trying to sell it as a matter of high principle. As a practical device I think he could think about it. Well, there were various things that Fred has not mentioned like the large organized game of hearts that took place up in the third floor dormitory. It was part of that meeting. Mosteller: Well designed too.2 Tukey: Yes. I think we ought to mention for the record the problems of keeping Fisher supplied with beer when we were meeting at a Methodist camp meeting. Fisher treated me very gently. I gave a talk about analysis of covariance from a somewhat nonstandard point of view. I thought he was really quite gentle in pointing out to me how much of it really became the standard one if you just twisted it slightly. I might have expected to be much more roughly handled. 2

Tukey was having fun with the design of experiments just then and he set up the investigation so that the players sat in all possible arrangements and took account of the changing position of the dealer. My recollection is that there were some differences among players but not in effect of position of dealer. The players included David Duncan, Fred Stephan, Charlie Winsor, J. W. T. and F. M.

40 Frederick Mosteller and John W. Tukey: A Conversation

655

Mosteller: After we finished that paper on graphing binomial counts we got involved in some others including a set of papers on industrial quality control, on “quick and dirty” methods (Mosteller and Tukey, 1949-1950). But I think the next joint event probably was the work on the Kinsey Report. Sam Wilks was President of the American Statistical Association and he was requested, I believe by the National Research Council, to appoint a committee to review the statistical methods of the Kinsey Report on sexual behavior in the human male. He asked Cochran to chair the committee and he asked John and me to join on that committee, which we did. It was a substantial effort and we did actually produce a book (Cochran, Mosteller and Tukey, 1954b), and from that book some articles (Cochran, Mosteller and Tukey, 1953 and 1954a) were produced. I think John wrote some very original material on sample surveys in that book, and we got some new thoughts about how social science and behavioral science were being carried out at that time. Dr. Kinsey was a self-reliant scientist and liked to do everything himself. Consequently, when he studied issues like variability he did it essentially by simulation. He was not aware of the substantial statistical work on variability. So some of his work was criticized because he did not use the published literature. Kinsey developed a special method of interviewing that allowed the interview to flow over its many topics in whatever order matters emerged from the respondent. Interviewers needed extensive training to handle this approach. He also had special coding methods to preserve the confidentiality of his respondents. A major problem for his studies was that his respondents were volunteers, although he had some good ideas for getting around this. He tried to gather all material about subjects he studied, and we saw an enormous collection of books and articles in his library on gall wasps, an insect he studied extensively. He also had an enormous library on sex. Tukey: Fred may or may not remember the time when the three of us were going to the train at Princeton to let Fred go back to Harvard and Wullie (Cochran) go back to Johns Hopkins. I think I can still quote Fred very accurately saying, “They couldn’t pay me to do this. They couldn’t pay me to do this.” It was really a labor of love! Mosteller: The visits to Bloomington, Indiana, required a lot of work and we had many long conferences there. We also had to go to Baltimore and work with Cochran, and Betty acted as hostess. We had to read a great number of critiques of the book. We took each critique and cut it up into little pieces and tried to write a response to each of those critiques. It was a really massive effort but we did finally get it all done. We published both the original statement and our comment in our book.3 Tukey: The implication is, if you can do that you can do anything? 3

When you have lots of little slips of paper of irregular size, it is sometimes hard to keep them straight. I recall that one slip of paper was edited by us and entered into the text of our article as our own work. After its publication, W. Allen Wallis pointed out to me that the paragraph was from his own review of the Kinsey Report and that in his opinion it was better said before we had edited it. Let me take this

656

Moderated by Francis J. Anscombe

Mosteller: [Laughs] I don’t know. We later had an opportunity to work together on the National Halothane Study (Bunker, Forrest, Mosteller and Vandam, 1969). Halothane is an anesthetic and the question was whether halothane could be killing people from massive liver necrosis because it had four halogens in its chemical composition and chemicals with halogens often cause liver damage. We got to working on this enterprise with John Bunker, Lincoln Moses, Bill Brown, John Gilbert, Yvonne Bishop, Morven Gentleman and a host of physicians for the Committee on Anesthesia of the National Research Council. We met almost anywhere in the country, wherever John was. John, I think was on sabbatical that year. At any rate, he was traveling a good deal, and the statistical team would just pick up and go and visit John—Phoenix, or New Orleans, or wherever. We attacked the halothane study with many different methods. There were three main methods, two being statistical and one being a study of the actual tissues from deaths in surgery. Tukey: Yes, I think there are still things for many statisticians to learn by going to see if they can find a copy of the report on the National Halothane Study and reading the statistical aspects in it. Although, if I don’t say it Fred undoubtedly will, about its having led to the Bishop-Fienberg-Holland book on contingency tables (Bishop, Fienberg and Holland, 1975). Not all the techniques that were piloted in the study got taken to the book stage. Mosteller: Right now there is renewed interest in this area because Congress, and also the executive branch of the government, especially the Health Care Finance Administration, want to compare hospitals in the quality of their care. This raises the problem of adjusting so as to have some fairness in the comparisons between hospitals that take severe cases and hospitals that take less severe cases. It’s not clear to me now how that should be done even though I was very comfortable with what we did in the halothane study. There seem to be more political questions involved in the new effort than there were in the halothane study. Maybe I’ve become more sensitive, but I think not. I think that’s a realistic part of the new study. Tukey: Maybe you were technically happier than I was. I wasn’t unhappy in the sense that I knew anything better to do. But if you start rating severity of cases on a five- or seven-point verbal scale, it’s hard to believe that a verbally described severity in a hospital that only sees light cases is the same as that same verbally described severity in one that sees much worse ones. And it’s very hard to understand how to change things so that the word ”moribund” will have the same interpretation everywhere. And without that, you have some real problems. Certainly the adjustments that we used were good things. I think we know enough now about making adjustments for broad categories that we might want to take those things somewhat further, if we opportunity to acknowledge that accidential bit of plagiarism without bothering to hunt up the offending paragraph.

40 Frederick Mosteller and John W. Tukey: A Conversation

657

had all that effort to do again and all that data to work with. “All that data” being information on about 800,000 cases. That’s a lot of data. But carrying on the expositional tradition, there is always the green book (Mosteller and Tukey, 1977). And in that there is material on adjusting for broad categories (pages 240–257). The thing that people forget when they say, “Well, I’m going to cure the fact that there is this background variable. I’ll dichotomize it, and then I’ll look at the two halves of the dichotomy and sort of take what effect there seems to be in each half and pool them together.” Forgetting that if the ratios of the fractions are different, if you sort of cut a distribution across under a knife, the centers of gravity are not going to be in the same place when the distribution is over here and cut there, as when it’s over there and cut here. Dichotomizing helps, but it’s not the whole answer. Being sure you make this correction—or any correction—accurately is not likely in human affairs, but you are a lot better off to make the correction than not. Correction for broad categories is one of the things that really did get into the green book, and I think it’s right to direct people’s attention to it, because it’s a pretty widespread problem that people often sweep under the rug, saying: “Well, we’ll divide it. We’ll at least separate the people with low blood pressure from the people with not-low blood pressure and then not worry about the details anymore.” We learned something about broad categories in the halothane study because the data was all collected on the 800,000 cases before one could get a hard look at it, and the ages had been coded in ten-year blocks. It turned out that the risk of death from surgery about doubled every ten years, so the distinction between sixty-one and sixtynine, or seventy-two and seventy-nine, was a really important distinction. If we’d only had better age data, we would have been able to squeeze things a little more. And if we knew more about broad categories, we could have done a little better. I don’t think it would have affected the overall conclusions,4 but it would have been nice to come nearer to getting out of the data what was in there waiting for us. Anscombe: Could I just ask, would you explain the phrase “the green book”? Mosteller: Well, Gardner Lindzey had asked John and me to write a statistical chapter (Mosteller and Tukey, 1968) for the second edition of the Handbook of Social Psychology. John and I worked on that very hard and we wrote much too much. Whereupon Lindzey took a share of it and we were left then with a considerable extra bundle. So we decided to put the bundle together and create a book around it called Data Analysis and Regression. 4

Based on mortality and on death from massive liver necrosis, the study found the anesthetic halothane to be as safe as nitrous oxide with pentothal and safer than cyclopropane; the distribution of the use of ether across hospitals made it impossible to get a reliable comparison between halothane and ether. It was also concluded that even after adjustment for several variables some difference in mortality of hospitals remained, and this ultimately led to a later study of institutional differences at Stanford.

658

Moderated by Francis J. Anscombe

And it does include some information about robust methods as well as more classical kinds of techniques. It especially has a substantial discussion of regression and the difficulties and hazards associated with multiple regression. Essentially it says a lot about how little you can do with regression as well as how much. That’s an important feature of the book. Anscombe: Would you care to say something about what I think you call the “Cambridge writing machine” as a more recent work in collaboration? Tukey: That’s what I call it. I’ll talk about it a little, and then let Fred talk about it. He wouldn’t call it that. But there has to be some name for the collaborative volumes edited by Hoaglin, Mosteller and Tukey, of which two are out and more are in the mill. Somehow this has worked out very well. Fred and Dave have been very effective at persuading people around Cambridge, or who have been around Cambridge—and it’s surprising how many of them were around Princeton before they were around Cambridge—to write chapters for these books. These are the things that are known in the trade as UREDA and EDITS at the moment: Understanding Robustness and Exploratory Data Analysis (Hoaglin, Mosteller and Tukey, 1983) and Exploring Data Tables, Trends, and Shapes (Hoaglin, Mosteller and Tukey, 1985). I’m not sure why this has been as good a solution as it has. The three editors have obviously been complementary and have beaten on the authors about very different things. The beating has been heavy enough so even though the chapters are separately authored, we’ve had compliments on coherence from reviewers, much to my surprise. Fred, what do you want to say about this in addition? Mosteller: I think it’s been helped enormously by some people behind the scenes, as well as by the collegiality of people who come and visit. First, the people behind the scenes: We’ve had marvelous secretarial support and two very helpful research assistants. One assistant was Anita Parunak, for part of this period, and the other, for a much longer period, is Mrs. Cleo Youtz. These people both have an important quality that this kind of work needs, and that is an eagle eye and an unwillingness to pass by something that they regard as possibly mistaken. They just keep after it until finally they force the authors to get it right. I wouldn’t say we don’t have any errors in the books, that would be silly, but the number of errors is fewer by, I would think, about two orders of magnitude than it would be if it were not for the help of these people. And they also redo all the examples and comb through the work. That explains one of the positive aspects of the work. Marjorie Olson manages the secretarial side of the productions with skill, imagination and organization. The second feature is that the number of people who are willing to participate in these enterprises is astonishing. I think they get some fun out of it and then when they leave Cambridge and go home to their own university, they usually have one or two more strings added to their bow which they may use with other students later or in their own work. Third, some of our own graduate students participate and this leads to training and publication for them. Fourth, sometimes Harvard faculty join in. We have had a sustained relation with John Emerson of Middlebury College, as well as Boris Iglewicz

40 Frederick Mosteller and John W. Tukey: A Conversation

659

at Temple. So it’s been a very productive enterprise and profitable, I think, for all participants. Tukey: Well, maybe we should turn to a closing topic. I think there are a few lessons that I would believe in very thoroughly. I think Fred would too. The first is that real problems deserve realistic attention. Which implies it’s better to have an approximate solution to the right problem than to have an exact solution to the wrong one. Second, that one should intend to learn from real problems, that they can be extremely suggestive over the long pull about both theory and techniques. Third, that the use of techniques is not confined to the instances that are covered by theory. If you had to have theory to cover every application, very few techniques would ever get used. And I think the corollary to this, or better, to the thing to which this is a corollary, is that statistics needs to be broad, not narrow. Anscombe: Do you care to add any comments on that kind of theme? Mosteller: No, I think that pretty well covers it. Anscombe: Well, then I think we are about at the end of the time that is allotted for us. And I thank you gentlemen very much indeed for your marvelous series of thoughts and reminiscences, and historical insights. Thank you. REFERENCES Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W. (with the collaboration of R.J. Light and F. Mosteller) (1975). Discrete Multivariate Analysis. MIT Press, Cambridge, Mass. Bunker, J.P., Forrest, W.H., Jr., Mosteller, F. and Vandam, L.D. (eds.) (1969). The National Halothane Study. National Institutes of Health, National Institute of General Medical Sciences. U. S. Government Printing Office, Washington. Cochran, W.G., Mosteller, F. and Tukey, J.W. (1953). Statistical problems of the Kinsey report. J. Amer. Statist. Assoc. 48 673-716. Cochran, W.G., Mosteller, F. and Tukey, J.W. (1954a). Principles of sampling. J. Amer. Statist. Assoc. 40 13-35. Cochran, W.G., Mosteller, F. and Tukey, J.W. (1954b). Statistical Problems of the Kinsey Report. Amer. Statist. Assoc., Washington. Hastings, C., JR., Mosteller, F., Tukey, J.W. and Winsor, C.P. (1947). Low moments for small samples: A comparative study of order statistics. Ann. Math. Statist. 18 413-426. Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds.) (1983). Understanding Robust and Exploratory Data Analysis. Wiley, New York. Hoaglin, D.C., Mosteller, F. and Tukey, J.W. (eds.) (1985). Exploring Data Tables, Trends, and Shapes. Wiley, New York. Mosteller, F. (1946). On some useful “inefficient” statistics. Ann. Math. Statist. 17 377-408. Mosteller, F. (1948). On pooling data. J. Amer. Statist. Assoc. 43 231-242. Mosteller, F. and Tukey, J.W. (1949). The uses and usefulness of binomial probability paper. J. Amer. Statist. Assoc. 44 174-212.

660

Moderated by Francis J. Anscombe

Mosteller, F. and Tukey, J.W. (1949-1950). Practical applications of new theory, a review. Part I: Location and scale: tables; Part II: Counted data-graphical methods; Part III: Analytical techniques; Part IV: Gathering information. Industrial Quality Control. 6 (2) 5-8; 6 (3) 5-7; 6 (4) 6-7; 6 (5) 5-7. Mosteller, F. and Tukey, J.W. (1968). Data analysis, including statistics. In revised Handbook of Social Psychology (G. Lindzey and E. Aronson, eds.) 2 Chapter 10. Addison-Wesley, Reading, Mass. Mosteller, F. and Tukey, J.W. ( 1977). Data Analysis and Regression. AddisonWesley, Reading, Mass.

Springer Series in Statistics

(continued from p. ii)

Manski: Partial Identification of Probability Distributions. Mielke/Berry: Permutation Methods: A Distance Function Approach. Molenberghs/Verbeke: Models for Discrete Longitudinal Data. Nelsen: An Introduction to Copulas. 2nd edition Pan/Fang: Growth Curve Models and Statistical Diagnostics. Politis/Romano/Wolf: Subsampling. Ramsay/Silverman: Applied Functional Data Analysis: Methods and Case Studies. Ramsay/Silverman: Functional Data Analysis, 2nd edition. Rao/Toutenburg: Linear Models: Least Squares and Alternatives. Reinsel: Elements of Multivariate Time Series Analysis. 2nd edition. Rosenbaum: Observational Studies, 2nd edition. Rosenblatt: Gaussian and Non-Gaussian Linear Time Series and Random Fields. Särndal/Swensson/Wretman: Model Assisted Survey Sampling. Santner/Williams/Notz: The Design and Analysis of Computer Experiments. Schervish: Theory of Statistics. Shao/Tu: The Jackknife and Bootstrap. Simonoff: Smoothing Methods in Statistics. Singpurwalla and Wilson: Statistical Methods in Software Engineering: Reliability and Risk. Small: The Statistical Theory of Shape. Sprott: Statistical Inference in Science. Stein: Interpolation of Spatial Data: Some Theory for Kriging. Taniguchi/Kakizawa: Asymptotic Theory of Statistical Inference for Time Series. Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 3rd edition. Tillé: Sampling Algorithms. Tsiatïs: Semiparametric Data and Missing Data van der Laan: Unified Methods for Censored Longitudinal Data and Causality. van der Vaart/Wellner: Weak Convergence and Empirical Processes: With Applications to Statistics. Verbeke/Molenberghs: Linear Mixed Models for Longitudinal Data. Weerahandi: Exact Statistical Methods for Data Analysis. West/Harrison: Bayesian Forecasting and Dynamic Models, 2nd edition.