1,751 275 260MB
English Pages 620 [621] Year 2022
ISTUDY
ISTUDY
Principles of Biostatistics
ISTUDY
ISTUDY
Principles of Biostatistics Third Edition
Marcello Pagano Kimberlee Gauvreau Heather Mattie
ISTUDY
Third edition published 2022 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2022 Taylor & Francis Group, LLC Second edition published by 2000 by Brooks/Cole and then Cengage Learning Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Names: Pagano, Marcello, 1945- author. | Gauvreau, Kimberlee, 1963- author. | Mattie, Heather, author. Title: Principles of biostatistics / Marcello Pagano, Kimberlee Gauvreau, Heather Mattie. Description: Third edition. | Boca Raton : CRC Press, 2022. | Revised edition of: Principles of biostatistics / Marcello Pagano, Kimberlee Gauvreau. 2nd ed. c2000. | Includes bibliographical references and index. Identifiers: LCCN 2021057073 (print) | LCCN 2021057074 (ebook) | ISBN 9780367355807 (hardback) | ISBN 9781032252445 (paperback) | ISBN 9780429340512 (ebook) Subjects: LCSH: Biometry. Classification: LCC QH323.5 .P34 2022 (print) | LCC QH323.5 (ebook) | DDC 570.1/5195--dc23/eng/20211223 LC record available at https://lccn.loc.gov/2021057073 LC ebook record available at https://lccn.loc.gov/2021057074 ISBN: 978-0-367-35580-7 (hbk) ISBN: 978-1-032-25244-5 (pbk) ISBN: 978-0-429-34051-2 (ebk) DOI: 10.1201/9780429340512 Typeset in TeXGyreTermesX by KnowledgeWorks Global Ltd. Publisher’s note: This book has been prepared from camera-ready copy provided by the authors. Access the Support Material: www.routledge.com/9780367355807
ISTUDY
This book is dedicated with love to Phyllis, Marisa, John-Paul, Camille and Ivy, Neil and Eliza, Ali, Bud, Connie, Nanette, Steve, Katie and Buddy
ISTUDY
ISTUDY
Contents
Preface 1 Introduction 1.1 Why Study Biostatistics . . . . . . . . 1.2 Difficult Numbers . . . . . . . . . . . 1.3 Overview of the Text . . . . . . . . . . 1.3.1 Part I: Chapters 2–4 Variability 1.3.2 Part II: Chapters 5–8 Probability 1.3.3 Part III: Chapters 9–22 Inference 1.3.4 Computing Resources . . . . . 1.4 Review Exercises . . . . . . . . . . . .
I
xiii . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Variability
2 Descriptive Statistics 2.1 Types of Numerical Data . . . . . . . . 2.1.1 Nominal Data . . . . . . . . . . 2.1.2 Ordinal Data . . . . . . . . . . 2.1.3 Ranked Data . . . . . . . . . . 2.1.4 Discrete Data . . . . . . . . . . 2.1.5 Continuous Data . . . . . . . . 2.2 Tables . . . . . . . . . . . . . . . . . . 2.2.1 Frequency Distributions . . . . 2.2.2 Relative Frequency . . . . . . . 2.3 Graphs . . . . . . . . . . . . . . . . . 2.3.1 Bar Charts . . . . . . . . . . . 2.3.2 Histograms . . . . . . . . . . . 2.3.3 Frequency Polygons . . . . . . 2.3.4 Box Plots . . . . . . . . . . . . 2.3.5 Two-Way Scatter Plots . . . . . 2.3.6 Line Graphs . . . . . . . . . . . 2.4 Numerical Summary Measures . . . . 2.4.1 Mean . . . . . . . . . . . . . . 2.4.2 Median . . . . . . . . . . . . . 2.4.3 Mode . . . . . . . . . . . . . . 2.4.4 Range . . . . . . . . . . . . . . 2.4.5 Interquartile Range . . . . . . . 2.4.6 Variance and Standard Deviation 2.5 Empirical Rule . . . . . . . . . . . . . 2.6 Further Applications . . . . . . . . . . 2.7 Review Exercises . . . . . . . . . . . .
1 1 3 3 4 6 6 11 12
13 . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
15 17 17 17 18 18 19 20 20 22 24 24 24 26 29 30 31 34 34 36 37 38 38 39 42 47 56
vii
ISTUDY
viii
Contents
3 Rates and Standardization 3.1 Rates . . . . . . . . . . . . . . 3.2 Adjusted Rates . . . . . . . . . 3.2.1 Direct Standardization . 3.2.2 Indirect Standardization 3.3 Further Applications . . . . . . 3.4 Review Exercises . . . . . . . .
. . . . . .
67 67 70 72 77 77 84
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
4 Life Tables 4.1 Historical Development . . . . . . . 4.2 Life Table as a Predictor of Longevity 4.3 Mean Survival . . . . . . . . . . . . 4.4 Median Survival . . . . . . . . . . . 4.5 Further Applications . . . . . . . . . 4.6 Review Exercises . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
89 . 90 . 95 . 97 . 101 . 101 . 106
II
. . . . . .
Probability
109
5 Probability 5.1 Operations on Events and Probability 5.2 Conditional Probability . . . . . . . . 5.3 Total Probability Rule . . . . . . . . 5.4 Relative Risk and Odds Ratio . . . . 5.5 Further Applications . . . . . . . . . 5.6 Review Exercises . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
111 111 115 117 121 126 131
6 Screening and Diagnostic Tests 6.1 Sensitivity and Specificity 6.2 Bayes’ Theorem . . . . . 6.3 Likelihood Ratios . . . . . 6.4 ROC Curves . . . . . . . 6.5 Calculation of Prevalence 6.6 Varying Sensitivity . . . . 6.7 Further Applications . . . 6.8 Review Exercises . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
135 136 137 142 145 147 149 151 155
7 Theoretical Probability Distributions 7.1 Probability Distributions . . . . . 7.2 Binomial Distribution . . . . . . 7.3 Poisson Distribution . . . . . . . 7.4 Normal Distribution . . . . . . . 7.5 Further Applications . . . . . . . 7.6 Review Exercises . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
159 159 161 168 170 181 186
8 Sampling Distribution of the Mean 8.1 Sampling Distributions . . . . . . . . . . . 8.2 Central Limit Theorem . . . . . . . . . . . 8.3 Applications of the Central Limit Theorem 8.4 Further Applications . . . . . . . . . . . . 8.5 Review Exercises . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
191 191 192 193 198 204
. . . . . . . .
. . . . . . . .
. . . . . . . .
ISTUDY
ix
Contents
III
Inference
207
9 Confidence Intervals 9.1 Two-Sided Confidence Intervals 9.2 One-Sided Confidence Intervals 9.3 Student’s t Distribution . . . . . 9.4 Further Applications . . . . . . 9.5 Review Exercises . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
209 209 213 215 218 222
10 Hypothesis Testing 10.1 General Concepts . . . . . . . 10.2 Two-Sided Tests of Hypothesis 10.3 One-Sided Tests of Hypothesis 10.4 Types of Error . . . . . . . . 10.5 Power . . . . . . . . . . . . . 10.6 Sample Size Estimation . . . 10.7 Further Applications . . . . . 10.8 Review Exercises . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
227 227 230 233 234 238 241 243 249
11 Comparison of Two Means 11.1 Paired Samples . . . . . . . . . . . . . 11.2 Independent Samples . . . . . . . . . . 11.2.1 Equal Variances . . . . . . . . 11.2.2 Unequal Variances . . . . . . . 11.3 Sample Size Estimation for Two Means 11.4 Further Applications . . . . . . . . . . 11.5 Review Exercises . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
253 254 258 259 263 266 267 274
12 Analysis of Variance 12.1 One-Way Analysis of Variance . . 12.1.1 The Problem . . . . . . . 12.1.2 Sources of Variation . . . 12.2 Multiple Comparisons Procedures 12.3 Further Applications . . . . . . . 12.4 Review Exercises . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
279 279 279 282 286 288 293
13 Nonparametric Methods 13.1 Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Wilcoxon Signed-Rank Test . . . . . . . . . . . . . . . . 13.3 Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . 13.4 Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . 13.5 Advantages and Disadvantages of Nonparametric Methods 13.6 Further Applications . . . . . . . . . . . . . . . . . . . . 13.7 Review Exercises . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
297 297 301 304 307 311 311 318
14 Inference on Proportions 14.1 Normal Approximation to the Binomial Distribution 14.2 Sampling Distribution of a Proportion . . . . . . . . 14.3 Confidence Intervals . . . . . . . . . . . . . . . . . 14.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . 14.5 Sample Size Estimation for One Proportion . . . . . 14.6 Comparison of Two Proportions . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
323 324 326 327 329 330 332
. . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
ISTUDY
x
Contents 14.7 Sample Size Estimation for Two Proportions . . . . . . . . . . . . . . . . . . . . 335 14.8 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 14.9 Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
15 Contingency Tables 15.1 Chi-Square Test . . . 15.1.1 2 × 2 Tables . 15.1.2 r × c Tables . 15.2 McNemar’s Test . . 15.3 Odds Ratio . . . . . 15.4 Berkson’s Fallacy . . 15.5 Further Applications 15.6 Review Exercises . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
351 351 351 356 358 360 365 366 373
16 Correlation 16.1 Two-Way Scatter Plot . . . . . . . . . . 16.2 Pearson Correlation Coefficient . . . . 16.3 Spearman Rank Correlation Coefficient 16.4 Further Applications . . . . . . . . . . 16.5 Review Exercises . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
381 381 382 387 389 395
17 Simple Linear Regression 17.1 Regression Concepts . . . . . . . . . . . . . 17.2 The Model . . . . . . . . . . . . . . . . . . 17.2.1 Population Regression Line . . . . . 17.2.2 Method of Least Squares . . . . . . . 17.2.3 Inference for Regression Coefficients 17.2.4 Inference for Predicted Values . . . . 17.3 Evaluation of the Model . . . . . . . . . . . 17.3.1 Coefficient of Determination . . . . . 17.3.2 Residual Plots . . . . . . . . . . . . . 17.3.3 Transformations . . . . . . . . . . . 17.4 Further Applications . . . . . . . . . . . . . 17.5 Review Exercises . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
399 399 402 402 404 408 410 413 413 414 416 419 425
18 Multiple Linear Regression 18.1 The Model . . . . . . . . . . . . . . . . . . 18.1.1 Least Squares Regression Equation . 18.1.2 Inference for Regression Coefficients 18.1.3 Indicator Variables . . . . . . . . . . 18.1.4 Interaction Terms . . . . . . . . . . . 18.2 Model Selection . . . . . . . . . . . . . . . 18.3 Evaluation of the Model . . . . . . . . . . . 18.4 Further Applications . . . . . . . . . . . . . 18.5 Review Exercises . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
431 431 432 434 435 436 438 440 442 451
19 Logistic Regression 19.1 The Model . . . . . . . . . 19.1.1 Logistic Function . . 19.1.2 Fitted Equation . . . 19.2 Indicator Variables . . . . . 19.3 Multiple Logistic Regression
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
455 455 457 458 460 464
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . .
. . . . . . . .
. . . . .
. . . . . . . .
. . . . .
. . . . . . . .
. . . . .
. . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
ISTUDY
xi
Contents 19.4 19.5 19.6 19.7 19.8
Simpson’s Paradox . Interaction Terms . . Model Selection . . Further Applications Review Exercises . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
466 467 468 469 474
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
479 481 487 491 495 496 505
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
509 511 512 514 515 519 521 523 526 527 528 530 535
22 Study Design 22.1 Randomized Studies . . . . . . . . . . 22.1.1 Control Groups . . . . . . . . . 22.1.2 Randomization . . . . . . . . . 22.1.3 Blinding . . . . . . . . . . . . 22.1.4 Intention to Treat . . . . . . . . 22.1.5 Crossover Trial . . . . . . . . . 22.1.6 Equipoise . . . . . . . . . . . . 22.2 Observational Studies . . . . . . . . . 22.2.1 Cross-Sectional Studies . . . . 22.2.2 Longitudinal Studies . . . . . . 22.2.3 Case-Control Studies . . . . . . 22.2.4 Cohort Studies . . . . . . . . . 22.2.5 Consequences of Design Flaws . 22.3 Big Data . . . . . . . . . . . . . . . . 22.4 Review Exercises . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
537 538 539 539 540 541 541 541 542 542 543 543 544 544 544 546
20 Survival Analysis 20.1 Life Table Method . . . . . . . 20.2 Product-Limit Method . . . . . 20.3 Log-Rank Test . . . . . . . . . 20.4 Cox Proportional Hazards Model 20.5 Further Applications . . . . . . 20.6 Review Exercises . . . . . . . .
. . . . .
. . . . .
21 Sampling Theory 21.1 Sampling Designs . . . . . . . . . 21.1.1 Simple Random Sampling . 21.1.2 Systematic Sampling . . . . 21.1.3 Stratified Sampling . . . . . 21.1.4 Cluster Sampling . . . . . . 21.1.5 Ratio Estimator . . . . . . . 21.1.6 Two-Stage Cluster Sampling 21.1.7 Design Effect . . . . . . . . 21.1.8 Nonprobability Sampling . . 21.2 Sources of Bias . . . . . . . . . . . 21.3 Further Applications . . . . . . . . 21.4 Review Exercises . . . . . . . . . .
. . . . .
. . . . . . . . . . . .
Bibliography
547
Glossary
569
Statistical Tables
583
Index
601
ISTUDY
ISTUDY
Preface
This book was written for students of the health sciences and serves as an introduction to the study of biostatistics – the use of numbers and numerical techniques to extract information from data and facts, and to then use this information to communicate scientific results. However, just as one can lie with words, one can also lie with numbers. Indeed, numbers and lies have been linked for quite some time; there is even a book titled How to Lie with Statistics. This association may owe its origin – or its affirmation at the very least – to the British Prime Minister Benjamin Disraeli. Disraeli is credited by Mark Twain as having said, “There are three kinds of lies: lies, damned lies, and statistics.” One has only to observe any modem political campaign to be convinced of the abuse of statistics. But enough about lies; this book adopts the position of Professor Frederick Mosteller, who said, “It is easy to lie with statistics, but it is easier to lie without them.”
Background Principles of Biostatistics is aimed at students in the biological and health sciences who wish to learn traditional research methods. The first edition was based on a required course for graduate students at the Harvard T.H. Chan School of Public Health, which is also attended by a large number of health professionals from the Harvard medical area. The course is as old as the school itself, which attests to its importance. It spans 16 weeks of lectures and laboratory sessions; the lab sessions reinforce the material covered in lectures and introduce the computer into the course. We have included a selection of lab materials – either additional examples, or a different perspective on the material covered in a chapter – in the sections called Further Applications. These sections are designed to provoke discussion, although they are sufficiently complete for an individual who is not using the book as a course text to benefit from reading them. The book includes a range of biostatistical topics, the majority of which can be covered at some depth in one semester in an American university. However, there is enough material to allow the instructor some flexibility. For example, some instructors may choose to omit the sections covering the calculation of prevalence (Section 6.5) or the Poisson distribution (Section 7.3), or the chapter on analysis of variance (Chapter 12), if they consider these concepts to be less important than others.
Structure Some say that statistics is the study of variability and uncertainty. We believe there is truth to this adage, and have used it as a guide to divide the book into three parts covering the basic principles of vip: (1) variability, (2) inference, and (3) probability. For pedagogical purposes, inference and probability are covered in reverse order in the text. Chapters 2 through 4 deal with the variability inherent in collections of numbers, and the ways in which to summarize, explore, and explain them. Chapters 5 through 8 focus on probability, and serve as an introduction to the tools needed for the subsequent investigation of uncertainty. In Chapter 8 we distinguish between populations and samples and begin to examine the variability introduced by sampling from a population, thus progressing to inference in the book’s remaining chapters. We think that this modular introduction to the quantification of uncertainty is justified by the success achieved by our students. Postponing
xiii
ISTUDY
xiv
Preface
the slightly more difficult concepts until a solid foundation has been established makes it easier for the reader to comprehend and retain them.
Datasets and Examples Throughout the text we have used data drawn from published studies to illustrate biostatistical concepts. Not only is real data more meaningful, it is usually more interesting as well. Of course, we do not wish to use examples in which the subject matter is too esoteric or too complex. To this end, we have been guided by the backgrounds and interests of our students – primarily topics in public health and clinical research – to choose examples that best demonstrate the concepts at hand. There is some risk involved in using published data. We cannot guarantee that all of the examples are honest and that the data were properly collected; for this we must rely on the reputations of our sources. We do not belittle the importance of this consideration. The value of our inference depends critically on the worth of the data, and we strongly recommend that a good deal of effort be expended on evaluating its quality. We assume that this is understood by the reader. In some cases we have used examples in which the population of the United States is broken down along racial lines. In reporting these official statistics we follow the lead of the government agencies that release them. We do not wish to rectify this racial categorization, since the observed differences may well be due to socioeconomic factors rather than the implied racial ones. One option would be to ignore these statistics; however, this would hide inequities which exist in our health system – inequities that need to be eliminated. We focus attention on the problem in the hope of stimulating interest in promoting solutions. We have minimized the use of mathematical notation because of its well-deserved reputation of being the ultimate jargon. If used excessively, it can intimidate even the most ardent scholar. We do not wish to eliminate it entirely, however; it has been developed over the ages to be helpful in communicating results. In this third edition, mathematical notation and important formulas used in the text have also been included in summary boxes at the ends of relevant sections.
Computing There is something about numbers – maybe a little magic – that makes them fun to study. The fun is in the conceptualization more than the calculations, however, and we are fortunate that we have the computer to do the drudge work. This allows students to concentrate on the concepts. In other words, the computer allows the instructor to teach the poetry of statistics and not the plumbing. To take advantage of the computer, one needs a good statistical package. We use Stata, a product of the Stata Corporation in College Station, Texas, and also R, a software environment available for free download. Stata is user-friendly, accurate, powerful, reasonably priced, and works on a number of different platforms, including Windows, Unix, and Macintosh. R is available on an open-source license, and also works on a number of platforms. It is a versatile and efficient programming language. Other statistical packages are available, and this book can be supplemented by any one of them. We strongly recommend that some statistical package be used for calculations. Some of the review exercises in the text require the use of a computer. The required datasets are available on the book’s companion website at https://github.com/Principles-of-Biostatistics/3rdEdition. There are also many exercises that do not require the computer. As always, active learning yields better results than passive observation. To this end, we cannot stress enough the importance of the review exercises, and urge the reader to attempt as many as time permits.
ISTUDY
Preface
xv
New to the Third Edition The third edition continues in the spirit of the first edition, but has been updated to reflect some of the advances of the last 30 years. It includes revised and expanded discussions on many topics throughout the book. Major revisions include: • The chapters on Data Presentation and Numerical Summary Measures from the second edition have been streamlined and combined into a single chapter titled Descriptive Statistics. • The chapter on Life Tables has been rewritten, and detailed calculations for the life table have been moved into the Further Applications section. • The material on Screening and Diagnostic Tests – formerly contained within the Probability chapter – has been given its own chapter. This new chapter includes sections on likelihood ratios and the concept of varying sensitivities. • New sections on sample size calculations for two-sample tests on means and proportions, the Kruskal-Wallis test, and the Cox proportional hazards model have been added to existing chapters. • Concepts previously covered in a chapter titled Multiple 2 × 2 Tables have now been moved into the Logistic Regression chapter. • The chapter on Sampling Theory has been greatly expanded. • A new chapter introducing the basic principles of Study Design has been added at the end of the text. • Datasets used in the text and those needed for selected review exercises are now available on the book’s companion website at https://github.com/Principles-of-Biostatistics/3rd-Edition. • The companion website also contains the Stata and R code used to produce the computer output displayed in the text’s Further Applications sections, as well as introductory material describing the use of both statistical packages. • A glossary of definitions for important statistical terms has been added at the back of the book. • As previously mentioned, mathematical notation and formulas used in the text have been included in summary boxes at the end of each section for ease of reference. • Additional review exercises have been included in each chapter. In addition to these changes in content, previously used data have been updated whenever possible to reflect more current public health information. As its name suggests, Principles of Biostatistics covers topics which are fundamental to an introduction to biostatistics. Of course we have had to limit the material presented, and some important topics have not been included. Decisions about what to exclude were difficult, especially as the field of biostatistics and data science continues to evolve. No small role in this evolution is played by the computer; the capacity of statistical software seems to increase limitlessly, providing new and exciting inferential tools. However, to truly appreciate these tools and to be able to utilize them properly requires a strong foundation in traditional statistical principles. Those laid out in this text are still essential and will be useful to the reader both today and in the future.
ISTUDY
xvi
Preface
Acknowledgments A debt of gratitude is owed to a number of people: former Harvard University President Derek Bok for providing the support which got the first edition of this book off the ground, Dr. Michael K. Martin for performing the calculations for the Statistical Tables section, John-Paul Pagano for assisting in the editing of the first edition, and the individuals who reviewed the manuscript. We thank the teaching assistants who have helped us teach our courses over the years and who have made many valuable suggestions. Probably the most deserving of thanks are our students, who have tolerated us as we learned how to best teach the material. We are still learning. Marcello Pagano Kimberlee Gauvreau Heather Mattie Boston, Massachusetts
ISTUDY
1 Introduction
CONTENTS 1.1 1.2 1.3
1.4
Why Study Biostatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Difficult Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Part I: Chapters 2–4 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Part II: Chapters 5–8 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Part III: Chapters 9–22 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Computing Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 3 3 4 6 6 11 12
In 1903, H.G. Wells hypothesized that statistical thinking would one day be as necessary for good citizenship as the ability to read and write. Wells was correct, and today statistics play an important role in many decision-making processes. For example, before any new drug or medical device can be marketed legally in the United States, the United States Food and Drug Administration (fda) requires that it be subjected to a clinical trial, an experimental study involving human subjects. The data from this study is compiled and analyzed to determine not only whether the drug is effective, but also if it is safe. How is this determined? As another example, the United States government’s decisions regarding Social Security and public health programs rely in part on the longevity of the nation’s population; the government must therefore be able to accurately predict the number of years each individual will live. How does it do this? If the government incorrectly forecasts human life expectancy, it could render itself insolvent and endanger the well-being of its citizens. There are many other issues that must be addressed as well. Where should a government invest its resources if it wishes to reduce infant mortality? Should a mastectomy always be recommended to a patient with breast cancer? Should a child play football? What factors increase the risk that an individual will develop coronary heart disease? Will we be able to afford our health care system in the future? Does global warming impact the sea level? Our health? What effect would a particular change in policy have on longevity? To answer these questions and others, we rely on the methods of biostatistics.
1.1
Why Study Biostatistics
The study of statistics explores the collection, organization, analysis, and interpretation of numerical data. The concepts of statistics may be applied to a number of fields, including business, psychology, and agriculture. When focus is on the biological and health sciences, we use the term biostatistics. Historically, statistics have been used to tell a story with numbers. Numbers often communicate ideas more succinctly than do words. For example, the World Health Organization (who) defines maternal mortality as “the death of a woman while pregnant or within 42 days of termination of DOI: 10.1201/9780429340512-1
1
ISTUDY
2
Principles of Biostatistics
FIGURE 1.1 Maternal mortality per 100,000 live births, 1990–2015 pregnancy, irrespective of the duration and site of the pregnancy, from any cause related to or aggravated by the pregnancy or its management but not from accidental or incidental causes” [1]. Therefore, when presented with the graph in Figure 1.1 [2, 3], someone concerned with maternal mortality might react with alarm at the reported striking behavior of the United States and research the issue further. How useful is the study of biostatistics? Biostatistics are certainly ubiquitous in the health sciences. The Centers for Disease Control and Prevention (cdc) reports that “During the 20th century, the health and life expectancy of persons residing in the United States improved dramatically. Since 1900, the average lifespan of persons in the United States has lengthened by greater than 30 years; 25 years of this gain are attributable to advances in public health” [4–6]. They go on to list what they consider to be ten great achievements: – – – – –
Vaccination Safer workplaces Healthier mothers and babies Fluoridation of drinking water Decline in deaths from coronary heart disease and stroke
– – – – –
Motor vehicle safety Control of infectious diseases Safer and healthier foods Family planning Recognition of tobacco use as a health hazard
When one reads the recounting of these achievements in subsequent Mortality and Morbidity Weekly Reports, it is evident that biostatistics played an important role in every one of them. Notwithstanding these societal successes, work still needs to be done. The future with its exabytes of data – known as big data – providing amounts of information which are orders of magnitude larger than was previously available is a new challenge. But if we are to progress responsibly, we cannot ignore the lessons of the past [7]. A case in point is our failure to control the number of deaths from guns that has led to a public health crisis in the United States. The statistic blared from a headline in
ISTUDY
Introduction
3
The New York Times in 2018 [8]: “nearly 40,000 people died from guns in u.s. last year, highest in 50 years.” This crisis looks even worse when one considers what is happening with mass shootings
in schools. The United States is experiencing a remarkable upward trend in the number of casualties involved. There have been more school shooting deaths in the first 18 years of the 21st century (66) than in the last 60 years of the 20th century (55). The same is true for injuries due to guns, with 260 and 81 in each of these two time periods, respectively [9]. A summary of this situation is made more pithy by the statistics.
1.2
Difficult Numbers
The numbers needed to tell a story are not always easy to come by – examples include attempts to investigate the volume of illicit human trafficking [10], or to measure the prevalence of female genital mutilation [11] – but are indispensable for communicating important ideas. The powerful use of statistics in this argument against continued restrictions on the drug mifepristone’s distribution is clear [12]: Since its approval in 2000, more than 3.7 million women have used mifepristone to end an early pregnancy in the United States — it is approved for use up to 70 days into a pregnancy. Nearly two decades of data on its use and effects on patients provide significant new insights into its safety and efficacy. Mifepristone is more than 97% effective. Most adverse effects are mild, such as cramping or abdominal pain, and the rate of severe adverse events is very low: such events occur in less than 0.5% of patients, according to the fda. Many drugs marketed in the United States have higher adverse event rates and are not subject to restricted distribution. In this example, the numbers provide a concise summary of the situation being studied. They, of course, must be both accurate and precise if we are to trust any conclusions based on them. The examples described deal with complex situations, yet the numbers convey essential information. A word of caution: we must remain realistic in our expectations of what statistics can achieve. No matter how powerful it is, no statistic will convince everyone that a given conclusion is true. The data on gun deaths in the United States mentioned above are often brushed away with some variant of the aphorism, “Guns don’t kill people, people do.” This should not come as a surprise. After all, there are still deniers of global warming, people who believe that the vaccine for measles, mumps, and rubella causes autism, and members in the Flat Earth Society, whose website states: “This website is dedicated to unravelling the true mysteries of the universe and demonstrating that the earth is flat and that Round Earth doctrine is little more than an elaborate hoax” [13].
1.3
Overview of the Text
The aim of a study using biostatistics is to analyze and present data in a transparent, interpretable, and coherent manner to effectively communicate results and to help lead policy makers to the best informed decisions. This text book, as its title states, covers the principles of biostatistics. The 21 chapters beyond this one can be arranged into three parts to cover the tenets of biostatistics: (1) variability, (2) inference, and (3) probability. We list them in this order so students can easily remember the acronym vip. For pedagogical reasons, however, we present them in a different order:
ISTUDY
4
Principles of Biostatistics
FIGURE 1.2 Racial breakdown of COVID-19 cases in the United States through May 28, 2020 (1) Chapters 2–4 discuss variability, (2) Chapters 5–8 cover probability, and (3) Chapters 9–22 cover inference.
1.3.1
Part I: Chapters 2–4 Variability
If we wish to study the effects of a new diet, we might place a group of individuals on that diet and measure changes in their body mass over time. Similarly, if we want to investigate the success of an innovative therapy for treating pancreatic cancer, we might record the lengths of time that patients treated with this therapy survive beyond their initial diagnosis. These numbers, however, can display a great deal of variability from one person to another. They are generally not very informative until we begin combining them in some way. Descriptive statistics, the topic of Chapter 2, are methods for organizing and summarizing a set of measurements. They help us to better understand the attributes of a group or population. For instance, to support the premise that there was racial inequity in who was afflicted by the coronavirus, reporters from The New York Times collected data and displayed it not only in a table, but also as a graph similar to Figure 1.2 [14]. To dig deeper into their analysis and show the impact by age group, they also included Figure 1.3 [14]. This example demonstrates the power of a picture to tell a story. The graphical capabilities of computers make this type of summarization feasible even for the most modest analyses, and use of both tables and graphs to summarize information enables scientists and policy makers to formulate hypotheses that then require further investigation. By definition, a summary captures only a particular aspect of the data being studied; consequently, it is important to have an idea of how well the summary represents the set of measurements as a whole. For example, we might wish to know how long hiv/aids patients survive after diagnosis with one of the opportunistic infections that characterize the disease. If we calculate an average survival time, is this average representative of all patients? Furthermore, how useful is it for planning future health service needs? In addition to tables and graphs, Chapter 2 examines numerical summary measures that help answer questions such as these. The chapter includes an introduction to the mean and standard deviation; the former tells us where the measurements are centered, and the latter how dispersed they are. The chapter ends with the splendid empirical rule, which quantifies the metaphor “the apple does not fall far from the tree.” Measurements that take on only two distinct values require special attention. In the health sciences, one of the most common examples of this type of data is the categorization of being alive or dead. If we denote survival by 0 and death by 1, we are able to classify each member of a group of individuals using these numbers and then average the results. In this way, we can summarize the mortality associated with the group. Chapter 3 deals exclusively with measurements
ISTUDY
5
Introduction
FIGURE 1.3 Racial breakdown of COVID-19 cases in the United States in 2020, by age that assume only two values. The notion of dividing a group into smaller subgroups or classes based on a characteristic such as age or sex is also introduced. Grouping individuals into smaller, more homogeneous subgroups decreases variability, thus allowing better prognosis. For example, it might make sense to determine the mortality of females separately from that of males, or the mortality of 20- to 29-year-olds separately from 80- to 89-year-olds. Chapter 3 also investigates techniques that allow us to make valid comparisons among populations whose compositions may differ substantially. Chapter 4 introduces the classical life table, one of the most important numerical summary techniques available in the health sciences. Life tables are used by public health professionals to characterize the well-being of a population, and by insurance companies to predict how long individuals will live. In this chapter, the study of mortality begun in Chapter 3 is extended to incorporate the actual time to death for each individual, resulting in a more refined analysis. Together, Chapters 2 through 4 demonstrate that the extraction of information from a collection of measurements is not precluded by the variability among those measurements. Despite their variability, the data often exhibit a certain regularity as well. For example, here are the birth rates in the United States among women 15–19 years of age over the 5-year time span shown [15]: Year :
2011
2012
2013
2014
2015
Birth rate per 1000 :
31.3
29.4
26.5
24.2
22.3
Are the numbers showing a natural variability around a constant rate over time – think of how many mistakes can go into the reporting of such numbers – or is this indicative of a real downward trend? This question deserves better than a simple choice between these two options. To answer it properly, we need to apply the principles of probability and inference, the subjects covered in the next two sections of the text.
ISTUDY
6
1.3.2
Principles of Biostatistics
Part II: Chapters 5–8 Probability
Probability theory resides within what is known as an axiomatic system; we start with some basic truths (axioms), and then build up a logical system around them. In its purest form, this theoretical system has no practical value. Its practical importance comes from knowing how to use the theory to yield useful approximations. An analogy can be drawn with geometry, a subject that most students are exposed to relatively early in their schooling. Although it is impossible for an ideal straight line to exist other than in our imaginations, that has not stopped us from constructing some wonderful buildings, based on geometric calculations; including some that have lasted thousands of years. The same is true of probability theory. Although it is not practical in its pure form, its basic principles – which we investigate in Chapter 5 – can be applied to provide a means of quantifying uncertainty. An important application of probability theory arises in medical screening and diagnostic testing, as we see in Chapter 6. Uncertainty is present because, despite some manufacturers’ claims, no biological test is perfect. This leads to complicated findings, which are sometimes unintuitive, even in the simple situation where the test is diagnosing the presence or absence of a medical condition. Before performing the test, we consider each of four possible classifications: the test result is correct or not, and the person being tested has the condition or not. The relationship between the results of the test and the truth gives rise to important practical questions. For instance, can we conclude that every blood sample that tests positive for hiv actually harbors the virus? All the units in the Red Cross blood supply have tested negative for hiv; does this mean that there are no contaminated samples? If there are contaminated samples, how many might there be? To address questions such as these, we study the average or long-term behavior of diagnostic tests by using probability theory. Chapters 7 and 8 extend probability theory and introduce some common probability distributions used to describe the variability in a set of measurements. These mathematical models are useful as a basis for the inferential methods covered in the remainder of the text.
1.3.3
Part III: Chapters 9–22 Inference
The Cambridge Dictionary defines inference as a guess that is made or an opinion that is formed based on the information available. The paradigm we use in this text is that the inference we make about a population is based on a sample of observations selected from that much larger population. On the basis of the sample, we draw conclusions about the entire population, including the part of the population we did not measure – those not in the sample. Humans are much more similar to each other than dissimilar, and we capitalize on this fact to add credibility to our inference. However, knowing how the sample is chosen and whom the sample represents are also of critical importance for making inference. An analogy can be made with the way in which traveling salesmen in the late 19th and early 20th centuries in the United States were able to sell their goods to potential customers. Rather than carry all the goods to be sold – including big items such as stoves – they would transport miniature models of the products they were selling; see Figure 1.4 for an example. These replicas were very carefully crafted, so as to convey an honest likeness, albeit a much smaller version of the sale item [16]. Although these are also called samples, this is where the analogy ceases to be useful; to make realistic models, the manufacturers had the real item as a guide. When we sample in biostatistics, it is because we do not know what the measurements look like for the entire target population. Suppose we want to know whether a new drug is effective in lowering high blood pressure. Since the population of all people in the world who have high blood pressure is very large, it is implausible to think we would have either the time or the resources necessary to locate and examine each and every person with this condition who might be a candidate to use the drug. Out of necessity, we must rely on a sample of people drawn from the population. The limits to our subsequent inference – which are always there – are determined by both the population that we sample, and by how well the sample represents that population.
ISTUDY
Introduction
7
FIGURE 1.4 Boxed salesman’s sample set of glass bottles, containing samples from the Crocker company (Buffalo, New York) (photo courtesy of Judy Weaver Gonyeau) [16] The ability to generalize results from a sample to a population is the bedrock of empirical research, and a central issue in this book. One requirement for credible inference is that it be based on a representative sample. In any particular study, do we truly have a representative sample? If we answer yes, this leads to a logical conundrum. To truly judge that we have a representative sample we need to know the entire population. And if we know the entire population, why then focus only on a sample? If we do not have the ability to study an entire population, the best solution available is to utilize a simple random sample of the population. This means, amongst other things, that everyone in the population has an equal chance of being selected into the sample. It ensures us that, on average, we have a representative sample. A pivotal side benefit of a simple random sample is that it also provides an estimate of the possible inaccuracy of our inference. It can often be difficult to obtain a simple random sample. The consequences of mistakenly thinking that a sample is representative when in fact it is not lead to invalid inferences. A case in point is provided by the behavioral sciences, where empirical results are often derived from individuals sampled from western, educated, industrialized, rich, and democratic (weird) societies. An example of this is the undergraduate students who make a few extra dollars by volunteering to be a subject for an on-campus study. Since most of these studies are done in the United States, we can see the problem. Clearly the results will reflect the pool from which the subjects came. Use of the label weird implies a certain contempt for a large number of published findings attacked in an article by Henrich and colleagues [17]. They investigate results in the domains of visual perception, fairness, cooperation, spatial reasoning, categorization and inferential induction, moral reasoning, reasoning styles, self-concepts and related motivations, and the heritability of iq. They conclude that “members of weird societies, including young children, are among the least representative populations one could find for generalizing about humans.” Yet the researchers who published the original results presumably believed that their samples were random and representative.
ISTUDY
8
Principles of Biostatistics
We have repeated this mistake in the bio-medical sciences, where the consequences can be even more severe. For example, we do not perform as many clinical trials on children as on adults [18]. Trials of adults, even randomized clinical trials, are not representative of children. Children are not small adults who simply require a modification in dosage. Some conditions – such as prematurity and many of its sequelae – occur only in infants and children [19]. Certain genetic conditions such as phenylketonuria (pku) will, if untreated, lead to severe disability or even death in childhood. The diagnosis, prevention, and treatment of these conditions cannot be adequately investigated without studying children. Other conditions such as influenza and certain cancers and forms of arthritis also occur in both adults and children, but their pathophysiology, severity, course, and response to treatment may be quite different for infants, children, and adolescents. Treatments that are safe and effective for adults may be dangerous or ineffective for children. There are many more examples where certain groups have been largely ignored by researchers. The lack of trials in women [20] and people of color led Congress, in 1993, to pass the National Institutes of Health Revitalization Act, which requires the agency to include more people from these groups in their research studies. Unfortunately, success in the implementation of this law has been slow [21]. The headline in Scientific American on September 1, 2018 – 25 years after the Act was passed – was clinical trials have far too little racial and ethnic diversity; it’s unethical and risky to ignore racial and ethnic minorities [22]. This problem extends beyond clinical trials. The 21st century has seen the mapping of the human genome. Genome wide association studies (gwas) have identified thousands of genetic variants identified with human traits and diseases. This exciting source of information is unfortunately restricted, so inference is constrained or biased. A 2009 study showed that 96% of participants in gwas studies were of European descent [23]. Seven years later this had decreased to 80%, largely due to studies carried out in China and Japan; the Asian content has increased, but the representation of other groups has not. Since gwas studies are the basis for precision medicine, this has raised the fear that precision medicine will exacerbate racial health disparities [24]. This, of course, is a general trait of artificial intelligence systems: they reflect the information that goes into them. As an example of the value of inference, we can consider a group of investigators who were interested in evaluating whether, at the time of their study, there was a difference in how analgesics were administered to male versus female patients with acute abdominal pain. It would be impossible to investigate this issue by observing every person in the world with acute abdominal pain, so they designed a study of a smaller group of individuals with this ailment so they could, on the basis of the sample, infer what was happening in the population as a whole. How far their inference should reach is not our focus right now, but it is important to take notice of what they say. Here is a copy of the abstract from the published article [25]: objectives: Oligoanalgesia for acute abdominal pain historically has been attributed to the provider’s fear of masking serious underlying pathology. The authors assessed whether a gender disparity exists in the administration of analgesia for acute abdominal pain. methods: This was a prospective cohort study of consecutive nonpregnant adults with acute nontraumatic abdominal pain of less than 72 hours duration who presented to an urban emergency department (ed) from April 5, 2004, to January 4, 2005. The main outcome measures were analgesia administration and time to analgesic treatment. Standard comparative statistics were used. results: Of the 981 patients enrolled (mean age standard deviation [sd] 41 17 years; 65% female), 62% received any analgesic treatment. Men and women had similar mean pain scores, but women were less likely to receive any analgesia (60% vs. 67%, difference 7%, 95% confidence interval (ci) = 1.1% to 13.6%) and less likely to receive opiates (45% vs. 56%, difference 11%, 95% ci = 4.1% to 17.1%). These differences persisted when genderspecific diagnoses were excluded (47% vs. 56%, difference 9%, 95% ci = 2.5% to 16.2%). After controlling for age, race, triage class, and pain score, women were still 13% to 25%
ISTUDY
Introduction
9
less likely than men to receive opioid analgesia. There was no gender difference in the receipt of nonopioid analgesia. Women waited longer to receive their analgesia (median time 65 minutes vs. 49 minutes, difference 16 minutes, 95% ci = 3.5 to 33 minutes). conclusions: Gender bias is a possible explanation for oligoanalgesia in women who present to the ed with acute abdominal pain. Standardized protocols for analgesic administration may ameliorate this discrepancy. This is a fairly typical abstract in the health sciences literature – it reports on a clinical study and uses statistics to describe the findings – so we look at it more closely. First consider the objectives of the study. We are told that the goal is to discover whether there is a gender disparity in the administration of drugs. This is not whether there was a difference in administering the drugs between genders in this particular study – that question is easy to answer – but rather a more ambitious finding; namely, is there something in this study that allows us to generalize the findings to a broader population? The abstract goes on to describe the methods utilized in the study, and then its results. We first learn that the researchers studied a group of 981 patients. To allow the reader to get an understanding of who these 981 patients are, they provide some descriptive statistics about the patients’ ages and genders. This is done to lay the groundwork for generalizing the results of the study to individuals not included in the study sample. The investigators then start generalizing their results. We are told that even though men and women suffered similar amounts of pain, women were less likely – 7% less likely – to receive any analgesia. This difference of 7% is clearly study specific. Had they chosen fewer than 981 patients or more, or even a different group of 981 patients, they likely would have observed a difference other than 7%. How to quantify this potential variability from sample to sample – even though we have observed only a single sample – and how to accommodate it when making inference, is answered by the most useful and effective result in the book. It is an application of the theory covered in Chapter 8, and is known as the central limit theorem. An application of the central limit theorem allows the study investigators to construct a 95% confidence interval for the difference in proportions, 1.1% to 13.6%. One way to interpret this interval is to appeal to a thought experiment and repetition: If we were to sample repeatedly from the underlying population, each sample might result in a difference other than 7%, and a confidence interval other than 1.1% to 13.6%. However, 95% of these intervals from repeated sampling will include the true population difference between the genders, whatever its value. The interpretations for all the other confidence intervals in the abstract are similar. More general applications of confidence intervals are introduced in Chapter 9, and examples appear throughout the text. For a study to be of general interest and usefulness, we must be able to extrapolate its findings to a larger population. By generalizing in this manner, however, we inevitably introduce uncertainty. There are various ways to measure and convey this uncertainty, and we cover two such inferential methods in this book. One is to use confidence intervals, as we just saw in the abstract, and the other is to use hypothesis testing. The latter is introduced in Chapter 10. The two methods are consistent with each other, and will lead to the same action following a study. There are some questions, however, that are best answered in the hypothesis testing framework. As an example, consider the way we monitor the water supply for lead contamination [26]. In 1974, the United States Congress passed the Safe Drinking Water Act, and its enforcement is a responsibility of the Environmental Protection Agency (epa). The epa determines the level of contaminants in drinking water at which no adverse health effects are likely to occur, with an adequate margin of safety. This level for lead is zero, and untenable. As a result, the epa established a treatment technique, an enforceable procedure which water systems must follow to ensure control of a contaminant. The treatment technique regulation for lead – referred to as the Lead and Copper Rule [27] – requires water systems to control the corrosivity of water. The regulation stipulates that to determine whether a system is safe, health regulators must sample taps in the system that are
ISTUDY
10
Principles of Biostatistics
more likely to have plumbing materials containing lead. The number of taps sampled depends on the size of the system served. To accommodate aberrant local conditions, if 10% or fewer of the sampled taps have no more than 15 parts per billion (ppb) of lead, the system is considered safe. If not, additional actions by the water authority are required. We can phrase this monitoring procedure in a hypothesis testing framework: We wish to test the hypothesis that the water has 15 ppb or fewer of lead. The action we take depends on whether we reject this hypothesis, or not. According to the Lead and Copper Rule, the decision depends on the measured tap water samples. If more than 10% of the water samples have more than 15 ppb, we reject the hypothesis and take corrective action. Just as with diagnostic testing in Chapter 6, we have the potential to make the wrong decision when conducting a hypothesis test. The chance of such an error is influenced by the way in which the samples are chosen, how many samples we take, and the 10% cutoff rule. In 2015, the city of Flint, Michigan, took water samples in order to check the level of lead in the water [28]. According to the Lead and Copper Rule, they were supposed to take 100 samples from houses most likely to have a lead problem. They did not. First, they took only 71 samples; second, they chose the 71 in what seemed like a random fashion. Setting aside these contraventions, they found that 8 of the 71 samples had more than 15 ppb. This is more than 10% of the samples, and thus they were required to alert the public and take corrective action. Instead, the State of Michigan forced Flint to drop two of the water samples, both with more than 15 ppb of lead. This meant that there were only 69 samples, and 6 had more than 15 ppb of lead. Thus fewer than 10% crossed the threshold, and the authorities felt free to tell the residents of Flint that their water was fine. This is yet another example of ignoring the message produced by the scientific method and having catastrophe follow [29]. It seems like the lead problem is repeating itself, only this time in Newark, New Jersey [30]. In Chapter 10 we apply hypothesis testing techniques to statements about the mean of a single population, and in Chapter 11 extend these techniques to the comparison of two population means. They are further generalized to the comparison of three or more means in Chapter 12. Chapter 13 continues the development of hypothesis testing concepts, but introduces techniques that allow the relaxation of some of the assumptions necessary to carry out the tests. Chapters 14 and 15 develop inferential methods that can be applied to enumerated data or counts – such as the numbers of cases of sudden infant death syndrome among children put to sleep in various positions – rather than continuous measurements. Inference can also be used to explore the relationships among a number of different attributes, with the underlying motivation being to reduce variability. If a full-term infant whose gestational age is 39 weeks is born weighing 4 kilograms, or 8.8 pounds, no one would be surprised. If the infant’s gestational age is only 22 weeks, however, then their weight would be cause for alarm. Why? We know that birth weight tends to increase with gestational age, and, although it is extremely rare to find a baby weighing 4 kilograms at 22 weeks, it is not uncommon at 39 weeks. There is sufficient variability in birth weights to not be surprised to hear that an infant weighs 4 kilograms at birth, but when the gestational age of the child is known, there is much less variability among infants of a particular gestational age, and 4 kilograms may seem out of place. In other words, our measurements have a more precise interpretation the more information we have about the measurement. The study of the extent to which two factors are related is known as correlation analysis; this is the topic of Chapter 16. If we wish to predict the outcome of one factor based on the value of another, then regression is the appropriate technique. Simple linear regression is investigated in Chapter 17, and is extended to the multiple regression setting – where two or more factors are used to predict a single outcome – in Chapter 18. If the outcome of interest can take on only two possible values, such as alive or dead, then an alternative technique must be applied; logistic regression is explored in Chapter 19. In Chapter 20, the inferential methods appropriate for life tables are introduced. These techniques enable us to draw conclusions about the mortality of a population based on the experience of a sample of individuals drawn from the population. This is common in clinical trials, especially in randomized
ISTUDY
Introduction
11
clinical trials, when the purpose of the trial is to study whether a patient’s survival has been prolonged by a treatment [31]. Chapter 21 is devoted to surveys and inference in finite populations. These techniques are very popular around election time in democracies, but also find many uses in public health. For example, the United States Census Bureau supplements the decennial census with an annual survey called the American Community Survey; its purpose is to help “local officials, community leaders, and businesses understand the changes taking in their communities. It is the premier source for detailed population and housing information about our nation” [32]. In 2017, 2,145,639 households were interviewed. Once again, the mainstay that enables us to make credible inference about the entire United States population, 325.7 million people in 2017, is the simple random sample. We take that as our starting point, and build on it with more refined designs. Practical examples are given by the National Centers for Health Statistics within the cdc [33]. Once again it would be helpful if we could control variability and lessen its effect. Some survey designs help in this regard. For example, if we can divide a population into strata where we know the size of each stratum, we can take advantage of that extra information – the size of the strata – to estimate the population characteristics more accurately via stratified sampling. If on the other hand we wish to lower the cost of the survey, we can turn to cluster sampling. Of course, we can combine these ideas and utilize both in a single survey. These design considerations and some of the issues raised are addressed in this chapter. The last chapter, Chapter 22, could have been the first. Even though it is foundational, one needs the material developed in the rest of the book to appreciate its content. It is here that we bolster the belief that it is not just the numbers that count, but what they represent, and how they are obtained. This was made quite clear during the covid-19 pandemic. The proper monitoring of a viral epidemic and its course requires an enumeration of people infected by the virus. This, unfortunately, did not happen. Miscounting of covid-19 cases occurred across the world [34], including the United States [35,36]. One cannot help but think that this disinformation contributed to the resultant damage from the pandemic. Chapter 22 explores how best to design studies to take advantage of the methods described in this book. It also should whet your appetite to study biostatistics further, as the story gets even more fascinating. To quote what George Udny Yule wrote almost a century ago [37]: When his work takes an investigator out of the field of the nearly perfect experiments, in which the influence of disturbing causes is practically negligible, into the field of imperfect experiment (or a fortiori of pure observation) where the influence of disturbing causes is important, the first step necessary for him is to get out of the habit of thinking in terms of the single observation and to think in terms of the average. Some seem never to get beyond this stage. But the next state is even more important, viz., to get out of the habit of thinking in terms of the average, and think in terms of the frequency distribution. Unless and until he does this, his conclusions will always be liable to fallacy.
1.3.4
Computing Resources
In addition to Stata output, R output is presented for all examples in the Further Applications sections of each chapter. In addition, all Stata and R code is available online, and can be accessed at https://github.com/Principles-of-Biostatistics/3rd-Edition.
ISTUDY
12
1.4
Principles of Biostatistics
Review Exercises 1. Design a study aimed at investigating an issue you believe might influence the health of the world. Briefly describe the data you will require, how you will obtain them, how you intend to analyze the data, and the method you will use to present your results. Keep this study design and reread it after you have completed the text. 2. Suppose it is stated that in a given year, 512 million people around the world were malnourished, up from 460 million just five years prior [38]. (a) Suppose that you sympathize with the point being made. Justify the use of these numbers. (b) Are you sure that the numbers are correct? Do you think it is possible that 513 million people were malnourished during the year in question rather than 512 million? 3. In addition to stating that “the Chinese have eaten pasta since 1100 b.c.,” the label on a box of pasta shells claims that “Americans eat 11 pounds of pasta per year,” whereas “Italians eat 60 pounds per year.” Do you believe that these statistics are accurate? Would you use these numbers as the basis for a nutritional study?
ISTUDY
Part I
Variability
ISTUDY
ISTUDY
2 Descriptive Statistics
CONTENTS 2.1
2.2 2.3
2.4
2.5 2.6 2.7
Types of Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Nominal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Ordinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Ranked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.5 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Frequency Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Relative Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Frequency Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Two-Way Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.6 Line Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Numerical Summary Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.6 Variance and Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Empirical Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 17 17 18 18 19 20 20 22 24 24 24 26 29 30 31 34 34 36 37 38 38 39 42 47 56
Every study or experiment yields a set of data. Its size can range from a few measurements to many millions of observations. A complete set of data, however, will not necessarily provide an investigator with information that can be easily interpreted. For example, Table 2.1 lists the first 2560 cases of human immunodeficiency virus infection and acquired immunodeficiency syndrome (hiv/aids) reported to the Centers for Disease Control and Prevention [39]. Each individual was classified as either suffering from Kaposi sarcoma, designated by a 1, or not suffering from the disease, represented by a 0. (Kaposi sarcoma is a malignant tumor which affects the skin, mucous membranes, and lymph nodes.) Although Table 2.1 displays the entire set of outcomes, it is extremely difficult to characterize the data. We cannot even identify the relative proportions of 0s and 1s. Between the raw data and the reported results of the study lies some intelligent and imaginative manipulation of the numbers carried out using the methods of descriptive statistics.
DOI: 10.1201/9780429340512-2
15
ISTUDY
16
Principles of Biostatistics
TABLE 2.1 Outcomes indicating whether an individual had Kaposi sarcoma for the first 2560 cases of hiv/aids reported to the Centers for Disease Control and Prevention in Atlanta, Georgia 00000000 00010100 00000010 00001000 00000001 00000000 10000000 00000000 00101000 00000000 00000000 00011000 00100001 01001100 00000000 00000010 00000001 00000000 00000010 01100000 00000000 00000100 00000000 00000000 00100010 00100000 00000101 00000000 00000000 00000001 00001001 00000000 00000000 00010000 00010000 00010000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001000 00000000 00010000 10000000 00000000 00100000 00000000 00001000 00000010 00000000 00000100 00000000 00010000 00000000 00000000 00000100 00001000 00001000 00000101 00000000 01000000 00010000 00000000 00010000 01000000 00000000 00000000 00000101 00100000 00000000 00000000 00000100 00000000 01000100 00000000 00000001 10100000 00000100 00000000 00010000 00000000 00001000 00000000 00000010 00100000 00000000 00000000 00000000 10001000 00001000 00000000 01000000 00000000 00000000 00001100 00000000 00000000 10000011 00000001 11000000 00001000 00000000 00000000 00000000 00000000 01000000 00000001 00010001 00000000 10000000 00000000 01000000 00000000 00000000 01010100 00000000 00010100 00000000 00000000 00000000 00001010 00000101 00000000 00000000 00010000 00000000 00000000 00000000 00000001 00000100 00000000 00000000 00001000 11000000 00000100 00000000 00000000 00000000 00000000 00000000 00001000 11000000 00010010 00000000 00001000 00000000 00111000 00000001 01001100 00000000 01100000 00100010 10000000 00000000 00000010 00000001 00000000 01000010 01000100 00000000 00010000 00000000 01000000 00000001 00000000 01000000 00000001 00000000 10000000 01000000 00000000 00000000 00000100 00000000 00000000 01000010 00000000 00000000 00000000 00000000 00000000 00000000 00000010 00001010 00001001 10000000 00000000 00000010 00000000 00000000 01000000 00000000 00001000 00000000 01000000 00010000 00000000 00001000 01000010 01001111 00100000 00000000 00100000 00000000 10000001 00000001 00000000 01000000 00000000 00000000 00000000 00000000 01000000 00000000 00000000 00100000 01000000 00100000 00000000 00000011 00000000 01000000 00000100 10000001 00000001 00001000 00000100 00001000 00001000 00100000 00000000 00000000 00000000 00000010 01000001 00010011 00000000 00000000 10000000 10000000 00000000 00000000 00001000 01000000 00000000 00001000 00000000 01000010 00011000 00000001 00001001 00000000 00000001 01000010 01001000 01000000 00000010 00000000 10000000 00000100 00000000 00000010 00000000 00000000 00000010 00000000 00100100 00000000 10110100 00001100 00000100 00001010 00000000 00000000 00000000 00000000 00000000 00000010 00000000 00000000 00000000 00100000 10100000 00001000 00000000 01000000 00000000 00000000 00100000 00000000 01000001 00010010 00010001 00000000 00100000 00110000 00000000 00010000 00000000 00000100 00000000 00010100 00000000 00001001 00000001 00000000 00000000 00000000 00000000 00000010 00000100 01010100 10000001 00001000 00000000 00010010 00010000
ISTUDY
Descriptive Statistics
17
Descriptive statistics are a means of organizing and summarizing observations. They provide us with an overview of the general features of a set of data. Descriptive statistics can assume a number of different forms, including tables, graphs, and numerical summary measures. Before we decide which techniques are the most appropriate in a given situation, however, we must first determine what type of data we have.
2.1
Types of Numerical Data
In the study of biostatistics we encounter many different types of numerical data: nominal, ordinal, ranked, discrete, and continuous. The different types of data have varying degrees of structure in the relationships among possible values.
2.1.1
Nominal Data
One of the simplest types of data is nominal data, in which the values fall into unordered categories or classes. As in Table 2.1, numbers are often used to represent the categories. In a certain study, for instance, males might be assigned the value 1 and females the value 0. Although the attributes are labeled with numbers rather than words, both the order and the magnitude of the numbers are unimportant. We could just as easily let 1 represent females and 0 designate males, or 5 represent females and 6 males. Numbers are used for the sake of convenience; numerical values allow us to use computers to perform complex analyses of the data. Nominal data that take on one of two distinct values – such as male and female, or alive and dead – are said to be dichotomous, or binary, depending on whether the Greek or the Latin root for “two” is preferred. However, not all nominal data need be dichotomous. Often there are three or more possible categories into which the observations can fall. For example, persons may be grouped according to their blood type, where 1 represents type o, 2 is type a, 3 is type b, and 4 is type ab. Again, the sequence or order of these values is not important. The numbers simply serve as labels for the different blood types, just as the letters do. We must keep this in mind when we perform arithmetic operations on the data. An average blood type of 1.8 for a given population is meaningless. One arithmetic operation that can be interpreted, however, is the proportion of individuals that fall into each group. An analysis of the data in Table 2.1 shows that 9.6% of the hiv/aids patients suffered from Kaposi sarcoma and 90.4% did not.
2.1.2
Ordinal Data
When the order among categories becomes important, the observations are referred to as ordinal data. For example, injuries may be classified according to their level of severity, where 1 represents a fatal injury, 2 is severe, 3 is moderate, and 4 is minor. Here a natural order exists among the groupings; a smaller number represents a more serious injury. We are still not concerned with the magnitude of these numbers, however. We could have let 4 represent a fatal injury and 1 a minor one. Furthermore, the difference between a fatal injury and a severe injury is not necessarily the same as the difference between a moderate injury and a minor one, even though both pairs of outcomes are one unit apart. As a result, many arithmetic operations still do not make sense when applied to ordinal data. Table 2.2 provides a second example of ordinal data; the scale displayed is used by oncologists to classify the performance status of patients enrolled in trials comparing alternative treatments for cancer [40]. Together, nominal and ordinal measurements are called categorical data.
ISTUDY
18
Principles of Biostatistics
TABLE 2.2 Eastern Cooperative Oncology Group’s classification of patient performance status Status
2.1.3
Definition
0
Patient fully active, able to carry on all pre-disease performance without restriction
1
Patient restricted in physically strenuous activity but ambulatory and able to carry out work of a light or sedentary nature
2
Patient ambulatory and capable of all self-care but unable to carry out any work activities; up and about more than 50% of waking hours
3
Patient capable of only limited self-care; confined to bed or chair more than 50% of waking hours
4
Patient completely disabled; not capable f any self-care; totally confined to bed or chair
Ranked Data
In some situations we have a group of observations that are first arranged from highest to lowest according to magnitude, and then assigned numbers corresponding to each observation’s place in the sequence. This type of data is known as ranked data. As an example, consider all possible causes of death in the United States. We could make a list of all of these causes, along with the number of lives that each one claimed in a particular calendar year. If the causes are ordered from the one that resulted in the greatest number of deaths to the one that caused the fewest and then assigned consecutive integers, the data are said to have been ranked. Table 2.3 lists the ten leading causes of death in the United States in 2016 [41]. Note that cerebrovascular diseases would be ranked fifth whether they caused 117,000 deaths or 154,000. In assigning the ranks, we disregard the actual values of the observations, and consider only their relative magnitudes. Even with this imprecision, it is amazing how much information the ranks contain. In fact, it is sometimes better to work with ranks than with the original data; this point is explored further in Chapter 12.
2.1.4
Discrete Data
For discrete data, both ordering and magnitude of the numbers are important. In this case, the numbers represent actual measurable quantities rather than mere labels. Despite this, discrete data are restricted to taking on only specified values – often integers or counts – that differ by fixed amounts; no intermediate values are possible. Examples of discrete data include the number of fatal motor vehicle accidents in Massachusetts in a specified month, the number of times a female has given birth, the number of new cases of tuberculosis reported in the United States during a one-year period, and the number of beds available in the intensive care unit of a particular hospital. Note that for discrete data a natural order exists among the possible values. If we are interested in the number of fatal motor vehicle accidents over one month, for instance, a larger number indicates more fatal accidents. Furthermore, the difference between one and two accidents is the same as the difference between four and five accidents, or the difference between 20 and 21 accidents. Finally, the number of fatal motor vehicle accidents is restricted to the nonnegative integers; there cannot be 20.2 fatal accidents. Because it is meaningful to measure the distance between possible data values for discrete observations, arithmetic rules can be applied. However, the outcome of an arithmetic operation performed on two discrete values is not necessarily discrete itself. Suppose, for instance, that in one month there are 15 fatal motor vehicle accidents, whereas there are 22 the following
ISTUDY
19
Descriptive Statistics TABLE 2.3 Ten leading causes of death in the United States, 2016 Rank
Cause of Death
Total Deaths
1
Diseases of the heart
635,260
2
Malignant neoplasms
599,038
3
Unintentional injuries
161,374
4
Chronic lower respiratory diseases
154,596
5
Cerebrovascular diseases
142,142
6
Alzheimer’s disease
116,103
7
Diabetes mellitus
80,058
8
Influenza and pneumonia
51,537
9
Nephritis, nephrotic syndrome and nephrosis
50,046
Intentional self harm (suicide)
44,965
10
month. The average number of fatal motor vehicle accidents for these two months is 18.5, which is not itself an integer.
2.1.5
Continuous Data
Data that represent measurable quantities but are not restricted to taking on certain specified values (such as integers) are known as continuous data. In this case, the difference between any two possible values can be arbitrarily small. Examples of continuous data include weight, age, serum cholesterol level, the concentration of a pollutant, length of time between two events, and temperature. In all instances, fractional values are possible. Since we are able to measure the distance between two observations in a meaningful way, arithmetic operations can be applied. The only limiting factor for a continuous observation is the degree of accuracy with which it can be measured; consequently, we often see time rounded off to the nearest second and weight to the nearest pound or gram or kilogram. The more accurate our measuring instruments, however, the greater the amount of detail that can be achieved in our recorded data. At times we might require a lesser degree of detail than that afforded by continuous data; hence we occasionally transform continuous observations into ordinal or even dichotomous ones. In a study of the effects of maternal smoking on newborns, for example, we might first record the birth weights of a large number of infants and then categorize the infants into three groups: those who weigh less than 1500 grams, those who weigh between 1500 and 2500 grams, and those who weigh more than 2500 grams. Although we have the actual measurements of birth weight, we are not concerned whether a particular child weighs 1560 grams or 1580 grams; we are only interested in the number of infants who fall into each category. From prior experience, we may not expect substantial differences among children within the very low birth weight, low birth weight, and normal birth weight groupings. Furthermore, ordinal data are often easier to work with than continuous data, thus simplifying the analysis. There is a consequential loss of detail in the information about the infants, however. In general, the degree of precision required in a given set of data depends upon the questions that are being studied. Section 2.1 describes a gradation of numerical data that ranges from nominal to continuous. As we progress, the nature of the relationship between possible data values becomes increasingly
ISTUDY
20
Principles of Biostatistics
TABLE 2.4 Cases of Kaposi sarcoma for the first 2560 hiv/aids patients reported to the Centers for Disease Control and Prevention in Atlanta, Georgia Kaposi Sarcoma
Number of Individuals
Yes
246
No
2314
complex. Distinctions must be made among the various types of data because different techniques are used to analyze them. As previously mentioned, it does not make sense to speak of an average blood type of 1.8; it does make sense, however, to refer to an average temperature of 36.1◦ C or 37.2◦ C, which are the upper and lower bounds for normal human body temperature.
2.2
Tables
Now that we are able to differentiate among the various types of data, we must learn how to identify the statistical techniques that are most appropriate for describing each kind. Although a certain amount of information is lost when data are summarized, a great deal can also be gained. A table is perhaps the simplest means of summarizing a set of observations and can be used for all types of numerical data.
2.2.1
Frequency Distributions
One type of table that is commonly used to evaluate data is known as a frequency distribution. For nominal and ordinal data, a frequency distribution consists of a set of classes or categories along with the numerical counts that correspond to each one. As a simple illustration of this format, Table 2.4 displays the numbers of individuals (numerical counts) who did and did not suffer from Kaposi sarcoma (classes or categories) for the first 2560 cases of hiv/aids reported to the Centers for Disease Control and Prevention [39]. A more complex example is given in Table 2.5, which specifies the numbers of cigarettes smoked per adult in the United States from 1900 through 2015 [42]. To display discrete or continuous data in the form of a frequency distribution, we must break down the range of values of the observations into a series of distinct, nonoverlapping intervals. If there are too many intervals, the summary is not much of an improvement over the raw data. If there are too few, then a great deal of information is lost. Although it is not necessary to do so, intervals are often constructed so that they all have equal widths; this facilitates comparisons among the classes. Once the upper and lower limits for each interval have been selected, the number of observations whose values fall within each pair of limits is counted, and the results are arranged as a table. As part of a National Health Examination Survey, for example, the serum cholesterol levels of 1067 25- to 34-year-old males were recorded to the nearest milligram per 100 milliliters [43]. The observations were then subdivided into intervals of equal width; the frequencies corresponding to each interval are presented in Table 2.6. Table 2.6 gives us an overall picture of what the data look like; it shows how the values of serum cholesterol level are distributed across the intervals. Note that the observations range from 80 to 399 mg/100 ml, with relatively few measurements at the ends of the range and a large proportion of the values falling between 120 and 279 mg/100 ml. The interval 160–199 mg/100 ml contains
ISTUDY
21
Descriptive Statistics
TABLE 2.5 Cigarette consumption per person 18 years of age or older, United States, 1900–2015 Year
Number of Cigarettes
1900
54
1910
151
1920
665
1930
1485
1940
1976
1950
3522
1960
4171
1970
3985
1980
3851
1990
2828
1995
2505
2000
2076
2005
1717
2010 2015
1278 1078
TABLE 2.6 Absolute frequencies of serum cholesterol levels for 1067 United States males, aged 25 to 34 years Cholesterol Level (mg/100 ml)
Number of Males
80–119
13
120–159 160–199
150 442
200–239
299
240–279
115
280–319
34
320–359
9
360–399
5
Total
1067
ISTUDY
22
Principles of Biostatistics
the greatest number of observations. Table 2.6 provides us with a much better understanding of the data than would a list of 1067 cholesterol level readings. Although we have lost some information – given the table, we can no longer recreate the raw data values – we have also extracted important information that helps us to understand the distribution of serum cholesterol levels for this group of males. The fact that one kind of information is gained while another is lost holds true even for the simple binary data in Tables 2.1 and 2.4. We might feel that we do not lose anything by summarizing these data and counting the numbers of 0s and 1s, but in fact we do. For example, if there is some type of trend in the observations over time – perhaps the proportion of hiv/aids patients with Kaposi sarcoma is either increasing or decreasing as the epidemic matures – then this information is lost in the summary. Tables are most informative when they are not overly complex. As a general rule, tables and the columns within them should always be clearly labeled. If units of measurement are involved, such as mg/100 ml for the serum cholesterol levels in Table 2.6, these units should be specified.
2.2.2
Relative Frequency
It is sometimes useful to know the proportion of values that fall into a given interval in a frequency distribution rather than the absolute number. The relative frequency for an interval is the proportion of the total number of observations that appear in that interval. The relative frequency is calculated by dividing the number of values within an interval by the total number of values in the table. The proportion can be left as it is, or can be multiplied by 100% to obtain the percentage of values in the interval. In Table 2.6, for example, the relative frequency in the 80–119 mg/100 ml class is (13/1067) × 100% = 1.2%; similarly, the relative frequency in the 120–159 mg/100 ml class is (150/1067) × 100% = 14.1%. The relative frequencies for all intervals in a table sum to 100%. Relative frequencies are useful for comparing sets of data that contain unequal numbers of observations. Table 2.7 displays the absolute and relative frequencies of serum cholesterol level readings for the 1067 25- to 34-year-old males depicted in Table 2.6, as well as a group of 1227 55to 64-year-olds. Because there are more males in the older age group, it is inappropriate to compare the columns of absolute frequencies for the two sets of males. Comparing the relative frequencies is meaningful, however. We can see that, in general, the older males have higher serum cholesterol levels than the younger ones; the younger males have a greater proportion of observations in each of the intervals below 200 mg/100 ml, whereas the older males have a greater proportion in each class above this value. The cumulative relative frequency for an interval is the percentage of the total number of observations that have a value less than or equal to the upper limit of the interval. The cumulative relative frequency is calculated by summing the relative frequencies for the specified interval and all previous ones. Thus, for the group of 25- to 34-year-olds in Table 2.7, the cumulative relative frequency of the second interval is 1.2 + 14.1 = 15.3%; similarly, the cumulative relative frequency of the third interval is 1.2 + 14.1 + 41.4 = 56.7%. Like relative frequencies, cumulative relative frequencies are useful for comparing sets of data that contain unequal numbers of observations. Table 2.8 lists the cumulative relative frequencies for the serum cholesterol levels of the two groups of males in Table 2.7. According to Table 2.7, older males tend to have higher serum cholesterol levels than younger ones do. This is the sort of generalization we hear quite often; for instance, it might also be said that males are taller than females, or that females live longer than males. The generalization about serum cholesterol does not mean that every 55- to 64-year-old male has a higher cholesterol level than every 25- to 34-year-old male; nor does it mean that the serum cholesterol level of every male increases with age. What the statement does imply is that for a given cholesterol level, the proportion of younger males with a reading less than or equal to this value is greater than the proportion of older males with a reading less than or equal to the value. This pattern is more obvious in Table 2.8 than
ISTUDY
23
Descriptive Statistics
TABLE 2.7 Absolute and relative frequencies of serum cholesterol levels for 2294 United States males Cholesterol Level (mg/100 ml) 80–119 120–159 160–199 200–239 280–319 320–359 360–399 Total
Ages 25–34 Number Relative of Males Frequency (%) 13 1.2 150 14.1 442 41.4 299 28.0 34 3.2 9 0.8 5 0.5 1067 100.0
Ages 55–64 Number Relative of Males Frequency (%) 5 0.4 48 3.9 265 21.6 458 37.3 128 10.4 35 2.9 7 0.6 1227 100.0
TABLE 2.8 Relative and cumulative relative frequencies in percentages of serum cholesterol levels for 2294 United States males Cholesterol Level (mg/100 ml) 80–119 120–159 160–199 200–239 240–279 280–319 320–359 360–399
Ages 25–34 Relative Cumulative Frequency Frequency 1.2 14.1 41.4 28.0 10.8 3.2 0.8 0.5
1.2 15.3 56.7 84.7 95.5 98.7 99.5 100.0
Ages 55–64 Relative Cumulative Frequency Frequency 0.4 3.9 21.6 37.3 22.9 10.4 2.9 0.6
0.4 4.3 25.9 63.2 86.1 96.5 99.4 100.0
ISTUDY
24
Principles of Biostatistics
it is in Table 2.7. For example, 56.7% of the 25- to 34-year-olds have a serum cholesterol level less than or equal to 199 mg/100 ml, whereas only 25.9% of the 55-to 64-year-olds fall into this category. Because the relative proportions for the two groups follow this trend in every interval in the table, the two distributions are said to be stochastically ordered. For any specified level, a larger proportion of the older males have serum cholesterol readings above this value than do the younger males; therefore, the distribution of cholesterol levels for the older males is stochastically larger than the distribution for the younger males. This definition will start to make more sense when we encounter random variables and probability distributions in Chapter 6. At that point, the implications of this ordering will become more apparent.
2.3
Graphs
A second way to summarize and display data is through the use of graphs, or pictorial representations of numerical data. Graphs should be designed so that they convey the general patterns in a set of observations at a single glance. Although they are easier to read than tables, graphs often supply a lesser degree of detail. Once again, however, the loss of detail may be accompanied by a gain in understanding of the data. The most informative graphs are relatively simple and self-explanatory. Like tables, they should be clearly labeled, and units of measurement should be indicated.
2.3.1
Bar Charts
Bar charts are a popular type of graph used to display a frequency distribution for nominal or ordinal data. In a bar chart, the various categories into which the observations fall are typically listed along a horizontal axis. A vertical bar is then drawn above each category such that the height of the bar represents either the frequency or relative frequency of observations within that class. Sometimes this format is reversed, with categories listed on the vertical axis and frequencies or relative frequencies along the horizontal axis. Either way, the bars should be of equal width and separated from one another so as not to imply continuity. As an example, Figure 2.1 is a bar chart displaying the relative frequencies of Australian adults experiencing major long-term health conditions, with various health conditions listed on the vertical axis [44].
2.3.2
Histograms
Perhaps the most commonly used type of graph is the histogram. While a bar chart is a pictorial representation of a frequency distribution for either nominal or ordinal data, a histogram depicts a frequency distribution for discrete or continuous data. The horizontal axis displays the true limits of the various intervals. The true limits of an interval are the points that separate it from the intervals on either side. For example, the boundary between the first two classes of serum cholesterol level in Table 2.5 is 119.5 mg/100 ml; it is the true upper limit of the interval 80–119 and the true lower limit of 120–159. The vertical axis of a histogram depicts either the frequency or relative frequency of observations within each interval. The first step in constructing a histogram is to determine the scales of the axes. The vertical scale should begin at zero; if it does not, visual comparisons among the intervals may be distorted. Once the axes have been drawn, a vertical bar centered at the midpoint is placed over each interval. The height of the bar marks the frequency associated with that interval. As an example, Figure 2.2 displays a histogram constructed from the serum cholesterol level data in Table 2.6. In reality, the frequency associated with each interval in a histogram is represented not by the height of the bar above it but by the bar’s area. Thus, in Figure 2.2, 1.2% of the total area corresponds
ISTUDY
Descriptive Statistics
25
FIGURE 2.1 Bar chart: Major long-term health conditions experienced by Australian adults, 2014–2015; MBC = mental and behavioral conditions
FIGURE 2.2 Histogram: Absolute frequencies of serum cholesterol levels for 1067 United States males, aged 25 to 34 years
ISTUDY
26
Principles of Biostatistics
FIGURE 2.3 Histogram: Relative frequencies of serum cholesterol levels for 1067 United States males, aged 25 to 34 years to the 13 observations that lie between 79.5 and 119.5 mg/100 ml, and 14.1% of the area corresponds to the 150 observations between 119.5 and 159.5 mg/100 ml. The area of the entire histogram sums to 100%, or 1. Note that the proportion of the total area corresponding to an interval is equal to the relative frequency of that interval. As a result, a histogram displaying relative frequencies – such as Figure 2.3 – will have the same shape as a histogram displaying absolute frequencies. Because it is the area of each bar that represents the relative proportion of observations in an interval, care must be taken when constructing a histogram with unequal interval widths; the height must vary along with the width so that the area of each bar remains in proper proportion.
2.3.3
Frequency Polygons
The frequency polygon, another commonly used graph, is similar to the histogram in many respects. A frequency polygon uses the same two axes as a histogram. It is constructed by placing a point at the center of each interval such that the height of the point is equal to the frequency or relative frequency associated with that interval. Points are also placed on the horizontal axis at the midpoints of the intervals immediately preceding and immediately following the intervals that contain observations. The points are then connected by straight lines. As in a histogram, the frequency of observations for a particular interval is represented by the area within the interval and beneath the line segment. Figure 2.4 is a frequency polygon of the serum cholesterol level data in Table 2.6. Compare it with the histogram in Figure 2.2, which is reproduced very lightly in the background. If the total number of observations in the data set were to increase steadily, we could decrease the widths of the intervals in the histogram and still have an adequate number of measurements in each class; in this case, the histogram and the frequency polygon would become indistinguishable. As they are, both types of graphs convey essentially the same information about the distribution of serum cholesterol levels for this population of men. We can see that the measurements are centered around 180 mg/100
ISTUDY
Descriptive Statistics
27
FIGURE 2.4 Frequency polygon: Absolute frequencies of serum cholesterol levels for 1067 United States males, aged 25 to 34 years ml, and drop off a little more quickly to the left of this value than they do to the right. Most of the observations lie between 120 and 280 mg/100 ml, and all are between 80 and 400 mg/100 ml. Because they can be easily superimposed, frequency polygons are superior to histograms for comparing two or more sets of data. Figure 2.5 displays the frequency polygons of the serum cholesterol level data presented in Table 2.7. Since the older males tend to have higher serum cholesterol levels, their polygon lies to the right of the polygon for the younger males. Although its horizontal axis is the same as that for a standard frequency polygon, the vertical axis of a cumulative frequency polygon displays cumulative relative frequencies. A point is placed at the true upper limit of each interval; the height of the point represents the cumulative relative frequency associated with that interval. The points are then connected by straight lines. Like frequency polygons, cumulative frequency polygons may be used to compare sets of data. This is illustrated in Figure 2.6. By noting that the cumulative frequency polygon for 55- to 64-year-old males lies to the right of the polygon for 25- to 34-year-old males for each value of serum cholesterol level, we can see that the distribution for older males is stochastically larger than the distribution for younger males. Cumulative frequency polygons can also be used to obtain the percentiles of a set of data. The 95th percentile is a value which is greater than or equal to 95% of the observations and less than or equal to the remaining 5%. Similarly, the 75th percentile is a value which is greater than or equal to 75% of the observations and less than or equal to the other 25%. This definition is only approximate because taking 75% of an integer does not typically result in another integer; consequently, there is often some rounding or interpolation involved. In Figure 2.6, the 50th percentile of the serum cholesterol levels for the group of 25- to 34-year-olds – the value that is greater than or equal to half of the observations and less than or equal to the other half – is approximately 193 mg/100 ml; the 50th percentile for the 55- to 64-year-olds is about 226 mg/100 ml. Percentiles are useful for describing the shape of a distribution. For example, if the 40th and 60th percentiles of a set of data lie an equal distance away from the midpoint, and the same is true of
ISTUDY
28
Principles of Biostatistics
FIGURE 2.5 Frequency polygon: Relative frequencies of serum cholesterol levels for 2294 United States males
FIGURE 2.6 Cumulative frequency polygon: Cumulative relative frequencies of serum cholesterol levels for 2294 United States males
ISTUDY
Descriptive Statistics
29
FIGURE 2.7 Box plot: Crude death rates for each state in the United States, 2016 the 30th and 70th percentiles, the 20th and 80th, and all other pairs of percentiles that sum to 100, then the data are symmetric; that is, the distribution of values has the same shape on each side of the 50th percentile. Alternatively, if there are a number of outlying observations on one side of the midpoint only, then the data are said to be skewed. If these observations are smaller than the rest of the values, the data are skewed to the left; if they are larger than the other measurements, the data are skewed to the right. The various shapes that a distribution of data can assume are discussed further in Section 2.4.
2.3.4
Box Plots
Another type of graph that can be used to summarize a set of discrete or continuous observations is the box plot. Unlike the histogram or frequency polygon, a box plot uses a single axis to display selected summaries of the measurements [45]. As an example, Figure 2.7 depicts the crude death rates for each of the 50 states and the District of Columbia in 2016, from a low of 587.1 per 100,000 population in Utah to a high of 1241.4 per 100,000 population in West Virginia [46]. (For each state, the “crude” death rate is simply the number of deaths in 2016 divided by the size of the population in that year. In Chapter 3 we will discuss this further, and investigate the differences among crudes rates, specific rates, and adjusted rates.) The central box in the box plot – which is depicted vertically in Figure 2.7 but which can also be horizontal – extends from the 25th percentile, 794.1 per 100,000, to the 75th percentile, 969.3 per 100,000. The 25th and 75th percentiles of a data set are called quartiles of the data. The line running between the quartiles at 891.6 deaths per 100,000 population marks the 50th percentile of the data set; half the observations are less than or equal to 891.6 per 100,000, and the other half are greater than or equal to this value. If the 50th percentile lies approximately halfway between the two quartiles, this implies that the observations in the center of the data set are roughly symmetric.
ISTUDY
30
Principles of Biostatistics
FIGURE 2.8 Box plots: Crude death rates for each state in the United States, 1996, 2006, and 2016 The lines projecting out from the box on either side extend to the adjacent values of the plot. The adjacent values are the most extreme observations in the data set that are not more than 1.5 times the height of the box beyond either quartile. In Figure 2.7, 1.5 times the height of the box is 1.5× (969.3−794.1) = 262.8 per 100,000 population. Therefore, the adjacent values are the smallest and largest observations in the data set which are not more extreme than 794.1 − 262.8 = 531.3 and 969.3 + 262.8 = 1232.1 per 100,000 population, respectively. Since there is no crude death rate less than 531.3, the lower adjacent value is simply the minimum value, 587.1 per 100,000. There is one value higher than 1232.1 – the maximum value of 1241.4 per 100,000 – and thus the upper adjacent value is 1078.8 per 100,000, the next largest value. In fairly symmetric data sets, the adjacent values should contain approximately 99% of the measurements. All points outside this range are represented by circles; these observations are considered to be outliers, or data points which are not typical of the rest of the values. It should be noted that the preceding explanation is merely one way to define a box plot; other definitions exist and exhibit varying degrees of complexity [47]. Because the box plot displays only a summary of the data on a single axis, it can be used to make comparisons across groups or over time. Figure 2.8, for example, contains summaries of crude death rates for the 50 states and the District of Columbia for three different calendar years: 1996, 2006, and 2016 [46]. The 25th, 50th, and 75th percentiles of crude death rate all decrease from 1996 to 2006, but then increase again in 2016.
2.3.5
Two-Way Scatter Plots
Unlike the other graphs we have discussed, a two-way scatter plot is used to depict the relationship between two different continuous measurements. Each point on the graph represents a pair of values; the scale for one quantity is marked on the horizontal axis, and the scale for the other on the vertical axis. For example, Figure 2.9 plots two simple measures of lung function – forced vital capacity (fvc) and forced expiratory volume in one second (fev1 ) – for 19 asthmatic subjects who participated
ISTUDY
Descriptive Statistics
31
FIGURE 2.9 Two-way scatter plot: Forced vital capacity versus forced expiratory volume in one second for nineteen asthmatic subjects in a study investigating the physical effects of sulfur dioxide exposure [48]. Forced vital capacity is the volume of air that can be expelled from the lungs in six seconds, and forced expiratory volume in one second is the volume that can be expelled after one second of constant effort. Note that the individual represented by the point that is farthest to the left had an fev1 measurement of 2.0 liters and an fvc measurement of 2.8 liters. (There are only 18 points marked on the graph instead of 19 because two individuals had identical values of fvc and fev1 ; consequently, one point lies directly on top of another.) As might be expected, the graph indicates that there is a strong relationship between these two quantities; fvc increases in magnitude as fev1 increases.
2.3.6
Line Graphs
A line graph is similar to a two-way scatter plot in that it can be used to illustrate the relationship between continuous quantities. Once again, each point on the graph represents a pair of values. In this case, however, each value on the horizontal axis has a single corresponding measurement on the vertical axis, and adjacent points are connected by straight lines. Most commonly, the scale along the horizontal axis represents time. Consequently, we are able to trace the chronological change in the quantity on the vertical axis over a specified period. Figure 2.10 displays the trends in the reported rates of malaria that occurred in the United States between 1940 and 2015 [49]. Note the log scale on the vertical axis; this scale allows us to depict a large range of observations while still showing the variation among the smaller values. To compare two or more groups with respect to a given quantity, it is possible to plot more than one measurement along the vertical axis. Suppose we are concerned with the rising costs of health care. To investigate this problem, we might wish to compare the variations in cost that have occurred under two different health care systems in recent years. Figure 2.11 depicts the trends in health care expenditures in both the United States and Canada between 1970 and 2016 [50, 51].
ISTUDY
32
Principles of Biostatistics
FIGURE 2.10 Line graph: Reported rates of malaria by year, United States, 1940–2015
FIGURE 2.11 Line graph: Health care expenditures as a percentage of gross domestic product (gdp) for the United States and Canada, 1970–2017
ISTUDY
Descriptive Statistics
33
FIGURE 2.12 Leading causes of death in South Africa, 1997–2013, in thousands of deaths; colored bands from bottom to top represent other causes, digestive, nervous, endocrine, respiratory system, circulatory system, neoplasm, infectious, external causes, blood and immune disorders
In this section, we have not attempted to examine all possible types of graphs. Instead, we have included only a selection of the more common ones. It should be noted that many other imaginative displays exist [52]. One such example is Figure 2.12, which displays the leading causes of death in South Africa from 1997 through 2013 [53]. The top border of the light blue segment at the bottom is actually a line graph tracking the number of deaths due to “other causes” – those not represented by the nine colored bands above it – over the 17-year time period. The purple segment above this shows the number of deaths due to diseases of the digestive system in each year; the top border of this segment displays the number of deaths due to other and digestive causes combined. The top of the uppermost blue segment displays the total number of deaths due to all causes in each calendar year, allowing us to see that the number of deaths in South Africa increased from 1997 through 2006, and then decreased from 2006 through 2013. Some of this decrease can be attributed to a fall in deaths due to diseases of the respiratory system, the bright pink band; note that this band becomes more narrow beginning in the late 2000s. The number of deaths due to infectious disease – the light green band – decreased after 2009. Deaths due to many of the other causes have not changed much over this time period, as evidenced by the segments of constant height. Regardless of the type of display being used, as a general rule, too much information should not be squeezed into a single graph. A relatively simple illustration is often the most effective.
ISTUDY
34
2.4
Principles of Biostatistics
Numerical Summary Measures
Although tables and graphs are extremely useful methods for organizing, visually summarizing, and displaying a set of data, they do not allow us to make concise, quantitative statements that characterize the distribution of values as a whole. In order to do this, we instead rely on numerical summary measures. Together, the various types of descriptive statistics can provide a great deal of information about a set of observations. The most commonly investigated characteristic of a set of data is its center, or the point about which the observations tend to cluster. This is sometimes called a “measure of central tendency.” Suppose we are interested in examining the response to air pollutants such as ozone and sulfur dioxide among adolescents suffering from asthma. Listed in Table 2.9 are the initial measurements of forced expiratory volume in one second for 13 subjects involved in such a study [54]. Recall that fev1 is the volume of air that can be expelled from the lungs after one second of constant effort. Before investigating the effect of pollutants on lung function, we might wish to determine the “typical” value of fev1 prior to exposure for the individuals in this group.
2.4.1
Mean
The most frequently used measure of central tendency is the arithmetic mean, or average. The mean is calculated by summing all the observations in a set of data and dividing by the total number of measurements. In Table 2.9, for example, we have 13 observations. If x is used to represent fev1 , then x 1 = 2.30 denotes the first in the series of observations; x 2 = 2.15, the second; and so on up through x 13 = 3.38. In general, x i refers to a single fev1 measurement where the subscript i can take on any value from 1 to n, the total number of observations in the group. The mean of the observations in the dataset – represented by x, ¯ or x-bar – is n
x¯
=
1X xi . n i=1
P Note that we have used some mathematical shorthand. The uppercase Greek letter sigma, , is the Pn symbol for summation. The expression i=1 x i indicates that we should add up the values of all of P the observations in the group, from x 1 to x n . When appears in the text, the limits of summation are placed beside it; when it does not, the limits are above and below it. Both representations of a summation denote exactly the same thing. In some cases where it is clear that we are supposed to sum all observations in a dataset, the limits may be dropped altogether. For the fev1 measurements, 13
x¯
= =
=
1 X xi 13 i=1 ! 1 (2.30 + 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05 13 + 2.25 + 2.68 + 3.00 + 4.02 + 2.85 + 3.38) 38.35 13
= 2.95 liters. The mean can be used as a summary measure for both discrete and continuous measurements. In general, however, it is not appropriate for either nominal or ordinal data. Recall that for these types of observations, the numbers are merely labels; even if we choose to represent the blood types o, a, b, and ab by the numbers 1, 2, 3, and 4, an average blood type of 1.8 is meaningless.
ISTUDY
35
Descriptive Statistics TABLE 2.9 Forced expiratory volumes in 1 second for 13 adolescents suffering from asthma Subject 1 2 3 4 5 6 7 8 9 10 11 12 13
FEV1 (liters) 2.30 2.15 3.50 2.60 2.75 2.82 4.05 2.25 2.68 3.00 4.02 2.85 3.38
One exception to this rule applies when we have dichotomous data, and the two possible outcomes are represented by the values 0 and 1. In this situation, the mean of the observations is equal to the proportion of 1s in the data set. For example, suppose that we want to know the proportion of asthmatic adolescents in the previously described study who are males. Listed in Table 2.10 are the relevant dichotomous data; the value 1 represents a male and 0 designates a female. If we compute the mean of these observations, we find that n
x¯
= = =
1X xi n i=1 ! 1 (0 + 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 1 + 0) 13 8 13
=
0.615.
Therefore, 61.5% of the study subjects are males. It would have been a little more difficult to determine the relative frequency of males, however, if we had represented males by the value 5 and females by 12. The method for calculating the mean takes into consideration the magnitude of each and every observation in a set of data. What happens when one observation has a value that is very different from the others? Suppose, for instance, that for the data shown in Table 2.9, we had accidentally recorded the fev1 measurement of subject 11 as 40.2 rather than 4.02 liters. The mean fev1 of all 13 subjects would then be calculated as ! 1 (2.30 + 2.15 + 3.50 + 2.60 + 2.75 + 2.82 + 4.05 x¯ = 13 + 2.25 + 2.68 + 3.00 + 40.2 + 2.85 + 3.38) =
74.53 13
=
5.73 liters,
ISTUDY
36
Principles of Biostatistics
TABLE 2.10 Indicators of sex for 13 adolescents suffering from asthma Subject 1 2 3 4 5 6 7 8 9 10 11 12 13
Sex 0 1 1 0 0 1 1 1 0 1 1 1 0
which is nearly twice as large as it was before. Clearly, the mean is extremely sensitive to unusual values. In this particular example, we would have rightfully questioned an fev1 measurement of 40.2 liters and would have either corrected the error or separated this observation from the others. In general, however, the error might not be as obvious, or the unusual observation might not be an error at all. Since it is our intent to characterize an entire group of individuals, we might prefer to use a summary measure that is not as sensitive to each and every observation.
2.4.2
Median
One measure of central tendency which is not as sensitive to the value of each measurement is the median. Like the mean, the median can be used as a summary measure for discrete and continuous measurements. However, it can also be used for ordinal data as well. The median is defined as the 50th percentile of a set of measurements; if a list of observations is ranked from smallest to largest, then half the values would be greater than or equal to the median, and the other half would be less than or equal to it. If a set of data contains a total of n observations where n is odd, the median is the middle value, or the [(n + 1)/2]th largest measurement; if n is even, the median is usually taken to be the average of the two middlemost values, the (n/2)th and [(n/2) + 1]th observations. If we were to rank the 13 fev1 measurements listed in Table 2.9, for example, the following sequence would result: 2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.02, 4.05. Since there are an odd number of observations in the list, the median would be the (13 + 1)/2 = 7th observation, or 2.82. Seven of the measurements are less than or equal to 2.82 liters, and seven are greater than or equal to 2.82. The calculation of the median takes into consideration only the ordering and relative magnitude of the observations in a set of data. In the situation where the fev1 of subject 11 was recorded as 40.2 rather than 4.02, the ranking of the measurements would change only slightly: 2.15, 2.25, 2.30, 2.60, 2.68, 2.75, 2.82, 2.85, 3.00, 3.38, 3.50, 4.05, 40.2.
ISTUDY
Descriptive Statistics
37
As a result, the median fev1 would still be 2.82 liters. The median is said to be robust; that is, it is much less sensitive to unusual data points than is the mean.
2.4.3
Mode
A third measure of central tendency is the mode; it can be used as a summary measure for all types of data, although it is most useful for categorical measurements. The mode of a set of values is the observation that occurs most frequently. The continuous fev1 data in Table 2.9 do not have a unique mode since each of the values occurs only once. It is not uncommon that continuous measurements will have no unique mode, or more than one. This is less likely to occur with nominal or ordinal measurements. For example, the mode for the dichotomous data in Table 2.10 is 1; this value appears eight times, whereas 0 appears only five times. The best measure of central tendency for a given set of data often depends on the way in which the values are distributed. If continuous or discrete measurements are symmetric and unimodal – meaning that, if we were to draw a histogram or a frequency polygon, there would be only one peak, as in the smoothed distribution pictured in Figure 2.13(a) – then the mean, the median, and the mode should all be roughly the same. If the distribution of values is symmetric but bimodal, so that the corresponding frequency polygon would have two peaks as in Figure 2.13(b), then the mean and median should again be the same. Note, however, that this common value could lie between the two peaks, and hence be a measurement that is extremely unlikely to occur. A bimodal distribution often indicates that the population from which the values are taken actually consists of two distinct subgroups that differ in the characteristic being measured; in this situation, it might be better to report two modes rather than the mean or the median, or to treat the two subgroups separately. The data in Figure 2.13(c) are skewed to the right, and those in Figure 2.13(d) are skewed to the left. When the data are not symmetric, as in these two figures, the median is often the best measure of central tendency. Because the mean is sensitive to extreme observations, it is pulled in the direction of the outlying data values. As a result, the mean might end up either excessively inflated or excessively deflated. Note that when the data are skewed to the right, the mean lies to the right of the median; when they are skewed to the left, the mean lies to the left of the median. In both instances, the mean is pulled in the direction of the extreme values. Regardless of the measure of central tendency used in a particular situation, it can be misleading to assume that this value is representative of all observations in the group. One example that illustrates this point was included in an episode of the popular news program “60 Minutes,” where it was noted that although the French diet tends to be high in fat and cholesterol, France has a fairly low rate of heart disease relative to other countries, including the United States. This paradox was attributed to the French habit of drinking wine with meals, red wine in particular. Studies have suggested that moderate alcohol consumption can lessen the risk of heart disease. The per capita intake of wine in France is one of the highest in the world, and the program implied that the French drink a moderate amount of wine each day, perhaps two or three glasses. The reality may be quite different, however. According to a wine industry survey, more than half of all French adults never drink wine at all [55]. Of those who do, only 28% of males and 11% of females drink it daily. Obviously the distribution is far more variable than the “typical value” would suggest. Remember that when we summarize a set of data, information is always lost. Thus, although it is helpful to know where the center of a dataset lies, this information is usually not sufficient to characterize an entire distribution of measurements. As another example, the two very different distributions of data values pictured in Figure 2.14 have the same means, medians, and modes. To know how good our measure of central tendency actually is, we need to have some idea about the variation among the measurements. Do all the observations tend to be quite similar and therefore lie close to the center, or are they spread out
ISTUDY
38
Principles of Biostatistics
FIGURE 2.13 Possible distributions of data values: (a) unimodal, (b) bimodal, (c) right-skewed, (d) left-skewed across a broad range of values? To answer this question, we need to calculate a measure of the variability among values, also called a measure of dispersion.
2.4.4
Range
One number that can be used to describe the variability in a set of data is the range. The range of a group of measurements is defined as the difference between the largest and the smallest observations. Although the range is easy to compute, its usefulness is limited; it considers only the extreme values of a dataset rather than the majority of the observations. Therefore, like the mean, it is highly sensitive to exceptionally large or exceptionally small values. The range for the fev1 data in Table 2.9 is 4.05 − 2.15 = 1.90 liters. If the fev1 of subject 11 was recorded as 40.2 instead of 4.02 liters, however, the range would be 40.2 − 2.15 = 38.05 liters, a value 20 times as large.
2.4.5
Interquartile Range
A second measure of variability – one that is not as easily influenced by extreme values – is called the interquartile range. The interquartile range is calculated by subtracting the 25th percentile of the data from the 75th percentile; consequently, it encompasses the middle 50% of the observations. (Recall that the 25th and 75th percentiles of a data set are called quartiles.) For the fev1 data in Table 2.9, the 75th percentile is 3.38. Note that three observations are greater than this value and nine are smaller. Similarly, the 25th percentile is 2.60. Therefore, the interquartile range is 3.38 − 2.60 = 0.78 liters. If a computer is not available, there are rules for finding the kth percentile of a set of measurements by hand, just as there were rules for finding the median. In that case, we ordered the measurements from largest to smallest, and the rule used depended on whether the number of observations n was even or odd. For other percentiles, we again begin by ranking the measurements from smallest to
ISTUDY
39
Descriptive Statistics
FIGURE 2.14 Two distributions with identical means, medians, and modes largest. If nk/100 is an integer, then the kth percentile of the data is the average of the (nk/100)th and (nk/100 + 1)th largest observations. If nk/100 is not an integer, then the kth percentile is the ( j + 1)th largest measurement, where j is the largest integer which is less than nk/100. To find the 25th percentile of the 13 fev1 measurements, for example, we first note that 13(25)/100 = 3.25 is not an integer. Therefore, the 25th percentile is the 3 + 1 = 4th largest measurement (since 3 is the largest integer less than 3.25), or 2.60 liters. Similarly, 13(75)/100 = 9.75 is not an integer, and the 75th percentile is the 9 + 1 = 10th largest measurement, or 3.38 liters. The interquartile ranges of daily glucose levels measured at each minute over a 24-hour period for a total of 90 days – as well as 10th and 90th percentiles – are presented for a single individual in Figure 2.15. These interquartile ranges allow us to determine at which times of day glucose has the most variability, and when there is less variability.
2.4.6
Variance and Standard Deviation
Another commonly used measure of dispersion is known as the variance. The variance quantifies how different the observations are from each other by computing half of the average squared distance between the measurements. To find the average distance, we list out all possible pairs of measurements x i and x j where i , j (we do not want to compare a measurement to itself), calculate the difference between the observations in each pair, square them, sum them all up, and divide by twice the total number of pairs. Since the total number of possible pairs for a sample of size n is n(n − 1), the variance is defined as s2
=
n n X X 1 (x i − x j ) 2 . 2n(n − 1) i=1 j=1, j,i
ISTUDY
40
Principles of Biostatistics
FIGURE 2.15 Medians and interquartile ranges of daily glucose levels measured over a 24-hour period for a total of 90 days A mathematically equivalent formula is the more commonly used s2
=
n 1 X (x i − x) ¯ 2, (n − 1) i=1
which is based on the squared difference of each measurement from the sample mean x. ¯ Although less intuitive, this formula is easier to calculate by hand. For the 13 fev1 measurements presented in Table 2.9, the mean is x¯ = 2.95 liters, and the difference and squared difference of each observation from the mean is given below. Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 Total
xi 2.30 2.15 3.50 2.60 2.75 2.82 4.05 2.25 2.68 3.00 4.02 2.85 3.38 38.35
x i − x¯ −0.65 −0.80 0.55 −0.35 −0.20 −0.13 1.10 −0.70 −0.27 0.05 1.07 −0.10 0.43 0.00
(x i − x) ¯ 2 0.4225 0.6400 0.3025 0.1225 0.0400 0.0169 1.2100 0.4900 0.0729 0.0025 1.1449 0.0100 0.1849 4.6596
ISTUDY
41
Descriptive Statistics Therefore, the variance is s2
=
13 X 1 (x i − 2.95) 2 (13 − 1) i=1
=
4.6596 12
=
0.39 liters2 .
The standard deviation of a set of values is the positive square root of the variance. Thus, for the 13 fev1 measurements above, the standard deviation is equal to p s = s2 p = 0.39 liters2 =
0.62 liters.
In practice, the standard deviation is used more frequently than the variance. This is primarily because the standard deviation has the same units of measurement as the mean, rather than squared units. In a comparison of two sets of measurements, the group with the smaller standard deviation has the more homogeneous observations. The group with the larger standard deviation exhibits a greater amount of variability. The actual magnitude of the standard deviation depends on the values in the dataset; what is large for one set of data may be small for another. In addition, because the standard deviation has units of measurement, it is meaningless to compare standard deviations for two unrelated quantities, such as age and weight. Together, these two numbers, a measure of central tendency and a measure of dispersion, can be used to summarize an entire distribution of values. It is most common to see the standard deviation reported with the mean, and either the range or the interquartile range reported with the median. Summary: Numerical Summary Measures Term
Notation / Definition
Mean
x¯ = 1n
n X
xi
i=1
Median
50th percentile
Mode
Value that occurs most frequently
Range
Maximum value − minimum value
Interquartile range (IQR)
75th percentile − 25th percentile
Variance
s2 = =
Standard deviation
s =
1 (n − 1)
n X
1 2n(n − 1) √
v t s2
=
(x i − x) ¯ 2
i=1 n X
n X
(x i − x j ) 2
i=1 j=1, j,i
1 (n − 1)
n X i=1
(x i − x) ¯ 2
ISTUDY
42
2.5
Principles of Biostatistics
Empirical Rule
When a distribution of continuous measurements is symmetric and unimodal, the mean and standard deviation can be used to construct an interval which captures a specified proportion of the observations in the dataset. The empirical rule tells us that approximately 67% of the observations lie in the interval x¯ ± 1s, about 95% in the interval x¯ ± 2s, and almost all of the observations in the interval x¯ ± 3s. Consider the measurements of total cholesterol depicted in the histogram in Figure 2.16, which come from the Framingham Heart Study [56]. This study, which began enrolling subjects who lived in Framingham, Massachusetts in 1948, was the first prospective study investigating risk factors for cardiovascular outcomes. Total cholesterol levels were measured at the time of enrollment for 4380 individuals in the study, and have a symmetric, unimodal distribution. The mean and standard deviation of these observations are 236.8 mg/dL and 43.8 mg/dL, respectively. Therefore, the empirical rule says that the interval 236.8 ± (1 × 43.8) or
(193.0 , 280.6)
contains approximately 67% of the total cholesterol measurements, 236.8 ± (2 × 43.8) or contains 95%, and or
(149.2 , 324.4) 236.8 ± (3 × 43.8) (105.4 , 368.2)
contains nearly all of the observations. In fact, for the 4380 measurements, 69.9% are between 193.0 and 280.6 mg/dL, 96.0% are between 149.2 and 324.4 mg/dL, and 99.4% are between 105.4 and 368.2 mg/dL. The empirical rule allows us to use the mean and the standard deviation of a set of data, just two numbers, to describe the entire group. Interpretation of the magnitude of the mean is enhanced by the empirical rule. As previously noted, however, in order to apply the empirical rule, a distribution of data values must be at least approximately symmetric and unimodal. The closer the distribution is to this ideal, the more precise the descriptions provided by the rule. Deviations from the ideal – especially if they are extreme – not only invalidate the use of the empirical rule, but even call into question the usefulness of the mean and standard deviation as numerical summary measures. Returning to the Framingham Heart Study, consider the reported average number of cigarettes smoked per day at the time of enrollment. In addition to this discrete measurement, the researchers also collected a binary measurement of smoking status: smoker versus non-smoker. If d is used to represent smoking status (taking the value 1 for a smoker, and 0 for a non-smoker), while x represents the average number of cigarettes smoked per day, then the ith individual in the group has a pair of measurements (d i , x i ). The subscript i takes on any value from 1 to 4402, the total number of subjects in the study for whom these values were recorded. Figure 2.17 displays the x values, the average numbers of cigarettes smoked per day. Note that these values are not symmetric and unimodal, and therefore the empirical rule should not be applied. Beyond that, however, we might wonder whether the mean is providing any useful information at
ISTUDY
Descriptive Statistics
43
FIGURE 2.16 Total cholesterol measurements at the time of enrollment for individuals participating in the Framingham Heart Study all. Recall that we introduced the mean as a measure of central tendency, a “typical” value for a set of measurements. Knowing that the center for the number of cigarettes smoked per day is x¯ = 9.0 is not particularly helpful. The problem is that there are really two distinct groups of study subjects: smokers and non-smokers. The mean of the x values ignores the information contained in d. Cigarette consumption for the individuals who do not smoke – the 51% of the total cohort for whom d i = 0 – is 0 cigarettes per day, resulting in a mean value of 0 for this subgroup. For the subgroup of smokers – those for whom d i = 1 – the mean cigarette consumption is 18.4 cigarettes per day. The overall mean of x¯ = 9.0 is not representative of either of these subgroups. It might be useful for the manufacturer who is trying to determine how many cigarettes to make, but it does not help us to understand the health of the population. Instead of attempting to capture the situation with a single mean, it is more informative to present two numerical summary measures: the proportion of the population who smokes, and the mean number of cigarettes smoked per day only for the subgroup of smokers. (Since the binary measurements of smoking status are represented by 0s and 1s, the proportion of 1s – equivalently, the proportion of smokers – is simply the mean of the d i s.) These two numbers give us a more complete summary of the data. Of course, reporting two means also complicates the interpretation. Suppose that we want to track changes in smoking habits over time. With a single mean, it is easy to see whether cigarette consumption is increasing or decreasing; with two means, it is not. What if fewer people smoke over time, but those who do smoke increase their consumption? Can this be considered an improvement in health? Additional complexity is introduced if we are dealing with a rare event. The information in Table 2.11 was presented as part of an argument about the loss of human lives attributable to guns [57]. The entries in the table show the number of deaths over each year from 2009 through 2015, by country, attributed to mass shootings. Although there is some disagreement on how to define a mass shooting, here it is defined as an incident resulting in four or more fatalities. The
ISTUDY
44
Principles of Biostatistics
FIGURE 2.17 Average number of cigarettes smoked per day at the time of enrollment for individuals participating in the Framingham Heart Study argument utilizing these data focused on a contrast between the United States and Europe. The authors took a country’s mean number of deaths per year over the seven-year period and divided by its population size to calculate the “annual death rate from mass shootings per million people.” Doing this, the United States ranked eighth highest, and it was therefore claimed that it is safer to live in the United States than in the seven European countries which ranked higher. We might consider, however, whether this metric is the most meaningful way to summarize this data. First, note that there are currently 44 countries in Europe, but only 16 are listed in Table 2.11. These 16 countries were selected because they had at least one mass shooting episode over the seven-year period. Since the majority of European countries had no mass shootings at all, the sample of countries shown is not representative. To more fairly compare the situation in Europe to that in the United States, all countries must be included. Second, just as with the cigarette consumption measurements from the Framingham Heart Study, we should consider two dimensions of this data rather than just one: the frequency of mass shootings, and the number of fatalities when a shooting does occur. Both of these pieces of information are important. To better understand the frequency of mass shootings, Table 2.12 contains the number of mass shootings in each year from 2009 to 2015. Over the seven-year period, there were six shootings in France, and two in Belgium, Russia, Serbia, and Switzerland. Each of the other countries in the table had just one mass shooting. The 28 European countries not shown in the table had none at all. At the country level, a mass shooting is a rare event, and the mean number of shootings per year is not a helpful summary measure, as all the means are low. In contrast, over the same time period, the United States had 25 shootings, the same number as all of Europe combined. In fact, looking at the last two rows of the table, the behavior of the two regions is quite similar. Some might say that a fairer comparison would take into account the relative population sizes of Europe and the United States. This would certainly be true if we believe that a certain fixed proportion of a population are potential mass shooters, and therefore a larger population would produce more of
ISTUDY
45
Descriptive Statistics TABLE 2.11 Number of deaths per year attributed to mass shootings, 2009–2015 Country Albania Austria Belgium Czech Republic Finland France Germany Italy Macedonia Netherlands Norway Russia Serbia Slovakia Switzerland United Kingdom United States
2009 0 0 0 0 5 0 13 0 0 0 0 0 0 0 0 0 38
2010 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 12 12
2011 2012 2013 0 0 0 0 0 4 6 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 5 0 6 0 0 67 0 0 0 6 6 0 0 13 0 0 0 0 0 4 0 0 0 18 66 16
2014 4 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 12
2015 0 0 0 9 0 150 0 4 0 0 0 0 0 0 4 0 37
Total 4 4 10 9 5 158 13 4 5 6 69 12 19 7 8 12 199
Mean 0.57 0.57 1.43 1.29 0.71 22.60 1.86 0.57 0.71 0.86 9.86 1.71 2.17 1.00 1.14 1.71 28.40
Median 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18
Total 1 1 2 1 1 6 1 1 1 1 1 2 2 1 2 1 25 25
Mean 0.14 0.14 0.28 0.14 0.14 0.86 0.14 0.14 0.14 0.14 0.14 0.28 0.28 0.14 0.28 0.14 3.57 3.57
Mediam 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 4
TABLE 2.12 Number of mass shootings per year, 2009–2015 Country 2009 Albania 0 Austria 0 Belgium 0 Czech Republic 0 Finland 1 France 0 Germany 1 Italy 0 Macedonia 0 Netherlands 0 Norway 0 Russia 0 Serbia 0 Slovakia 0 Switzerland 0 United Kingdom 0 Europe 2 United States 4
2010 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 2 2
2011 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 4 3
2012 2013 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 4 4 6 3
2014 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3
2015 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0 8 4
ISTUDY
46
Principles of Biostatistics
FIGURE 2.18 Frequency of number of fatalities per mass shooting, 2009–2015 these individuals. The population of Europe is more than twice that of the United States – in 2019, there are approximately 740 million people in Europe and 330 million in the United States – and that ratio has been fairly consistent since 2009. Therefore, we can conclude that the proportion of mass shooters in the United States is more than twice as high as in Europe. Going a step further, the description above did not account for the number of fatalities in each shooting. Figure 2.18 displays the frequencies with which each number of fatalities occurred. In Europe, the mean number of fatalities is 13.7 per event, while in the United States it is 8.0. Note, however, the three outlying values, representing 26 deaths in Newtown, Connecticut in 2012, 67 in Norway in 2011, and 130 in France in 2015. We have seen that the mean is affected by outlying values. To assess their impact, we might consider excluding the outliers and recalculating the means. If we do this, the means are 6.4 fatalities per event in Europe and 7.2 in the United States, both of which are much lower; furthermore, the mean for Europe is now smaller than the mean for the United States. In summary, there are instances when a single mean does not provide an accurate representation of a complex situation, especially when comparisons are being made. To fully understand what is happening when comparing gun violence in the United States and Europe, the annual death rate from mass shootings per million people does not give the whole picture; especially when Europe is represented by a hand-picked sample of countries, chosen in a biased way so as to make a political point. We might convey a better understanding of the data by noting that the frequency of mass shootings was the same in Europe and the United States over the seven-year period from 2009 through 2015, with 25 mass shootings in each region, even though the population of Europe is more than twice as large. There were three mass shootings with exceptionally high numbers of fatalities, noted above, and excluding these, the mean numbers of deaths per event were 6.4 in Europe and 7.2 in the United States.
ISTUDY
47
Descriptive Statistics
TABLE 2.13 Listing of categories of injury death for 100 children between the ages of 5 and 9 years, United States, 2017 1
5
3
1
2
4
1
3
1
5
2
1
1
5
3
1
2
1
4
1
4
1
3
1
5
1
2
1
1
2
5
1
1
5
1
5
3
1
2
1
2
3
1
1
2
1
5
1
5
1
1
2
5
1
1
1
3
4
1
1
1
1
2
1
1
2
1
1
2
3
3
3
1
5
2
3
5
1
3
4
1
1
2
4
5
4
1
5
1
5
5
1
1
5
1
1
5
1
1
5
Summary: Empirical Rule
2.6
Term
Definition
Empirical rule
When a distribution of values is symmetric and unimodal, 67% of the observations lie between x¯ ± 1s, 95% lie between x¯ ± 2s, and almost all lie between x¯ ± 3s.
Further Applications
Suppose that we wish to reduce the number of childhood deaths caused by injury. We first need to understand the nature of the problem. Displayed in Table 2.13 is a set of observations indicating the causes of death for the first 100 out of 712 children between the ages of 5 and 9 who died as a result of selected types of injury in the United States in 2017 [58]. The data are nominal; 1 represents a motor vehicle accident, 2 a drowning, 3 is death due to fire or a burn, 4 a firearm homicide, and 5 is suffocation. Given these observations, what are we able to conclude about childhood injury deaths? When individual causes of death are listed out, it is extremely difficult to make any type of statement about these data, even with only 100 values instead of all 712. If we wish to summarize the observations, however, we could begin by constructing a frequency distribution. For nominal and ordinal data, a frequency distribution is a table made up of a list of categories or classes along with the numerical counts which correspond to each one. To construct a frequency distribution for the set of data shown above, we would begin by listing the various causes of death; we would then count the number of children who died as a result of each of these causes. The observations are displayed in frequency distribution format in Table 2.14. Using this table, we are able to see that 384 out of the 712 injury deaths were the result of motor vehicle accidents, 147 were caused by drowning, 78 by fires or burns, 68 by firearm homicides, and 35 by suffocation.
ISTUDY
48
Principles of Biostatistics
TABLE 2.14 Injury deaths of 712 children between the ages of 5 and 9 years, United States, 2017 Cause
Number of Deaths
Motor vehicle accident
384
Drowning
147
Fire / burn
78
Firearm homicide
68
Suffocation
35
Total
712
Like nominal and ordinal data, discrete and continuous measurements can also be displayed in the form of a frequency distribution. To do this, the range of values must be subdivided into a series of distinct, non-overlapping intervals. The numbers of observations that fall within each pair of limits are then counted and arranged in a table. Suppose we are interested in studying the consequences of low birth weight among newborns in the United States. To put the magnitude of the problem into context, we first examine the distribution of birth weights for all infants born in 2016 [59]. We separate these observations into intervals of equal width; the corresponding frequencies are displayed in Table 2.15. This table provides us with more information about the distribution of birth weights than would a list of 3,974,876 measurements. We can see that most of the observations lie between 2000 and 4499 grams; relatively few measurements fall outside this range. The intervals 3000–3499 and 3500–3999 grams contain the largest numbers of values. After looking at the actual counts, we might also be interested in determining the relative frequency associated with each interval in the table. The relative frequency is the percentage of the total number of observations that lie within an interval. The relative frequencies for the birth weights displayed in Table 2.15 – which we compute by dividing the number of values in the interval by the total number of measurements in the table and multiplying by 100 – are shown in Table 2.16. The table indicates that 38.9 + 26.7 = 65.6% of the birth weights are between 3000 and 3999 grams, and 5.1 + 18.4 + 38.9 + 26.7 + 6.9 = 96.0% are between 2000 and 4499 grams. Only 2.9% of the children born in 2016 weighed less than 2000 grams. In addition to tables, we can also use graphs to summarize and display a set of data. For example, we could illustrate the nominal data in Table 2.11 using the bar chart in Figure 2.19. The categories into which the observations fall are placed along the horizontal axis; the vertical bars represent the frequency of observations in each class. The graph emphasizes that a large proportion of childhood injury deaths are the result of motor vehicle accidents. Of the various graphical displays that can be used for discrete or continuous data, the histogram is perhaps the most common. Like a bar chart, a histogram is a pictorial representation of a frequency distribution. The horizontal axis displays the true limits of the intervals into which the observations fall; the vertical axis depicts the frequency or relative frequency of observations within each interval. As an example, Figure 2.20 is a histogram of the birth weight data summarized in Table 2.13. Looking at the graph, we can see that the data are slightly skewed to the left. A box plot is another type of graph often used for discrete or continuous data. The plot displays a summary of the observations using a single vertical or horizontal axis. Suppose that we are interested in summarizing health care expenditures for the 36 nations which make up the Organization for Economic Cooperation and Development (oecd). These expenditures are displayed as a percentage of GDP in Figure 2.21, from a low of 4.2% in Turkey to a high of 17.1% in the United States [60]. The
ISTUDY
49
Descriptive Statistics
TABLE 2.15 Absolute frequencies of birth weights for 3,974,876 infants born in the United States, 2016 Birth Weight (grams)
Number of Infants
0–499
5863
500–999
20,689
1000–1499
29,040
1500–1999
62,862
2000–2499
202,415
2500–2999
729,673
3000–3499
1,544,024
3500–3999
1,062,456
4000–4499
274,404
4500–4999
38,796
5000–5500
4654
Total
3,974,876
TABLE 2.16 Relative frequencies of birth weights for 3,974,876 infants born in the United States, 2016 Birth Weight (grams)
Relative Frequency (%)
0–499
0.1
500–999
0.5
1000–1499
0.7
1500–1999
1.6
2000–2499
5.1
2500–2999
18.4
3000–3499
38.9
3500–3999
26.7
4000–4499
6.9
4500–4999
1.0
5000–5500
0.1
Total
100.0
ISTUDY
50
Principles of Biostatistics
FIGURE 2.19 Injury deaths of 712 children between the ages of 5 and 9 years, United States, 2017
FIGURE 2.20 Relative frequencies of birth weights for 3,945,875 infants born in the United States, 2016
ISTUDY
51
Descriptive Statistics
FIGURE 2.21 Health care expenditure as a percentage of GDP for 36 oecd nations, 2017 three horizontal lines which make up the central box indicate that the 25th, 50th, and 75th percentiles of the data are 7.1%, 8.9%, and 10.3% respectively. The height of the box is the distance between the 25th and 75th percentiles. The lines extending from either side of the central box mark the most extreme observations which are not more than 1.5 times the height of the box beyond either quartile, or the adjacent values. In Figure 2.19, the adjacent values are 4.2% and 12.3%. The United States is an outlier, with health care expenditures which are not typical of the rest of the oecd nations. A line graph is one type of display that can be used to illustrate the relationship between two continuous measurements. Each point on the line represents a pair of values; the line itself allows us to trace the change in the quantity on the vertical axis which corresponds to a change along the horizontal axis. For example, Figure 2.22 depicts information about cigarette consumption over time in the United States [61]. Numerical summary measures are single numbers which quantify important characteristics of a distribution of values. In a study investigating the causes of death among individuals with severe asthma, data were recorded for ten patients who arrived at the hospital in a state of respiratory arrest; breathing had stopped, and the subjects were unconscious. Table 2.17 lists the heart rates of the ten patients upon admission to the hospital [62]. How might we characterize or describe this set of observations? To begin, we might be interested in finding a typical heart rate for the ten individuals. The most commonly used measure of central tendency is the mean. To find the mean of these data, we simply sum all the observations and divide by n = 10. Therefore, for the measurements in Table 2.17, n
x¯
=
1X xi n i=1
ISTUDY
52
Principles of Biostatistics
FIGURE 2.22 Cigarette consumption per person 18 years of age or older, United States, 1900–2012
! 1 (167 + 150 + 125 + 120 + 150 + 150 + 40 + 136 + 120 + 150) 10
= =
1308 10
=
130.8 beats per minute.
The mean heart rate upon admission to the hospital is 130.8 beats per minute. In this dataset, the heart rate of patient 7 is considerably lower than the heart rates of the other subjects. What would happen if this observation were removed from the group? In this case, ! 1 x¯ = (167 + 150 + 125 + 120 + 150 + 150 + 136 + 120 + 150) 9 =
1268 9
=
140.9 beats per minute.
The mean has increased by approximately ten beats per minute; this change demonstrates how much influence a single unusual observation can have on the mean. A second measure of central tendency is the median, or the 50th percentile of the set of values. Ranking the measurements from smallest to largest, we have: 40, 120, 120, 125, 136, 150, 150, 150, 150, 167 Since there are an even number of observations, the median is taken to be the average of the two middlemost values. In this case, these values are the 10/2 = 5th and the (10/2) + 1 = 6th largest
ISTUDY
53
Descriptive Statistics TABLE 2.17 Heart rates for ten asthmatic patients in a state of respiratory arrest Patient 1 2 3 4 5 6 7 8 9 10
Heart Rate (beats per minute) 167 150 125 120 150 150 40 136 120 150
observations. Consequently, the median of the data is (136 + 150)/2 = 143 beats per minute, a number quite a bit larger than the mean. Five observations are smaller than the median, and five are larger. The calculation of the median takes into account the ordering and the relative magnitude of the observations. If we were to again remove patient 7, the ranking of heart rates would be: 120, 120, 125, 136, 150, 150, 150, 150, 167 There are nine observations in the list; the median is the [(9 + 1)/2] = 5th largest measurement, or 150 beats per minute. Although the median does increase somewhat when patient 7 is removed, it does not change as much as the mean did; the median is more robust than the mean. Once we have found the center of the data, we often want to determine the amount of variability among the observations as well; this allows us to quantify the degree to which the “typical value” is representative of the group as a whole. One measure of dispersion is the range. The range of the data is the difference between the largest and smallest measurements. For the heart rates in Table 2.14, the range is 167 − 40 = 127 beats per minute. Since the range considers only the most extreme observations in a data set, it is highly sensitive to outliers. If we were to remove patient 7 from the group, the range of the data would be only 167 − 120 = 47 beats per minute. The interquartile range of a set of data is defined as the 75th percentile minus the 25th percentile. If we were to construct a box plot using the data in Table 2.14 – as is shown in Figure 2.23 – the interquartile range would be the height of the central box. (Note that for this particular set of measurements, the lower adjacent value is equal to the 25th percentile.) To find the 25th percentile of the data, we note that nk/100 = 10(25)/100 = 2.5 is not an integer. Therefore, the 25th percentile is the 2 + 1 = 3rd largest measurement, or 120 beats per minute. Similarly, 10(75)/100 = 7.5 is not an integer and the 75th percentile is the 7 + 1 = 8th largest measurement, or 150 beats per minute. Subtracting these two values, the interquartile range for the heart rate data is 150 − 120 = 30 beats per minute; this is the range of the middle 50% of the observations. The range or interquartile range is often used with the median to describe a distribution of values. The most commonly used measures of dispersion for a set of data values are the variance and the standard deviation. The variance quantifies the amount of variability around the mean of the data; it is calculated by subtracting the mean from each of the measurements, squaring these deviations, summing them, and dividing by the total number of observations minus 1. The variance of the heart
ISTUDY
54
Principles of Biostatistics
FIGURE 2.23 Heart rates for ten asthmatic patients in a state of respiratory arrest rates in Table 2.14 is s2
10
= =
X 1 (x i − 130.8) 2 (10 − 1) i=1 ! 1 [(36.2) 2 + (19.2) 2 + (−5.8) 2 + (−10.8) 2 + (19.2) 2 9 + (19.2) 2 + (−90.8) 2 + (5.2) 2 + (−10.8) 2 + (19.2) 2 ]
=
11323.6 9
= 1258.2 (beats per minute) 2 . The standard deviation is the positive square root of the variance; it is used more frequently in practice because it has the same units of measurement as the mean. For the ten measures of heart rate, the standard deviation is q 1258.2 (beats per minute) 2 s = = 35.5 beats per minute. The standard deviation is typically used with the mean to describe a set of values. Over the years, the use of computers in statistics has increased dramatically. As a result, many formerly time-consuming calculations can now be performed much more efficiently using a statistical package. A statistical package is a series of programs that have been designed to analyze numerical data. A variety of packages are available; in general, they differ with respect to the commands that they use and the format of the output they produce.
ISTUDY
55
Descriptive Statistics TABLE 2.18 Stata output displaying numerical summary measures for heart rate Heart rate (beats per minute) ------------------------------------------------------------Percentiles Smallest 1% 40 40 5% 40 120 10% 80 120 Obs 10 25% 120 125 Sum of Wgt. 10 50%
143
75% 90% 95% 99%
Largest 150 150 150 167
150 158.5 167 167
Mean Std. Dev. Variance Skewness Kurtosis
130.8 35.4708 1258.178 -1.772591 5.479789
TABLE 2.19 R output displaying numerical summary measures for heart rate Min. 1st Qu. 40.0 121.2
Median 143.0
Mean 3rd Qu. 130.8 150.0
Max. 167.0
One statistical package that is both powerful and relatively easy to use is called Stata. Stata is an interactive program that helps us manage, display, and analyze data. Observations or measurements are saved in columns; each column is assigned a variable name. We then use these names to execute specific analytical procedures or commands. Another powerful tool for statistical analysis is the programming language R. R is open source and freely available to download and use on your own machine. We recommend using the integrated development environment (IDE) RStudio for programming in R, as it is much easier to organize your work using this platform. To illustrate the use of these packages, instead of calculating numerical summary measures for the heart rates in Table 2.17 by hand, we could have used a computer to do the calculations for us. In practice, analysts use computers to perform virtually all calculations. Table 2.18 shows the relevant output from Stata. Selected percentiles of the data are displayed on the left-hand side of the table. Using these values, we can determine the median and the interquartile range. The middle column contains the four smallest and four largest measurements; the minimum and maximum allow us to calculate the range. The information on the right-hand side of the table includes the number of observations, the mean of the data, the standard deviation, and the variance. Table 2.19 shows numerical summary measures for the same set of data computed in R.
ISTUDY
56
Principles of Biostatistics
2.7
Review Exercises 1. What are descriptive statistics? 2. How do ordinal data differ from nominal data? 3. How do continuous data differ from discrete data? 4. What are the advantages and disadvantages of transforming continuous measurements into ordinal or dichotomous measurements? 5. When constructing a table, when might it be beneficial to use relative rather than absolute frequencies? 6. What types of graphs can be used to display nominal or ordinal observations? What types of graphs can be used for discrete or continuous observations? 7. What are the percentiles of a set of data? 8. Define and compare the mean, median, and mode as measures of central tendency. 9. Under what conditions is the median preferred as a measure of central tendency, rather than the mean?
10. Define and compare the range, the interquartile range, and the standard deviation as measures of variability or dispersion. 11. For each of the following measurements, identify the type of numerical data as nominal, ordinal, discrete, or continuous. (a) The number of suicides in the United States in a specified year (b) Response to treatment defined as no response, minor improvement, major improvement, or complete recovery (c) The concentration of lead in a sample of water (d) Political party affiliation (e) Presence or absence of hepatitis C (f) The length of time that a cancer patient survives after diagnosis (g) The number of previous miscarriages an expectant mother has had (h) Satisfaction with care received during a hospital admission, defined as poor, fair, good, very good, or excellent (i) The age of a child undergoing tonsillectomy 12. The table below categorizes 10,614,000 office visits to cardiovascular disease specialists in the United States by the duration of each visit [63]. A duration of 0 minutes implies that the patient did not have face-to-face contact with the specialist.
ISTUDY
57
Descriptive Statistics Duration (minutes) 0 1–5 6–10 11–15 16–30 31–60 ≥61 Total
Number of Visits (thousands) 390 227 1023 3390 4431 968 185 10,614
The statement is made that office visits to cardiovascular disease specialists are most often between 16 and 30 minutes long. Do you agree with this statement? Why or why not? 13. The frequency distribution below displays the numbers of cases of pediatric hiv/aids reported in the United States between 1983 and 1989 [64]. Construct a bar chart showing the number of cases by year. What does the graph tell you about pediatric hiv/aids in this time period? Year 1983 1984 1985 1986 1987 1988 1989
Number of Cases 122 250 455 848 1412 2811 3098
14. Listed below are the numbers of people who were executed in the United States in each year since the 1976 Supreme Court decision allowing the death penalty to be carried out, up through 1994 [65]. Year 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985
Number of Executions 0 1 0 2 0 1 2 5 21 18
Year 1986 1987 1988 1989 1990 1991 1992 1993 1994
Number of Executions 18 25 11 16 23 14 31 38 28
ISTUDY
58
Principles of Biostatistics Use these data to construct a bar chart of number of executions by year. How did the number of executions vary from 1976 through 1994? 15. In an investigation of the risk factors for cardiovascular disease, levels of serum cotinine – a metabolic product of nicotine – were recorded for a group of smokers and a group of nonsmokers [66]. The relevant frequency distributions are displayed below. Cotinine Level (ng/ml) 0–13 14–49 50–99 100–149 150–199 200–249 250–299 300+ Total
Smokers 78 133 142 206 197 220 151 412 1539
Nonsmokers 3300 72 23 15 7 8 9 11 3445
(a) What type of numerical data is serum cotinine level? (b) Is it fair to compare the distributions of cotinine levels for smokers and nonsmokers based on the absolute frequencies in each interval? Why or why not? (c) Compute the relative frequencies of serum cotinine level readings for smokers and for nonsmokers. (d) Using the relative frequencies, construct a pair of frequency polygons. (e) Describe the shape of each polygon. What can you say about the distribution of recorded cotinine levels in each group? (f) For all individuals in this study, smoking status is self-reported. Do you think that any of the subjects might be misclassified? Why or why not? 16. The relative frequencies of blood lead concentrations for two groups of workers in Canada – one studied in 1979 and the other in 1987 – are displayed below [67]. Blood Lead (µg/dl) 0.5. If p = 0.5, as is the case in Figure 7.4, the probability distribution is symmetric. As shown in Figure 7.5, the binomial distribution is highly skewed for values of p near 0 or 1, and more symmetric as p gets closer to 0.5.
ISTUDY
166
Principles of Biostatistics
FIGURE 7.2 Probability distribution of a binomial random variable with parameters n = 10 and p = 0.14
FIGURE 7.3 Probability distribution of a binomial random variable with parameters n = 10 and p = 0.86
ISTUDY
Theoretical Probability Distributions
167
FIGURE 7.4 Probability distribution of a binomial random variable with parameters n = 10 and p = 0.50
FIGURE 7.5 Probability distributions of four binomial random variables with n = 10 and specified values of p
ISTUDY
168
Principles of Biostatistics Summary: Binomial Distribution Term
Notation
Number of trials
n
Probability of success at each trial
p
Binomial random variable
X is Binomial(n, p) ! n x P(X = x) = p (1 − p) n−x x
Probability distribution Mean of X
np
Variance of X
np(1 − p) p np(1 − p)
Standard deviation of X
7.3
Poisson Distribution
Suppose that X is a random variable representing the number of individuals involved in a motor vehicle accident each year. In the United States, the probability that a particular individual is involved is 0.00024 [160]. Technically, this is a binomial situation in which there are two possible outcomes for each person: accident or no accident. Note, however, that n is very large; we are interested in the entire United States population. When n is very large and p is very small, the binomial distribution is well approximated by another theoretical probability distribution called the Poisson distribution. The Poisson distribution is used to model discrete events that occur infrequently in time or space; it is sometimes called the distribution of rare events. Consider a random variable X that represents the number of occurrences of some event of interest over a given interval. Since X is a count, it is theoretically able to assume any integer value between 0 and infinity. Let λ (the Greek letter lambda) be a constant that denotes the average number of occurrences of the event in an interval. If the probability that X assumes the value x is P(X = x)
=
e−λ λ x , x!
then X is said to have a Poisson distribution with parameter λ. The symbol e represents a constant that is approximated by 2.71828; e is the base of the natural logarithms. Like the binomial distribution, the Poisson distribution involves a set of underlying assumptions: 1. The probability that a single event occurs within an interval is proportional to the length of the interval. 2. Within a single interval, an infinite number of occurrences of the event are theoretically possible. We are not restricted to a fixed number of trials. 3. The events occur independently both within the same interval and between consecutive intervals. The Poisson distribution can be used to model the number of ambulances needed in a city in a given night, the number of particles emitted from a specified amount of radioactive material, or the number of bacterial colonies growing in a Petri dish.
ISTUDY
169
Theoretical Probability Distributions
Recall that the mean of a binomial random variable is equal to np and that its variance is np(1−p). When p is very small, 1 − p is close to 1 and np(1 − p) is approximately equal to np. In this case, the mean and the variance of the distribution are identical and may be represented by the single parameter λ. The property that the mean is equal to the variance is an identifying characteristic of the Poisson distribution. Suppose we are interested in determining the number of people in a population of 10,000 who will be involved in a motor vehicle accident each year. The mean number of persons involved is λ
=
np
=
(10, 000)(0.00024)
=
2.4;
this is also the variance. The probability that no one in this population will be involved in an accident in a given year is e−2.4 (2.4) 0 = 0.091. P(X = 0) = 0! The probability that exactly one person will be involved is P(X = 1)
=
e−2.4 (2.4) 1 1!
=
0.218.
P(X = 2)
=
e−2.4 (2.4) 2 2!
=
0.261,
P(X = 3)
=
e−2.4 (2.4) 3 3!
=
0.209,
P(X = 4)
=
e−2.4 (2.4) 4 4!
=
0.125,
P(X = 5)
=
e−2.4 (2.4) 5 5!
=
0.060,
P(X = 6)
=
e−2.4 (2.4) 6 6!
=
0.024.
Similarly,
and
Since the outcomes of X are mutually exclusive and exhaustive, P(X ≥ 7)
= 1 − P(X < 7) = 1 − (0.091 + 0.218 + 0.261 + 0.209 + 0.125 + 0.060 + 0.024) = 0.012.
Instead of performing the calculations by hand – or using a statistical software package – Table A.2 in the Statistical Tables can be used to obtain Poisson probabilities for selected values of λ. The number of successes x appears in the first column on the left-hand side of the table; λ is in the row across the top. For specified values of x and λ, the entry in the table represents P(X = x). In
ISTUDY
170
Principles of Biostatistics
a population of 10,000 people, what is the probability that exactly three of them will be involved in a motor vehicle accident in a given year? We begin by locating x = 3 in the first column of Table A.2. Rounding 2.4 up to 2.5, we find the column corresponding to λ = 2.5. The table tells us that we can approximate the probability that exactly three individuals are involved in an accident by 0.214. (This result differs from 0.209, the probability calculated above, because it was necessary to round the value of λ in order to use the table.) Figure 7.6 is a graph of the probability distribution of X, the number of individuals in the population involved in a motor vehicle accident each year. The area represented by the vertical bars sums to 1. As shown in Figure 7.7, the Poisson distribution is highly skewed for small values of λ. As λ increases, the distribution becomes more symmetric. Summary: Poisson Distribution Term
Notation
Mean number of occurrences of event in interval
λ
Poisson random variable
X is Poisson(λ)
Probability distribution
P(X = x) =
Mean of X
λ
Variance of X
λ √
Standard deviation of X
7.4
e−λ λ x x!
λ
Normal Distribution
When it follows either a binomial or a Poisson distribution, a random variable X is restricted to taking on integer values only. Under different circumstances, however, the outcomes of a random variable may not be limited to integers or counts. Suppose that X represents height. Rarely is an individual exactly 67 inches tall or exactly 68 inches tall; theoretically, X can assume an infinite number of intermediate values, such as 67.04 inches or 67.8352 inches. In fact, between any two possible outcomes of X we can always find a third value. Although we could argue philosophically that we are only able to measure discrete outcomes due to the limitations of our measuring instruments – perhaps we can measure height only to the nearest tenth of an inch – treating such a variable as if it were continuous allows us to take advantage of powerful mathematical results. As we have seen, the probability distribution of a discrete random variable is represented by an equation for P(X = x), the probability that the random variable X will take on the specific value x. For a binomial random variable with parameters n and p, ! n x P(X = x) = p (1 − p) n−x . x These probabilities may be plotted against x, as in Figure 7.4. Suppose that the number of possible outcomes of X were to become very large and the widths of the corresponding intervals very small.
ISTUDY
Theoretical Probability Distributions
FIGURE 7.6 Probability distribution of a Poisson random variable with parameter λ = 2.4
FIGURE 7.7 Probability distributions of Poisson random variables with various values of λ
171
ISTUDY
172
Principles of Biostatistics
FIGURE 7.8 Probability distribution of a binomial random variable with parameters n = 30 and p = 0.50 In Figure 7.8, for example, n = 30 and p = 0.50. In general, if the number of possible values of X approaches infinity while the widths of the intervals approach zero, the graph will increasingly resemble a smooth curve. A smooth curve is used to represent the probability distribution of a continuous random variable; the curve is called a probability density. For any graph that illustrates a discrete probability distribution, the area represented by the vertical bars sums to 1. For a probability density, the total area beneath the curve must also be 1. Because a continuous random variable X can take on an infinite number of possible values, the probability associated with any particular one of them, such as a height of 67.00000... inches, is effectively equal to 0. However, the probability that X will assume some value in the interval enclosed by the outcomes x 1 and x 2 – such as a height between 67 and 68 inches – is not equal to 0; it is the area beneath the curve that lies between these two values. The most common continuous distribution is the normal distribution, also known as the Gaussian distribution, or the bell-shaped curve. Its shape is that of a binomial distribution for which p is constant but n approaches infinity, or a Poisson distribution where λ approaches infinity. Its probability density is given by the equation f (x)
=
√
1 2π σ
1
e− 2 (
x−µ 2 σ )
where −∞ < x < ∞. The symbol π (pi) represents a constant approximated by 3.14159. The e is the same constant we previously encountered in the formula for a Poisson probability, the base of the natural logarithms. The normal curve is unimodal and symmetric about its mean µ (the Greek letter mu); in this special case, the mean, median, and mode of the distribution are all identical. The standard deviation, represented by σ (sigma), specifies the amount of variability around the mean. Together, the two parameters µ and σ completely define a normal curve. If we know these two values, we know exactly what the probability density looks like.
ISTUDY
173
Theoretical Probability Distributions
The value of the normal distribution will become more apparent when we begin to work with the sampling distribution of the mean. For now, however, it is important to note that many random variables of interest – including blood pressure, serum cholesterol level, height, weight, and body temperature – are approximately normally distributed. The normal curve may therefore be used to estimate probabilities associated with these variables. For example, in a population in which serum cholesterol level is normally distributed with mean µ and standard deviation σ, we might wish to find the probability that a randomly chosen individual has a serum cholesterol level greater than 250 mg/100 ml to help us to plan for future cardiac services. Since the total area beneath the normal curve is equal to 1, we can estimate this probability by determining the proportion of the area under the curve that lies to the right of the point x = 250, or P(X > 250). This may be done using a computer program, or a table of areas calculated for the normal curve. Since a normal distribution could have an infinite number of possible values for its mean and standard deviation, it is impossible to tabulate the area associated with each and every normal curve. Instead, only a single curve is tabulated – the special case for which µ = 0 and σ = 1. This curve is known as the standard normal distribution. Figure 7.9 illustrates the standard normal distribution, and Table A.3 in the Statistical Tables displays the areas in the upper tail of this distribution. Outcomes of the random variable Z are denoted by lowercase z; the whole number and tenths decimal place of z are listed in the column on the left of the table, and the hundredths decimal place is shown in the row across the top. For a particular value of z, the entry in the body of the table specifies the area beneath the curve to the right of z, or P(Z > z). Some sample values of z and their corresponding areas are as follows: z 0.00 1.65 1.96 2.58 3.00
Area in Right Tail 0.500 0.049 0.025 0.005 0.001
Since the standard normal distribution is symmetric about z = 0, the area under the curve to the right of z is equal to the area to the left of −z. −z −0.00 −1.65 −1.96 −2.58 −3.00
Area in Left Tail 0.500 0.049 0.025 0.005 0.001
Suppose we wish to know the area under the standard normal curve that lies between z = −1.00 and z = 1.00; since µ = 0 and σ = 1, this is the area contained in the interval µ ± 1σ, illustrated in Figure 7.10. Equivalently, this area is P(−1 ≤ Z ≤ 1). We can use Table A.3 to determine this probability. First we see that the area to the right of z = 1.00 is P(Z > 1) = 0.159. Therefore, because of the symmetry of the standard normal distribution, the area to the left of z = −1.00 must be 0.159 as well, so P(Z < −1) = 0.159. Furthermore, we know that the events that Z > 1 and Z < −1 are mutually exclusive. Consequently, applying the additive rule of probability, the sum of the area to the right of 1 and to the left of −1 is P(Z > 1) + P(Z < −1)
= 0.159 + 0.159 = 0.318.
ISTUDY
174
FIGURE 7.9 The standard normal distribution with parameters µ = 0 and σ = 1
FIGURE 7.10 The standard normal curve, area between z = −1.00 and z = 1.00
Principles of Biostatistics
ISTUDY
175
Theoretical Probability Distributions
FIGURE 7.11 The standard normal curve, area between z = −2.00 and z = 2.00 Since the total area under the curve is equal to 1, the area between −1 and 1 must be P(−1 ≤ Z ≤ 1)
= 1 − [P(Z > 1) + P(Z < −1)] = 1 − 0.318 = 0.682
Therefore, for the standard normal distribution, approximately 68.2% of the area beneath the curve lies within ±1 standard deviation from the mean. Similarly, we might also wish to calculate the area under the standard normal curve that is contained in the interval µ ± 2σ, or P(−2 ≤ Z ≤ 2). This area is illustrated in Figure 7.11. Table A.3 indicates that the area to the right of z = 2.00 is 0.023; the area to the left of z = −2.00 is 0.023 as well. Therefore, the area between −2.00 and 2.00 must be P(−2 ≤ Z ≤ 2)
= 1 − [P(Z > 2) + P(Z < −2)] = 1.000 − [0.023 + 0.023] = 0.954.
Approximately 95.4% of the area under the standard normal curve lies within ±2 standard deviations from the mean. The two previous calculations form the basis of the empirical rule described in Section 2.5, which stated that if a distribution of values is symmetric and unimodal, then approximately 67% of the observations lie within one standard deviation of the mean, and about 95% lie within 2 standard deviations. Table A.3 can also be used the other way around. For example, we might wish to find the value of z that cuts off the upper 10% of the standard normal distribution, or the value of z for which P(Z > z) = 0.10. Locating 0.100 in the body of the table, we observe that the corresponding value of z is 1.28. Therefore, 10% of the area under the standard normal curve lies to the right of z = 1.28; this area is illustrated in Figure 7.12. Similarly, another 10% of the area lies to the left of z = −1.28.
ISTUDY
176
Principles of Biostatistics
FIGURE 7.12 The standard normal curve, area to the right of z = 1.28 Now suppose that X is a normal random variable with mean µ = 2 and standard deviation σ = 0.5. Subtracting 2 from X would give us a normal random variable that has mean 0; as shown in Figure 7.13, the whole distribution would be shifted two units to the left. Dividing (X − 2) by 0.5 then alters the spread or variability of the distribution so that we have a normal random variable with standard deviation 1. Therefore, if X is a normal random variable with mean 2 and standard deviation 0.5, then Z
=
X −2 0.5
is a standard normal random variable with mean 0 and standard deviation 1. In general, for any arbitrary normal random variable with mean µ and standard deviation σ, Z
=
X−µ σ
has a standard normal distribution. By transforming X into Z, we can use a table of areas computed for the standard normal curve to estimate probabilities associated with X. An outcome of the random variable Z, denoted z, is called a standard normal deviate or a z-score. Let X be a random variable that represents systolic blood pressure. For the population of 18- to 74-year-old males in the United States, systolic blood pressure is approximately normally distributed with mean 129 mm Hg and standard deviation 19.8 mm Hg [161]. This distribution is shown in Figure 7.14. Note that Z
=
X − 129 19.8
is normally distributed with mean 0 and standard deviation 1.
ISTUDY
Theoretical Probability Distributions
177
FIGURE 7.13 Transforming a normal curve with mean 2 and standard deviation 0.5 into the standard normal curve
FIGURE 7.14 Distribution of systolic blood pressure for males 18 to 74 years of age, United States
ISTUDY
178
Principles of Biostatistics
Suppose we wish to find the value of x that cuts off the upper 2.5% of the curve of systolic blood pressures, or equivalently, the value of x for which P(X > x) = 0.025. Using Table A.3, we see that the area to the right of z = 1.96 is 0.025. To obtain the value of x that corresponds to this value of z, we solve the equation z
= 1.96 =
or x
=
x − 129 19.8
129 + (1.96)(19.8)
=
167.8.
Therefore, approximately 2.5% of the males in this population have systolic blood pressures that are greater than 167.8 mm Hg, while 97.5% have blood pressures less than 167.8 mm Hg. In other words, if we randomly select an individual from this adult male population, the probability that his systolic blood pressure is greater than 167.8 mm Hg is 0.025. Because the standard normal curve is symmetric around z = 0, we know that the area to the left of z = −1.96 is also 0.025. By solving the equation z
= −1.96 =
or x
=
x − 129 19.8
129 + (−1.96)(19.8)
=
90.2,
we find that 2.5% of the males have a systolic blood pressure that is less than 90.2 mm Hg. Equivalently, the probability that a randomly selected male has a systolic blood pressure less than 90.2 mm Hg is 0.025. Since 2.5% of the men in the population have systolic blood pressures greater than 167.8 mm Hg and 2.5% have values less than 90.2 mm Hg, the remaining 95% of the males must have systolic blood pressure readings that lie between 90.2 and 167.8 mm Hg. We might also be interested in determining the proportion of males in the population who have systolic blood pressures greater than 150 mm Hg. In this case, we are given the outcome of the random variable X and must solve for the z-score: z
=
150 − 129 19.8
=
1.06.
The z-score tells us that value 150 lies 1.06 standard deviations above the population mean of 129 mm Hg. The area to the right of z = 1.06 is 0.145. Therefore, approximately 14.5% of the males in this population have a systolic blood pressure reading that is greater than 150 mm Hg. Now consider the more complicated situation in which we have two normally distributed random variables. In an Australian national study of risk factor prevalence, two of the populations investigated are males whose blood pressures are within a normal or accepted range and who are not taking any corrective medication, and males who have had high blood pressure but who are presently undergoing antihypertensive drug therapy [162]. For the population of males who are not taking medication, diastolic blood pressure is approximately normally distributed with mean µ n = 80.7 mm Hg and standard deviation σ n = 9.2 mm Hg. For those who are using antihypertensive drugs, diastolic blood pressure is also approximately normally distributed, but with mean µa = 94.9 mm Hg and standard deviation σa = 11.5 mm Hg. These two distributions are pictured in Figure 7.15. Our goal is to be able to determine whether an individual has normal blood pressure or whether he is taking antihypertensive medication solely on the basis of his diastolic blood pressure reading. This exercise provides us with a foundation for hypothesis testing, which we will discuss in Chapter 10.
ISTUDY
179
Theoretical Probability Distributions
FIGURE 7.15 Distributions of diastolic blood pressure for two populations of Australian males The first thing to notice is that, because of the large amount of overlap between the two normal curves, it will be difficult to distinguish between them. Nevertheless, we will proceed; if our goal is to identify 90% of the individuals who are currently taking medication, what value of diastolic blood pressure should be designated as the lower cutoff point? Equivalently, we must find the value of diastolic blood pressure that marks off the lower 10% of this distribution. Looking at Table A.3, we find that z = −1.28 cuts off an area of 0.10 in the lower tail of the standard normal curve. Therefore, solving for x, z
= −1.28 =
and x
=
x − 94.9 11.5
94.9 + (−1.28)(11.5)
=
80.2.
Approximately 90% of the males taking antihypertensive drugs have diastolic blood pressures that are greater than 80.2 mm Hg. If we use this value as our cutoff point, then the other 10% of the population – those with readings below 80.2 mm Hg – represent false negative results; they are individuals currently using medication who are not identified as such. What proportion of the males with normal blood pressures will be incorrectly labeled as antihypertensive drug users? These are the males in the population not taking medication who have diastolic blood pressure readings greater than 80.2 mm Hg. Solving for z (notice that we are now using the mean and standard deviation of the population who are not taking corrective medication), 80.2 − 80.7 = − 0.05. 9.2 An area of 0.480 lies to the left of −0.05; therefore, the area to the right of z = −0.05 must be z
=
1.000 − 0.480 = 0.520.
ISTUDY
180
Principles of Biostatistics
Approximately 52.0% of the males with normal blood pressures would be incorrectly labeled as using medication. Note that these errors are false positive results. To reduce the large proportion of false positive errors, 52.0%, the cut point for identifying individuals who are currently using antihypertensive drugs could be raised. If the cut point were 90 mm Hg, for example, then z
90 − 80.7 9.2
=
=
1.01,
and only 15.6% of the males with normal blood pressures would be incorrectly classified as taking medication. When the cutoff is raised, however, the proportion of males correctly labeled as using antihypertensive medication decreases. Note that z
=
90 − 94.9 11.5
=
− 0.43.
The area to the left of z = −0.43 is 0.334, and 1.000 − 0.334 = 0.666; therefore, only 66.6% of the males using antihypertensive drugs would be identified. The remaining 33.4% of these individuals would be false negatives. A trade-off always exists when we try to manipulate proportions of false negative and false positive results. This is the same phenomenon that was observed when we were investigating the sensitivity and specificity of a diagnostic test in Chapter 6. In general, a smaller proportion of false positive errors can be achieved only by increasing the probability of a false negative outcome, and the proportion of false negatives can be reduced only by raising the probability of a false positive. The relationship between these two types of errors in a specific application is determined by the amount of overlap in the two normal populations being studied.
Summary: Normal Distribution Term
Notation
Normal random variable
X is N(µ, σ)
Probability distribution
f (x) = √
Mean of X
µ
Variance of X
σ2
Standard deviation of X
σ
Standard normal random variable
Z=
1 2π σ
1
e− 2 (
x−µ 2 σ )
X−µ is N(0, 1) σ
ISTUDY
181
Theoretical Probability Distributions
7.5
Further Applications
Suppose that we are interested in investigating the probability that a person who has been stuck with a needle infected with hepatitis B actually develops the disease. Let Y be a Bernoulli random variable that represents the disease status of an individual who has been exposed to an infected needle; Y takes the value 1 if the person develops hepatitis and 0 if he or she does not. These two outcomes are mutually exclusive and exhaustive. If 30% of the individuals who are exposed to hepatitis B become infected [163], then P(Y = 1) = p = 0.30, and
P(Y = 0)
=
1−p
=
1 − 0.30
=
0.70.
If we have n independent observations of a dichotomous random variable such that each observation has a constant probability of “success” p, then the total number of successes X follows a binomial distribution. The random variable X can assume any integer value between 0 and n; the probability that X takes on a particular value x may be expressed as ! n x P(X = x) = p (1 − p) n−x . x Suppose that we select five people from the population of individuals who have been stuck with a needle infected with hepatitis B. The number of people in this sample who develop the disease is a binomial random variable with parameters n = 5 and p = 0.30. Its probability distribution may be represented in the following way: ! 5 P(X = 0) = (0.30) 0 (0.70) 5−0 = (1)(1)(0.70) 5 = 0.168, 0
=
! 5 (0.30) 1 (0.70) 5−1 1
=
(5)(0.30)(0.70) 4
=
! 5 (0.30) 2 (0.70) 5−2 2
=
(10)(0.30) 2 (0.70) 3
=
0.309,
P(X = 3)
=
! 5 (0.30) 3 (0.70) 5−3 3
=
(10)(0.30) 3 (0.70) 2
=
0.132,
P(X = 4)
=
! 5 (0.30) 4 (0.70) 5−4 4
=
(5)(0.30) 4 (0.70)
P(X = 5)
=
! 5 (0.30) 5 (0.70) 5−5 5
=
(1)(0.30) 5 (1)
P(X = 1)
P(X = 2)
and
=
=
=
0.360,
0.028,
0.002.
Rather than calculate these probabilities using the formula, we could have used a statistical package such as Stata or R to generate the probabilities associated with this binomial random variable. The probability that at least three individuals among the five develop hepatitis B is P(X ≥ 3)
=
P(X = 3) + P(X = 4) + P(X = 5)
=
0.132 + 0.028 + 0.003
=
0.163.
ISTUDY
182
Principles of Biostatistics
The probability that at most one develops the disease is P(X ≤ 1)
=
P(X = 0) + P(X = 1)
=
0.168 + 0.360
=
0.528.
In addition, the mean number of persons who would develop the disease of size √ p √ in repeated samples five is np = 5(0.3) = 1.5, and the standard deviation is np(1 − p) = 5(0.3)(0.7) = 1.05 = 1.03. If X represents the number of occurrences of some event in a specified interval of time or space such that both the mean number of occurrences and the population variance are equal to λ, then X has a Poisson distribution with parameter λ. The random variable X can take on any integer value between 0 and ∞; the probability that X assumes a particular value x is P(X = x)
e−λ λ x . x!
=
Suppose we are concerned with the possible spread of diphtheria and wish to know how many cases we can expect to see in a particular year. Let X represent the number of cases of diphtheria reported in the United States in a given year over a 10-year period. The random variable X has a Poisson distribution with parameter λ = 2.5 [164]; the probability distribution of X may be expressed as P(X = x)
=
e−2.5 (2.5) x . x!
Therefore, the probability that no cases of diphtheria will reported during a given year is P(X = 0)
=
e−2.5 (2.5) 0 0!
=
0.082.
The probability that a single case will be reported is P(X = 1)
=
e−2.5 (2.5) 1 1!
=
0.205;
P(X = 2)
=
e−2.5 (2.5) 2 2!
=
0.257,
P(X = 3)
=
e−2.5 (2.5) 3 3!
=
0.214,
P(X = 4)
=
e−2.5 (2.5) 4 4!
=
0.134,
P(X = 5)
=
e−2.5 (2.5) 5 5!
=
0.067.
similarly
and
Since the outcomes of X are mutually exclusive and exhaustive, P(X ≥ 4)
= 1 − P(X < 4) = 1 − (0.082 + 0.205 + 0.257 + 0.214) = 0.242.
ISTUDY
183
Theoretical Probability Distributions
There is a 24.2% chance that we will observe four or more cases of diptheria in a given year. Similarly, the probability that we will observe six or more cases is P(X ≥ 6)
= 1 − P(X < 6)
= 1 − (0.082 + 0.205 + 0.257 + 0.214 + 0.134 + 0.067) = 0.041. √ √ The mean number of cases per year is λ = 2.5, and the standard deviation is λ = 2.5 = 1.58. If X is able to assume any value within a specified interval rather than being restricted to integers only, then it is a continuous random variable. The most common continuous probability distribution is the normal distribution. The normal distribution is defined by two parameters: its mean µ and standard deviation σ. The mean specifies the center of the distribution; the standard deviation quantifies the amount of variability around the mean. The shape of the normal distribution indicates that outcomes of the random variable X which are close to the mean are more likely to occur than values which are far from it. The normal distribution with mean µ = 0 and standard deviation σ = 1 is known as the standard normal distribution. Because its area has been tabulated, it is used to obtain probabilities associated with any normal random variable. For example, suppose that we wish to know the area under the standard normal curve that lies between z = −3.00 and z = 3.00; equivalently, this is the area in the interval µ ± 3σ, pictured in Figure 7.16. Looking at Table A.3, we find the area to the right of z = 3.00 to be 0.001. Since the standard normal curve is symmetric, the area to the left of z = −3.00 must be 0.001 as well. Therefore, the area between −3.00 and 3.00 is P(−3 ≤ Z ≤ 3)
= 1 − [P(Z < −3) + P(Z > 3)] = 1 − 0.001 − 0.001
=
0.998;
approximately 99.8% of the area under a standard normal curve lies within ±3 standard deviations from the mean. If X is an arbitrary normal random variable with mean µ and standard deviation σ, then Z
=
X−µ σ
is a standard normal random variable. By transforming X into Z, we are able to use the table of areas for the standard normal curve to estimate probabilities associated with X. Let X represent height. For the population of 18- to 74-year-old females in the United States, height is normally distributed with mean µ = 63.9 inches and standard deviation σ = 2.6 inches [165]. This distribution is illustrated in Figure 7.17. Then Z
=
X − 63.9 2.6
is a standard normal random variable. If we randomly select a female from this population, what is the probability that she is between 60 and 68 inches tall? For x = 60, =
z and, for x = 68,
60 − 63.9 2.6
=
− 1.50,
68 − 63.9 = 1.58. 2.6 First, the z-scores tell us that 60 inches is 1.50 standard deviations below the mean of 63.9 inches, while 68 inches is 1.58 standard deviations above the mean. Furthermore, the probability that z
=
ISTUDY
184
FIGURE 7.16 The standard normal curve, area between z = −3.00 and z = 3.00
FIGURE 7.17 Distribution of height for females 18 to 74 years of age, United States
Principles of Biostatistics
ISTUDY
185
Theoretical Probability Distributions
x – the individual’s height – lies between 60 and 68 inches is equal to the probability that z lies between −1.50 and 1.58 for the standard normal curve. The area to the left of z = −1.50 is 0.067, and the area to the right of z = 1.58 is 0.057. Since the total area under the curve is equal to 1, the area between −1.50 and 1.58 must be P(60 ≤ X ≤ 68)
= P(−1.50 ≤ Z ≤ 1.58) = 1 − [P(Z < −1.50) + P(Z > 1.58)] = 1 − [0.067 + 0.057]
=
0.876.
The probability that the female’s height is between 60 and 68 inches is 0.876. We might also wish to know the value of height that cuts off the upper 5% of this distribution. From Table A.3, we observe that a tail area of 0.050 corresponds to z = 1.645. Solving for x, z and x
=
=
1.645
=
x − 63.9 2.6
63.9 + (1.645)(2.6)
=
68.2.
Approximately 5% of the females in the United States population are taller than 68.2 inches.
ISTUDY
186
7.6
Principles of Biostatistics
Review Exercises 1. What is a probability distribution? How can a probability distribution be represented? 2. What are three properties associated with the binomial distribution? 3. What are three properties associated with the Poisson distribution? 4. What are the properties of the normal distribution? 5. Explain the importance of the standard normal distribution. 6. Let X be a discrete random variable that represents the number of diagnostic services a child receives during an office visit to a pediatric specialist; these services include procedures such as blood tests and urinalysis. The probability distribution for X appears below [166]. x 0 1 2 3 4 5+ Total
P(X = x) 0.671 0.229 0.053 0.031 0.010 0.006 1.000
(a) Construct a graph of the probability distribution of X. (b) What is the probability that a child receives exactly three diagnostic services during an office visit to a pediatric specialist? (c) What is the probability that he or she receives at least one service? Four or more services? (d) What is the probability that the child receives exactly three services given that he or she receives at least one service? 7. Figure 7.18 displays the probability distribution of the random variable X representing the birth order of a child born in the United States [167]. Using the graph, estimate the following: (a) The probability that a child is its mother’s fourth child (b) The probability that a child is its mother’s first or second child (c) The probability that a child is it mother’s third child or higher 8. Suppose that you are interested in monitoring air pollution in Los Angeles, California, over a one-week period. Let X be a random variable that represents the number of days in a week on which the concentration of carbon monoxide surpasses a specified level. Do you believe that X has a binomial distribution? Explain.
ISTUDY
Theoretical Probability Distributions
187
FIGURE 7.18 Probability distribution of a random variable representing birth order in the United States, 2016 9. Consider a group of seven individuals selected from the population of adults age 65 years and older residing in the United States. The number of persons in this sample who suffer from diabetes is a binomial random variable with parameters n = 7 and p = 0.252 [168]. (a) If you wish to make a list of the seven persons chosen, in how many ways can they be ordered? (b) Without regard to order, in how many ways can you select four individuals from this group of seven? (c) What is the probability that exactly two of the individuals in the sample of size seven have been diagnosed with diabetes? (d) What is the probability that at most two of the seven have been diagnosed with diabetes? (e) What is the probability that four of the seven have diabetes? 10. According to the National Health Survey, 9.8% of the population of 18- to 24-year-olds in the United States are left-handed [165]. (a) Suppose that you select ten individuals from this population. In how many ways can the ten persons be ordered? (b) Without regard to order, in how many ways can you select four individuals from this group of ten? (c) What is the probability that exactly three of the ten persons are left-handed? (d) What is the probability that at least six of the ten persons are left-handed? (e) What is the probability that at most two individuals are left-handed?
ISTUDY
188
Principles of Biostatistics
11. According to the Youth Risk Behavior Surveillance System, 20.7% of all American high school students watch television for three or more hours on a typical school day [169]. (a) If you select repeated samples of size 20 from the population of high school students, what would be the mean number of individuals per sample who watch television for three or more hours per day? What would be the standard deviation? (b) Suppose that you select a sample of 20 individuals and find that 18 of them watch at least three hours of television per day. Assuming that the Surveillance System is correct, what is the probability that you would have obtained results as extreme as or even more extreme than those you observed? (c) Suppose that you select a sample of 20 individuals and find that 8 of them watch at least three hours of television per day. Assuming that the Surveillance System is correct, what is the probability that you would have obtained results as extreme as or even more extreme than those you observed? 12. The number of cases of tetanus reported in the United States during a single month has a Poisson distribution with parameter λ = 4.5 [164]. (a) What is the probability that exactly one case of tetanus will be reported during a given month? (b) What is the probability that at most two cases of tetanus will be reported? (c) What is the probability that four or more cases will be reported? (d) What is the mean number of cases of tetanus reported in a one-month period? What is the standard deviation? 13. In a particular county, the average number of suicides reported each month is 2.75 [170]. Assume that the number of suicides follows a Poisson distribution. (a) What is the probability that no suicides will be reported during a given month? (b) What is the probability that at most four suicides will be reported? (c) What is the probability that six or more suicides will be reported? 14. Let X be a random variable that represents the number of infants in a group of 2000 who die before reaching their first birthdays. In the United States, the probability that a child dies during his or her first year of life is 0.0059 [171]. (a) What is the mean number of infants who would die in a group of this size? (b) What is the probability that at most five infants out of 2000 die in their first year of life? (c) What is the probability that between 15 and 20 infants die in their first year of life? 15. Consider the standard normal distribution with mean µ = 0 and standard deviation σ = 1. (a) (b) (c) (d) (e)
What is the probability that an outcome z is greater than 2.60? What is the probability that z is less than 1.35? What is the probability that z is between −1.70 and 3.10? What value of z cuts off the upper 15% of the standard normal distribution? What value of z cuts off the lower 20% of the distribution?
16. Among females in the United States between 18 and 74 years of age, diastolic blood pressure is normally distributed with mean µ = 77 mm Hg and standard deviation σ = 11.6 mm Hg [161].
ISTUDY
Theoretical Probability Distributions
189
(a) What is the probability that a randomly selected female has a diastolic blood pressure less than 60 mm Hg? (b) What is the probability that she has a diastolic blood pressure greater than 90 mm Hg? (c) What is the probability that the female has a diastolic blood pressure between 60 and 90 mm Hg? (d) Suppose that a randomly selected female has a z-score of 2.15. What does this tell us? 17. The distribution of weights for the population of males in the United States is approximately normal with mean µ = 172.2 pounds and standard deviation σ = 29.8 pounds [165]. (a) (b) (c) (d)
What is the z-score associated with a weight of 130 pounds? What is the probability that a randomly selected male weighs less than 130 pounds? What is the probability that he weighs more than 210 pounds? What is the probability that among five males selected at random from the population, exactly two will have a weight outside the range 130 to 210 pounds? (e) What is the probability that among five males selected at random from the population, at least one will have a weight outside the range 130 to 210 pounds?
18. The Wechsler Adult Intelligence Scale, commonly called an IQ score, is designed to have a normal distribution with mean µ = 100 and standard deviation σ = 15 in the general population. (a) What is the probability that an adult selected from the general population has an IQ score above 100? (b) What is the probability that an adult has an IQ score above 130? (c) What is the probability that an adult has an IQ score below 80? (d) What is the probability that an adult has an IQ score between 85 and 115? 19. In the Framingham Heart Study, serum cholesterol levels were measured for a large number of healthy males. The population was then followed for 16 years. At the end of this time, the males were divided into two groups: those who had developed coronary heart disease and those who had not. The distributions of the initial serum cholesterol levels for each group were found to be approximately normal. Among individuals who eventually developed coronary heart disease, the mean serum cholesterol level was µ d = 244 mg/100 ml and the standard deviation was σ d = 51 mg/100 ml; for those who did not develop coronary heart disease, the mean serum cholesterol level was µ nd = 219 mg/100 ml and the standard deviation was σ nd = 41 mg/100 ml [172]. (a) Suppose that an initial serum cholesterol level of 260 mg/100 ml or higher is used to predict future coronary heart disease. What is the probability of correctly predicting future heart disease for a male who will develop it? (b) What is the probability of predicting disease for a male who will not develop it? (c) What is the probability of failing to predict coronary heart disease for a male who will develop it? (d) What would happen to the probabilities of false positive and false negative results if the cutoff point for predicting future disease is lowered to 250 mg/100 ml? (e) In this population, does initial serum cholesterol level appear to be useful for predicting future coronary heart disease? Why or why not?
ISTUDY
ISTUDY
8 Sampling Distribution of the Mean
CONTENTS 8.1 8.2 8.3 8.4 8.5
Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applications of the Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191 192 193 198 204
In the previous chapter we examine theoretical probability distributions, such as the binomial distribution and the normal distribution. In all cases the relevant population parameters are assumed to be known; this allows us to describe the distributions completely and calculate the probabilities associated with various outcomes. In most practical applications, however, we are not given the values of these parameters. Instead, we must attempt to describe or estimate a population parameter – such as the mean of a normally distributed random variable – using the information contained in a sample of observations selected from the population. The process of drawing conclusions about an entire population based on the information contained in a sample is known as statistical inference.
8.1
Sampling Distributions
Suppose that we are interested in estimating the mean value of a continuous random variable. For example, we might wish to make a statement about the mean serum cholesterol level of all males residing in the United States, based on a sample drawn from this population. The obvious approach would be to use the mean of the sample as an estimate of the unknown population mean µ. The sample mean X is called an estimator of the parameter µ. There are many different approaches to the process of estimation; in this case, because the population is assumed to be normally distributed, the sample mean X is something called a maximum likelihood estimator [173]. The method of maximum likelihood finds the value of the parameter that is most likely to have produced the observed sample data. This method can usually be relied on to yield reasonable estimators. Note, however, that two different samples are likely to result in different sample means; consequently, there is some degree of uncertainty involved. Before we apply this estimation procedure, therefore, we first examine some of the properties of the sample mean and the ways in which it can vary. The population under investigation can be any group that we choose. In general, we are able to estimate a population mean µ with greater precision when the group is relatively homogeneous. If there is only a small amount of variation among individuals, then we can be more certain that the observations in any given sample are representative of the entire group. It is very important that a sample provide an accurate representation of the population from which it is selected. If it does not, then the conclusions drawn about the population may be distorted or biased. For instance, if we intend to make a statement about the mean serum cholesterol level DOI: 10.1201/9780429340512-8
191
ISTUDY
192
Principles of Biostatistics
of all 20- to 74-year-old males in the United States, but sample only those over the age of 60, then our estimate of the population mean is likely to be too high. It is crucial that the sample drawn be random; each individual in the population should have an equal chance of being selected. This point is discussed further in Chapter 21. In addition, we would expect that the larger the sample, the more reliable our estimate of the population mean. Suppose that the mean of the continuous random variable serum cholesterol level is µ and the standard deviation is σ. We randomly select a sample of n observations from the population and compute the mean of this sample. Call the sample mean x¯ 1 . We then obtain a second random sample of n observations and calculate the mean of this new sample. Label this second sample mean x¯ 2 . Unless everyone in the population has exactly the same serum cholesterol level, it is very unlikely that x¯ 1 will equal x¯ 2 . If we were to continue this procedure indefinitely – selecting all possible samples of size n and computing their means – we would end up with a set of values consisting entirely of sample means. Another way to think about this is to observe that the estimator X is actually itself a random variable, with outcomes x¯ 1 , x¯ 2 , x¯ 3 , and so on. If each mean in this series is treated as a unique observation, their collective probability distribution – the probability distribution of X – is known as a sampling distribution of the mean for samples of size n. If we were to select repeated samples of size 25 from the population of males residing in the United States and calculate the mean serum cholesterol level for each sample, we would end up with the sampling distribution of mean serum cholesterol levels of samples of size 25. In practice it is not common to select repeated samples of size n from a given population; understanding the properties of the theoretical distribution of their means, however, allows us to make inference based upon a single sample of size n.
8.2
Central Limit Theorem
Given that the distribution of serum cholesterol levels in the underlying population of adult males has mean µ and standard deviation σ, the distribution of sample means computed for samples of size n has three important properties: 1. The mean of the sampling distribution is identical to the population mean µ. √ 2. The standard deviation of the distribution of sample means is equal to σ/ n, the standard deviation of the underlying population divided by the square root of the sample size. This quantity is known as the standard error of the mean. 3. Provided that the sample size n is large enough, the shape of the sampling distribution is approximately normal. Regarding the first property, we would intuitively expect the means of all the samples to cluster around the mean of the population from which they were drawn. For the second property, although the standard deviation of the sampling distribution is related to the population standard deviation σ, there is less variability among the sample means than there is among individual observations. Even if a particular sample contains one or two extreme values, that these values would likely be offset by the other measurements in the group. Thus, as long as n is greater than 1, the standard error of the mean is always smaller than the standard deviation of the population. In addition, as n increases, the amount of sampling variability decreases. Finally, if n is large enough, the distribution of the sample means is approximately normal. This remarkable result is known as the central limit theorem; it applies to any population with a finite standard deviation, regardless of the shape of the underlying distribution [174]. The caveat is that the farther the underlying population departs from being normally distributed, the larger the value of n needs to be to ensure normality of the
ISTUDY
193
Sampling Distribution of the Mean
sampling distribution. If the underlying population is itself normal, then samples of size 1 are large enough. Even if the population is bimodal or noticeably skewed, however, a sample of size 30 is often sufficient. The central limit theorem is very powerful. It holds true not only for serum cholesterol levels, but for almost any other type of measurement as well. It even applies to discrete random variables. The central limit theorem allows us to quantify the uncertainty inherent in statistical inference without having to make a great many assumptions that cannot be verified. Regardless of the probability distribution of X, because the √ distribution of the sample means is approximately normal with mean µ and standard deviation σ/ n, we know that =
Z
X−µ √ σ/ n
is normally distributed with mean 0 and standard deviation 1, as long as n is large enough. We have simply standardized the normal random variable X in the usual way. As a result, we can use tables of the standard normal distribution – such as Table A.3 in Appendix A – to make inference about the value of a population mean.
8.3
Applications of the Central Limit Theorem
Consider the distribution of serum cholesterol levels for all 20- to 74-year-old males living in the United States. The mean of this population is µ = 211 mg/100 ml, and the standard deviation is σ = 46 mg/100 ml [43]. If we select repeated samples of size 25 from the population, what proportion of the samples will have a mean value of 230 mg/100 ml or above? Assuming that a sample of size 25 is large enough, the central limit theorem states that the distribution of means√of samples √ of size 25 is approximately normal with mean µ = 211 mg/100 ml and standard error σ/ n = 46/ 25 = 9.2 mg/100 ml. This sampling distribution, and the underlying population from which the samples were drawn, are illustrated in Figure 8.1. Note that Z
=
x−µ √ σ/ n
X − 211 9.2 is a standard normal random variable. If x¯ = 230, then =
230 − 211 = 2.07. 9.2 This tells us that a cholesterol level of 230 lies 2.07 standard errors above the population mean. Referring to Table A.3, the area to the right of z = 2.07 is 0.019. Only about 1.9% of the samples will have a mean greater than 230 mg/100 ml. Equivalently, if we select a single sample of size 25 from the population of 20- to 74-year-old males, the probability that the mean serum cholesterol level for this sample is 230 mg/100 ml or higher is 0.019. What mean value of serum cholesterol level cuts off the lower 10% of the sampling distribution of means? Locating 0.100 in the body of Table A.3, we see that it corresponds to the value z = −1.28. Plugging in this value for z and solving for x, ¯ =
z
z
= −1.28 =
x¯ − 211 9.2
ISTUDY
194
Principles of Biostatistics
FIGURE 8.1 Distributions of individual values and means of samples of size 25 for the serum cholesterol levels of 20- to 74-year-old males, United States and
x¯ = 211 + (−1.28)(9.2) = 199.2.
Approximately 10% of the samples of size 25 have means that are less than or equal to 199.2 mg/100 ml. Let us now calculate the upper and lower limits that enclose 95% of the means of samples of size 25 drawn from the population. Since 2.5% of the area under the standard normal curve lies above z = 1.96 and another 2.5% lies below z = −1.96, we have that P(−1.96 ≤ Z ≤ 1.96)
=
0.95.
Therefore, we are interested in outcomes of Z for which −1.96 ≤ Z ≤ 1.96. We must transform this inequality into a statement about X . Substituting (X − 211)/9.2 for Z, we have X − 211 −1.96 ≤ ≤ 1.96. 9.2 Multiplying all three terms of the inequality by 9.2 and adding 211 results in 211 − 1.96(9.2) ≤ X ≤ 211 + 1.96(9.2) or
193.0 ≤ X ≤ 229.0.
ISTUDY
195
Sampling Distribution of the Mean
This inequality statement tells us that approximately 95% of the means of samples of size 25 lie between 193.0 mg/100 ml and 229.0 mg/100 ml. Consequently, if we select a random sample of size 25 which is reported to be from the population of serum cholesterol levels for all 20- to 74-yearold males, and the sample has a mean that is either greater than 229.0 or less than 193.0 mg/100 ml, we should be suspicious of this claim. Either the random sample was actually drawn from a different population, or a rare event has occurred. For the purposes of this discussion, a “rare event” is defined as an outcome that occurs less than 5% of the time. Suppose we had selected samples of size 10 from √ the population rather than samples of size 25. In this case, the standard error of X would be 46/ 10 = 14.5 mg/100 ml, and we would construct the inequality X − 211 −1.96 ≤ ≤ 1.96. 14.5 The upper and lower limits that enclose 95% of the means would be 182.5 ≤ X ≤ 239.5. Note that this interval is wider than the one calculated for samples of size 25. We expect the amount of sampling variation to increase as the sample size gets smaller. Drawing samples of size 50 would result in upper and lower limits 198.2 ≤ X ≤ 223.8; not surprisingly, this interval is narrower than the one constructed for samples of size 25. Samples of size 100 produce the limits 202.0 ≤ X ≤ 220.0. In summary, if we include the case for which n = 1, we have the following results:
n
σ/
√
n
Interval Enclosing 95% of the Means
Length of Interval
1
46.0
120.8 ≤ X ≤ 301.2
180.4
10
14.5
182.5 ≤ X ≤ 239.5
57.0
25
9.2
193.0 ≤ X ≤ 229.0
36.0
50
6.5
198.2 ≤ X ≤ 223.8
25.6
100
4.6
202.0 ≤ X ≤ 220.0
18.0
As the size of the samples √increases, the amount of variability among the sample means — represented by the standard error σ/ n — decreases. Consequently, the limits encompassing 95% of these means move closer together. (The length of an interval is simply the upper limit minus the lower limit.) Note that the intervals we have constructed have all been symmetric around the population mean 211 mg/100 ml; 211 always lies at the center of the interval. Clearly, there are other intervals that would also capture the appropriate proportion of the sample means. Suppose that we again wish to construct an interval that contains 95% of the means of samples of size 25, but this time we begin by noting that 1% of the area under the standard normal curve lies above z = 2.32, and 4% lies below z = −1.75. Therefore, P(−1.75 ≤ Z ≤ 2.32)
=
In this case, we are interested in the outcomes of Z for which −1.75 ≤ Z ≤ 2.32.
0.95.
ISTUDY
196
Principles of Biostatistics
Substituting (X − 211)/9.2 for Z, we find the interval to be 194.9 ≤ X ≤ 232.3. We are able to say that approximately 95% of the means of samples of size 25 lie between 194.9 mg/100 ml and 232.3 mg/100 ml. It is usually preferable to construct a symmetric interval, however, primarily because it is the shortest interval that captures the desired proportion of the means. (An exception to this rule is the one-sided interval; we return to this special case below.) In this example, the asymmetrical interval has length 232.3 − 194.9 = 37.4 mg/100 ml; the length of the symmetric interval is 229.0 − 193.0 = 36.0 mg/100 ml. We now move on to a slightly more complicated question: How large would the samples need to be so that 95% of their means lie within ±5 mg/100 ml of the population mean µ? To answer this, it is not necessary to know the value of the parameter µ. We simply find the sample size n for which P(µ − 5 ≤ X ≤ µ + 5)
=
0.95,
or P(−5 ≤ X − µ ≤ 5)
=
0.95.
√ √ To begin, we divide all three terms of the inequality by the standard error σ/ n = 46/ n; this results in −5 X−µ 5 + P* = √ ≤ √ ≤ √ 46/ n 46/ n , 46/ n
0.95.
√ Since Z is equal to X minus its mean and divided by its standard error, or (X − µ)/(46/ n), we can say that ! −5 5 P = 0.95. √ ≤Z≤ √ 46/ n 46/ n Recall that 95% of the area under the standard normal curve lies between z = −1.96 and z = 1.96. Therefore, to find the sample size n we could use the upper bound of the interval and solve the equation 5 z = 1.96 = √ ; 46/ n equivalently, we could use the lower bound and solve =
z
− 1.96
=
−5 √ . 46/ n
Taking 1.96 =
√ 5 n 46
and multiplying both sides of the equation by 46/5, we find that √ and
" n
=
n
=
1.96(46) 5
1.96(46) 5
#2
=
325.2.
ISTUDY
197
Sampling Distribution of the Mean
When dealing with sample sizes it is conventional to round up. Therefore, samples of size 326 would be required for 95% of the sample means to lie within ±5 mg/100 ml of the population mean µ. Another way to state this is that if we select a sample of size 326 from the population and calculate its mean, the probability that the sample mean is within ±5 mg/100 ml of the true population mean µ is 0.95. Up to this point we have focused on two-sided intervals, and we have found the upper and lower limits which enclose a specified proportion of the sample means. More specifically, we have focused on symmetric intervals. In some situations, however, we are interested in a one-sided interval instead. For instance, we might wish to find the upper bound for 95% of the mean serum cholesterol levels of samples of size 25. Since 95% of the area under the standard normal curve lies below z = 1.645, P(Z ≤ 1.645)
=
0.95.
Consequently, we are interested in outcomes of Z for which Z ≤ 1.645. Substituting (X − 211)/9.2 for Z produces X − 211 ≤ 1.645, 9.2 or
X ≤ 226.1.
Approximately 95% of the means of samples of size 25 lie below 226.1 mg/100 ml. If we want to construct a lower bound for 95% of the mean serum cholesterol levels, we instead focus on values of Z that lie above −1.645; in this case, we solve X − 211 ≥ −1.645 9.2 to find
X ≥ 195.9.
Approximately 95% of the means of samples of size 25 lie above 195.9 mg/100 ml. Always keep in mind that we must be cautious when making multiple statements about the sampling distribution of the means. For samples of serum cholesterol levels of size 25, we found that the probability is 0.95 that a sample mean lies within the interval (193.0 , 229.0). We also said that the probability is 0.95 that the mean lies below 226.1 mg/100 ml, and 0.95 that it is above 195.9 mg/100 ml. Although these three statements are correct individually, they are not true simultaneously. The three events are not independent. For all of them to occur at the same time, the sample mean would have to lie in the interval (195.9 , 226.1). The probability that this happens is not equal to 0.95.
ISTUDY
198
Principles of Biostatistics
FIGURE 8.2 Distribution of age at the time of death, United States, 2015
Summary: Sampling Distribution of X Term
Notation
Sampling distribution
√ X is N(µ, σ/ n) for large sample size n
8.4
Mean of X
µ
Standard error of X
√ σ/ n
Standard normal random variable
Z=
X−µ √ is N(0, 1) σ/ n
Further Applications
Consider the distribution of age at the time of death for the United States population in 2015. This distribution is pictured in Figure 8.2; it has mean µ = 79.2 years, standard deviation σ = 16.4 years, and is far from normally distributed [171]. What do we expect to happen when we sample from this population of ages?
ISTUDY
199
Sampling Distribution of the Mean
FIGURE 8.3 Histograms of four samples of size 25 Rather than draw samples from the population physically, we can write a computer program to simulate this process. To conduct a simulation, the computer is used to model an experiment or procedure according to a specified probability distribution. In our example, the procedure would consist of selecting an individual observation from the distribution pictured in Figure 8.2. The computer is then instructed to repeat the process a given number of times while keeping track of the results. To illustrate this technique, we can use the computer to simulate the selection of four random samples of size 25 from the population of ages at the time of death for the United States population. Histograms of these samples are shown in Figure 8.3; their means and standard deviations are as follows: Sample of Size 25
x¯
s
1
77.5
14.4
2
76.3
23.1
3
78.7
17.7
4
80.6
12.9
Note that the four random samples are not identical. Each time we select a set of 25 measurements from the population, the observations included in the sample change. As a result, the values of x¯ and s – our estimates of the population mean µ and standard deviation σ – differ from sample to sample. This random variation is known as sampling variability. In the four samples of size 25 selected above, the estimates of µ range from 76.3 years to 80.6 years. The estimates of σ range from 12.9 years to 23.1 years. Suppose that now, instead of selecting samples of size 25, we choose four random samples of size 100 from the population of ages at the time of death. Again we use the computer to simulate
ISTUDY
200
Principles of Biostatistics
FIGURE 8.4 Histograms of four samples of size 100 this process. Histograms of the samples are displayed in Figure 8.4, and their means and standard deviations are provided below. Sample of Size 100
x¯
s
1
80.2
17.3
2
80.6
14.9
3
78.2
17.8
4
79.8
18.3
For these samples, estimates of µ range from 78.2 years to 80.6 years and estimates of σ from 14.9 years to 18.3 years. These ranges are smaller than the corresponding intervals for samples of size 25. We would in fact expect this. As the sample size increases, the amount of sampling variability decreases. We next select four random samples of size 500 from the population of ages at the time of death. Histograms are shown in Figure 8.5, and the means and standard deviations are listed below. Sample of Size 500
x¯
s
1
78.3
15.9
2
79.1
16.8
3
79.5
16.2
4
79.8
15.9
Again, the ranges of the estimates for both µ and σ decrease.
ISTUDY
Sampling Distribution of the Mean
201
FIGURE 8.5 Histograms of four samples of size 500
Looking at Figures 8.3 through 8.5, we see that as the size of the samples increases, their distributions approach the shape of the population distribution pictured in Figure 8.2. Although there are still differences among the samples, the amount of variability in the estimates x¯ and s decreases. This property is known as consistency. As the samples that we select become larger and larger, the estimates of the population parameters approach their target values. The population of ages at the time of death can also be used to demonstrate an application of the central limit theorem. To do this, we must select repeated samples of size n from the population with mean µ = 79.2 years and standard deviation σ = 16.4 years and examine the distribution of the means of these samples. Theoretically, we must enumerate all possible random samples; for now, however, we select 100 samples of size 25. A histogram of the 100 sample means is displayed in Figure 8.6. According to the central limit theorem, the distribution of the sample means possesses three properties. First, its mean should be equal to the population mean µ = 79.2 years. In fact, the mean of the√100 sample√means is 79.8 years. Second, we expect the standard error of the sample means to be σ/ n = 16.4/ 25 = 3.3 years. The calculated standard error is 2.9 years. Finally, the distribution of sample means should be approximately normal. The shape of Figure 8.6 – and the theoretical normal distribution superimposed over the histogram – suggest that this third property holds true. Note that this is a large departure from the population distribution illustrated in Figure 8.2, or from any of the individual samples shown in Figures 8.3 through 8.5. If n were larger, we would expect the distribution to look even more normal. Based on the sampling distribution, we can calculate probabilities associated with various outcomes of the sample mean. For instance, among samples of size 25 that are drawn from the population of ages at the time of death, what proportion have a mean that lies between 77 and 81 years? To answer this question, we must find P(77 ≤ X ≤ 81). As we just saw, the central limit theorem states
ISTUDY
202
Principles of Biostatistics
FIGURE 8.6 Histogram of 100 sample means from samples of size 25 that the distribution of sample means √ of size 25 is approximately normal with mean µ = √ of samples 79.2 years and standard error σ/ n = 10.4/ 25 = 2.1 years. Therefore, Z
=
X−µ √ σ/ n
=
X − 79.2 2.1
is a standard normal random variable. If we represent the inequality in the expression P(77 ≤ X ≤ 81) in terms of Z rather than X, we can use Table A.3 to find the proportion of samples that have a mean value in this range. We begin by subtracting 79.2 from each term in the inequality and dividing by 2.1; thus we can express P(77 ≤ X ≤ 81) as
or
P* ,
77 − 79.2 X − 79.2 81 − 79.2 + ≤ ≤ 2.1 2.1 2.1 P(−1.05 ≤ Z ≤ 0.86).
We know that the total area underneath the standard normal curve is equal to 1. According to Table A.3, the area to the right of z = 0.86 is 0.195, and the area to the left of z = −1.05 is 0.147. Therefore, P(−1.05 ≤ Z ≤ 0.86)
= 1 − 0.195 − 0.147 = 0.658.
Approximately 65.8% of the samples of size 25 have a mean that lies between 77 and 81 years.
ISTUDY
203
Sampling Distribution of the Mean
What proportion of the means of samples of size 100 lie between 77 and 81 years? Again we must find P(77 ≤ X ≤ 81).√This time,√however, X has a normal distribution with mean µ = 73.9 years and standard error σ/ n = 10.4/ 100 = 1.04 years. Therefore, we construct the inequality P* , or
77 − 79.2 X − 79.2 81 − 79.2 + ≤ ≤ 1.04 1.04 1.04 P(−2.12 ≤ Z ≤ 1.73).
According to Table A.3, the area to the right of z = 1.73 is 0.042, and the area to the left of z = −2.12 is 0.017. Therefore, P(−2.12 ≤ Z ≤ 1.73)
= 1 − 0.042 − 0.017 = 0.941.
About 94.1% of the samples of size 100 have a mean that lies between 77 and 81 years. If we were to select a single random sample of size 100 and find that its sample mean is x¯ = 85 years, either the sample actually came from a population with a different underlying mean – something higher than µ = 79.2 years – or a rare event has occurred. To address a different type of question, we might wish to find the upper and lower limits that enclose 80% of the means of samples of size 100. Referring to Table A.3, we find that 10% of the area under a standard normal curve lies above z = 1.28, and another 10% lies below z = −1.28. Since 80% of the area lies between −1.28 and 1.28, we are interested in values of Z for which −1.28 ≤ Z ≤ 1.28, and values of X for which
X − 79.2 ≤ 1.28. 1.04 Multiplying all three terms of the inequality by 1.04 and adding 79.2 results in −1.28 ≤
79.2 + (−1.28)(1.04) ≤ X ≤ 79.2 + (1.28)(1.04) or, equivalently,
77.9 ≤ X ≤ 80.5.
Therefore, 80% of the means of samples of size 100 lie between 77.9 years and 80.5 years.
ISTUDY
204
8.5
Principles of Biostatistics
Review Exercises 1. What is statistical inference? 2. When making inference, why is it important that a sample drawn from a population be a random sample? 3. Why is it necessary to understand the properties of a theoretical distribution of means of samples of size n drawn from a population, when in practice you will only select a single such sample? 4. What is the standard error of a sample mean? How does the standard error compare to the standard deviation of the population? 5. Explain the central limit theorem. 6. What happens to the amount of sampling variability among a set of sample means x¯ 1, x¯ 2, x¯ 3, . . . as the size of the samples increases? 7. What is consistency? 8. Among adults in the United States, the distribution of the protein albumin in cerebrospinal fluid is roughly symmetric with mean µ = 29.5 mg/100 ml and standard deviation σ = 9.25 mg/100 ml [175]. Suppose that you select repeated samples of size 20 from the population of albumin levels and calculate the mean for each sample. (a) If you were to select a large number of random samples of size 20, what would be the mean of the sample means? (b) What would be their standard deviation? What is another name for this standard deviation of the sample means? (c) How does the standard deviation of the sample means compare with the standard deviation of the albumin levels themselves? (d) If you were to take all the different sample means and construct a histogram of these values, what would be the shape of their distribution? (e) What proportion of the means of samples of size 20 are greater than 33 mg/100 ml? (f) What proportion of the means are less than 28 mg/100 ml? (g) What proportion of the means are between 29 and 31 mg/100 ml? 9. Consider a random variable Z which has a standard normal distribution with mean µ = 0 and standard deviation σ = 1. (a) What can you say about the distribution of means of samples of size 10 which are drawn from this population? List three properties. (b) What proportion of the means of samples of size 10 are greater than 0.60? (c) What proportion of the means are less than −0.75? (d) What value cuts off the upper 20% of the distribution of means of samples of size 10? (e) What value cuts off the lower 10% of the distribution of means?
ISTUDY
Sampling Distribution of the Mean
205
10. In Denver, Colorado, the distribution of daily measures of ambient nitric acid, a corrosive liquid, is skewed to the right. It has mean µ = 1.81 µg/m3 and standard deviation σ = 2.25 µg/m3 [176]. Describe the distribution of means of samples of size 40 selected from this population. 11. In Norway, the distribution of birth weights for infants whose gestational age is 40 weeks is approximately normal with mean µ = 3500 grams and standard deviation σ = 430 grams [177]. (a) Given a newborn whose gestational age is 40 weeks, what is the probability that his or her birth weight is less than 2500 grams? (b) What value cuts off the lower 5% of the distribution of birth weights? (c) Describe the distribution of means of samples of size 5 drawn from this population. List three properties. (d) What value cuts off the lower 5% of the distribution of samples of size 5? (e) Given a sample of 5 newborns all with gestational age 40 weeks, what is the probability that their mean birth weight is less than 2500 grams? (f) What is the probability that only one of the 5 newborns has a birth weight less than 2500 grams? 12. For the population of females between the ages of 3 and 74 who participated in the National Health Interview Survey, the distribution of hemoglobin levels has mean µ = 13.3 g/100 ml and standard deviation σ = 1.12 g/100 ml [75]. (a) If repeated samples of size 15 are selected from this population, what proportion of the samples will have a mean hemoglobin level between 13.0 and 13.6 g/100 ml? (b) If the repeated samples are of size 30, what proportion will have a mean between 13.0 and 13.6 g/100 ml? (c) How large must the samples be so that 95% of their means lie within ±0.2 g/100 ml of the population mean µ? (d) How large must the samples be so that 95% of their means lie within ±0.1 g/100 ml of the population mean µ? 13. In the Netherlands, healthy males between the ages of 65 and 79 have a distribution of serum uric acid levels that is approximately normal with mean µ = 341 µmol/l and standard deviation σ = 79 µmol/l [178]. (a) What proportion of the males have a serum uric acid level between 300 and 400 µmol/l? (b) What proportion of samples of size 5 have a mean serum uric acid level between 300 and 400 µmol/l? (c) What proportion of samples of size 10 have a mean serum uric acid level between 300 and 400 µmol/l? (d) Construct an interval that encloses 95% of the means of samples of size 10. Which would be shorter, a symmetric interval or an asymmetric one? 14. For the population of adult males in the United States, the distribution of weights is approximately normal with mean µ = 172.2 pounds and standard deviation σ = 29.8 pounds [165].
ISTUDY
206
Principles of Biostatistics (a) Describe the distribution of means of samples of size 25 which are drawn from this population. (b) What is the upper bound for 90% of the mean weights of samples of size 25? (c) What is the lower bound for 80% of the mean weights? (d) Suppose that you select a single random sample of size 25 and find that the mean weight for the men in the sample is x¯ = 190 pounds. How likely is this result? What would you conclude?
15. At the end of Section 8.3, it was noted that for samples of serum cholesterol levels of size 25 drawn from a population with mean µ = 211 mg/100 ml and standard deviation σ = 46 mg/100 ml, the probability that a sample mean x¯ lies within the interval (193.0, 229.0) is 0.95. Furthermore, the probability that the mean lies below 226.1 mg/100 ml is 0.95, and the probability that it is above 195.9 mg/100 ml is 0.95. For all three of these events to happen simultaneously, the sample mean x¯ would have to lie in the interval (195.9, 226.1). What is the probability that this occurs?
ISTUDY
Part III
Inference
ISTUDY
ISTUDY
9 Confidence Intervals
CONTENTS 9.1 9.2 9.3 9.4 9.5
Two-Sided Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-Sided Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Student’s t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
209 213 215 218 222
Now that we have investigated the theoretical properties of a distribution of sample means, we are ready to take the next step and apply this knowledge to the process of statistical inference. Recall that our goal is to estimate the population mean associated with a continuous random variable using the information contained in a sample of observations drawn from that population. There are two methods of estimation which are commonly used. The first is called point estimation; it involves using the sample data to calculate a single number to estimate the parameter of interest. For instance, we might use the sample mean x¯ to estimate the population mean µ. The problem is that two different samples are very likely to result in different sample means, and thus there is some degree of uncertainty involved. A point estimate does not provide any information about the inherent variability of the estimator; we do not know how close x¯ is to µ in any given situation. While x¯ is more likely to be near the true population mean if the sample on which it is based is large – recall the property of consistency – a point estimate provides no information about the size of this sample. Consequently, a second method of estimation known as interval estimation is often preferred. Interval estimation provides a range of reasonable values which are intended to contain the parameter of interest – the population mean µ, in this case – with a certain degree of confidence. This range of values, called a confidence interval, gives us information about the precision with which the parameter is estimated.
9.1
Two-Sided Confidence Intervals
To construct a confidence interval for µ, we draw on our knowledge of the sampling distribution of the mean from the previous chapter. Given a random variable X that has mean µ and standard deviation σ, the central limit theorem states that Z
=
X−µ √ σ/ n
has a standard normal distribution if X is itself normally distributed, and an approximate standard normal distribution if it is not but n is sufficiently large. For a standard normal random variable, 95% of the observations lie between −1.96 and 1.96. In other words, the probability that Z assumes DOI: 10.1201/9780429340512-9
209
ISTUDY
210
Principles of Biostatistics
a value between −1.96 and 1.96 is P(−1.96 ≤ Z ≤ 1.96)
=
0.95. √ Equivalently, we could substitute the quantity (X − µ)/(σ/ n) for Z and write X−µ P *−1.96 ≤ √ ≤ 1.96+ = σ/ n , -
0.95.
Given this expression, we are able to manipulate the inequality inside the parentheses without changing the probability statement. We begin by multiplying all three terms of the inequality by the √ standard error σ/ n; therefore, ! σ σ P −1.96 √ ≤ X − µ ≤ 1.96 √ = 0.95. n n We next subtract X from each term so that σ σ P −1.96 √ − X ≤ −µ ≤ 1.96 √ − X n n
! =
0.95.
Finally, we multiply through by −1. Bear in mind that multiplying an inequality by a negative number reverses the direction of the inequality. Consequently, ! σ σ P 1.96 √ + X ≥ µ ≥ −1.96 √ + X = 0.95 n n and, rearranging the terms, σ σ P X − 1.96 √ ≤ µ ≤ X + 1.96 √ n n
! =
0.95.
Note that the sample mean X is no longer in the center of the inequality; instead, the √ probability statement√ says something about the population mean µ. The quantities X − 1.96(σ/ n) and X + 1.96(σ/ n), which will vary from sample to sample, are the 95% confidence limits for the population mean. We say that we are 95% confident that the interval ! σ σ X − 1.96 √ , X + 1.96 √ n n will cover µ. Consider the distribution of serum cholesterol levels for all males in the United States who are hypertensive and who smoke. This distribution is approximately normal with an unknown mean µ and standard deviation σ = 46 mg/100 ml. (Even though the mean might be different, we assume for the moment that σ is the same as it was for the general population of adult males living in the United States.) We are interested in estimating the mean serum cholesterol level of this population. Before we go out and select a random sample, the probability that the interval ! 46 46 X − 1.96 √ , X + 1.96 √ n n covers the true population mean µ is 0.95.
ISTUDY
211
Confidence Intervals
Suppose that we now draw a sample of size 12 from the population of hypertensive smokers and that these men have a mean serum cholesterol level of x¯ = 217 mg/100 ml [179]. Based on this sample, a 95% confidence interval for µ is ! 46 46 217 − 1.96 √ , 217 + 1.96 √ 12 12 or
(191 , 243).
While 217 mg/100 ml is our best guess for the mean serum cholesterol level of the population of male hypertensive smokers, the interval from 191 to 243 provides a range of reasonable values for µ. The width of this interval suggests a lack of precision in the estimate, which is not surprising given the small sample size. To interpret this confidence interval, we say that we are 95% confident that the limits 191 to 243 cover the true mean µ. This statement does not mean that µ is a random variable with a 95% probability of assuming a value within the interval (191, 243), nor does it imply that 95% of the population values lie between 191 and 243. The population mean µ is fixed, not random, and it is either between 191 and 243 or it is not. Instead, “95% confidence” means that if we were to select 100 random samples of size 12 from the population and use these samples to calculate 100 different confidence intervals for µ, the same way we just did, approximately 95 of the intervals would cover the true population mean and 5 would not. The estimator X is a random variable, not µ. Therefore, the interval ! σ σ X − 1.96 √ , X + 1.96 √ n n is random and has a 95% chance of covering µ before a sample is selected. Since µ has a fixed value, once a sample has been drawn and the confidence limits ! σ σ x¯ − 1.96 √ , x + 1.96 √ n n have been calculated, this interval either contains µ or it does not. There is no longer any probability involved. Although a 95% confidence interval is used most often in practice, we are not restricted to this choice. We might prefer to have a greater degree of certainty regarding the value of the population mean; in this case, we could choose to construct a 99% confidence interval instead of a 95% interval. Since 99% of the observations in a standard normal distribution lie between −2.58 and 2.58, a 99% confidence interval for µ is ! σ σ X − 2.58 √ , X + 2.58 √ . n n Approximately 99 out of the confidence intervals obtained from 100 independent random samples of size n drawn from the population would cover the true mean µ. As we would expect, the 99% confidence interval is wider than the 95% interval; the larger the range of values we consider, the more confident we are that the interval covers µ. Instead of generating a 95% confidence interval for the mean serum cholesterol level of male hypertensive smokers, we might prefer to calculate a 99% confidence interval for µ. Using the same sample of 12 hypertensive smokers, we find the limits to be ! 46 46 217 − 2.58 √ , 217 + 2.58 √ 12 12 or
(183 , 251).
ISTUDY
212
Principles of Biostatistics
We are 99% confident that this interval covers the true mean serum cholesterol level of the population. As previously noted, this interval is wider than the corresponding 95% confidence interval. The general form for a confidence interval for µ can be obtained by introducing some new notation. Let z1−α/2 be the (1 − α/2)th percentile of the standard normal distribution. By the definition of a percentile, the probability that a standard normal random variable Z takes a value less than z1−α/2 is 1 − α/2, and the probability that it takes a value greater than z1−α/2 is 1 − (1 − α/2) = α/2. If α = 0.05, for example, then z1−0.05/2 = z0.975 = 1.96; if α = 0.01, then z0.995 = 2.58. Using this notation, the general form for a 100% × (1 − α) confidence interval for µ is ! σ σ X − z1−α/2 √ , X + z1−α/2 √ . n n This interval has a 100% × (1 − α) chance of covering µ before a random sample is selected. If we wish to make an interval tighter without reducing the level of confidence, we need more information about the population mean. √ Therefore, we must select a larger sample. As the sample size n increases, the standard error σ/ n decreases; this√results in a more narrow confidence interval. Consider the 95% confidence√limits X ± 1.96(σ/ n). If we choose a sample of size 10, the confidence √ limits are X ± 1.96(σ/ 10). If the selected sample is of size 100, then the limits are X ± 1.96(σ/√ 100). For an even larger sample of size 1000, the 95% confidence limits would be X ± 1.96(σ/ 1000). Summarizing these calculations, we have: n 10 100 1000
95% Confidence Limits for µ X ± 0.620 σ X ± 0.196 σ X ± 0.062 σ
Length of Interval 1.240 σ 0.392 σ 0.124 σ
As we select larger and larger random samples, the variability of X – our estimator of the population mean µ – becomes smaller. The inherent variability of the underlying population, measured by σ, is always present, however. We sometimes wish to determine the sample size necessary to calculate a confidence interval of a particular length. In the serum cholesterol level example, for instance, the length of the 99% confidence interval is 251 − 183 = 68 mg/100 ml. How large a sample would we need to reduce the length of this interval to only 20 mg/100 ml? Since the interval is centered around the sample mean x¯ = 217 mg/100 ml, we are interested in the sample size necessary to produce the interval (217 − 10 , 217 + 10) or
(207 , 227).
Recall that the 99% confidence interval is of the form ! 46 46 217 − 2.58 √ , 217 + 2.58 √ . n n Therefore, to find the required sample size n, we must solve the equation 10
=
2.58(46) . √ n
√ Multiplying both sides of the equality by n and dividing by 10, we find that √
n
=
2.58(46) 10
ISTUDY
213
Confidence Intervals and n
= 140.8.
We would need a sample of 141 men to reduce the length of the 99% confidence interval to 20 mg/100 ml. Although the sample mean 217 mg/100 ml lies at the center of the interval, it does not play any part in determining its length. The length is a function of σ, n, and the level of confidence. As we previously mentioned, this confidence interval also has a frequency interpretation. Suppose that the true mean serum cholesterol level of the population of male hypertensive smokers is equal to 211 mg/100 ml, the mean level for adult males in the United States [43]. If we were to draw 100 random samples of size 12 from this population and use each one to construct a 95% confidence interval, we would expect that, on average, 95 of the intervals would cover the true population mean µ = 211 and 5 would not. This procedure was simulated and the results illustrated in Figure 9.1. The only quantity that varies from sample to sample is X. Although the centers of the intervals differ, they all have the same length. The confidence intervals that do not contain the true value of µ are marked by a dot; note that exactly five intervals fall into this category. Summary: Two-Sided Confidence Interval for the Mean, Standard Deviation Known Two-sided confidence interval for mean µ
9.2
σ σ X − z1−α/2 √ , X + z1−α/2 √ n n
!
One-Sided Confidence Intervals
In some situations we are concerned with either an upper limit for the population mean µ or a lower limit for µ, but not both. Consider the distribution of hemoglobin levels – hemoglobin is an oxygen-bearing protein found in red blood cells – for the population of children under the age of 6 years who have been exposed to high levels of lead. This distribution has an unknown mean µ and standard deviation σ = 0.85 g/100 ml [75]. We know that children who have lead poisoning tend to have much lower levels of hemoglobin than children who do not. Therefore, we might be interested in finding an upper bound for µ. To construct a one-sided confidence interval, we consider the area in only one tail of the standard normal distribution. Referring to Table A.3, we find that 95% of the observations for a standard normal random variable lie above z = −1.645. Therefore, P(Z ≥ −1.645) = 0.95. √ Substituting (X − µ)/(σ/ n) for Z and working through some algebra as we did to derive the two-sided interval, we find that ! σ = 0.95. P µ ≤ X + 1.645 √ n √ √ Therefore, X + 1.645(σ/ n) is an upper 95% confidence bound for µ. Similarly, X − 1.645(σ/ n) is the corresponding lower 95% confidence bound.
ISTUDY
214
Principles of Biostatistics
FIGURE 9.1 Set of 95% confidence intervals constructed from samples of size 12 drawn from a normal population with mean 211 (marked by the vertical line) and standard deviation 46
ISTUDY
215
Confidence Intervals
Suppose that we select a sample of 74 children who have been exposed to high levels of lead; these children have a sample mean hemoglobin level of x¯ = 10.6 g/100 ml [180]. Based on this sample, a one-sided 95% confidence interval for µ – the upper bound only – is ! 0.85 µ ≤ 10.6 + 1.645 √ 74 ≤
10.8.
We are 95% confident that the true mean hemoglobin level for this population of children is at most 10.8 g/100 ml. In reality, since the value of µ is fixed, the true mean is either less than 10.8 or it is not. However, if we were to select 100 random samples of size 74 and use each one to construct a one-sided 95% confidence interval, approximately 95 of the intervals would cover the true population mean µ. Summary: One-Sided Confidence Interval for the Mean, Standard Deviation Known σ µ ≥ X − z1−α √ n
One-sided confidence interval for mean µ (lower bound)
σ µ ≤ X + z1−α √ n
One-sided confidence interval for mean µ (upper bound)
9.3
Student’s t Distribution
When computing confidence intervals for an unknown population mean µ, we have up to this point assumed that σ, the population standard deviation, was known. In reality this is unlikely to be the case. If µ is unknown, then σ is probably unknown as well. In this situation, confidence intervals are calculated in much the same way as we have already seen. Instead of using the standard normal distribution, however, the analysis depends on a probability distribution known as Student’s t distribution. (The name “Student” is the pseudonym of the statistician who originally discovered this distribution.) To construct a two-sided confidence interval for a population mean µ, we began by noting that Z
=
X−µ √ σ/ n
has an approximate standard normal distribution if n is sufficiently large. When the population standard deviation is not known, it might seem logical to substitute s for σ, where s is the standard deviation of a sample drawn from the population. This is, in fact, what is done. However, the ratio t
=
X−µ √ s/ n
ISTUDY
216
Principles of Biostatistics
FIGURE 9.2 The standard normal distribution and Student’s t distribution with 1 degree of freedom does not have a standard normal distribution. In addition to the sampling variability inherent in X – which we are using as an estimator of the population mean µ – there is also variability in s. The value of s is likely to change from sample to sample. Therefore, we must account for the fact that s may not be a reliable estimate of σ, especially when the sample size is small. If X is normally distributed and a sample of size n is randomly chosen from this underlying population, then the probability distribution of the random variable t
=
X−µ √ s/ n
is known as Student’s t distribution with n − 1 degrees of freedom. We represent this distribution using the notation t n−1 . Like the standard normal distribution, the t distribution is unimodal and symmetric around its mean of 0. The total area under the curve is equal to 1. However, it has somewhat thicker tails than the normal distribution; extreme values are more likely to occur with the t distribution than with the standard normal. This difference is illustrated in Figure 9.2. The shape of the t distribution reflects the extra variability introduced by the estimate s. In addition, the t distribution has a property called the degrees of freedom, abbreviated df. The degrees of freedom measure the amount of information available in the data that can be used to estimate σ 2 ; hence, they measure the reliability of s2 as an estimate of σ 2 . (Heuristically, the degrees of freedom are n − 1 rather than n because we lose 1 df by estimating the sample mean x¯ in order to calculate the sample standard deviation s.) For each possible value of the degrees of freedom, there is a different t distribution. The distributions with smaller degrees of freedom are more spread out; as df increases, the t distribution approaches the standard normal. This occurs because, as the sample size increases, s becomes a more reliable estimate of σ. If n is very large, knowing the value of s is nearly equivalent to knowing σ.
ISTUDY
217
Confidence Intervals
Since there is a different t distribution for every value of the degrees of freedom, it would be quite cumbersome to have a complete table of areas corresponding to each one. As a result, we typically rely on either a computer program or a condensed table that lists the areas under the curve for selected percentiles of the distribution only; for example, a table might contain the upper 5.0, 2.5, 1.0, 0.5, and 0.05% of the distributions. When a computer is not available, condensed tables are sufficient for most applications involving the construction of confidence intervals. Table A.4 in Statistical Tables is a condensed table of areas computed for the family of t distributions. For a particular value of df, the entry in the table represents the value of t n−1 that cuts off the specified area in the upper tail of the distribution. Given a t distribution with 10 degrees of freedom, for instance, t 10 = 2.228 cuts off the upper 2.5% of the area under the curve; it is the 97.5th percentile of the distribution. Since the distribution is symmetric, t 10 = −2.228 marks off the lower 2.5%. The values of t n−1 representing the 97.5th percentiles of the t distributions with various degrees of freedom are listed in the following table. df (n − 1) 2 5 10 20 30 40 60 120 ∞
t n−1,0.975 4.303 2.571 2.228 2.086 2.042 2.021 2.000 1.980 1.960
For the standard normal curve, z0.975 = 1.96 is the 97.5th percentile marking the upper 2.5% of the distribution. Observe that, as n increases, t n−1,0.975 approaches this value. In fact, when we have more than 30 degrees of freedom, we are able to substitute the standard normal distribution for the t and be off in our calculations by less than 5%. Consider a random sample of 10 children selected from the population of infants receiving antacids that contain aluminum. These antacids are often used to treat peptic or digestive disorders. The distribution of plasma aluminum levels is known to be approximately normal; however, its mean µ and standard deviation σ are not known. We wish to estimate the mean plasma aluminum level for this population. For the random sample of 10 children, the mean aluminum level is x¯ = 37.2 µg/l and the sample standard deviation is s = 7.13 µg/l [181]. Since the population standard deviation σ is not known, we must use the t distribution to find 95% confidence limits for µ. For a t distribution with 10 − 1 = 9 degrees of freedom, the 97.5th percentile is 2.262, and 95% of the observations lie between −2.262 and 2.262. Replacing σ with s, a 95% confidence interval for the population mean µ is ! s s X − 2.262 √ , X + 2.262 √ . 10 10 Substituting in the values of x¯ and s, the interval becomes 7.13 7.13 37.2 − 2.262 √ , 37.2 + 2.262 √ 10 10 or
(32.1 , 42.3).
!
ISTUDY
218
Principles of Biostatistics
We are 95% confident that these limits cover the true mean plasma aluminum level for the population of infants receiving antacids. If we are given the additional piece of information that the mean plasma aluminum level for the population of infants not receiving antacids is 4.13 µg/l – not a plausible value of µ for the infants who do receive them, according to the 95% confidence interval – then this would suggest that being given antacids greatly increases the plasma aluminum levels of children. If the population standard deviation σ had been known and had been equal to the sample value of 7.13 µg/l, then the 95% confidence interval for µ would have been ! 7.13 7.13 37.2 − 1.96 √ , 37.2 + 1.96 √ 10 10 or
(32.8 , 41.6).
In this case, the confidence interval is slightly shorter. Most of the time, confidence intervals based on the t distribution are longer than the corresponding intervals based on the standard normal distribution because, for a given level of confidence, the relevant value of t n−1 is larger than z. This generalization does not always apply, however. Because of the nature of sampling variability, it is possible that the value of the estimate s will be considerably smaller than σ for a given sample. In a previous example, we examined the distribution of serum cholesterol levels for all males in the United States who are hypertensive and who smoke. Recall that the standard deviation of this population was assumed to be 46 mg/100 ml. On the left-hand side, Figure 9.3 contains the 95% confidence intervals for µ that were calculated from 100 random samples and previously displayed in Figure 9.1. The right-hand side of the figure shows 100 additional intervals that were computed using the same samples; in each case, however, the standard deviation was not assumed to be known. Once again, 95 of the intervals contain the true mean µ, and the other 5 do not. Note that this time, however, the intervals vary in length.
Summary: Two-Sided Confidence Interval for the Mean, Standard Deviation Unknown Two-sided confidence interval for mean µ
9.4
s s X − t n−1,1−α/2 √ , X + t n−1,1−α/2 √ n n
!
Further Applications
Consider the distribution of heights for the population of individuals between the ages of 12 and 40 who suffer from fetal alcohol syndrome. Fetal alcohol syndrome refers to the severe end of the spectrum of disabilities caused by maternal alcohol use during pregnancy. The distribution of heights is approximately normal with unknown mean µ. We wish to find both a point estimate and a confidence interval for µ. The confidence interval provides a range of reasonable values for the parameter of interest. When constructing a confidence interval for the mean of a continuous random variable, the technique used differs depending on whether the standard deviation of the underlying population is known or not known. For the height data, the standard deviation is assumed to be σ = 6 cm [182].
ISTUDY
Confidence Intervals
219
FIGURE 9.3 Two sets of 95% confidence intervals constructed from samples of size 12 drawn from normal populations with mean 211 (marked by the vertical lines), one with standard deviation 46 and the other with standard deviation unknown
ISTUDY
220
Principles of Biostatistics
TABLE 9.1 R output displaying a 95% confidence interval for true mean height of individuals with fetal alcohol syndrome, standard deviation known n 31
x.bar 147.4
std 6
se 1.077632
upper 149.5122
lower 145.2878
Therefore, we use the standard normal distribution to help us construct a 95% confidence interval. Before a sample is drawn from this population, the interval ! 6 6 X − 1.96 √ , X + 1.96 √ n n has a 95% chance of covering the true population mean µ. A random sample of 31 patients is selected from the underlying population; the mean height for these individuals is x¯ = 147.4 cm. This is the point estimate — our best guess — for the population mean µ. A 95% confidence interval based on the sample is ! 6 6 147.4 − 1.96 √ , 147.4 + 1.96 √ 31 31 or
(145.3 , 149.5).
We are 95% confident that these limits cover the true mean height for the population of individuals between the ages of 12 and 40 who suffer from fetal alcohol syndrome. We can think of this as a range of reasonable values for µ, all of which are compatible with the sample data. In reality, however, the fixed value of µ is either between 145.3 cm and 149.5 cm or it is not. Rather than generate the confidence interval by hand, we could have used a computer to do the calculations for us. Table 9.1 shows the relevant output from R. In addition to the sample size, the table displays the sample mean, the assumed standard deviation, the standard error of the mean, and the upper and lower bounds of the 95% confidence interval. As a second example, methylphenidate is a drug widely used in the treatment of attention-deficit disorder. As part of a crossover study, 10 children between the ages of 7 and 12 who suffered from this disorder were assigned to receive the drug and 10 were given a placebo [183]. After a fixed period of time, treatment was withdrawn from all 20 children. Subsequently, patients were given the alternative treatment; children who had originally received methylphenidate were given the placebo, and those who had received the placebo now got the drug. (This is what is meant by a crossover study.) Measures of each child’s attention and behavioral status, both while taking the drug and while taking placebo, were obtained using an instrument called the Parent Rating Scale. Distributions of these scores are approximately normal with unknown mean and standard deviation. In general, lower scores indicate an increase in attention. We wish to estimate the mean attention rating scores for children with attention-deficit disorder when taking methylphenidate, and when taking placebo. Since the standard deviations are not known for either population, we use the t distribution to help us construct 95% confidence intervals. For a t distribution with 20 − 1 = 19 degrees of freedom, 95% of the observations lie between −2.093 and 2.093. Therefore, before a sample of size 20 is drawn from the population, the interval ! s s X − 2.093 √ , X + 2.093 √ 20 20 has a 95% chance of covering the true mean µ.
ISTUDY
221
Confidence Intervals
TABLE 9.2 Stata output displaying a 95% confidence interval for true mean attention rating score for children taking methylphenidate, standard deviation unknown Variable | Obs Mean Std. Err. [95% Conf. Interval] --------------+------------------------------------------------score_methyl | 20 10.8 .6484597 9.442758 12.15724
TABLE 9.3 Stata output displaying a 95% confidence interval for true mean attention rating score for children taking placebo, standard deviation unknown Variable | Obs Mean Std. Err. [95% Conf. Interval] --------------+------------------------------------------------score_placebo | 20 14 1.073313 11.75353 16.24647
The random sample of 20 children enrolled in the study has mean attention rating score x¯ M = 10.8 and standard deviation s M = 2.9 when taking methylphenidate, and mean rating score x¯ P = 14.0 and standard deviation s P = 4.8 when taking the placebo. Therefore, a 95% confidence interval for µ M , the mean attention rating score for children taking the drug, is ! 2.9 2.9 10.8 − 2.093 √ , 10.8 + 2.093 √ 20 20 or
(9.4 , 12.2),
and a 95% confidence interval for µ P , the mean rating score for children taking the placebo, is ! 4.8 4.8 14.0 − 2.093 √ , 14.0 + 2.093 √ 20 20 or
(11.8 , 16.2).
The relevant output from Stata for calculating these intervals is displayed in Tables 9.2 and 9.3. Looking at the intervals, it appears that the mean attention rating score is likely to be lower when children with attention-deficit disorder are taking methylphenidate, implying improved attention. However, there is some overlap between the two intervals.
ISTUDY
222
9.5
Principles of Biostatistics
Review Exercises 1. Explain the difference between point and interval estimation. 2. Describe the 95% confidence interval for a population mean µ. How is the interval interpreted? 3. What are the factors that affect the length of a confidence interval for a mean? Explain briefly. 4. Describe the similarities and differences between the t distribution and the standard normal distribution. If you are constructing a confidence interval, when would you use one rather than the other? 5. The distributions of systolic and diastolic blood pressures for female diabetics between the ages of 30 and 34 have unknown means. However, their standard deviations are σ s = 11.8 mm Hg and σ d = 9.1 mm Hg, respectively [184]. (a) A random sample of 10 women is selected from this population. The mean systolic blood pressure for the sample is x¯ s = 130 mm Hg. Calculate a two-sided 95% confidence interval for µ s , the true mean systolic blood pressure. (b) Interpret this confidence interval. (c) The mean diastolic blood pressure for the sample of size 10 is x¯ d = 84 mm Hg. Find a two-sided 90% confidence interval for µ d , the true mean diastolic blood pressure of the population. (d) Calculate a two-sided 99% confidence interval for µ d . (e) How does the 99% confidence interval compare to the 90% interval? 6. Consider the t distribution with 5 degrees of freedom. (a) (b) (c) (d)
What proportion of the area under the curve lies to the right of t = 2.015? What proportion of the area lies to the left of t = −3.365? What proportion of the area lies between t = −4.032 and t = 4.032? What is the 97.5th percentile of this t distribution?
7. Consider the t distribution with 21 degrees of freedom. (a) (b) (c) (d)
What proportion of the area under the curve lies to the left of t = −2.518? What proportion of the area lies to the right of t = 1.323? What proportion of the area lies between t = −1.721 and t = 2.831? What is the 2.5th percentile of this t distribution?
8. Before beginning a study investigating the ability of the drug heparin to prevent bronchoconstriction, baseline values of pulmonary function were measured for a sample of 12 individuals with a history of exercise-induced asthma [185]. The mean value of forced vital capacity (fvc) for the sample is x¯ 1 = 4.49 liters and the standard deviation is s1 = 0.83 liters; the mean forced expiratory volume in 1 second (fev1 ) is x¯ 2 = 3.71 liters and the standard deviation is s2 = 0.62 liters.
ISTUDY
223
Confidence Intervals
(a) Compute a two-sided 95% confidence interval for µ1 , the true population mean fvc. (b) Rather than a 95% interval, construct a 90% confidence interval for the true mean fvc. How does the length of the interval change? (c) Compute a 95% confidence interval for µ2 , the true population mean fev1 . (d) In order to construct these confidence intervals, what assumption is made about the underlying distributions of fvc and fev1 ? 9. For the population of infants undergoing fetal surgery for congenital anomalies, the distribution of gestational ages at birth is approximately normal with unknown mean µ and standard deviation σ. A random sample of 14 such infants has mean gestational age x¯ = 29.6 weeks and standard deviation s = 3.6 weeks [186]. (a) Construct a 95% confidence interval for the true population mean µ. (b) What is the length of this interval? (c) How large a sample would be required so that the 95% confidence interval has length 3 weeks? Assume that the population standard deviation σ is known and that σ = 3.6 weeks. (d) How large a sample would be needed for the 95% confidence interval to have length 2 weeks? Again assume that σ is known. 10. Percentages of ideal body weight were determined for 18 randomly selected insulindependent diabetics and recorded below [187]. A percentage of 120 means that an individual weighs 20% more than his or her ideal body weight; a percentage of 95 means that the individual weighs 5% less than the ideal.
107
119
99
114
120
104
88
114
124
116
101
121
152
100
125
114
95
117
(%)
(a) Compute a two-sided 95% confidence interval for the true mean percentage of ideal body weight for the population of insulin-dependent diabetics. (b) Does this confidence interval contain the value 100%? What does this tell you? 11. When eight persons in Massachusetts experienced an unexplained episode of vitamin D intoxication that required hospitalization, it was suggested that these unusual occurrences might be the result of excessive supplementation of dairy milk [71]. Blood levels of calcium and albumin for each individual at the time of hospital admission are provided below. Calcium (mmol/l) 2.92 3.84 2.99 2.67 3.17 3.74 3.44
Albumin (g/l) 43 42 40 42 38 34 42
ISTUDY
224
Principles of Biostatistics (a) Construct a one-sided 95% confidence interval — a lower bound — for the true mean calcium level of individuals who experience vitamin D intoxication. (b) Compute a 95% lower confidence bound for the true mean albumin level of this group. (c) For healthy individuals, the normal range of calcium values is 2.12 to 2.74 mmol/l, and the range of albumin levels is 32 to 55g/l. Do you believe that patients suffering from vitamin D intoxication have normal blood levels of calcium and albumin? Explain.
12. Figure 9.4 displays estimates of the annual number of nonfatal firearm injuries in the United States from 2001 through 2017, along with 95% confidence intervals [188]. (a) What does this figure tell you about changes in the number of nonfatal firearm injuries over time? (b) What does this figure tell you about changes in the precision with which the number of nonfatal firearm injuries is estimated over time? 13. Serum zinc levels for 462 males between the ages of 15 and 17 are saved under the variable name zinc in the data set serum_zinc [75]. The units of measurement for serum zinc level are micrograms per deciliter. (a) Find a two-sided 95% confidence interval for µ, the true mean serum zinc level for the population of males between the ages of 15 and 17 years. (b) Interpret this confidence interval. (c) Calculate a 90% confidence interval for µ. (d) How does the 90% confidence interval compare to the 95% interval? 14. The data set lowbwt contains information recorded for a sample of 100 low birth weight infants born in two teaching hospitals in Boston, Massachusetts [81]. Measurements of systolic blood pressure are saved under the variable name sbp, while indicators of sex – where 1 represents a male and 0 a female – are saved under the name sex. (a) Compute a 95% confidence interval for the true mean systolic blood pressure of male low birth weight infants. (b) Calculate a 95% confidence interval for the true mean systolic blood pressure of female low birth weight infants. (c) Do you think it is possible that males and females have the same mean systolic blood pressure? Explain briefly. 15. The Bayley Scales of Infant Development yield scores on two indices – the Psychomotor Development Index (pdi) and the Mental Development Index (mdi) – which can be used to assess a child’s level of functioning in each of these areas at approximately one year of age. Among normal healthy infants, both indices have a mean value of 100. As part of a study assessing the development and neurologic status of children who underwent reparative heart surgery during the first three months of life, the Bayley Scales were administered to a sample of one-year-old infants born with congenital heart disease. The data are contained in the data set bayley [189]. pdi scores are saved under the variable name pdi, while mdi scores are saved under mdi. (a) Calculate a point estimate and 95% confidence interval for the true mean pdi score for children born with congenital heart disease who undergo reparative heart surgery during the first three months of life.
ISTUDY
Confidence Intervals
225
FIGURE 9.4 Estimates of the annual number of nonfatal firearm injuries, 2001–2017 (b) Calculate a point estimate and 95% confidence interval for the true mean mdi score for children born with congenital heart disease who undergo reparative heart surgery during the first three months of life. (c) Do either of these confidence intervals contain the value 100? What does this tell you?
ISTUDY
ISTUDY
10 Hypothesis Testing
CONTENTS 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8
General Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-Sided Tests of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-Sided Tests of Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types of Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Size Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
227 230 233 234 238 241 243 249
In our study of confidence intervals, we encountered the distribution of serum cholesterol levels for the population of males in the United States who are hypertensive and who smoke. This distribution is approximately normal with unknown mean µ. However, we do know that the mean serum cholesterol level for the general population of all 20- to 74-year-old males in the United States is 211 mg/100 ml [43]. Therefore, we might wonder whether the mean cholesterol level of the subpopulation of males who are hypertensive smokers is 211 mg/100 ml as well. If we select a random sample of size 25 from this subpopulation and the mean serum cholesterol level for the sample is x¯ = 220 mg/100 ml, is this sample mean compatible with a hypothesized mean of 211 mg/100 ml? We know that some amount of sampling variability is to be expected. What if the sample mean is 230 mg/100 ml, or 250 mg/100 ml? How far from 211 must x¯ be before we can conclude that µ is really equal to some other value?
10.1
General Concepts
We again concentrate on drawing some conclusion about a population parameter – the mean of a continuous random variable – using the information contained in a sample of observations drawn from that population. As we saw in the previous chapter, one approach to statistical inference is to construct a confidence interval for µ. Another is to conduct a hypothesis test. To perform a hypothesis test, we begin by claiming that the mean of the population we are studying is equal to some postulated value µ0 . This statement about the value of the population parameter is called the null hypothesis, represented by H0 . If we want to test whether the mean serum cholesterol level of the subpopulation of hypertensive smokers is equal to the mean of the general population of 20- to 74-year-old males, for instance, the null hypothesis would be H0 : µ = µ0 = 211 mg/100 ml.
DOI: 10.1201/9780429340512-10
227
ISTUDY
228
Principles of Biostatistics
The alternative hypothesis, represented by H A , is a second statement that contradicts H0 . In this case, we have H A : µ , 211 mg/100 ml. Together the null and the alternative hypotheses cover all possible values of the population mean µ. Consequently, one of the two statements must be true. After formulating the hypotheses, we draw a random sample of size n from the population of interest. For the hypertensive smokers, we selected a sample of size 12. We compare the mean of this sample, x, ¯ to the postulated mean µ0 ; specifically, we want to know whether the difference between the sample mean and the hypothesized mean is too large to be attributed to chance or sampling variability alone. If there is sufficient evidence that the sample did not come from a population with mean µ0 , then we reject the null hypothesis. This occurs when, given that H0 is true, the probability of obtaining a sample mean as extreme or more extreme than the observed value x¯ – more extreme meaning farther away from the value µ0 – is sufficiently small. In this case, the data are not compatible with the null hypothesis; they are more supportive of the alternative. We therefore conclude that the population mean could not be µ0 . Such a test result is said to be statistically significant. Note that statistical significance does not imply clinical or scientific significance; a statistically significant test result could actually have little practical consequence. If there is insufficient evidence to doubt the validity of the null hypothesis, then we cannot reject this claim. We do not conclude that the population mean is different from µ0 . However, we do not say that we accept H0 ; the test does not prove the null hypothesis. It is still possible that the population mean is some value other than µ0 , but that the random sample selected does not confirm this. Such an event could occur, for instance, if the sample chosen is too small. This point is discussed further later in this chapter. We stated above that if the probability of obtaining a sample mean as extreme or more extreme than the observed x¯ is sufficiently small, then we reject the null hypothesis. But what is a “sufficiently small” probability? In many applications, a probability of 0.05 is chosen [190]. Thus, we reject H0 when the chance that the sample could have come from a population with mean µ0 is less than 5%. This implies that we reject incorrectly 5% of the time; given many repeated tests of significance, five times out of one hundred we will erroneously reject the null hypothesis when it is true. To be more conservative, a probability of 0.01 is sometimes chosen. In this case we mistakenly reject H0 when it is true only 1% of the time. If we are willing to be less conservative, a probability of 0.10 might be used. The probability that we choose – whether 0.05, 0.01, or some other value – is known as the significance level of the hypothesis test. The significance level is denoted by the Greek letter α and must be specified in advance, before the test is actually carried out. In many ways, a test of hypothesis can be compared to a criminal trial by jury in the United States. The individual on trial is either innocent or guilty, but is assumed innocent by law. After evidence pertaining to the case has been presented, the jury finds the defendant either guilty or not guilty. If the defendant is innocent and the decision of the jury is that he or she is not guilty, then the right verdict has been reached. The verdict is also correct if the defendant is guilty and is convicted of the crime. Verdict of Jury Not Guilty Guilty
Defendant Innocent Guilty Correct Incorrect Incorrect Correct
Analogously, the true population mean is either µ0 or it is not µ0 . We begin by assuming that the null hypothesis H0 : µ = µ0
ISTUDY
229
Hypothesis Testing
is correct, and we consider the “evidence” that is presented in the form of a sample of size n. Based on our findings, the null hypothesis is either rejected or it is not rejected. Again there are two situations in which the conclusion drawn is correct: when the population mean is µ0 and the null hypothesis is not rejected, and when the population mean is not µ0 and H0 is rejected.
Test Result Do Not Reject Reject
Population µ = µ0 µ , µ0 Correct Incorrect Incorrect Correct
(This table should remind you of the similar table summarizing screening test results in Chapter 6. There, instead of the null and alternative hypotheses, we had members of a population with and without disease. Instead of rejecting or not rejecting the null hypothesis, individuals had either positive or negative screening test results. If an individual has the disease, a positive screening test result is correct and a negative test is incorrect. If they do not have the disease, a negative test result is correct and a positive test is incorrect.) Like our legal system, the process of hypothesis testing is not perfect; there are two kinds of errors that can be made. In particular, we could either reject the null hypothesis when it is true and µ is equal to µ0 , or fail to reject it when it is false and µ is not equal to µ0 . These two types of errors – which have much in common with the false positive and false negative results which occur in diagnostic testing – are discussed in more detail in Section 10.4. The probability of obtaining a mean as extreme or more extreme than the observed sample mean x, ¯ given that the null hypothesis H0 : µ = µ0 is true, is called the p-value of the test. The p-value is compared to the predetermined significance level α to decide whether the null hypothesis should be rejected. If p is less than α, we reject H0 . If p is greater than or equal to α, we do not reject H0 . In addition to the conclusion of the test, the p-value itself is typically reported in the literature. Unfortunately, p-values are commonly misused and misinterpreted in practice. A p-value is a measure of the strength of the evidence against the null hypothesis, with smaller p-values indicating stronger evidence against H0 . Because it is a probability, a p-value is continuous. Even though it is customary to report whether results are statistically significant or not, we must take care not to rely exclusively on a rigid division of study conclusions into two distinct boxes. In response to what many consider to be an over-simplification when reporting results, the American Statistical Association released a statement on statistical significance and p-values which clarifies several important points about what p-values can and cannot do in practice [191]. 1. P-values can indicate how incompatible the data are with a specified statistical model. 2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4. Proper inference requires full reporting and transparency. 5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
ISTUDY
230
Principles of Biostatistics
It is necessary to keep these principles in mind when making inference. In addition to the results of a hypothesis test (or confidence interval), we should also consider prior scientific evidence, study design, the way in which the data were collected, and the choice of an analytical method. We will continue to discuss these issues throughout the text.
10.2
Two-Sided Tests of Hypothesis
To conduct a test of hypothesis, we again draw on our knowledge of the sampling distribution of the mean. Assume that the continuous random variable X has mean µ0 and known standard deviation σ. Thus, according to the central limit theorem, Z
=
X − µ0 √ σ/ n
has an approximate standard normal distribution if the value of n is sufficiently large. For a given sample with mean x, ¯ we can calculate the corresponding outcome of Z, called the test statistic. We then use either a computer program or a table of the standard normal curve – such as Table A.3 in Statistical Tables – to determine the probability of obtaining a value of Z that is as extreme or more extreme than the one observed. By “more extreme,” we mean farther away from µ0 in the direction of the alternative hypothesis. Because it relies on the standard normal distribution, a test of this kind is called a one-sample z-test. When the population standard deviation is not known, we substitute the sample value s for σ. If the underlying population is normally distributed, the random variable t
=
X − µ0 √ s/ n
has a t distribution with n − 1 degrees of freedom. In this case, we can calculate the outcome of t corresponding to a given x¯ and consult our computer program or Table A.4 to find the probability of obtaining a sample mean that is more extreme than the one observed. This procedure is known as a one-sample t-test. To illustrate the process of hypothesis testing, again consider the distribution of serum cholesterol levels for adult males in the United States who are hypertensive and who smoke. The standard deviation of this distribution is assumed to be σ = 46 mg/100 ml; the null hypothesis to be tested is H0 : µ = 211 mg/100 ml, where µ0 = 211 mg/100 ml is the mean serum cholesterol level for all 20- to 74-year-old males. Since the mean of the subpopulation of hypertensive smokers could be either larger than µ0 or smaller than µ0 , we are concerned with deviations that occur in either direction. As a result, we conduct what is called a two-sided test; the alternative hypothesis is H A : µ , 211 mg/100 ml. The test will be conducted at the 0.05 level of significance. The previously mentioned random sample of 12 hypertensive smokers has mean serum cholesterol level x¯ = 217 mg/100 ml [179]. Is it likely that this sample comes from a population with mean
ISTUDY
231
Hypothesis Testing 211 mg/100 ml? To answer this question, we compute the test statistic z = =
x¯ − µ0 √ σ/ n 217 − 211 √ 46/ 12
=
0.45.
If the null hypothesis is true, then this statistic is the outcome of a standard normal random variable. According to Table A.3, the area under the standard normal curve to the right of z = 0.45 – which is the probability of observing Z = 0.45 or anything larger, given that H0 is true – is 0.326. The area to the left of z = −0.45 is 0.326 as well. Thus, the area in the two tails of the standard normal distribution sums to 0.652; this is the p-value of the test. Since p > 0.05, we do not reject the null hypothesis. Based on this sample, the evidence is not sufficient to conclude that the mean serum cholesterol level of the population of hypertensive smokers is different from 211 mg/100 ml. Although it may not be immediately obvious, there is actually a mathematical equivalence between confidence intervals and tests of hypothesis. Because we conducted a two-sided test, any value of z that is between −1.96 and 1.96 would result in a p-value greater than 0.05. (The outcome 0.45 is just one such value.) In each case, the null hypothesis would fail to be rejected. On the other hand, H0 would be rejected for any value of z that is either less than −1.96 or greater than 1.96. Because they indicate when we reject and when we do not, the numbers −1.96 and 1.96 are called the critical values of the test statistic. Another way to look at this is to note that the null hypothesis would fail to be rejected when µ0 is any value that lies within the 95% confidence interval for µ. Recall that in Chapter 9 we found a 95% confidence interval for the mean serum cholesterol level of hypertensive smokers to be (191 , 243). Any value of µ0 that is contained within this interval would result in a test statistic that is between −1.96 and 1.96. Therefore, if the null hypothesis had been H0 : µ = 240 mg/100 ml, H0 would not have been rejected. Similarly, the null hypothesis would not be rejected for µ0 = 195 mg/100 ml, or for µ0 = 240 mg/100 ml. In contrast, any value of µ0 that lies outside of the 95% confidence interval for µ – such as µ0 = 260 mg/100 ml – would result in H0 being rejected at the α = 0.05 level of significance. These values produce test statistics either less than −1.96 or greater than 1.96. Although confidence intervals and tests of hypothesis lead us to the same conclusions, the information provided by each one is somewhat different. The confidence interval supplies a range of reasonable values for the parameter µ and tells us something about the uncertainty in our point estimate x. ¯ The hypothesis test helps us to decide whether the postulated value of the mean is likely to be incorrect, and provides a specific p-value quantifying the amount of evidence against the null hypothesis. Returning to the test itself, the value µ0 = 211 mg/100 ml was selected for the null hypothesis because it is the mean serum cholesterol level of the population of all 20- to 74-year-old males. Consequently, H0 claims that the mean serum cholesterol level of males who are hypertensive smokers is identical to the mean cholesterol level of the general population of males. The hypothesis was established with an interest in obtaining evidence that would cause it to be rejected in favor of the alternative; a rejection would have implied that the mean serum cholesterol level of male hypertensive smokers is not equal to the mean of the population as a whole.
ISTUDY
232
Principles of Biostatistics
As a second example, consider the random sample of 10 children selected from the population of infants receiving antacids that contain aluminum. The underlying distribution of plasma aluminum levels for this population is approximately normal with an unknown mean µ and standard deviation σ. However, we do know that the mean plasma aluminum level for the sample of size 10 is x¯ = 37.20 µg/l and that its standard deviation is s = 7.13 µg/l [181]. Furthermore, the mean plasma aluminum level for the population of infants not receiving antacids is 4.13 µg/l. Is it possible that the data in our sample could have come from a population with mean µ0 = 4.13 µg/l? To find out, we conduct a test of hypothesis; the null hypothesis is H0 : µ = 4.13 µg/l, and the alternative is
H A : µ , 4.13 µg/l.
We are interested in deviations from the mean which could occur in either direction; we would want to know if µ is actually larger than 4.13 or if it is smaller. Therefore, we conduct a two-sided test at the α = 0.05 level of significance. Because we do not know the population standard deviation σ, we use a one-sample t-test rather than a one-sample z-test. The test statistic is t
=
x¯ − µ0 √ s/ n
=
37.20 − 4.13 √ 7.13/ 10
=
14.67.
If the null hypothesis is true, this outcome follows a t distribution with 10 − 1 = 9 df. Consulting Table A.4, we observe that the total area under the curve to the right of t 9 = 14.67 and to the left of t 9 = −14.67 is less than 2(0.0005) = 0.001. Therefore, p < 0.05 and we reject the null hypothesis H0 : µ = 4.13 µg/l. This sample of infants provides evidence that the mean plasma aluminum level of children receiving antacids is not equal to the mean aluminum level of children who do not receive them. In fact, since the sample mean x¯ is larger than µ0 , the true mean aluminum level is higher than 4.13 µg/l. Summary: One-Sample, Two-Sided Hypothesis Tests for the Mean Standard deviation
Known
Unknown
Null hypothesis
H0 : µ = µ0 or H0 : µ − µ0 = 0
H0 : µ = µ0 or H0 : µ − µ0 = 0
Alternative hypothesis
H A : µ , µ0 or H A : µ − µ0 , 0
H A : µ , µ0 or H A : µ − µ0 , 0
One-sample, two-sided z-test
One sample, two-sided t-test
Test
Test statistic Distribution of test statistic
Z=
X¯ − µ0 √ σ/ n
Standard normal
t=
X¯ − µ0 √ s/ n
t distribution with n − 1 degrees of freedom
ISTUDY
233
Hypothesis Testing
10.3
One-Sided Tests of Hypothesis
Before we conduct a test of hypothesis, we must decide whether we are concerned with deviations from µ0 that could occur in both directions – meaning either higher or lower than µ0 – or in one direction only. This choice determines whether we consider the area in two tails of the appropriate probability distribution when calculating a p-value, or the area in a single tail. The decision must be made before a random sample is selected; it should not be influenced by the outcome of the sample. If prior knowledge indicates that µ cannot be less than µ0 , then the only values of x¯ that will provide evidence against the null hypothesis H0 : µ = µ0 are those which are much larger than µ0 . In a situation such as this, the null hypothesis is more properly stated as H0 : µ ≤ µ0 and the alternative hypothesis as
H A : µ > µ0 .
For example, most people would agree that it is unreasonable to believe that exposure to a toxic substance — such as ambient carbon monoxide or sulfur dioxide — could possibly be beneficial to humans. Therefore, we anticipate only harmful effects and conduct a one-sided test. A two-sided test is always the more conservative choice, however; in general, the p-value of a two-sided test is twice as large as that of a one-sided test. Consider the distribution of hemoglobin levels for the population of children under the age of 6 who have been exposed to high levels of lead. This distribution has an unknown mean µ; its standard deviation is assumed to be σ = 0.85 g/100 ml [75]. We might wish to know whether the mean hemoglobin level of this population is equal to the mean of the general population of children under the age of 6, µ = 12.29 g/100 ml. We believe that if the hemoglobin levels of exposed children differ from those of unexposed children that they must on average be lower; therefore, we are concerned only with deviations from the mean that are below µ0 . The null hypothesis for the test is H0 : µ ≥ 12.29 g/100 ml, and the one-sided alternative is
H A : µ < 12.29 g/100 ml.
H0 would be rejected for values of x¯ which are lower than 12.29, but not for those which are higher. We conduct the one-sided test at the α = 0.05 level of significance; since σ is known, we use the normal distribution rather than the t. A random sample of 74 children who have been exposed to high levels of lead has a mean hemoglobin level of x¯ = 10.6 g/100 ml [165]. Therefore, the appropriate test statistic is 10.6 − 12.29 x¯ − µ0 = = − 17.10. z = √ √ σ/ n 0.85/ 74 According to Table A.3, the area to the left of z = −17.10 is less than 0.001. Since this p-value is smaller than α = 0.05, we reject the null hypothesis H0 : µ ≥ 12.29 g/100 ml in favor of the alternative. Because this is a one-sided test, any value of z that is less than the critical value −1.645 would have led us to reject the null hypothesis. (Also note that 12.29 lies above 10.8, the upper one-sided 95% confidence bound for µ calculated in Chapter 9.)
ISTUDY
234
Principles of Biostatistics
In this example, H0 was chosen to test the statement that the mean hemoglobin level of the population of children who have been exposed to lead is the same as that of the general population, 12.29 g/100 ml. By rejecting H0 , we conclude that this is not the case; the mean hemoglobin level for children who have been exposed to lead is in fact lower than the mean for children who have not been exposed. The choice between a one- and a two-sided test can be controversial. Not infrequently, a one-sided test achieves significance when a two-sided test does not. Consequently, the decision is often made on nonscientific grounds. In response to this, some journal editors are reluctant to publish studies that employ one-sided tests. This may be an overreaction to something that the intelligent reader is able to discern. In any event, we avoid further discussion of this debate. Summary: One-Sample, One-Sided Hypothesis Test for the Mean Variance
Known
Unknown
Null hypothesis
H0 : µ ≥ µ0 or H0 : µ − µ0 ≥ 0
H0 : µ ≥ µ0 or H0 : µ − µ0 ≥ 0
Alternative hypothesis
H A : µ < µ0 or H A : µ − µ0 < 0
H A : µ < µ0 or H A : µ − µ0 < 0
One-sample, one-sided z-test
One-sample, one-sided t-test
Test
Test statistic
Z=
Distribution of test statistic
10.4
X¯ − µ0 √ σ/ n
Standard normal
t=
X¯ − µ0 √ s/ n
t distribution with n − 1 degrees of freedom
Types of Error
As noted in Section 10.1, there are two kinds of errors that can be made when conducting a test of hypothesis. The first is called a type I error; it is also known as a rejection error, or an α error. A type I error is made if we reject the null hypothesis H0 : µ = µ0 when H0 is true. The probability of committing a type I error is determined by the significance level of the test; recall that α is the conditional probability α
= P(reject H0 | H0 is true).
If we were to conduct repeated, independent tests of hypotheses setting the significance level at 0.05, we would erroneously reject a true null hypothesis 5% of the time.
ISTUDY
235
Hypothesis Testing
Consider the case of a drug that has been deemed effective for reducing high blood pressure. After being treated with this drug for a given period of time, a population of individuals suffering from hypertension has mean diastolic blood pressure µ d , a value that is clinically lower than the mean diastolic blood pressure of the untreated hypertensives. Now suppose that another company produces a generic version of the same drug. We would like to know whether the generic drug is as effective at reducing high blood pressure as the brand name version. To determine this, we examine the distribution of diastolic blood pressures for a sample of individuals who have been treated with the generic drug; if µ is the mean of this population, we use the sample to test the null hypothesis H0 : µ = µ d . What if the manufacturer of the generic drug actually submits the brand name product for testing in place of its own version? Vitarine Pharmaceuticals, a New York based drug company, reportedly did just that [192]. In a situation such as this, we know that the null hypothesis must be true; we are testing the drug that itself set the standard. Therefore, if the test of hypothesis leads us to reject H0 and pronounce the “generic” drug to be either more or less efficacious than the brand name version, a type I error has been made. The second kind of error that can be committed during a hypothesis test is a type II error, also known as an acceptance error, or a β error. A type II error is made if we fail to reject the null hypothesis H0 : µ = µ0 when H0 is false. The probability of committing a type II error is represented by the Greek letter β, where β
= P(do not reject H0 | H0 is false).
If β = 0.10, for instance, then the probability that we do not reject the null hypothesis when µ , µ0 is 0.10, or 10%. The two types of errors that can be made are summarized below:
Test Result Do Not Reject Reject
Population µ = µ0 Correct Type I Error
µ , µ0 Type II Error Correct
Recall the distribution of serum cholesterol levels for all 20- to 74-year-old males in the United States. The mean of this population is µ = 211 mg/100 ml, and the standard deviation is σ = 46 mg/100 ml. Suppose that we do not know the true mean of this population; however, we do know that the mean serum cholesterol level for the subpopulation of 20- to 24-year-old males is 180 mg/100 ml. Since older men tend to have higher cholesterol levels than do younger men on average, we would expect the mean cholesterol level of the population of 20- to 74-year-olds to be higher than 180 mg/100 ml. (And indeed it is, although we are pretending not to know this.) Therefore, if we were to conduct a one-sided test of the null hypothesis H0 : µ ≤ 180 mg/100 ml against the alternative hypothesis H A : µ > 180 mg/100 ml, we would expect H0 to be rejected. It is possible, however, that it would not be. The probability of reaching this incorrect conclusion – a type II error – is β.
ISTUDY
236
Principles of Biostatistics
What is the value of β associated with a test of the null hypothesis H0 : µ ≤ 180 mg/100 ml, assuming that we select a sample of size 25? To determine this, we first find the mean serum cholesterol level our sample must have in order for H0 to be rejected. Since we are conducting a one-sided test at the α = 0.05 level of significance, H0 would be rejected for z ≥ 1.645; this is the critical value of the test. Writing out the test statistic z =
x¯ − µ0 √ , σ/ n
we have 1.645 =
x¯ − 180 √ , 46/ 25
and, solving for x, ¯
1.645(46) = 195.1. √ 25 As shown in Figure 10.1, the area to the right of x¯ = 195.1 corresponds to the upper 5% of the sampling distribution of means of samples of size 25 when µ = 180. Therefore, the null hypothesis x¯
=
180 +
H0 : µ ≤ 180 mg/100 ml would be rejected if our sample has a mean x¯ which is greater than or equal to 195.1 mg/100 ml. A sample with a smaller mean would not provide sufficient evidence to reject H0 in favor of H A at the 0.05 level of significance. Recall that the probability of making a type II error, β, is the probability of not rejecting the null hypothesis given that H0 is false. Therefore, it is the chance of obtaining a sample mean which is less than 195.1 mg/100 ml given that the true population mean is not 180 but is instead µ1 = 211 mg/100 ml. To find the value of β, we again consider the sampling distribution of means of samples of size 25; this time, however, we let µ = 211. This distribution is pictured on the right side of Figure 10.2. Since a sample mean less than x¯ = 195.1 mg/100 ml implies that we do not reject H0 , we would like to know what proportion of this new distribution centered at 211 mg/100 ml lies below 195.1. Observe that 195.1 − 211.0 z = = − 1.73. √ 46/ 25 The area under the standard normal curve that lies to the left of z = −1.73 is 0.042. Therefore, β – the probability of failing to reject H0 : µ ≤ 180 mg/100 ml when the true population mean is µ1 = 211 mg/100 ml – is equal to 0.042. Whereas α, the probability of committing a type I error, is determined by looking at the situation in which H0 is true and µ is equal to µ0, β is found when H0 is false and µ does not equal µ0 . If µ is not equal to µ0 , however, there are an infinite number of possible values that µ could assume. The type II error is calculated for a single such value, µ1 ; in the previous example, µ1 was chosen to be 211 mg/100 ml. (We selected 211 because in this unusual example, we knew it to be the true population mean.) If we had chosen a different alternative population mean, then we would have computed a different value for β. The closer µ1 is to µ0 , the more difficult it is to reject the null hypothesis and the higher β will be.
ISTUDY
Hypothesis Testing
237
FIGURE 10.1 Distribution of means of samples of size 25 for the serum cholesterol levels of males 20 to 74 years of age, µ = 180 mg/100 ml
FIGURE 10.2 Distributions of means of samples of size 25 for the serum cholesterol levels of males 20 to 74 years of age, µ = 180 mg/100 ml versus µ = 211 mg/100 ml
ISTUDY
238
Principles of Biostatistics Summary: Types of Error Error
10.5
Notation
Definition
Type I error
α
P(reject H0 | H0 is true)
Type II error
β
P(do not reject H0 | H0 is false)
Power
If β is the probability of committing a type II error, then 1 − β is called the power of the test of hypothesis. The power is the probability of rejecting the null hypothesis when H0 is false. In other words, it is the probability of avoiding a type II error; power
=
P(reject H0 | H0 is false).
The power may also be thought of as the likelihood that a particular study will detect a deviation from the null hypothesis given that one exists. Like β, the power must be computed for a particular alternative population mean µ1 . In the serum cholesterol example, the power of the one-sided test of hypothesis is 1− β
=
1 − 0.042
=
0.958.
Consequently, for a test conducted at the 0.05 level of significance and using a sample of size 25, there is a 95.8% chance of rejecting the null hypothesis H0 : µ ≤ 180 mg/100 ml given that H0 is false and the true population mean is µ1 = 211 mg/100 ml. Note that this could also have been expressed in the following way: power = = = = =
P(reject µ ≤ 180 | µ = 211) P(X ≥ 195.1 | µ = 211) P(Z ≥ −1.73) 1 − P(Z < −1.73) 1 − 0.042 = 0.958.
The quantity 1 − β would have assumed a different value if we had set µ1 equal to 200 mg/100 ml, and yet another value if we had let µ1 be 220 mg/100 ml. If we were to plot the values of 1 − β against all possible alternative population means, we would end up with what is known as a power curve. A power curve for the test of the null hypothesis H0 : µ ≤ 180 mg/100 ml is shown in Figure 10.3. Note that when µ1 = 180, power = P(reject µ ≤ 180 | µ = 180) = P(reject µ ≤ 180 | H0 is true) = α = 0.05.
ISTUDY
239
Hypothesis Testing
FIGURE 10.3 Power curve for µ0 = 180, α = 0.05, and n = 25 The power of the test approaches 1 as the alternative mean moves farther and farther away from the null value of 180 mg/100 ml. Investigators generally try to design tests of hypotheses so that they have high power. It is not enough to know that we have a small probability of rejecting H0 when it is true; we would also like there to be a large probability of rejecting the null hypothesis when it is false. In most practical applications, a power less than 80% is considered insufficient. One way to increase the power of a test is to raise the significance level α. If we increase α, then we cut off a smaller portion of the tail of the sampling distribution centered at µ1 . Correspondingly, β becomes smaller, and the power, 1 − β, increases. If α had been equal to 0.10 for the test of the null hypothesis H0 : µ ≤ 180 mg/100 ml, for instance, then β would have been 0.018 and the power 0.982. This situation is illustrated in Figure 10.4; compare it to Figure 10.2, where α was equal to 0.05. Bear in mind, however, that by raising α we are increasing the probability of making a type I error. This trade-off between α and β is similar to that observed to exist between the sensitivity and specificity of a diagnostic test. Recall that by increasing the sensitivity of a test, we automatically decreased its specificity; alternatively, increasing the specificity lowered the sensitivity. The same is true for α and β. The balance between the two types of error is a delicate one, and their relative importance varies depending on the situation. In 1692, during the Salem witch trials, Increase Mather published a sermon signed by himself and fourteen other parsons wherein he stated [193]: It were better that ten suspected witches should escape, than that one innocent person should be condemned.
ISTUDY
240
Principles of Biostatistics
FIGURE 10.4 Distributions of means of samples of size 25 for the serum cholesterol levels of males 20 to 74 years of age, µ = 180 mg/100 ml versus µ = 211 mg/100 ml In the 18th century, Benjamin Franklin said: It is better that 100 guilty persons should escape than that one innocent person should suffer. More recently, however, an editorial on child abuse claimed that it is “equally important” to identify and punish child molesters and to exonerate those who are falsely accused [194]. The more information that we have – meaning the larger our sample – the less likely we are to commit an error of either type. Regardless of our decision, however, the possibility that we have made a mistake always exists. The only way to decrease α and β simultaneously is to reduce the amount of overlap in the two normal distributions – the one centered at µ0 and the one centered at µ1 . One way that this can be accomplished is by considering only large deviations from µ0 . The farther apart the values of µ0 and µ1 , the greater the power of the test. (The difference should be clinically meaningful, however.) An√alternative is to increase the sample size n. By increasing n, we decrease the standard error σ/ n; this causes the two sampling distributions to become more narrow, which in turn lessens the amount of overlap. The standard error also decreases if we reduce the underlying population standard deviation σ, but this is usually not a viable option. Another possibility that we have not yet mentioned is to find a “more powerful” test statistic. This topic is discussed further in Chapter 13. Summary: Power Term
Notation
Definition
Power
1− β
P(reject H0 | H0 is false)
ISTUDY
241
Hypothesis Testing
10.6
Sample Size Estimation
In the previous section, we outlined a method for calculating the power of a test conducted at the α level of significance using a sample of size n. In the early stages of planning a study, however, investigators usually want to reverse the situation and determine the sample size that will be necessary to provide a specified power. For example, suppose that we wish to test the null hypothesis H0 : µ ≤ 180 mg/100 ml at the α = 0.01 level of significance. Once again, µ is the mean serum cholesterol level of the population of 20- to 74-year-old males in the United States; the standard deviation is σ = 46 mg/100 ml. If the true population mean is as large as 211 mg/100 ml, then we want to risk only a 5% chance of failing to reject the null hypothesis; consequently, we set β equal to 0.05 and the power of the test to 0.95. Under these circumstances, how large a sample would we require? Since α = 0.01 instead √ of 0.05, we begin by noting that H0 would be rejected for z ≥ 2.32. Substituting ( x¯ − 180)/(46/ n) for the normal deviate z, we let =
z
2.32
x¯ − 180 √ . 46/ n
=
Solving for x, ¯ x¯
=
! 46 180 + 2.32 √ . n
Therefore, we would reject √ the null hypothesis if the sample mean x¯ takes any value greater than or equal to 180 + 2.32(46/ n). Now consider the desired power of the test. If the true mean serum cholesterol level were √ actually µ1 = 211 mg/100 ml – so that the normal deviate z could be expressed as ( x¯ − 211)/(46/ n ) – then we would want to reject the null hypothesis with probability 1 − β = 1 − 0.05 = 0.95. The value of z that corresponds to β = 0.05 is z = −1.645; therefore, =
z
− 1.645
and x¯
=
=
x¯ − 211 √ 46/ n
! 46 211 − 1.645 √ . n
Setting the two expressions for the sample mean x¯ equal to each other, ! ! 46 46 = 211 − 1.645 √ . 180 + 2.32 √ n n √ Multiplying both sides of the equality by n and collecting terms, √ n (211 − 180) = [2.32 − (−1.645)](46) and
" n
=
(2.32 + 1.645)(46) (211 − 180)
#2
=
34.6.
By convention, we always round up when calculating a sample size. Therefore, a sample of 35 individuals would be required.
ISTUDY
242
Principles of Biostatistics
Using notation introduced in Chapter 9, it is possible to write a more general formula for calculating sample size. Recall that z1−α represents the value which cuts off an area of α in the upper tail of the standard normal distribution; it is the 1 − αth percentile of the distribution. If we conduct a one-sided test of the null hypothesis H0 : µ ≤ µ0 against the alternative
H0 : µ > µ0
at the α level of significance, then H0 would be rejected for any test statistic which takes a value z ≥ z1−α . Similarly, considering the desired power of the test 1 − β, the generic value of z that corresponds to a probability β is z = z1−β . The two different expressions for x¯ are ! σ x¯ = µ0 + z1−α √ n and x¯
! σ µ1 − z1−β √ , n
=
and setting them equal to each other gives us [z1−α − (−z1−β )](σ) (µ1 − µ0 ) " # (z1−α + z1−β )(σ) 2 . (µ1 − µ0 ) "
n
= =
#2
This is the sample size necessary to achieve a power of 1 − β when conducting a one-sided test at the α level of significance. Several factors influence the size of n. If we reduce the type I error α, then z1−α – the cutoff point for rejecting H0 – would increase in value; this would result in a larger sample size. Similarly, if we lower the type II error β or increase the power, then z1−β gets smaller. Again this would produce a larger value of n. If we consider an alternative population mean that is closer to the hypothesized value, then the difference µ1 − µ0 would decrease and the sample size increase. It makes sense that we would need a bigger sample size to detect a smaller difference. Finally, the larger the variability of the underlying population σ, the larger the sample size required. In the serum cholesterol level example, we knew that the hypothesized population mean µ0 had to be smaller than the alternative µ1 ; consequently, we conducted a one-sided test. If it is not known whether µ0 is larger or smaller than µ1 , then a two-sided test is appropriate. In this case, we must modify the critical value of z that would cause the null hypothesis to be rejected. When α = 0.01, for instance, H0 : µ = 180 mg/100 ml would be rejected for z ≥ 2.58, not z ≥ 2.32. Substituting this value into the equation above, " n
=
(2.58 + 1.645)(46) (211 − 180)
#2
=
39.3,
and a sample of size 40 would be required. More generally, H0 would be rejected at the α level for z ≥ z1−α/2 (and also for z ≥ zα/2 ), and the sample size formula becomes " n
=
(z1−α/2 + z1−β )(σ) (µ1 − µ0 )
#2
.
ISTUDY
243
Hypothesis Testing
Note that the sample size for a two-sided test is always larger than the sample size for the corresponding one-sided test.
Summary: Sample Size for One-Sample Test on the Mean Test
Significance level
Power
Sample Size
One-sided
α
1− β
n =
(z 1−α +z 1−β )(σ) 2 (µ 1 −µ 0 )
Two-sided
α
1− β
n =
(z 1−α/2 +z 1−β )(σ) 2 (µ 1 −µ 0 )
Always round n up to the nearest integer.
10.7
Further Applications
Consider once again the distribution of heights for the population of 12- to 40-year-olds who suffer from fetal alcohol syndrome. This distribution is approximately normal with unknown mean µ; its standard deviation is σ = 6 cm [182]. We might wish to know whether the mean height for this population is equal to the mean height for individuals in the same age group who do not have fetal alcohol syndrome. The first step in conducting a test of hypothesis is to make a formal claim about the value of µ0 . Since the mean height of 12- to 40-year-olds who do not suffer from fetal alcohol syndrome is 160.0 cm, the null hypothesis would be H0 : µ = 160.0 cm. We are concerned with deviations from µ0 that could occur in either direction, so the alternative hypothesis is H A : µ , 160.0 cm. The second step is to set the significance level of the test. Here we will set α = 0.05 for the two-sided test. The third step of the hypothesis test is to examine the data and calculate a test statistic. For a random sample of size 31 selected from the population of 12- to 40-year-olds who suffer from fetal alcohol syndrome, the mean height is x¯ = 147.4 cm. The test statistic is z
=
x¯ − µ0 √ σ/ n
=
147.4 − 160.0 √ 6/ 31
=
− 11.69.
The fourth step is to identify the probability distribution of the test statistic, and use it to calculate a p-value. The p-value answers the question: If the true mean height of the population of 12- to 40-year-olds who suffer from fetal alcohol syndrome is µ = 160.0 cm, what is the probability of selecting a sample with a mean as low as 147.4, or even more extreme than this? Since the value of σ is known, we use a one-sample z-test rather than a one-sample t-test, and assume that the test statistic follows a standard normal distribution if the null hypothesis is true. Referring to Table A.3,
ISTUDY
244
Principles of Biostatistics
TABLE 10.1 R output for the one-sample z-test n 31
x.bar 147.4
std 6
null H_0: mu = 160
alt H_1: mu != 160
z -11.69
p_value |t|) = 0.0000
Ha: mean > 81 Pr(T > t) = 0.0000
TABLE 10.3 R output for the one-sample t-test n 7
x.bar 151
se 3.402
df 6
std 9
null H_0: mu = 81
upper 159.32
alt H_1: mu != 81
t 20.578
p_value |t|) = 0.0119
Ha: mean(diff) > 0 Pr(T > t) = 0.9941
TABLE 11.2 R output for the paired t-test Paired t-test data: data$air_percd and data$co_percd t = -2.5928, df = 62, p-value = 0.01186 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -11.738644 -1.518129 sample estimates: mean of the differences -6.628387
Now consider a study designed to investigate the effects of lactose consumption on carbohydrate energy absorption among premature infants. In particular, we are interested in determining whether a reduction in the intake of lactose – a sugar found in milk – either increases or decreases energy absorption. In this study, one group of newborns was fed their mothers’ breast milk; another group received a formula that contained only half as much lactose [208]. The distributions of carbohydrate energy absorption for the two populations are approximately normal. We believe it is reasonable to assume that they have equal variances, and would like to know whether they also have identical means. Since we are concerned with deviations that could occur in either direction, we test the null hypothesis H0 : µ1 = µ2 against the two-sided alternative
H A : µ1 , µ2 .
A random sample of n1 = 8 infants fed their mothers’ breast milk has mean energy absorption x¯ 1 = 87.38% and standard deviation s1 = 4.56%; a sample of n2 = 10 newborns who were given formula has mean x¯ 2 = 90.14% and standard deviation s2 = 4.58%. Since the samples are independent and the underlying population variances are assumed to be equal – an assumption which seems reasonable based on the values of s1 and s2 – we apply the two-sample t-test.
ISTUDY
270
Principles of Biostatistics
TABLE 11.3 Stata output for the two-sample t-test, assuming equal variances Two-sample t test with equal variances ---------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-----------------------------------------------------------milk | 8 87.38 1.612203 4.56 83.56774 91.19226 formula | 10 90.14 1.448323 4.58 86.86367 93.41633 ---------+-----------------------------------------------------------combined | 18 88.91 1.096934 4.654 86.59901 91.22766 ---------+-----------------------------------------------------------diff | -2.76 2.168339 -7.356674 1.836674 ---------------------------------------------------------------------diff = mean(milk) - mean(formula) t = -1.2729 Ho: diff = 0 degrees of freedom = 16 Ha: diff < 0 Pr(T < t) = 0.1106
Ha: diff != 0 Pr(|T| > |t|) = 0.2213
Ha: diff > 0 Pr(T > t) = 0.8894
We begin by calculating the pooled estimate of the variance, s p2
=
(n1 − 1) s12 + (n2 − 1) s22 n1 + n2 − 2
=
(8 − 1) (4.56) 2 + (10 − 1) (4.58) 2 8 + 10 − 2
= 20.90. The value s p2 combines information from both samples of children to produce a more reliable estimate of the common variance σ 2 . The test statistic is t
=
( x¯ 1 − x¯ 2 ) − (µ1 − µ2 ) q s p2 [(1/n1 ) + (1/n2 )]
=
√
(87.38 − 90.14) − 0 (20.90) [(1/8) + (1/10)]
=
− 1.27.
For a t distribution with 8 + 10 − 2 = 16 degrees of freedom, the total area under the curve to the left of −1.27 and to the right of 1.27 is greater than 2(0.10) = 0.20. Therefore, we fail to reject the null hypothesis. Based on these samples, lactose intake in newborns cannot be said to have an effect on carbohydrate energy absorption. Once again we could have used a computer to conduct the test of hypothesis for us. Output from Stata is contained in Table 11.3, and output from R in Table 11.4. Whereas Table A.4 enabled us to say that p > 0.20 for a two-sided test, either Stata or R would give us a more precise p-value, p = 0.221. Suppose we did not feel we had any reason to assume that the variances of the two distributions of carbohydrate energy absorption are equal; even if the means for the populations of premature infants being fed breast milk versus formula are the same, the amount of variability among the measurements might not be. In this case, we test the null hypothesis H0 : µ1 = µ2 against the two-sided alternative
H A : µ1 , µ2
ISTUDY
271
Comparison of Two Means TABLE 11.4 R output for the two-sample t-test, assuming equal variances Two Sample t-test data: milk and formula t = -1.2728635, df = 16, p-value = 0.2212517 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -7.356674 1.836674 sample estimates: mean of milk mean of formula 87.38 90.14
TABLE 11.5 Stata output for the two-sample t-test, not assuming equal variances Two-sample t test with unequal variances ---------------------------------------------------------------| Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-----------------------------------------------------milk | 8 87.38 1.612203 4.56 83.56774 91.19226 formula | 10 90.14 1.448323 4.58 86.86367 93.41633 ---------+-----------------------------------------------------combined | 18 88.913 1.096934 4.6539 86.59901 91.22766 diff | -2.76 2.167219 -7.3748 1.85476 ---------------------------------------------------------------diff = mean(x) - mean(y) t = -1.2735 Ho: diff = 0 Satterthwaite’s degrees of freedom = 15.1719 Ha: diff < 0 Pr(T < t) = 0.1110
Ha: diff != 0 Pr(|T| > |t|) = 0.2220
Ha: diff > 0 Pr(T > t) = 0.889
using the modified version of the two-sample test. The test statistic is t
=
( x¯ 1 − x¯ 2 ) − (µ1 − µ2 ) q (s12 /n1 ) + (s22 /n2 )
=
(87.38 − 90.14) − 0 p [(4.562 /8] + [(4.58) 2 /10]
=
− 1.27.
Note from the Stata output in Table 11.5 and the R output in Table 11.6 that the approximate degrees of freedom are 15. This is a little more conservative than the test assuming equal variances. For a t distribution with 15 degrees of freedom, the p-value is 0.222. Once again, lactose intake in newborns cannot be said to have an effect on carbohydrate energy absorption. When designing this study, suppose the investigators postulated that the difference in mean carbohydrate energy absorption between the two groups would be 3.0%, and that the standard deviation in each group is 4.6%. What sample size would have been required to have 90% power to
ISTUDY
272
Principles of Biostatistics
TABLE 11.6 R output for the two-sample t-test, not assuming equal variances Welch Two Sample t-test data: milk and formula t = -1.2735213, df = 15.1719, p-value = 0.2219991 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -7.374763 1.854763 sample estimates: mean of milk mean of formula 87.38 90.14
detect a difference of this magnitude? Assuming they planned to conduct a two-sided test at the 0.05 level of significance, n
= =
(z0.975 + z0.90 ) 2 (σ12 + σ22 ) (µ1 − µ2 ) 2
(1.96 + 1.28) 2 (4.62 + 4.62 ) (3.0) 2
= 49.4. The researchers would have required 50 patients per group, or a total sample size of 100 infants. Sample size calculations performed in Stata and R are shown in Tables 11.7 and 11.8, respectively. (Small discrepancies are due to differences in rounding.) With only 18 infants, the study was underpowered to detect a clinically relevant difference in energy absorption between groups.
ISTUDY
273
Comparison of Two Means
TABLE 11.7 Stata output for a sample size calculation for the comparison of two independent means Estimated sample sizes for a two-sample means test Satterthwaite’s t test assuming unequal variances Ho: m2 = m1 versus Ha: m2 != m1 Study parameters: alpha power delta m1 m2 sd1 sd2
= = = = = = =
0.0500 0.9000 3.0000 87.0000 90.0000 4.6000 4.6000
Estimated sample sizes: N = N per group =
102 51
TABLE 11.8 R output for a sample size calculation for the comparison of two independent means Two-sample t test power calculation n delta sd sig.level power alternative
= = = = = =
50.38825 3 4.6 0.05 0.9 two.sided
NOTE: n is number in *each* group
ISTUDY
274
Principles of Biostatistics
11.5
Review Exercises
1. What is the main difference between paired and independent samples of observations? 2. Explain the concept of paired data. In certain situations, what might be the advantage of using paired samples to compare populations rather than independent samples? 3. When can you use the two-sample t-test? When must the modified version of the test be applied? 4. What is the rationale for using a pooled estimate of the variance in the two-sample t-test? 5. Suppose you are interested in determining whether exposure to the organochlorine ddt, which has been used extensively as an insecticide for many years, is associated with breast cancer in women. As part of a study which investigated this issue, blood was drawn from a sample of women diagnosed with breast cancer over a six-year period and a second sample of healthy control subjects matched to the cancer patients on age, menopausal status, and date of blood donation [209]. Each woman’s blood level of dde – an important byproduct of ddt in the human body – was measured, and the difference in levels for each patient and her matched control calculated. A sample of 171 differences has mean d¯ = 2.7 ng/ml and standard deviation s d = 15.9 ng/ml. (a) Test the null hypothesis that the mean blood levels of dde are identical for women with breast cancer and for healthy control subjects, or equivalently, that the mean difference in dde levels is equal to 0. (b) What is the probability distribution of the test statistic? What is the p-value? (c) Do you reject or fail to reject the null hypothesis? (d) What do you conclude? (e) Would you expect a 95% confidence interval for the true difference in population mean dde levels to contain the value 0? Explain. (f) Calculate the 95% confidence interval for δ, the true mean difference in dde levels. 6. The data below come from a study that examines the efficacy of saliva cotinine as an indicator for exposure to tobacco smoke. In one part of the study, seven subjects – none of whom were heavy smokers and all of whom had abstained from smoking for at least one week prior to the study – were each required to smoke a single cigarette. Samples of saliva were taken from all individuals 2, 12, 24, and 48 hours after smoking the cigarette. The cotinine levels at 12 hours and at 24 hours are provided below [210]. Assume that cotinine levels are approximately normally distributed. Subject 1 2 3 4 5 6 7
Cotinine Levels (nmol/l) After 12 Hours After 24 Hours 73 24 58 27 67 49 93 59 33 0 18 11 147 43
ISTUDY
Comparison of Two Means
275
Let µ12 represent the population mean cotinine level 12 hours after smoking the cigarette and µ24 the mean cotinine level 24 hours after smoking. It is believed that if µ24 is not equal to µ12 , it must be lower. (a) Construct a one-sided 95% confidence interval for the true difference in population means δ = µ12 − µ24 . What does this tell you? (b) Test the null hypothesis that the population means are identical at the α = 0.05 level of significance. Use a one-sided test. What do you conclude? 7. A study was conducted to determine whether an expectant mother’s cigarette smoking has any effect on the bone mineral content of her otherwise healthy child [211]. A sample of 77 newborns whose mothers smoked during pregnancy has mean bone mineral content x¯ 1 = 0.098 g/cm and standard deviation s1 = 0.026 g/cm; a sample of 161 infants whose mothers did not smoke during pregnancy has mean x¯ 2 = 0.095 g/cm and standard deviation s2 = 0.025 g/cm. Assume that the underlying population variances of bone mineral content are equal. (a) Are the two samples of data paired or independent? (b) Calculate separate 95% confidence intervals for the true mean bone mineral contents of infants in each population. Use these intervals to construct a graph like Figure 11.3. Does the graph suggest that the two population means are likely to be equal to each other? (c) Formally test the null hypothesis that the two populations have the same mean bone mineral content. State the null and alternative hypotheses of the two-sided test. Calculate the appropriate test statistic, and state the probability distribution of this test statistic assuming that the null hypothesis is true. (d) If the test is being conducted at the 0.05 level of significance, do you reject the null hypothesis? Explain. (e) What do you conclude? (f) Construct a 95% confidence interval for the true difference in population means. 8. In an investigation of pregnancy-induced hypertension, one group of females with this disorder was treated with low-dose aspirin, and a second group was given a placebo [212]. A sample consisting of 23 females who received aspirin has mean arterial blood pressure 111 mm Hg and standard deviation 8 mm Hg; a sample of 24 individuals who were given the placebo has mean blood pressure 109 mm Hg and standard deviation 8 mm Hg. (a) At the 0.01 level of significance, test the null hypothesis that the two populations of females have the same mean arterial blood pressure. What do you conclude? (b) Construct a 99% confidence interval for the true difference in population means. Does this interval contain the value 0? How does this compare with the results of the hypothesis test? 9. You plan to conduct a study evaluating an anti-smoking campaign in a large public high school. Prior to implementation of the campaign, you will enroll a sample of students who are all self-reported smokers, and record the number of cigarettes smoked on the day prior to study enrollment. Two months after the launch of the campaign, the same group of students will be contacted and asked to report the number of cigarettes they smoked the day before.
ISTUDY
276
Principles of Biostatistics (a) If differences in the number of cigarettes smoked from baseline to two months after the launch of the anti-smoking campaign can be assumed to be normally distributed, what statistical test would be used to evaluate whether there has been a change in the mean number of cigarettes smoked? (b) What are the null and alternative hypotheses of this test? (c) Based on pilot data, you expect that the standard deviation of differences in the number of cigarettes smoked will be 5. If you wish to have 80% power to detect a mean difference of 3 cigarettes over the two-month period, how many students should you enroll? (d) How many students should you enroll if you wish to have 90% power to detect a mean difference of 3 cigarettes? (e) How many students should you enroll if you wish to have 90% power to detect an expected mean difference of 5 cigarettes?
10. In the study of adults undergoing liver transplantation who completed the Medical Outcomes Study sf-36 questionnaire six months after surgery, we previously compared the mental component summary mcs score for individuals receiving usual care versus those treated with exercise and dietary counseling [202]. Like the mcs score, the physical component summary (pcs) score – which incorporates information about physical functioning, bodily pain, limitations due to physical problems, and general physical health – is scaled to have an approximately normal distribution in the general population. You wish to evaluate whether mean pcs scores are the same in the two populations of transplant patients. (a) What are the null and alternative hypotheses of the appropriate test? (b) Six months after liver transplantation, the sample of n1 = 70 adults receiving usual care has mean pcs score x¯ 1 = 41.8 and standard deviation s1 = 10.6; the sample of n2 = 49 individuals in the intervention group has mean score x¯ 2 = 42.9 and standard deviation s2 = 11.0. Using this information, conduct the hypothesis test at the 0.05 level of significance. (c) What do you conclude? (d) Six months after liver transplantation, what is your overall assessment of differences in physical and mental health-related quality of life for patients treated with usual care versus those receiving an exercise and dietary counseling intervention? 11. Investigators are designing a study to compare forced vital capacity – the maximum amount of air which can be forcibly exhaled from the lungs after fully inhaling – for females between 30 and 35 years of age who smoke at least 10 cigarettes per day and females in the same age group who smoke fewer than 10 cigarettes per day. (a) If forced vital capacity (fvc) can be assumed to be approximately normally distributed, what statistical test could be used to evaluate whether mean fvc is the same in these two populations of females? (b) What are the null and alternative hypotheses of this test? (c) Based on some preliminary measurements, the investigators expect that the standard deviation of fvc will be 0.3 liters in both groups. If they wish to have 90% power to detect a difference of 0.5 liters between the means of the two populations, how many individuals would they need to enroll? (d) What sample size would they need if the standard deviation of fvc measurements in these two populations was 0.5 liters instead of 0.3 liters?
ISTUDY
Comparison of Two Means
277
12. You wish to compare health-related quality of life as measured by the mental component summary (mcs) score of the Medical Outcomes Study sf-36 questionnaire for patients with fibromyalgia undergoing two different treatment regimens. In the general population, mcs scores are normally distributed with standard deviation 10. (a) If you wish to have 85% power to detect a 5 point difference in means between the two treatment groups, what sample size would you require? (b) What sample size would you need to have 90% power? 13. A study was conducted to compare physical and psychological characteristics of monozygotic (identical) twins [213]. One question of interest was whether brain volume differed by birth order. The dataset twins contains total brain volumes measured in cm3 for 10 pairs of monozygotic twins. The brain volumes of the first-born twins are saved under the variable name volume_1, and the brain volumes of the second-born twins under the name volume_2. (a) Are the two samples of total brain volume measurements paired or independent? (b) What are the appropriate null and alternative hypotheses for a two-sided test? (c) Conduct the test at the 0.05 level of significance. What is the p-value? What is the probability distribution of the test statistic, assuming that the null hypothesis is true? (d) What do you conclude about the relationship between birth order and brain volume in monozygotic twins? 14. A crossover study was conducted to investigate whether oat bran cereal helps to lower serum cholesterol levels in hypercholesterolemic males [214]. Fourteen such individuals were randomly placed on a diet which included either oat bran or corn flakes. After two weeks, their low-density lipoprotein (ldl) cholesterol levels were recorded. Each male was then switched to the alternative diet. After a second two week period, the ldl cholesterol level of each individual was again recorded. The measurements from this study are contained in the datset ldl. ldl cholesterol levels in millimoles/liter measured two weeks after starting the oat bran cereal diet are saved under the variable name ldl_oat, and ldl levels two weeks after beginning the corn flake diet under the name ldl_corn. You wish to investigate whether mean ldl cholesterol levels are the same on the two different diets. (a) What are the appropriate null and alternative hypotheses for a two-sided test? (b) Conduct the test at the 0.05 level of significance. What is the p-value? (c) What do you conclude? 15. The data set lowbwt contains information for a sample of 100 low birth weight infants born in two teaching hospitals in Boston, Massachusetts [81]. Measurements of systolic blood pressure are saved under the variable name sbp and indicators of sex – with 1 representing a male and 0 a female – under the name sex. (a) Construct a histogram of systolic blood pressure measurements for this sample. Based on the graph, do you believe that blood pressure is approximately normally distributed? (b) Test the null hypothesis that among low birth weight infants, the mean systolic blood pressure for boys is equal to the mean for girls. Use a two-sided test at the 0.05 level of significance.
ISTUDY
278
Principles of Biostatistics (c) What is the p-value of this test? What does the p-value tell you? (d) Do you reject or fail to reject the null hypothesis? (e) What do you conclude?
16. The Bayley Scales of Infant Development provide scores on two indices – the Psychomotor Development Index (pdi) and the Mental Development Index (mdi) – which can be used to assess a child’s level of functioning at approximately one year of age. As part of a study investigating the development and neurologic status of children who had undergone reparative heart surgery during the first three months of life, the Bayley Scales were administered to a sample of one-year-old infants born with congenital heart disease. The children had been randomized to one of two different treatment groups, known as “circulatory arrest” and “low-flow bypass.” The groups differed in the specific way in which the reparative surgery was performed. Unlike circulatory arrest, which stops blood flow through the brain for a short period of time, low-flow bypass maintains continuous circulation. Although it is felt to be preferable by some physicians, it also has its own associated risk of brain injury. The data for this study are saved in the data set bayley [189]. Pdi scores are saved under the variable name pdi, mdi scores under mdi, and indicators of treatment group under trtment. For this variable, 0 represents circulatory arrest and 1 is low-flow bypass. (a) At the 0.05 level of significance, test the null hypothesis that the mean pdi score at one year of age for the circulatory arrest treatment group is equal to the mean pdi score for the low-flow group. What is the p-value for this test? What do you conclude? (b) Test the null hypothesis that the mean mdi scores are identical for the two treatment groups. What is the p-value? What do you conclude? (c) What do these tests suggest about the relationship between a child’s surgical treatment group during the first three months of life and his or her subsequent developmental status at one year of age?
ISTUDY
12 Analysis of Variance
CONTENTS 12.1 12.2 12.3 12.4
One-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.2 Sources of Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Comparisons Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
279 279 282 286 288 293
In the preceding chapter we cover techniques for determining whether a difference exists between the means of two independent populations. It is not unusual, however, to encounter situations in which we wish to test for a difference among the means of three or more populations rather than just two. The extension of the two-sample t-test to a comparison of the means for three or more groups is a type of analysis of variance.
12.1
One-Way Analysis of Variance
When the groups for which the means are being compared represent the classes defined by a single categorical variable, we use one-way analysis of variance.
12.1.1
The Problem
When discussing the paired t-test in the Further Applications section of Chapter 11, we examine data from a study that investigated the effects of carbon monoxide exposure on patients with coronary artery disease by subjecting them to a series of exercise tests. The males involved in the study were recruited from three different medical centers — the Johns Hopkins University School of Medicine, the Rancho Los Amigos Medical Center, and the St. Louis University School of Medicine. Before combining the subjects into one large group to conduct the analysis, we should have first examined some baseline characteristics to ensure that the individuals from different centers were in fact comparable. One characteristic that we might wish to consider is pulmonary function at the start of the study. If the subjects from one medical center begin with measures of forced expiratory volume in 1 second that are much larger – or much smaller – than those from the other centers, the results of the analysis may be affected. Therefore, given that the populations of patients in the three centers have mean baseline fev1 measurements µ1 , µ2 , and µ3 respectively, we would like to test the null hypothesis that the population means are identical. This may be expressed as H0 : µ1 = µ2 = µ3 . DOI: 10.1201/9780429340512-12
279
ISTUDY
280
Principles of Biostatistics
Note that the populations being compared are defined by the three classes of the categorical variable medical center. The alternative hypothesis is that at least one of the population means differs from one of the others. In general, we are interested in comparing the means of k different populations. Suppose that the k populations are independent and normally distributed. We begin by drawing a random sample of size n1 from the normal population with mean µ1 and standard deviation σ1 . The mean of this sample is denoted by x¯ 1 and its standard deviation by s1 . Similarly, we select a random sample of size n2 from the normal population with mean µ2 and standard deviation σ2, and so on for the remaining populations. The numbers of observations in each sample need not be the same. Group 1
Group 2
Group k
Population
Mean Standard deviation
µ1 σ1
µ2 σ2
µk σk
Sample
Mean Standard deviation Sample size
x¯ 1 s1 n1
x¯ 2 s2 n2
x¯ k sk nk
For the study investigating the effects of carbon monoxide exposure on individuals with coronary artery disease, the fev1 distributions of patients associated with each of the three medical centers make up distinct populations. From the population of fev1 measurements for the patients at Johns Hopkins University, we select a sample of size n1 = 21. From the population at Rancho Los Amigos we draw a sample of size n2 = 16, and from the one at St. Louis University we select a sample of size n3 = 23. The data, along with their sample means and standard deviations, are shown in Table 12.1 [207]. A 95% confidence interval for the true mean fev1 of subjects at each medical center is shown in Figure 12.1. Based on this graph, the mean fev1 for the patients at Johns Hopkins may be a little lower than the means for the other two groups; however, all three intervals overlap. We would like to conduct a more formal analysis comparing the population means. To begin, we might attempt to compare the three population means by evaluating all possible pairs of sample means using the two-sample t-test. For a total of three groups, there are three tests required; we would compare group 1 to group 2, group 1 to group 3, and group 2 to group 3. We assume that the variances of the underlying populations are all equal, or σ12 = σ22 = σ32 = σ 2 . The pooled estimate of the common variance, which we denote by sW2 , combines information from all three samples to estimate σ 2 ; in particular, sW2
=
(n1 − 1)s12 + (n2 − 1)s22 + (n3 − 1)s32 n1 + n2 + n3 − 3
.
This quantity is simply an extension of s p2 , the pooled estimate of the variance used for the two-sample t-test. Performing all possible pairs of tests is not a problem if the number of populations is relatively small. In the instance where k = 3, there are only three such tests. If k = 10, however, the process becomes much more complicated. In this case, we would have to perform 10 different pairwise 2 tests. Even more important, another problem that arises when all possible two-sample t-tests are conducted is that this procedure is likely to lead to an incorrect conclusion. Suppose that the three
ISTUDY
281
Analysis of Variance
TABLE 12.1 Forced expiratory volume in 1 second for patients with coronary artery disease sampled at three different medical centers Johns Hopkins
Rancho Los Amigos
St. Louis
3.23
3.22
2.79
3.47
2.88
3.22
1.86
1.71
2.25
2.47
2.89
2.98
3.01
3.77
2.47
1.69
3.29
2.77
2.10
3.39
2.95
2.81
3.86
3.56
3.28
2.64
2.88
3.36
2.71
2.63
2.61
2.71
3.38
2.91
3.41
3.07
1.98
2.87
2.81
2.57
2.61
3.17
2.08
3.39
2.23
2.47
3.17
2.19
2.47
4.06
2.74
1.98
2.88
2.81
2.63
2.85
2.53
2.43 3.20 3.53
n1 = 21
n2 = 16
n3 = 23
x¯ 1 = 2.63 liters
x¯ 2 = 3.03 liters
x¯ 3 = 2.88 liters
s1 = 0.496 liters
s2 = 0.523 liters
s3 = 0.498 liters
ISTUDY
282
Principles of Biostatistics
FIGURE 12.1 95% confidence intervals for the true mean forced expiratory volumes in 1 second at three medical centers population means are in fact equal and that we conduct all three pairwise tests. We assume, for argument’s sake, that the tests are independent and set the significance level for each one at 0.05. By the multiplicative rule, the probability of failing to reject a null hypothesis of no difference in all instances – and thereby drawing the correct conclusion in each of the three tests – would be P(fail to reject in all three tests)
=
(1 − 0.05) 3
=
(0.95) 3
=
0.857.
Consequently, the probability of rejecting the null hypothesis in at least one of the tests would be P(reject in at least one test)
=
1 − 0.857
=
0.143.
Since we know that the null hypothesis is true in each case, 0.143 is the overall probability of committing a type I error. As we can see, the combined probability of a type I error for the set of three tests is much larger than 0.05. In reality the problem is even more complex; since each of the t-tests is conducted using the same set of data, we cannot assume that they are all independent. We would like to be able to use a testing procedure in which the overall probability of making a type I error is equal to some predetermined level α. The one-way analysis of variance is such a technique.
12.1.2
Sources of Variation
As its name implies, the one-way analysis of variance procedure is dependent on estimates of variability or dispersion. The term “one-way” indicates that there is a single factor or characteristic that distinguishes the various populations from each other; in the study of carbon monoxide exposure, that characteristic is the medical center at which a subject was recruited. When we work with several different populations with a common variance σ 2 , two measures of variability can be computed: the variation of the individual values around their population means, and the variation of the population
ISTUDY
283
Analysis of Variance
means around the overall mean. If the variability within the k different populations is small relative to the variability among their respective means, this suggests that the population means are in fact different. To test the null hypothesis H0 : µ1 = µ2 = · · · = µk for a set of k populations, we first need to find a measure of the variability of the individual observations around their population means. The pooled estimate of the common variance σ 2 is one such measure; if we let n = n1 + n2 + . . . + nk , then sW2
= =
(n1 − 1)s12 + (n2 − 1)s22 + · · · + (nk − 1)s k2 n1 + n2 + · · · + n k − k
(n1 −
1)s12
+ (n2 − 1)s22 + · · · + (nk − 1)s k2 n−k
.
This quantity is simply a weighted average of the k individual sample variances. The subscript W refers to the “within-groups” variability. We next need an expression that estimates the extent to which the population means vary around the overall mean. If the null hypothesis is true and the means are identical, the amount of variability expected will be the same as that for an individual population; thus, this quantity also estimates the common variance σ 2 . In particular, s B2
=
n1 ( x¯ 1 − x) ¯ 2 + n2 ( x¯ 2 − x) ¯ 2 + · · · + nk ( x¯ k − x) ¯ 2 . k −1
The terms ( x¯ i − x) ¯ 2 are the squared deviations of the sample means x¯ i from the grand mean x. ¯ The grand mean is defined as the overall average of the n observations that make up the k different samples. It can be calculated as x¯
=
n1 x¯ 1 + n2 x¯ 2 + · · · + nk x¯ k n1 + n2 + · · · + n k
=
n1 x¯ 1 + n2 x¯ 2 + · · · + nk x¯ k . n
The subscript B denotes the “between-groups” variability. Now that we have these two different estimates of the variance, we would like to be able to answer the following question: Do the sample means vary around the grand mean more than the individual observations vary around the sample means? If they do, this implies that the corresponding population means are in fact different. To test the null hypothesis that the population means are identical, we use the test statistic F
=
s B2 . sW2
Under the null hypothesis – meaning if the null hypothesis is true – both sW2 and s B2 estimate the common variance σ 2 , and F is close to 1. If there is a difference among populations, the betweengroups variance exceeds the within-groups variance, and F is greater than 1. Under H0 , the ratio F has an F distribution with k − 1 and n − k degrees of freedom; the degrees of freedom correspond to the numerator and the denominator of the test statistic, respectively. We represent this probability distribution using the notation Fk−1, n−k , or, more generally, Fdf1, df2 . If we have only two independent samples, the F-test reduces to the two-sample t-test.
ISTUDY
284
Principles of Biostatistics
FIGURE 12.2 The F distribution with 4 and 2 degrees of freedom The F distribution is similar to the t in that it is not unique; there is a different F distribution for each possible pair of values d f 1 and d f 2 . Unlike the t distribution, however, the F distribution cannot assume negative values. In addition, it is skewed to the right. The extent to which it is skewed is determined by the values of the degrees of freedom. To illustrate its shape, the F distribution with 4 and 2 degrees of freedom is pictured in Figure 12.2. Table A.5 in Statistical Tables is a table of critical values computed for the family of F distributions. Only selected percentiles are included – in this case, the upper 10.0, 5.0, 2.5, 1.0, 0.05, and 0.01% of the distributions. The degrees of freedom for the numerator are displayed across the top of the table, and the degrees of freedom for the denominator are listed down the left-hand side. For any given combination, the corresponding entry in the table represents the value of Fdf1, df2 that cuts off the specified area in the upper tail of the distribution. Given an F distribution with 4 and 2 degrees of freedom, for example, the table shows that F4, 2 = 19.25 cuts off the upper 5% of the curve. Referring back to the fev1 data collected for patients from three different medical centers, we are interested in testing H0 : µ1 = µ2 = µ3, the null hypothesis that the mean forced expiratory volumes in 1 second for subjects from each of the three centers are identical. To begin, we verify that the fev1 measurements are approximately normally distributed. Based on the histograms shown in Figure 12.3, this appears to be a reasonable assumption. Next, since we feel it is fair to assume that the population variances are identical (note that the sample standard deviations are all quite close), we compute the estimate of the within-groups variance; sW2
= =
(n1 − 1)s12 + (n2 − 1)s22 + (n3 − 1)s32 n1 + n2 + n3 − 3
(21 − 1)(0.496) 2 + (16 − 1)(0.523) 2 + (23 − 1)(0.498) 2 21 + 16 + 23 − 3
=
0.254 liters2 .
ISTUDY
285
Analysis of Variance
FIGURE 12.3 Histograms for measurements of forced expiratory volume in 1 second for subjects at three medical centers Since x¯
=
n1 x¯ 1 + n2 x¯ 2 + n3 x¯ 3 n1 + n2 + n3
=
21(2.63) + 16(3.03) + 23(2.88) 21 + 16 + 23
=
2.83 liters,
the estimate of the between-groups variance is s B2
=
n1 ( x¯ 1 − x) ¯ 2 + n2 ( x¯ 2 − x) ¯ 2 + n3 ( x¯ 3 − x) ¯ 2 3−1
=
21(2.63 − 2.83) 2 + 16(3.03 − 2.83) 2 + 23(2.88 − 2.83) 2 3−1
=
0.769 liters2 .
Therefore, the test statistic is F
=
s B2 sW2
=
0.769 0.254
=
3.03.
For an F distribution with k − 1 = 3 − 1 = 2 and n − k = 60 − 3 = 57 degrees of freedom, 0.05 < p < 0.10. Although we would reject the null hypothesis at the 0.10 level, we do not reject it at the 0.05 level. There may possibly be some difference among the mean fev1 measurements for these three populations, but we are unable to state this with our specified level of significance.
ISTUDY
286
Principles of Biostatistics Summary: Sources of Variation Variability
Notation
Within-groups
sW2
Between-groups
s B2
Formula sW2 =
(n1 − 1)s12 + (n2 − 1)s22 · · · + (nk − 1)s k2 n−k
n ( x¯ − x) ¯ 2 + n2 ( x¯ 2 − x) ¯ 2 · · · + nk ( x¯ k − x) ¯ 2 s B2 = 1 1 k −1
Summary: Multi-Sample Hypothesis Test for Means, Independent Samples
12.2
Null hypothesis
H0 : µ1 = µ2 = · · · = µk where k is equal to the number of groups
Alternative hypothesis
H A : At least one pair of means differs
Test
One-way analysis of variance
Test statistic
F=
Distribution of test statistic
Fk−1, n−k ; F distribution with k − 1 and n − k degrees of freedom
s B2 sW2
Multiple Comparisons Procedures
As we have seen, one-way analysis of variance may be used to test the null hypothesis that k population means are identical, H0 : µ1 = µ2 = · · · = µk . What happens, however, if we reject H0 ? Although we can conclude that the population means are not all equal, we cannot be more specific than this. We do not know whether all the means are different from one another, or if only some of them are different. Once we reject the null hypothesis, therefore, we often want to conduct additional tests to find out where the differences lie. Many different techniques for conducting multiple comparisons exist. They typically involve testing each pair of means individually. In the previous section, we mentioned that one possible approach is to perform a series of k2 two-sample t-tests. As noted, however, performing multiple tests increases the probability of committing a type I error. We can avoid this problem by being more conservative in our individual comparisons; by reducing the individual α levels, we ensure that the overall level of significance is kept at a predetermined level. The significance level for each of the individual comparisons depends on the number of tests being conducted. The greater the number of tests, the smaller it must be. To set the overall probability of committing a type I error at 0.05, for example, we should use α∗
=
0.05 k 2
ISTUDY
287
Analysis of Variance
as the significance level for an individual comparison. This modification is known 3 as the Bonferroni correction. For the case in which we have k = 3 populations, a total of 2 tests are required. Consequently, if we wish to set the overall level of significance at 0.10, we must use α∗
0.10 3
=
=
0.033
as the level for each individual test. To conduct a test of the null hypothesis H0 : µi = µ j , we calculate ti j
=
x¯ i − x¯ j q
sW2 [(1/ni ) + (1/n j )]
,
the test statistic for a two-sample t-test. Note that instead of using the data from only two samples to estimate the common variance σ 2 , however, we take advantage of the additional information that is available and use all k samples. Under the null hypothesis, ti j has a t distribution with n − k degrees of freedom. For the comparison of baseline fev1 values among the three medical centers, we begin by considering populations 1 and 2, the patients at Johns Hopkins and those at Rancho Los Amigos. In this case, t12
x¯ 1 − x¯ 2
=
q
=
√
2
sW [(1/n1 ) + (1/n2 )]
2.63 − 3.03 0.254 [(1/21) + (1/16)]
=
− 2.39.
For a t distribution with n − k = 60 − 3 = 57 degrees of freedom, p = 0.02. Therefore, we reject the null hypothesis at the 0.033 level of significance and conclude that µ1 is not equal to µ2 . Looking at the sample means, the mean baseline FEV1 for patients at Johns Hopkins is lower than the mean for subjects at Rancho Los Amigos. We next compare populations 1 and 3, the group at Johns Hopkins and the group at St. Louis. The test statistic is t13
x¯ 1 − x¯ 3
=
q
=
√
sW2 [(1/n1 ) + (1/n3 )]
2.63 − 2.88 0.254 [(1/21) + (1/23)]
=
− 1.64.
Since p > 0.10, we do not have sufficient evidence to reject the null hypothesis that µ1 is equal to µ2 . Finally, we compare the patients at Rancho Los Amigos and those at St. Louis using the test statistic t23
x¯ 2 − x¯ 3
=
q
=
√
sW2 [(1/n2 ) + (1/n3 )]
3.03 − 2.88 0.254 [(1/16) + (1/23)]
=
0.91.
ISTUDY
288
Principles of Biostatistics
This time p > 0.20, and we are unable to reject the null hypothesis that µ2 is equal to µ3 . In summary, therefore, we find that the mean baseline fev1 measurement for patients at Johns Hopkins is somewhat lower than the mean for the patients at Rancho Los Amigos; we cannot make any further distinctions among the medical centers. One disadvantage of the Bonferroni multiple comparisons procedure is that it can suffer from a lack of power. It is highly conservative and may fail to detect a difference in means that actually exists. However, there are many other competing multiple comparisons procedures that could be used instead [215]. The appropriate technique to apply in a given situation depends upon a variety of factors, including the types of comparisons to be made (are all possible pairwise tests being performed as in the example above, for instance, or are two or more treatment groups being compared to a single control group), whether the comparisons were specified before collecting and summarizing the data or after, and whether all the samples contain equal numbers of observations.
Summary: Bonferroni Multiple Comparisons Procedure Null hypothesis
H0 : µi = µ j
Alternative hypothesis
H A : µi , µ j
Test
Two-sample t-test
Test statistic
ti j = q
Distribution of test statistic
t distribution with n − k degrees of freedom
Significance level (Bonferroni correction)
x¯ i − x¯ j sW2 [(1/ni ) + (1/n j )]
0.05 α∗ = k 2
Repeat for each of k pairwise comparisons.
12.3
Further Applications
A study was conducted to follow three groups of overweight males for a period of one year [216]. The first group decreased their energy intake by dieting but did not participate in an exercise program. The second group exercised regularly but did not alter their eating habits. The third group changed neither their diet nor their level of physical activity. At the end of one year, total change in body weight was measured for each individual. Among these three populations, is there any evidence of a difference in mean change in body weight? We begin by noting that changes in weight tend to be normally distributed, and then select a random sample from each population. The sample of 42 males who dieted has mean change in body weight x¯ 1 = −7.2 kg and standard deviation s1 = 4.1 kg. The sample of 47 men who participated in an exercise program has mean x¯ 2 = −4.0 kg and standard deviation s2 = 4.0 kg. Finally, the sample of 42 men who neither dieted nor exercised has mean x¯ 3 = 0.6 kg and standard deviation s3 = 3.9 kg.
ISTUDY
289
Analysis of Variance
We are interested in testing the null hypothesis that the mean changes in total body weight are identical for the three populations, or H0 : µ1 = µ2 = µ3 . The alternative hypothesis is that at least one of the population means differs from one of the others. We assume that the underlying population variances are all equal, a reasonable assumption given the very similar sample standard deviations above. It should be noted, however, that the one-way analysis of variance is relatively insensitive to departures from this assumption; even if the variances are not quite identical, the technique still works pretty well. To conduct the test, we begin by computing an estimate of the within-groups variance; sW2
= =
(n1 − 1)s12 + (n2 − 1)s22 + (n3 − 1)s32 n1 + n2 + n3 − 3
(42 − 1)(4.1) 2 + (47 − 1)(4.0) 2 + (42 − 1)(3.9) 2 42 + 47 + 42 − 3
=
16.0 kg2 .
Since the grand mean of the data is x¯
=
n1 x¯ 1 + n2 x¯ 2 + n3 x¯ 3 n1 + n2 + n3
=
42(−7.2) + 47(−4.0) + 42(0.6) 42 + 47 + 42
=
− 3.55 kg,
the estimate of the between-groups variance is s B2
=
n1 ( x¯ 1 − x) ¯ 2 + n2 ( x¯ 2 − x) ¯ 2 + n3 ( x¯ 3 − x) ¯ 2 3−1
=
42(−7.2 + 3.55) 2 + 47(−4.0 + 3.55) 2 + 42(0.6 + 3.55) 2 3−1
=
646.2 kg2 .
Consequently, the test statistic is F
=
s B2 sW2
=
646.2 16.0
=
40.4.
For an F distribution with k − 1 = 3 − 1 = 2 and n − k = 131 − 3 = 128 degrees of freedom, p < 0.001. Therefore, we reject the null hypothesis and conclude that the mean changes in total body weight are not identical for the three populations. Now that we have determined that the population means are not all the same, we would like to find out where the specific differences lie. One way to do this is to apply the Bonferroni multiple comparisons procedure. If we plan to conduct all possible pairwise tests and wish to set the overall probability of making a type I error at 0.05, then the significance level for an individual comparison is 0.05 0.05 α ∗ = 3 = = 0.0167. 3 2
We begin by considering the group of males who dieted and those who were enrolled in an exercise program. To test the null hypothesis that the mean changes in total body weight are identical for these two populations, H0 : µ1 = µ2,
ISTUDY
290
Principles of Biostatistics
we calculate the test statistic for group 1 versus group 2, t12
x¯ 1 − x¯ 2
=
q
=
√
sW2 [(1/n1 ) + (1/n2 )]
−7.2 − (−4.0) 16.0 [(1/42) + (1/47)]
=
− 3.77.
For a t distribution with n − k = 131 − 3 = 128 degrees of freedom, p < 0.001. Therefore, we reject the null hypothesis at the 0.0167 level of significance and conclude that µ1 differs from µ2 . Looking at the sample means, we see that the mean decrease in total body weight is greater for the men who dieted. We next compare means for the men who dieted and for those who neither dieted nor exercised. In this case, the test statistic for group 1 versus group 3 is t13
x¯ 1 − x¯ 3
=
q
=
√
sW2 [(1/n1 ) + (1/n3 )]
−7.2 − 0.6 16.0 [(1/42) + (1/42)]
=
− 8.93.
Since p < 0.001, we again reject the null hypothesis of equal means and conclude that µ1 differs from µ3 . The mean decrease in body weight is larger for the group of men who dieted. Finally, we compare means for the men who exercised and those who did not change their lifestyles in any way, calculating t23
x¯ 2 − x¯ 3
=
q
=
√
sW2 [(1/n2 ) + (1/n3 )]
−4.0 − 0.6 16.0 [(1/47) + (1/42)]
=
− 5.42.
We find that p < 0.001; as a result, we conclude that µ2 is not equal to µ3 . The mean decrease in total body weight is greater for the men who participated in an exercise program. In summary, when we conduct each individual test at the 0.0167 level of significance, we conclude that all three of the population means are different from each other. The mean change in total body weight is lowest for the men who dieted – meaning that they lost the most weight – followed by the men who exercised. The mean is highest for those who made no changes in their lifestyle. Rather than work through all these computations ourselves, we could have used a computer to carry out the analysis for us. To illustrate, Stata output is displayed in Table 12.2. The top portion of the output lists numerical summary measures of change in body weight for each of the three groups of males. Below this is an analysis of variance table. On the far right, the table lists the test statistic F and its associated p-value. In Stata output, a p-value displayed as 0.0000 means that p < 0.0001. The column labeled MS, or mean squares, contains the between- and within-groups estimates of the variance. The columns labeled SS and df contain the numerators and denominators of these estimates, respectively. Note that each of the numerators is actually a sum of squared deviations from the mean; with s B2 we are concerned with the deviations of the sample means from the grand mean, and with sW2 we use the deviations of the individual observations from the sample means. Consequently, the between- and within-groups estimates of the variance may be thought of as averages of squared deviations. The bottom portion of Table 12.2 contains the results of the Bonferroni multiple comparisons procedure. For each of the three possible pairwise t-tests, the table lists the difference in sample
ISTUDY
Analysis of Variance
291
TABLE 12.2 Stata output for one-way analysis of variance with the Bonferroni multiple comparisons procedure Intervention | Summary of Change in body weight (kg) group | Mean Std. Dev. Freq. -------------+-------------------------------------Diet | -7.2047619 4.1359784 42 Exercise | -3.9978723 3.9547159 47 No Change | .5595238 3.9002226 42 -------------+-------------------------------------Total | -3.5648855 5.0567384 131
Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------Between groups 1279.70843 2 639.854217 40.06 0.0000 Within groups 2044.47 128 15.9724219 ------------------------------------------------------------Total 3324.17844 130 25.5706034 Bartlett’s test for equal variances: chi2(2) = 0.1552 Prob>chi2 = 0.925
Comparison of Change in body weight (kg) by group (Bonferroni) Row Mean-| Col Mean | Diet Exercise ----------+---------------------Exercise | 3.20689 | 0.001 | No Change | 7.76429 4.5574 | 0.000 0.000
ISTUDY
292
Principles of Biostatistics
TABLE 12.3 R output for one-way analysis of variance with the Bonferroni multiple comparisons procedure Df Sum Sq Mean Sq F value Pr(>F) as.factor(group) 2 1280 639.9 40.06 3.09e-14 *** Residuals 128 2044 16.0 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Pairwise comparisons using t tests with pooled SD data:
data$wt_change and data$group
1 2 2 0.00072 3 1.4e-14 1.1e-06 P value adjustment method: bonferroni
means x¯ i − x¯ j and the corresponding p-value. Note that Stata has automatically applied a Bonferroni correction which adjusts the p-value for each comparison rather than the significance level. Since we are conducting three separate tests, Stata calculates each p-value and multiplies it by 3, rather than taking the significance level α and dividing by 3. Either way, we would draw the same conclusions. In this example, we again see that all comparisons are highly statistically significant. The corresponding output from R is displayed in Table 12.3. Once again, examining the results of the one-way analysis of variance and the subsequent Bonferroni multiple comparisons procedure, we would infer that all three population mean weight changes differ from each other.
ISTUDY
293
Analysis of Variance
12.4
Review Exercises
1. When testing the equality of several population means, what problems arise if you attempt to perform all possible pairwise two-sample t-tests? 2. What is the idea behind the one-way analysis of variance? What two measures of variation are being compared? 3. What are the properties of the F distribution? 4. Describe the purpose of the Bonferroni correction for multiple comparisons. 5. Consider the F distribution with 8 and 16 degrees of freedom. (a) What proportion of the area under the curve lies to the right of F = 2.09? (b) What value of F cuts off the upper 1% of the distribution? (c) What proportion of the area under the curve lies to the left of F = 4.52? 6. Consider the F distribution with 3 and 30 degrees of freedom. (a) (b) (c) (d)
What proportion of the area under the curve lies to the right of F = 5.24? What proportion of the area under the curve lies to the left of F = 2.92? What value of F cuts off the upper 2.5% of the distribution? What value of F cuts off the upper 0.1%?
7. A study of patients with insulin-dependent diabetes was conducted to investigate the effects of cigarette smoking on renal and retinal complications. Before examining the results of the study, you wish to compare the baseline measures of systolic blood pressure across four different subgroups: nonsmokers, current smokers, exsmokers, and tobacco chewers. A sample is selected from each subgroup; the relevant data are provided below [217]. Sample means and standard deviations are expressed in millimeters of mercury (mm Hg). Assume that systolic blood pressure is normally distributed. (a) Calculate the estimate of the within-groups variance for the four populations. (b) Calculate the estimate of the between-groups variance. Nonsmokers Current smokers Exsmokers Tobacco chewers
n 269 53 28 9
x¯ 115 114 118 126
s 13.4 10.1 11.6 12.2
(c) Calculate the test statistic to evaluate the null hypothesis that the mean systolic blood pressures of the four groups are identical. (d) What is the probability distribution of the test statistic? (e) If you are conducting the test at the 0.05 level of significance, do you reject or fail to reject the null hypothesis? (f) What do you conclude?
ISTUDY
294
Principles of Biostatistics (g) If you determine that the population means are not all equal, use the Bonferroni multiple comparisons procedure to determine where the differences lie. What is the significance level of each individual test? What do you conclude? 8. One of the goals of the Edinburgh Artery Study was to investigate the risk factors for peripheral arterial disease among persons 55 to 74 years of age. You wish to compare mean LDL cholesterol levels, measured in mmol/liter, among four different populations of subjects: patients with intermittent claudication or interruptions in movement, those with major asymptomatic disease, those with minor asymptomatic disease, and those with no evidence of disease at all. Samples are selected from each population; summary statistics are provided below [218].
Intermittent claudication Major asymptomatic disease Minor asymptomatic disease No disease
n 73 105 240 1080
x¯ 6.22 5.81 5.77 5.47
s 1.62 1.43 1.24 1.31
(a) At the 0.05 level of significance, test the null hypothesis that the mean LDL cholesterol levels are the same for the four populations. What are the degrees of freedom associated with this test? (b) What do you conclude? (c) What assumptions about the data must be true in order to use the one-way analysis of variance technique? (d) Is it necessary to take an additional step in this analysis? If so, what is it? Explain. 9. A study was conducted to assess the performance of outpatient substance abuse treatment centers. Three different types of units were evaluated: private for-profit (FP), private notfor-profit (NFP), and public. Among the performance measures considered were minutes of individual therapy per session and minutes of group therapy per session. Samples were selected from each type of treatment center; summary statistics for session length are shown in the table below [219].
Treatment Centers FP NFP Public
Individual Therapy n x¯ s 37 49.5 15.5 312 54.8 11.4 169 53.3 11.1
n 30 296 165
Group Therapy x¯ s 105.8 42.9 98.7 31.3 94.2 27.1
(a) Given these numerical summary measures, how do the different types of treatment centers compare with respect to average minutes of therapy per session? (b) At the 0.05 level of significance, test the null hypothesis that the mean minutes of individual therapy per session are identical for each type of center. What do you conclude? (c) If necessary, carry out a multiple comparisons procedure and state your conclusions. (d) At the 0.05 level of significance, test the null hypothesis that the mean minutes of group therapy per session are the same for each type of center. What do you conclude?
ISTUDY
Analysis of Variance
295
(e) Again carry out a multiple comparisons procedure if necessary, and state your conclusions. (f) How do the different types of treatment centers compare to each other? 10. For the study discussed in the text which investigates the effect of carbon monoxide exposure on patients with coronary artery disease, baseline measures of pulmonary function were examined across the three medical centers which enrolled patients. Another characteristic that you might wish to investigate is age. The relevant measurements are saved in a dataset called cad. Values of age are saved under the variable name age, and indicators of center are saved under center. (a) Construct both a histogram and a box plot for the measurements of age. Does age appear to be at least approximately normally distributed? (b) For each medical center, calculate the sample mean and standard deviation for age. (c) At the 0.05 level of significance, test the null hypothesis that, at the time the study was conducted, the mean ages for men with coronary artery disease at each of the three centers are identical. What is the test statistic? What is the probability distribution of the test statistic? (d) What is the p-value for the test? Do you reject or fail to reject the null hypothesis? (e) What do you conclude? (f) If necessary, perform the Bonferroni method for multiple comparisons. What do you conclude? 11. The data set lowbwt contains information for a sample of 100 low birth weight infants born in two teaching hospitals in Boston, Massachusetts [81]. Systolic blood pressure measurements are saved under the variable name sbp and indicators of sex – where 1 represents a male and 0 a female – under the name sex. (a) Assuming equal variances for males and females, use the two-sample t-test to evaluate the null hypothesis that among low birth weight infants, the mean systolic blood pressure for girls is equal to that for boys. What do you conclude? (b) Even though there are only two populations instead of three or more, test the same null hypothesis using the one-way analysis of variance. What do you conclude? (c) In the text, the statement was made that in the case of two independent samples, the F-test used in the one-way analysis of variance reduces to the two-sample t-test. Do you believe this to be true? Explain briefly.
ISTUDY
ISTUDY
13 Nonparametric Methods
CONTENTS 13.1 13.2 13.3 13.4 13.5 13.6 13.7
Sign Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wilcoxon Signed-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advantages and Disadvantages of Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
297 301 304 307 311 311 318
For all the statistical tests we have studied up to this point, the populations from which the data were sampled were assumed to be at least approximately normally distributed, or the sample sizes were large enough that the central limit theorem could be applied. Normality of the populations is necessary for the tests to be valid. Since the forms of the underlying distributions are assumed to be known and only the values of certain parameters – the means and standard deviations of the normal distributions – are not, these procedures are called parametric tests. If the data do not conform to the assumptions made by such traditional techniques, nonparametric methods of statistical inference can be used instead. Nonparametric techniques make fewer assumptions about the nature of the underlying probability distributions. As a result, they are sometimes called distribution-free methods. Nonparametric tests of hypothesis follow the same general procedure as the parametric tests we have already studied. We begin by making some claim about the underlying populations in the form of a null hypothesis. We calculate a test statistic based on random samples of observations drawn from the underlying populations. We then use the value of the test statistic to determine a p-value, compare the p-value to the significance level of the test α, and either reject or fail to reject the null hypothesis.
13.1
Sign Test
The sign test may be used to compare two populations which are not independent. In this respect, it is similar to the paired t test. A random sample of paired observations is selected from the two populations of interest. The test then focuses on the difference in values within each pair. However, it does not require that the population of differences be normally distributed. Furthermore, whereas the null hypothesis of the paired t test is that the mean of the underlying population of differences is equal to 0, the null hypothesis of the sign test is that the median difference is equal to 0. Consider a study designed to investigate the amount of energy expended by patients born with cystic fibrosis. We would like to compare energy expenditure at rest for persons suffering from this disease and for healthy individuals matched to the cystic fibrosis patients on important clinical DOI: 10.1201/9780429340512-13
297
ISTUDY
298
Principles of Biostatistics
TABLE 13.1 Resting energy expenditure (ree) for patients with cystic fibrosis and healthy individuals matched on age, sex, height, and weight Pair 1 2 3 4 5 6 7 8 9 10 11 12 13
REE (kcal/day) CF Healthy 1153 996 1132 1080 1165 1182 1460 1452 1634 1162 1493 1619 1358 1140 1453 1123 1185 1113 1824 1463 1793 1632 1930 1614 2075 1836
Difference
Sign
157 52 −17 8 472 −126 218 330 72 361 161 316 239
+ + − + + − + + + + + + +
characteristics. Because the subjects with and without cystic fibrosis are matched, the two groups are not independent. If differences in resting energy expenditure (ree) within each matched pair are normally distributed, we would be able to assess whether the mean difference is equal to 0 using the paired t test. If we do not feel it is appropriate to make this assumption, however, we could instead use the sign test to evaluate the null hypothesis H0 : δmedian = 0, where δmedian is the median of the population of differences. For a two-sided test, the alternative hypothesis is H A : δmedian , 0. We will conduct the test at the 0.05 level of significance. We begin by selecting a random sample of n pairs of observations from the two populations. Table 13.1 contains the measurements of ree for samples of 13 patients with cystic fibrosis and 13 healthy individuals matched to the patients on age, sex, height, and weight [220]. Using these values, we calculate the difference in ree for each pair of observations. The distribution of these differences is displayed in the histogram in Figure 13.1. Next, if a difference is greater than 0, the pair is assigned a plus sign, indicating that the individual with cystic fibrosis has the higher ree. If the difference is less than 0, the pair receives a minus sign. Here, the healthy subject has the higher ree. Differences of exactly 0 provide no information about which individual in the pair has a higher resting energy expenditure, and are excluded from the analysis. When differences are excluded, the sample size n is reduced accordingly. After assigning the plus and minus signs, we count the number of plus signs in the sample; this total is denoted by D. Under the null hypothesis that the median difference is equal to 0, we would expect to have equal numbers of plus signs and minus signs. Equivalently, the probability that a particular difference is positive is 0.5, and the probability that the difference is negative is also 0.5. If a plus sign is considered to be a “success,” the n plus and minus signs can be thought of as the outcomes of a Bernoulli random variable with probability of success p = 0.5. The total number of
ISTUDY
299
Nonparametric Methods
FIGURE 13.1 Differences in resting energy expenditure (ree) for patients with cystic fibrosis and healthy individuals matched on age, sex, height, and weight plus signs D is then a binomial random variable with parameters n and p. The mean number of p √ plus signs in a sample of size n is np = n(0.5) = n/2, and the standard deviation is np(1 − p) = n/4. If D is either much larger or much smaller than n/2, we would want to reject the null hypothesis that the median difference is equal to 0. We evaluate H0 by considering the test statistic z+
=
D − (n/2) . √ n/4
If the null hypothesis is true and the sample size n is large enough, z+ follows an approximate standard normal distribution with mean 0 and standard deviation 1. This test is called the sign test because it depends only on the signs of the calculated differences, not their actual magnitudes. For the data in Table 13.1, there are D = 11 plus signs. If the probability of a positive difference is p = 0.5, the mean number of plus signs for samples of size 13 is =
13 2
13 4
=
D − (n/2) √ n/4
=
n 2
=
6.5.
and the standard deviation is r
n 4
r =
√
3.25
=
1.80.
Therefore, the test statistic is z+
=
11 − 6.5 1.80
=
2.50.
The area under the standard normal curve to the right of z = 2.50 and to the left of z = −2.50 is p = 2(0.006) = 0.012. In other words, if the null hypothesis is true and the mean number of plus
ISTUDY
300
Principles of Biostatistics
signs for samples of size 13 is 6.5, the probability of observing 11 plus signs, or something even more extreme than this, is 0.012. Since the p-value is less than 0.05, we reject the null hypothesis and conclude that the median difference among pairs is not equal to 0. Because most of the differences are positive (and the median difference for the sample of values is 161 kcal/day, which is greater than 0) we can infer that ree is higher among persons with cystic fibrosis than it is among healthy individuals. This could be due to a number of factors, including differences in metabolism and the increased effort required to breathe. If the sample size n is small, the test statistic z+ cannot always be assumed to follow a standard normal distribution. In this case, we can use a different procedure to test the null hypothesis that the median of the population of differences is equal to 0. Recall that if H0 is true, D is a binomial random variable with parameters n and p = 0.5. Therefore, we could use the binomial distribution itself to calculate the probability of observing D positive differences or some number even more extreme than this. Using this method, we do not calculate a test statistic; instead, we calculate the p-value empirically. A hypothesis test of this type is called an exact test. For the resting energy expenditure data in Table 13.1, we found D = 11 plus signs. Under the null hypothesis that the median difference is equal to 0, we would expect only 13/2, or 6.5. The probability of observing 11 or more plus signs is P(D ≥ 11)
= P(D = 11) + P(D = 12) + P(D = 13) ! ! ! 13 13 13 11 13−11 12 13−12 = (0.5) (0.5) + (0.5) (0.5) + (0.5) 13 (0.5) 13−13 11 12 13 = 0.0095 + 0.0016 + 0.0001 = 0.0112.
This is the p-value of the one-sided hypothesis test; we are considering only “more extreme” values of D which are larger than 6.5. What about values that are smaller than 6.5? For a two-sided test, we would reject the null hypothesis not only when D is too large, but also when it is too small. For the exact test, extreme values smaller than 6.5 are defined as values d with probabilities less than or equal to the probability of the observed outcome D = 11. Equivalently, we look for values d such that P(D = d) ≤ P(D = 11). Here, P(D = 0) = 0.0001, P(D = 1) = 0.0016, and P(D = 2) = 0.0095. Therefore, the p-value of the two-sided test is P(D ≥ 11)+ P(D ≤ 2) = 0.0224. Once again we would reject the null hypothesis at the 0.05 level of significance, and conclude that resting energy expenditure is higher among patients with cystic fibrosis.
Summary: Two-Sample Nonparametric Hypothesis Test, Paired Samples Null hypothesis
H0 : δmedian = 0
Alternative hypothesis
H A : δmedian , 0
Test
Sign test
Test statistic
z+ =
D − (n/2) √ n/4
where
D = number of positive differenced n = sample size Distribution of test statistic
Standard normal or exact distribution
ISTUDY
301
Nonparametric Methods
13.2
Wilcoxon Signed-Rank Test
Although the sign test frees us from having to make any assumptions about the underlying distribution of differences, it also ignores some potentially useful information: the magnitude of these differences. For patients with cystic fibrosis and healthy individuals matched on age, sex, height, and weight, a difference in ree of 8 kcal/day is counted the same as a different of 472 kcal/day. As a result, the sign test is not often used in practice. Instead, the Wilcoxon signed-rank test can be used to compare two populations that are not independent. Like the sign test – and the paired t test – the signed-rank test does not consider the measurements sampled from the two populations separately. Instead, it focuses on the difference in values for each pair of observations. It does not require that the population of these differences be normally distributed. However, it does take into account the magnitudes of the differences as well as their signs. The Wilcoxon signed-rank test is used to evaluate the null hypothesis that in the underlying population of differences among pairs, the median difference is equal to 0. The alternative hypothesis is that the population median difference is not equal to 0. Suppose we would like to investigate the use of the drug amiloride as a therapy for patients with cystic fibrosis. It is believed that this drug may help to improve air flow in the lungs, and thereby delay the loss of pulmonary function often associated with the disease. Forced vital capacity (fvc) is the volume of air that a person can expel from the lungs in 6 seconds; we would like to compare the reduction in fvc that occurs over a 25-week period of treatment with the drug to the reduction that occurs in the same patients over a similar period of time during treatment with a placebo. However, we are not willing to assume that differences in reduction in fvc are normally distributed. To conduct the Wilcoxon signed-rank test, we proceed as follows. After setting the significance level of the test at 0.05, we select a random sample of n pairs of observations from the underlying populations. Table 13.2 contains the measurements of fvc reduction for a sample of 14 patients with cystic fibrosis [221]. We calculate the difference for each pair of observations, and, ignoring the signs of these differences, rank their absolute values from smallest to largest. In the traditional version of the test, a difference of 0 is not ranked. Instead, it is eliminated from the analysis, and the sample size n is reduced by 1 for each pair eliminated. (An alternative procedure will be shown in the Further Applications section of this chapter.) If there are any tied observations, they are assigned an average rank. If the two smallest differences had both taken the value 11, for instance, then each would have received a rank of (1 + 2)/2 = 1.5. Finally, we assign each rank either a plus or a minus sign, depending on the sign of the difference. For example, the difference in Table 13.2 that has the second smallest absolute value is −15; therefore, this observation receives rank 2. Because the difference itself is negative, however, the observation’s signed rank is −2. The next step in the test is to compute the sum of the positive ranks and the sum of the negative ranks. Ignoring the signs, we denote the smaller sum by T. Under the null hypothesis that the median of the underlying population of differences is equal to 0, we would expect the sample to have approximately equal numbers of positive and negative ranks. Furthermore, the sum of the positive ranks should be comparable in magnitude to the sum of the negative ranks. We evaluate this hypothesis by considering the statistic zT
=
T − µT σT
µT
=
n(n + 1) 4
where is the mean sum of the ranks and r σT
=
n(n + 1)(2n + 1) 24
ISTUDY
302
Principles of Biostatistics
TABLE 13.2 Reduction in forced vital capacity (fvc) for a sample of patients with cystic fibrosis Reduction in FVC (ml) Placebo Drug 224 213 80 95 75 33 541 440 74 −32 85 −28 293 445 −23 −178 525 367 −38 140 508 323 255 10 525 65 1023 343
Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Difference
Rank
11 −15 42 101 106 113 −152 155 158 −178 185 245 460 680
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Total:
Signed Rank 1 −2 3 4 5 6 −7 8 9 −10 11 12 13 14 86
−19
is the standard deviation [222]. If H0 is true and the sample size n is large enough, zT
=
T − µT σT
follows an approximate standard normal distribution. The differences in Table 13.2 are displayed as a histogram in Figure 13.2. The graph confirms that differences in reduction in fvc may not be normally distributed (although we cannot say for sure, since the sample size is small). Proceeding with the Wilcoxon signed-rank test, the sum of the positive ranks is 1 + 3 + 4 + 5 + 6 + 8 + 9 + 11 + 12 + 13 + 14 = 86, and the sum of the negative ranks is −2 + (−7) + (−10) = −19. Ignoring the signs, the smaller sum is T = 19. Also, µT and
r σT
=
=
n(n + 1) 4
n(n + 1)(2n + 1) 24
The test statistic is
=
14(14 + 1) 4 r
=
=
52.5.
14(14 + 1)[2(14) + 1] 24
=
15.93.
T − µT 19 − 52.5 = = − 2.10. σT 15.93 The area under the standard normal curve to the left of z = −2.10 and to the right of z = 2.10 is p = 2(0.018) = 0.036. This is the probability of observing a value of T as extreme as or more extreme than 19, given that the null hypothesis is true. Since the p-value is less than α = 0.05, we reject the null hypothesis and conclude that the median difference is not equal to 0. Most of the differences are positive, and the median reduction in forced vital capacity in Table 13.2 is 109.5 ml; this suggests that the reduction in fvc is greater during treatment with the placebo than it is during treatment with the drug. In other words, use of the drug does diminish the loss of pulmonary function. zT =
ISTUDY
303
Nonparametric Methods
FIGURE 13.2 Differences in reduction in forced vital capacity for a sample of patients with cystic fibrosis
If the sample size n is small, the test statistic zT cannot be assumed to follow a standard normal distribution. In this case, an exact version of the test can be applied. Because computations are difficult to perform by hand, we will always use a statistical package to calculate exact p-values for the Wilcoxon signed-rank test. This will be demonstrated in Section 13.6.
Summary: Two-Sample Nonparametric Hypothesis Test, Paired Samples Null hypothesis
H0 : δmedian = 0
Alternative hypothesis
H A : δmedian , 0
Test
Wilcoxon signed-rank test
Test statistic
zT =
T − µT σT
where
T = min(sum of positive ranks, sum of negative ranks) n(n + 1) 4 r n(n + 1)(2n + 1) σT = 24 µT =
n = sample size Distribution of test statistic
Standard normal or exact distribution
ISTUDY
304
13.3
Principles of Biostatistics
Wilcoxon Rank Sum Test
The Wilcoxon rank sum test – or the mathematically equivalent procedure known as the MannWhitney test – is used to compare two independent populations. Consequently, it is a nonparametric counterpart of the two-sample t test. Unlike the t test, it does not require that the underlying populations be normally distributed or that their variances be equal. It does, however, assume that the population distributions have the same general shape. The Wilcoxon rank sum test evaluates the null hypothesis that the medians of the two independent populations are identical, H0 : Median1 = Median2, versus the alternative hypothesis that the medians are not the same, H A : Median1 , Median2 . Consider the distributions of normalized mental age scores for two populations of children suffering from phenylketonuria (pku). Individuals with this disorder are unable to metabolize the amino acid phenylalanine. It has been suggested that an elevated level of serum phenylalanine increases a child’s likelihood of mental deficiency, which would lead to lower cognitive ability and a lower normalized mental age score. The members of the first population have average daily serum phenylalanine levels below 10.0 mg/dl and are considered to have low exposure; those in the second population have average levels above 10.0 mg/dl and are labeled as high exposure. We would like to compare mental age scores normalized to 48 months for these two groups of children using a test conducted at the 0.05 level of significance. However, we are not willing to assume that normalized mental age scores are normally distributed in patients with pku. To carry out the Wilcoxon rank sum test, we select independent random samples from each of the populations of interest. Table 13.3 displays samples taken from the two populations of children with pku; there are 21 children with low exposure and 18 children with high exposure [223]. We combine the two samples into one large group, order the observations from smallest to largest, and assign a rank to each one. If there are tied observations, we assign an average rank to all measurements with the same value. Note, for instance, that two of the children in the sample have a normalized mental age score of 37.0 months. Since these observations are fourth and fifth in the ordered list of 39 measurements, we assign an average rank of (4 + 5)/2 = 4.5 to each one. Similarly, three subjects have a normalized mental age score of 51.0 months; these observations each receive a rank of (22 + 23 + 24)/3 = 23. The next step in the test is to find the sum of the ranks corresponding to each of the original samples. The smaller of the two sums is denoted by W . Under the null hypothesis that the underlying populations have identical medians, we would expect the ranks to be distributed evenly between the two groups. Therefore, the average ranks for each of the samples should be approximately equal. We test this hypothesis by calculating the statistic zW
=
W − µW σW
where µW
=
nS (nS + n L + 1) 2
is the mean sum of the ranks and r σW
=
nS n L (nS + n L + 1) 12
ISTUDY
305
Nonparametric Methods
TABLE 13.3 Normalized mental age (nMA) scores for two samples of children suffering from phenylketonuria Low Exposure
High Exposure
(< 10.0 mg/dl)
(≥ 10.0 mg/dl)
nMA (mos)
Rank
nMA (mos)
Rank
34.5
2
28.0
1
37.5
6
35.0
3
39.5
7
37.0
4.5
40.0
8
37.0
4.5
45.5
11.5
43.5
9
47.0
14.5
44.0
10
47.0 47.5
14.5 16
45.5 46.0
11.5 13
48.7
19.5
48.0
17
49.0
21
48.3
18
51.0
23
48.7
19.5
51.0
23
51.0
23
52.0
25.5
52.0
25.5
53.0
28
53.0
28
54.0
31.5
53.0
28
54.0
31.5
54.0
31.5
55.0
34.5
54.0
31.5
56.5
36
55.0
34.5
57.0
37
58.5
38.5
58.5
38.5
Total:
467
Total:
313
ISTUDY
306
Principles of Biostatistics
FIGURE 13.3 Normalized mental age scores for two samples of children suffering from phenylalaninemia, separated by average daily serum phenylalanine level is the standard deviation of W [222]. In these equations, nS represents the number of observations in the sample that has the smaller sum of ranks and n L the number of observations in the sample with the larger sum. For large values of nS and n L , zW
=
W − µW σW
follows an approximate standard normal distribution, assuming that the null hypothesis is true. Using the data in Table 13.3, the values of normalized mental age score for each phenylalanine exposure group are shown in Figure 13.3. While the scores may not be normally distributed – they appear to be skewed to the left – the histograms do have the same basic shape. The sum of the ranks in the low exposure group is 467, and the sum in the high exposure group is 313; therefore, W = 313. In addition, nS (nS + n L + 1) 18(18 + 21 + 1) µW = = = 360, 2 2 and r r nS n L (nS + n L + 1) 18(21)(18 + 21 + 1) σW = = = 35.5. 12 12 Substituting these values into the equation for the test statistic, we have zW
=
W − µW σW
=
313 − 360 35.5
=
− 1.32.
ISTUDY
307
Nonparametric Methods
Since p = 2(0.093) = 0.186 is greater than the significance level 0.05, we do not reject the null hypothesis. The samples do not provide evidence of a difference in median normalized mental age scores for the two populations of children. If nS and n L are small, then zW cannot be assumed to follow a standard normal distribution. In this case, as we noted for the Wilcoxon signed-rank test, an exact version of the Wilcoxon rank sum test can be applied. In Section 13.6, we will use a statistical package to perform the calculations for us. Summary: Two-Sample Nonparametric Hypothesis Test, Independent Samples Null hypothesis
H0 : Median1 = Median2
Alternative hypothesis
H A : Median1 , Median2
Test
Wilcoxon rank sum test
Test statistic
zW =
W − µW σW
where
n S (n S + n L + 1) 2 r n S n L (n S + n L + 1) = 12
µW = σW
n S = size of sample with smaller sum of ranks n L = size of sample with larger sum of ranks Distribution of test statistic
13.4
Standard normal or exact distribution
Kruskal-Wallis Test
The Kruskal-Wallis test is an extension of the Wilcoxon rank sum test which can be used to compare three or more independent populations. It is the nonparametric counterpart to the one-way analysis of variance, but does not require that the underlying populations be normally distributed or that their variances be equal. Like the Wilcoxon rank sum test, it does assume that the k populations being compared all have the same basic shape. The Kruskal-Wallis test evaluates the null hypothesis that the medians of the k populations are identical. The alternative hypothesis is that at least one of the population medians differs from one of the others. In Chapter 12, we used one-way analysis of variance to compare pulmonary function at the time of study enrollment for males with coronary artery disease recruited from three different medical centers. We wanted to determine whether patients from these institutions were in fact comparable before combining them for analysis. We assumed that forced expiratory volume in 1 second (fev1 ) was approximately normally distributed, and tested the null hypothesis that mean fev1 was identical for males with coronary artery disease at each of the k = 3 centers. With a p-value between 0.05 and 0.10 (p = 0.052, using Stata), we were strictly unable to reject the null hypothesis that mean pulmonary function is identical at the three centers at the 0.05 level of significance. What if we do not wish to assume that measurements of fev1 are normally distributed? In this case, we could use the Kruskal-Wallis test to evaluate the null hypothesis that the medians of the
ISTUDY
308
Principles of Biostatistics
three independent populations are all equal to each other, H0 : Median1 = Median2 = Median3 . The values of fev1 from Table 12.1 – representing independent random samples of males from each of the three medical centers – are reproduced in Table 13.4 [207]. To conduct the test, we follow the same procedure used for the Wilcoxon rank sum test. We begin by combining the three samples into one large group, ordering the observations from smallest to largest, and assigning a rank to each one. Tied observations receive the same average rank. We then find the sum of the ranks corresponding to each of the original samples, and label these R1 for the first sample, R2 for the second sample, and R3 for the third sample. Under the null hypothesis that the underlying populations have identical medians, we expect the ranks to be distributed randomly among the three groups, and the average ranks to be approximately equal. For a comparison of k independent populations, we test the null hypothesis by calculating the statistic H =
k 12 X Ri 2 − 3(n + 1) n(n + 1) i=1 ni
where n is the sum of the individual sample sizes n1 +n2 +...+nk and k is the number of groups [222]. The probability distribution of the Kruskal-Wallis test statistic is a chi-square ( χ2 ) distribution with k − 1 degrees of freedom. Like the F distribution, the chi-square distribution is not symmetric. A chi-square random variable cannot be negative; it assumes values from zero to infinity and is skewed to the right. As is true for all probability distributions, however, the total area beneath the curve is equal to one. Like the t and F distributions, there is a different chi-square distribution for each possible value of the degrees of freedom. The distributions with small degrees of freedom are highly skewed; as the number of degrees of freedom increases, the distributions become less skewed and more symmetric. This is illustrated in Figure 13.4. Table A.6 in Statistical Tables is a condensed table of areas for the chi-square distribution with various degrees of freedom. For a particular value of df, the entry in the table is the outcome of χ2df that cuts off the specified area in the upper tail of the distribution. Given a chi-square distribution with 1 degree of freedom, for instance, χ21 = 3.84 cuts off the upper 5% of the area under the curve; it is the 95th percentile of that distribution. Using the data in Table 13.4, R1 = 499, R2 = 604.5, and R3 = 726.5. Furthermore, n = n1 + n2 + n3 = 21 + 16 + 23 = 60. Substituting these values into the equation for the test statistic, H =
=
k 12 X Ri 2 − 3(n + 1) n(n + 1) i=1 ni
" # (499) 2 (604.5) 2 (726.5) 2 12 + + − 3(60 + 1) 60(60 + 1) 21 16 23
= 6.00. If the null hypothesis is true, this test statistic follows a chi-square distribution with k − 1 = 3 − 1 = 2 degrees of freedom. The value 5.99 is the 95th percentile of this probability distribution, so the p-value is just slightly below 0.05 (p = 0.0498 in Stata) and we reject the null hypothesis. We have evidence that median fev1 differs among the medical centers. (Note that in instances of borderline statistical significance, two different hypothesis tests can in fact lead to different conclusions; using one-way analysis of variance we found p = 0.052, but with the Kruskal-Wallis test p = 0.0498. These p-values are very similar, but one leads us to reject H0 at the 0.05 level of significance while the other does not.)
ISTUDY
309
Nonparametric Methods
TABLE 13.4 Forced expiratory volume in 1 second fev1 for patients with coronary artery disease sampled at three different medical centers Johns Hopkins fev1 Rank 1.69 1 1.86 3 1.98 4.5 2.08 6 2.10 7 2.47 13.5 2.47 13.5 2.47 13.5 2.53 16 2.57 17 2.61 18.5 2.63 20.5 2.74 25 2.81 29 2.88 34 2.91 37 3.01 40 3.23 47 3.28 48 3.36 50 3.47 55
Total:
499
Rancho Los Amigos fev1 Rank 1.71 2 2.61 18.5 2.64 22 2.71 23.5 2.71 23.5 2.87 32 2.88 34 2.89 36 3.17 42.5 3.22 45.5 3.29 49 3.39 52.5 3.39 52.5 3.41 54 3.77 58 3.86 59
Total:
604.5
St. Louis fev1 Rank 1.98 4.5 2.19 8 2.23 9 2.25 10 2.43 11 2.47 13.5 2.63 20.5 2.77 26 2.79 27 2.81 29 2.81 29 2.85 31 2.88 34 2.95 38 2.98 39 3.07 41 3.17 42.5 3.20 44 3.22 45.5 3.38 51 3.53 56 3.56 57 4.06 60 Total: 726.5
Sample median fev1 is 2.61 ml for Johns Hopkins, 3.03 ml for Rancho Los Amigos, and 2.85 ml for St. Louis.
ISTUDY
310
Principles of Biostatistics
FIGURE 13.4 Chi-square distributions with 1 and 6 degrees of freedom When we reject the null hypothesis that the medians of the k populations are all identical, we know that at least two of the medians are different, but we do not know which ones. We previously encountered this situation in Chapter 12 when discussing one-way analysis of variance and the concept of multiple comparisons procedures. After rejecting the null hypothesis, we need to conduct additional tests to see where the differences lie, preferably without inflating the probability of making a type I error. One option would be to use the Wilcoxon rank sum test to perform all pairwise comparisons – comparing median fev1 for population 1 versus population 2, population 1 versus population 3, and population 2 versus population 3 – while applying a Bonferroni correction to the significance level of each test. Summary: Multi-Sample Nonparametric Hypothesis Test, Independent Samples Null hypothesis
H0 : Median1 = Median2 = · · · = Mediank where k is the number of samples
Alternative hypothesis
H A : At least two medians are different
Test
Kruskal-Wallis test
Test statistic
H=
k R2 X 12 i − 3(n + 1) n(n + 1) ni i=1
where
Ri = sum of ranks for sample i n = sum of the individual sample sizes = n1 + n2 + ... + nk Distribution of test statistic
Chi-square distribution with k − 1 degrees of freedom
ISTUDY
Nonparametric Methods
13.5
311
Advantages and Disadvantages of Nonparametric Methods
Nonparametric techniques have several advantages over traditional methods of statistical inference. One advantage is that they do not require all the restrictive assumptions characteristic of parametric tests. It is not necessary that the underlying populations be normally distributed, for instance. At most, when two or more independent populations are being compared they should have the same basic shape. Their use of ranks makes nonparametric techniques less sensitive to measurement error than traditional tests. They can also be applied to ordinal or discrete data, in addition to continuous measurements. Since it does not make sense to calculate either a mean or a standard deviation for ordinal values, parametric tests are usually not appropriate. Nonparametric methods also have disadvantages. If the assumptions underlying a parametric test are satisfied, the nonparametric test is less powerful than the comparable parametric technique. This means that if the null hypothesis is false, the nonparametric test would require a larger sample to provide sufficient evidence to reject it. This loss of power is not substantial, however. If the sample data do come from an underlying normal population, the power of the Wilcoxon tests is approximately 95% of that for the t tests. In other words, if the t test requires 19 observations to achieve a particular level of power, the Wilcoxon test would need 20 observations to have the same power. Another disadvantage is that the hypotheses tested by nonparametric techniques tend to be less specific than those tested by traditional methods, focusing on medians rather than means. Because they rely on ranks rather than on the actual values of the observations, nonparametric tests do not use all the information known about a distribution. This, of course, presumes that our information about the underlying population is correct. Finally, if a large proportion of the observations are tied, then σT and σW overestimate the standard deviations of T and W , respectively. To compensate for this, a correction term must be added to the calculations [222]. These correction terms are beyond the scope of this text, but they are implemented in most statistical packages that perform nonparametric hypothesis tests.
13.6
Further Applications
A study was conducted to investigate the use of extracorporeal membrane oxygenation (ecmo) – a mechanical system for oxygenating the blood – in the treatment of newborns with neonatal respiratory failure. It was thought that the use of this procedure may reduce the output of an infant’s left ventricle, thereby decreasing the amount of blood pumped to the body. Therefore, we would like to compare left ventricular dimension before versus during the use of ecmo. We are not willing to assume that the population of differences in left ventricular dimension is normally distributed; therefore, we use the Wilcoxon signed-rank test to evaluate the null hypothesis that the median difference is equal to 0. We will conduct the test at the 0.05 level of significance. Table 13.5 shows the relevant data for a sample of 15 infants suffering from extreme respiratory distress [224]. We begin by calculating the difference for each pair of observations, and note that the median difference is 0.0 cm. Then, ignoring the signs of these differences, we rank their absolute values from smallest to largest. Tied observations are assigned an average rank. Since a difference of 0 is not ranked and there are four of these in the data set, the sample size is reduced to n = 11. Each rank is then assigned either a plus or a minus sign depending on the sign of the difference itself. If we ignore the signs, the smaller sum of ranks is T = 18.5. In addition, the mean sum of the ranks is 11(11 + 1) n(n + 1) = = 33, µT = 4 4
ISTUDY
312
Principles of Biostatistics
TABLE 13.5 Left ventricular dimension (lvd) for a sample of infants suffering from neonatal respiratory failure
Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LVD (cm) Before During ECMO ECMO 1.6 1.6 2.0 2.0 1.2 1.2 1.6 1.6 1.6 1.5 1.7 1.6 1.6 1.5 1.6 1.7 1.6 1.4 1.7 1.5 1.0 1.3 1.5 1.8 1.5 1.8 1.4 1.8 1.5 2.0
Difference
Rank
0.0 0.0 0.0 0.0 0.1 0.1 0.1 −0.1 0.2 0.2 −0.3 −0.3 −0.3 −0.4 −0.5
− − − − 2.5 2.5 2.5 2.5 5.5 5.5 8 8 8 10 11 Total:
and the standard deviation is r n(n + 1)(2n + 1) σT = 24
r =
11(11 + 1)(22 + 1) 24
Signed Rank
2.5 2.5 2.5 2.5 −2.5 5.5 5.5 −8 −8 −8 −10 −11 18.5
=
−47.5
11.25.
Therefore, the test statistic is zT
=
T − µT σT
=
18.5 − 33 11.25
=
− 1.29.
The p-value of the test is 2(0.099), or 0.198. Since p is greater than 0.05, we do not reject the null hypothesis. The sample fails to provide evidence that the median difference in left ventricular dimension is not equal to 0. We are unable to conclude that treatment with ecmo has an effect on the output of the infants’ left ventricles. If we use Stata to implement the Wilcoxon signed-rank test – as shown in Table 13.6 – we see that it performs an alternative version of the test. Rather than eliminate pairs with differences of 0, Stata leaves them in and ranks them as the smallest observations in the dataset. It then assigns half of these ranks to the positive sum, and half to the negative sum. Stata also makes additional corrections to the variance term in the test statistic to account for the presence of 0s, and of ties. Therefore, the test statistic zT = −0.952 differs from the value calculated above. Since p = 0.34, however, we still fail to reject the null hypothesis. We are unable to conclude that treatment with ecmo affects the output of the infants’ left ventricles.
ISTUDY
313
Nonparametric Methods TABLE 13.6 Stata output for the Wilcoxon signed-rank test Wilcoxon signed-rank test sign | obs sum ranks expected -------------+--------------------------------positive | 5 38.5 55 negative | 6 71.5 55 zero | 4 10 10 -------------+--------------------------------all | 15 120 120 unadjusted variance adjustment for ties adjustment for zeros adjusted variance
310.00 -1.88 -7.50 ---------300.63
Ho: before = during z = -0.952 Prob > |z| = 0.3413 Exact Prob = 0.3516 In this study, the sample size n is small, and we may not wish to assume that zT follows a standard normal distribution. Stata provides us with the p-value for an exact test, p = 0.35, which in this case is very similar to the p-value for the normal approximation test. R is also able to perform the exact version of the Wilcoxon signed-rank test; the output is shown in Table 13.7. Note that R is able to perform both versions of the signed-rank test; the top of the table contains the results of the test which ranks the 0 differences (like Stata), and the bottom the results when the 0 differences are excluded. Emphysema is a swelling of the air sacs in the lungs that is characterized by labored breathing and an increased susceptibility to infection. Carbon monoxide diffusing capacity, denoted DlCO, is a measure of lung function that has been tested as a possible diagnostic tool for detecting emphysema. Consider the distributions of CO diffusing capacity for a population of healthy individuals, and a population of patients with emphysema. We are not willing to assume that these distributions follow a normal distribution. Therefore, using a Wilcoxon rank sum test conducted at the α = 0.05 level of significance, we evaluate the null hypothesis that the two populations have the same median DlCO . Table 13.8 provides the data for random samples of 13 individuals who have emphysema and 23 who do not [225]. Median DlCO is 15.25 for those with emphysema, and 21.41 for the healthy individuals. To conduct the test, the observations are combined into one large group and ordered from smallest to largest; a rank is assigned to each one. Separating the ranks according to the original samples, the smaller sum of ranks corresponds to the sample of individuals who have emphysema. Therefore, W = 168. In addition, µW
=
nS (nS + n L + 1) 2
is the mean sum of the ranks, and r nS n L (nS + n L + 1) σW = 12
=
13(13 + 23 + 1) 2 r
=
=
13(23)(13 + 23 + 1) 12
240.5
=
30.36
ISTUDY
314
Principles of Biostatistics
TABLE 13.7 R output for the Wilcoxon signed-rank test data: y by x (pos, neg) stratified by block Z = -0.95164, p-value = 0.3413 alternative hypothesis: true mu is not equal to 0 data: y by x (pos, neg) stratified by block Z = -1.2989, p-value = 0.194 alternative hypothesis: true mu is not equal to 0
TABLE 13.8 Carbon monoxide diffusing capacity for samples of individuals with and without emphysema Emphysema DlCO Rank 7.51 2 10.81 3 11.75 4 12.59 6 13.47 7 14.18 9 15.25 10 17.40 15 17.75 16 19.13 17 20.93 21 25.73 28 26.16 30
Total:
168
No Emphysema DlCO Rank 6.19 1 12.11 5 14.12 8 15.50 11 15.52 12 16.56 13 17.06 14 19.59 18 20.21 19 20.35 20 21.05 22 21.41 23 23.39 24 23.60 25 24.05 26 25.59 27 25.79 29 26.29 31 29.60 32 30.88 33 31.42 34 32.66 35 36.16 36 Total: 498
ISTUDY
315
Nonparametric Methods TABLE 13.9 Stata output for the Wilcoxon rank sum test Two-sample Wilcoxon rank-sum (Mann-Whitney) test group | obs rank sum expected -------------+--------------------------------emph | 13 168 240.5 no_emph | 23 498 425.5 -------------+--------------------------------combined | 36 666 666 unadjusted variance adjustment for ties adjusted variance
921.92 0.00 ---------921.92
Ho: capacity(group==1) = capacity(group==2) z = -2.388 Prob > |z| = 0.0170 Exact Prob = 0.0162
is the standard deviation of W . Under the null hypothesis that the two underlying populations have identical medians, the test statistic zW
=
W − µW σW
=
168 − 240.5 30.36
=
− 2.39
follows an approximate standard normal distribution. The p-value of the test is 2(0.008) = 0.016. Therefore, we reject the null hypothesis at the 0.05 level of significance. The samples provide evidence that the median CO-diffusing capacity of the population of individuals who have emphysema is different from the median of the population of those who do not. In general, people suffering from emphysema have lower CO diffusing capacities. Stata output performing this hypothesis test is given in Table 13.9. Note that when necessary, Stata corrects the variance test to account for the presence of ties. Here, because there are no ties, the test statistic calculated by hand and the one calculated by Stata are the same, both leading us to reject the null hypothesis. Also note that the p-value for the exact test is very similar to that for the normal approximation test in this instance. The R output for the same test is shown in Table 13.10. Duration of hospital stay adjusted for the percentage of the total body surface area (tbsa%) burned is a commonly used outcome measure for burn victims. To look for differences in patient outcomes by the cause of their burn, one study compared adjusted operative stay, defined as the days between hospital admission and the last operation divided by tbsa% [226]. Cause of burn was classified as one of the following: scald, chemical, hot object, electricity, or flame. Because this outcome measure has a skewed distribution as shown in Figure 13.5, we evaluate the null hypothesis that the five populations of patients defined by cause of burn have the same median adjusted operative stay using the Kruskal-Wallis test. Sample median values of adjusted operative stay are 0.67 days for the 46 patients who were scalded, 0.83 days for the 9 who suffered a chemical burn, 0.35 days for 20 patients injured by a hot
ISTUDY
316
Principles of Biostatistics
TABLE 13.10 R output for the Wilcoxon rank sum test Wilcoxon rank sum test data: capacity by group W = 77, p-value = 0.01623 alternative hypothesis: true location shift is not equal to 0
FIGURE 13.5 Operative stay adjusted for total body surface area burned for five samples of burn victims, separated by cause of burn
ISTUDY
317
Nonparametric Methods TABLE 13.11 Stata output for the Kruskal-Wallis test Kruskal-Wallis equality-of-populations rank test +------------------------------+ | cause | Obs | Rank Sum | |-------------+-----+----------| | Scald | 46 | 3756.50 | | Chemical | 9 | 961.00 | | Hot object | 20 | 1991.00 | | Electricity | 13 | 1513.00 | | Flame | 129 | 15431.50 | +------------------------------+ chi-squared = probability =
13.058 with 4 d.f. 0.0110
chi-squared with ties = probability = 0.0107
13.114 with 4 d.f.
TABLE 13.12 R output for the Kruskal-Wallis test Kruskal-Wallis rank sum test data: opstay_tbsa by cause Kruskal-Wallis chi-squared = 13.114, df = 4, p-value = 0.01073
object, 1.00 days for 13 individuals with an electrical burn, and 1.09 days for the 129 patients with flame burns. The Stata output for the Kruskal-Wallis test is shown in Table 13.11, and the R output in Table 13.12. With p = 0.011, the null hypothesis H0 : Median1 = Median2 = Median3 = Median4 = Median5 is rejected at the 0.05 level of significance. We conclude that median adjusted operative stay is not the same for the five populations of patients. It appears that individuals injured by hot objects have the shortest adjusted operative stays, and those burned by flames the longest. If we wish to more formally compare the outcome measure across the five groups to determine which medians differ, we would need to use a multiple comparisons procedure.
ISTUDY
318
Principles of Biostatistics
13.7
Review Exercises
1. How do nonparametric tests differ from parametric ones? 2. What are the advantages and disadvantages of using the sign test to analyze paired observations? 3. How does the Wilcoxon signed-rank test differ from the sign test? 4. How do the assumptions of the Wilcoxon rank sum test differ from those underlying the two-sample t test? 5. What are the advantages and disadvantages of using ranks rather than continuous measurements to conduct tests of hypothesis? 6. Refer to the resting energy expenditure (ree) data for patients with cystic fibrosis and healthy individuals matched to the patients on age, sex, height and weight that was provided in Table 13.1 [220]. We previously used the sign test to evaluate the null hypothesis that the median difference in ree is equal to 0. (a) Instead of the sign test, use the Wilcoxon signed-rank test to evaluate the null hypothesis that the median of the population of differences is equal to 0. Conduct the test at the 0.05 level of significance. What are the test statistic and the p-value? (b) Do you reject or fail to reject the null hypothesis? (c) What do you conclude? (d) Compare the results of the signed-rank test to those obtained when the sign test was used. Do you reach the same conclusion? 7. We are interested in examining the effects of the transition from fetal to postnatal circulation among premature infants. For each of 14 healthy newborns, respiratory rate is measured at two different times: once when the child is less than 15 days old, and again when he or she is more than 25 days old [227]. The data are presented below.
Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Respiratory Rate (breaths/minute) Time 1 Time 2 62 46 35 42 38 40 80 42 48 36 48 46 68 45 26 40 48 42 27 40 43 46 67 31 52 44 88 48
ISTUDY
319
Nonparametric Methods
(a) Using the sign test, evaluate the null hypothesis that the median difference in respiratory rates for the two times is equal to 0. Conduct the test at the 0.05 level of significance. What do you conclude? (b) Evaluate the same hypothesis using the Wilcoxon signed-rank test. What do you conclude? 8. The data below are taken from a study that compares adolescents who have bulimia to adolescents without bulimia who have similar body compositions and levels of physical activity. The data consist of measures of daily caloric intake for random samples of 23 bulimic adolescents and 15 adolescents without bulimia [72].
15.9 16.0 16.5 17.0 17.6 18.1 18.4 18.9
Daily Caloric Intake (kcal/kg) Bulimic Nonbulimic 18.9 25.1 20.7 30.6 19.6 25.2 22.4 33.2 21.5 25.6 23.1 33.7 21.6 28.0 23.8 36.6 22.9 28.7 24.5 37.1 23.6 29.2 25.3 37.4 24.1 30.9 25.7 40.8 24.5 30.6
(a) Test the null hypothesis that the median daily caloric intake of the population of individuals suffering from bulimia is equal to the median caloric intake of the population which does not have bulimia. Conduct a two-sided test at the 0.05 level of significance. (b) Do you believe that adolescents with bulimia require a lower daily caloric intake than do nonbulimic adolescents? Explain. 9. Nineteen individuals with asthma were enrolled in a study investigating the respiratory effects of sulfur dioxide. During the study, two measurements were obtained for each subject. The first was the increase in specific airway resistance – a measure of bronchoconstriction – from the time when the individual was at rest until after he or she had been exercising for 5 minutes in room air; the second was the increase in specific airway resistance for the same subject after he or she had undergone a similar exercise test conducted in an atmosphere of 0.25 ppm sulfur dioxide [48]. The measurements are contained in the dataset sar.dta. (a) Construct a histogram of the differences in increase in specific airway resistance for the two conditions. (b) In the sample of 19 subjects, what is the median difference in increase in specific airway resistance? (c) At the α = 0.05 level of significance, test the null hypothesis that the median difference in increase in specific airway resistance for the two occasions is equal to 0. What is the p-value? Do you reject or fail to reject the null hypothesis? (d) What do you conclude? (e) Do you feel that it would have been appropriate to use the paired t-test to evaluate these data? Why or why not?
ISTUDY
320
Principles of Biostatistics
10. The characteristics of low birth weight children dying of sudden infant death syndrome were examined for both females and males. The ages at time of death for samples of 11 female infants and 16 males are provided in the dataset sids.dta [228]. (a) Construct box plots of age at time of death by sex. (b) Do the box plots suggest that there is a difference in age for female and male infants who die from sudden infant death syndrome? (c) Test the null hypothesis that the median ages at death are identical for the two populations, females and males. What do you conclude? (d) Do you feel that it would have been appropriate to use the two-sample t test to analyze these data? Why or why not? 11. A study was conducted to evaluate the effectiveness of a work site health promotion program in reducing the prevalence of cigarette smoking. Thirty-two work sites were randomly assigned to either implement the health program or to make no changes for a period of two years. The promotion program consisted of health education classes combined with a payroll-based incentive system. The data collected during the study are saved in the dataset work_program [229]. For each work site, smoking prevalence at the start of the study is saved under the variable name baseline, and smoking prevalence at the end of the two-year period under the name followup. The variable group contains the value 1 for the work sites that implemented the health program and 2 for the sites that did not. (a) For the work sites that implemented the health promotion program, test the null hypothesis that the median difference in smoking prevalence over the two-year period is equal to 0 at the 0.05 level of significance. What do you conclude? (b) Test the same null hypothesis for the sites that did not make any changes. What do you conclude? (c) Evaluate the null hypothesis that the median difference in smoking prevalence over the two-year period for work sites that implemented the health program is equal to the median difference for sites that did not implement the health program. Again conduct the test at the 0.05 level of significance. What do you conclude? (d) Do you believe that the health promotion program was effective in reducing the prevalence of smoking? Explain. 12. A study was conducted to investigate whether females who do not have health insurance coverage are less likely to be screened for breast cancer than those who do, and whether their disease is more advanced at the time of diagnosis [230]. The medical records for a sample of individuals who were privately insured and for a sample who were uninsured were examined. The stage of breast cancer at diagnosis was assigned a number between 1 and 5 where 1 denotes the least advanced disease and 5 the most advanced. The relevant observations are saved in the dataset called insurance; the stage of disease is saved under the variable name stage, and an indicator of group status – which takes the value 1 for females who were uninsured and 0 for those who were privately insured – under the name group. (a) Could the two-sample t test be used to analyze these data? Why or why not? (b) Test the null hypothesis that the median stage of breast cancer for females with private insurance is identical to the median stage of cancer for those who are not insured. (c) Do these data suggest that uninsured females have more advanced disease at the time of diagnosis of breast cancer than those who are insured? Explain.
ISTUDY
Nonparametric Methods
321
13. The nursing home occupancy rates for each state in the United States in 2015 are saved in the dataset nursing_home [15]. For each state, the variable occupancy is the percentage of beds occupied (number of nursing home residents per 100 nursing home beds), and region contains the region of the United States for that state, defined as Northeast, Southeast, Midwest, Southwest, and West. (a) Construct box plots of the nursing home occupancy rates by region of the United States. (b) What test would you use to evaluate the null hypothesis that median nursing home occupancy rates are the same in each region of the country? Explain your choice. (c) What is the probability distribution of the test statistic for this technique? (d) Carry out the test. What is the p-value? Do you reject or fail to reject the null hypothesis at the 0.05 level of significance? (e) What do you conclude? (f) If you wish to characterize differences in nursing home occupancy rates in different regions of the United States, is there another step that needs to be taken in this analysis? Explain. 14. The dataset lowbwt contains measurements for a sample of 100 low birth weight infants born in two teaching hospitals in Boston, Massachusetts [81]. The values of Apgar score – an index of neonatal asphyxia or oxygen deprivation – recorded five minutes after birth are saved under the variable name apgar5. The Apgar score is an ordinal random variable that takes values between 0 and 10. Indicators of sex, where 1 represents a male and 0 a female, are saved under the name sex. (a) Construct box plots displaying five-minute Apgar scores for males and females separately. (b) Test the null hypothesis that among low birth weight infants, the median Apgar score for males is equal to the median score for females. What do you conclude?
ISTUDY
ISTUDY
14 Inference on Proportions
CONTENTS 14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9
Normal Approximation to the Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling Distribution of a Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Size Estimation for One Proportion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Two Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Size Estimation for Two Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
324 326 327 329 330 332 335 336 345
In previous chapters we apply the techniques of statistical inference to continuous or measured data. In particular, we investigate the properties of population means and, in the case of nonparametric methods, medians. We now extend the methodology of statistical inference to include enumerated data or counts. The basic underlying principles remain the same, and the normal distribution again plays a key role. When studying counts, we are usually concerned with the proportion of times that an event occurs rather than the number of times. In the mid-19th century, for example, the Vienna General Hospital – the Allgemeines Krankenhaus of the University of Vienna – had two obstetrical divisions [231]. Every year, approximately 3500 babies were delivered in each division. There were two major differences between them, however. In the first division, all deliveries were supervised by obstetricians and medical students; in the second, they were overseen by midwives. Furthermore, the proportion of women who died of puerperal fever – an infection developing during childbirth – was between 0.17 and 0.23 in the first division, and about 0.017 in the second division. Ignac Semmelweiss, the assistant to the professor of obstetrics, was convinced that this tenfold difference in the proportion of deaths was not due to chance alone. His research led him to conclude that the discrepancy existed because, in addition to delivering babies, the obstetricians and students dissected several cadavers per day. Since the germ theory of disease had not yet been proposed, proper hygiene was not practiced; individuals went freely from dissections to deliveries without taking sanitary precautions. Believing that this practice was the root of the problem, Semmelweiss changed the procedure. He insisted that obstetricians wash their hands in a chlorine solution before being allowed to attend a delivery. In the subsequent year, the proportions of women who died were 0.012 in the first division and 0.013 in the second. Unfortunately, Semmelweiss was ahead of his time. His conclusions were not generally accepted. In fact, his discovery caused him to lose his position. Would such a discrepancy in proportions be ignored today, or would we accept that the two divisions are in fact different? To address this issue, we investigate the variability in proportions.
DOI: 10.1201/9780429340512-14
323
ISTUDY
324
14.1
Principles of Biostatistics
Normal Approximation to the Binomial Distribution
The binomial distribution provides a foundation for the analysis of proportions. Recall that if we have a sequence of n independent Bernoulli trials – each of which results in one of two mutually exclusive and exhaustive outcomes, “success” or “failure” – where each trial has a constant probability of success p, then the total number of successes X is a binomial random variable. The probability distribution of X, represented by the formula ! n x P(X = x) = p (1 − p) n−x , x can be used to make statements about the possible outcomes of the random variable. In particular, we can use this expression to compute the probabilities associated with specified outcomes x. Suppose we select a random sample of 60 individuals from the population of adults in the United States. As we saw in Chapter 7, the probability that a member of this population currently smokes cigarettes, cigars, or a pipe is 0.14 [159]; therefore, the total number of smokers in the sample is a binomial random variable with parameters n = 60 and p = 0.14. For a given sample of size 60, what is the probability that 10 or fewer of its members smoke? Using the additive rule of probability, P(X ≤ 10)
= P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) + P(X = 6) + P(X = 7) + P(X = 8) + P(X = 9) + P(X = 10) = 0.788,
where individual terms are calculated as P(X = 0)
! 60 (0.14) 0 (1 − 0.14) 60, 0
=
P(X = 1)
=
P(X = 2)
=
! 60 (0.14) 1 (1 − 0.14) 59, 1 ! 60 (0.14) 2 (1 − 0.14) 58, 2
and so on. We could also calculate the probabilities associated with the outcomes of a binomial random variable X using an approximate procedure based on the normal distribution. This can be more convenient when working with large samples, if we do not have a computer to perform calculations for us. In Chapter 7 we saw that as the sample size n gets larger, the shape of a binomial distribution increasingly resembles that of a normal distribution. Furthermore, the mean number of successes per sample is np, and the variance is np(1 − p). Therefore, if n is sufficiently large, we can approximate the distribution of a binomial random variable X by a normal distribution with the same mean and variance. This is illustrated in Figure 14.1. A widely used criterion states that n is “sufficiently large” if both np and n(1− p) are greater than 5. (Some people believe that this condition is not conservative enough, and prefer that both np and n(1 − p) be greater than 10.) In this case, Z
=
X − np p
np(1 − p)
is approximately normal with mean 0 and standard deviation 1.
ISTUDY
325
Inference on Proportions
FIGURE 14.1 Normal approximation to a binomial distribution with n = 60 and p = 0.14 We would like to find the proportion of samples of size 60 in which at most 10 individuals are current smokers using the normal approximation to the binomial distribution. Since np = 60(0.14) = 8.4 and n(1 − p) = 60(0.86) = 51.6 are both greater than 5, we note that P(X ≤ 10)
X − np 10 − np + = P *p ≤ p np(1 − p) np(1 − p) , ! 10 − (60)(0.14) = P Z≤ √ 60(0.14)(0.86) = P(Z ≤ 0.60).
The area under the standard normal curve that lies to the left of z = 0.60 is 0.726; this is the probability that at most 10 of the individuals in the sample smoke. Note that in this instance – even though np is greater than 5 – the normal approximation provides only a rough estimate (92%) of the exact binomial probability 0.788. Looking at Figure 14.1, we can see why this would be true; the discrete binomial distribution is being approximated by a continuous normal distribution. It has been shown that a better approximation to the binomial distribution can be obtained by adding 0.5 to x if we are interested in the probability that X is less than x, and subtracting 0.5 if we are calculating the probability that X is greater than x. If we wish to find P(X ≤ 10), for instance, we would replace this quantity by P(X ≤ 10 + 0.5) = P(X ≤ 10.5); similarly, to compute P(X ≥ 10), we would replace it by P(X ≥ 10 − 0.5) = P(X ≥ 9.5). The 0.5 appearing in these expressions is known as a continuity correction. Return to the problem of determining the proportion of samples of size 60 in which at most 10 individuals are current smokers. Applying the continuity correction, we find that ! 10.5 − (60)(0.14) P(X ≤ 10.5) = P Z ≤ √ = P(Z ≤ 0.78). 60(0.14)(0.86)
ISTUDY
326
Principles of Biostatistics
The area under the standard normal curve to the left of z = 0.78 is 0.782. In this example, the use of a continuity correction results in a better approximation of the exact binomial probability 0.788.
14.2
Sampling Distribution of a Proportion
As we have noted, we are usually interested in estimating the proportion of times that a particular event occurs in a given population rather than the number of times it happens. For example, we might wish to make a statement about the proportion of patients who survive five years after being diagnosed with non-small cell lung cancer based on a random sample drawn from this population. If the sample is of size n and x of its members are alive five years after diagnosis, we can estimate the population proportion p by x pˆ = n The sample proportion pˆ (p-hat) is the maximum likelihood estimator of p, the value of the parameter that is most likely to have produced the observed sample data. Before we use pˆ to make inference about p, however, we should first investigate some of its properties. In the population of patients who have been diagnosed with non-small cell lung cancer, we could represent five-year survival for an individual patient by 1 and death within five years by 0. The mean ofpthese values is equal to the proportion of 1s in the population, or p. The standard deviation is p(1 − p). Suppose that we randomly select a sample of size n from the population and denote the proportion of 1s in the sample by pˆ1 . We could then select a second sample of size n and denote the proportion of 1s in this new sample by pˆ2 . If we were to continue this procedure indefinitely, we would end up with a set of values consisting entirely of sample proportions. Treating each proportion in the series as a unique observation, their collective probability distribution is a sampling distribution of proportions for samples of size n. According to the central limit theorem, the distribution of sample proportions has the following properties: 1. The mean of the sampling distribution is the population proportion p. p 2. The standard deviation of the distribution of sample proportions is p(1 − p)/n. As in the case of the mean, this quantity is called the standard error. 3. The shape of the sampling distribution is approximately normal, provided that n is sufficiently large. Because the p distribution of sample proportions is approximately normal with mean p and standard deviation p(1 − p)/n, we know that Z
=
pˆ − p p
p(1 − p)/n
is a standard normal random variable with mean 0 and standard deviation 1. As a result, we can use tables of the standard normal distribution to make inference about the value of a population proportion. Going back to our original example, consider five-year survival among patients diagnosed with non-small cell lung cancer. The of individuals surviving five years is p = 0.24 p mean proportion √ and the standard deviation is p(1 − p) = 0.24(1 − 0.24) = 0.43 [232]. If we select a random sample of 100 patients from this population, what is the probability that at least 30% of them survive five years? Before we apply the central limit theorem, we first verify that np and n(1 − p) are both greater than 5. Since np = 100(0.24) = 24 and n(1 − p) = 100(0.76) = 76, we assume that the
ISTUDY
327
Inference on Proportions
distribution of sample proportions is approximately normal with mean p = 0.24 and standard error p √ p(1 − p)/n = 0.24(1 − 0.24)/100 = 0.043. Therefore, P( pˆ ≥ 0.30)
pˆ − p 0.30 − p + = P *p ≥ p p(1 − p)/n , p(1 − p)/n ! 0.30 − 0.24 = P Z≥ 0.043 = P(Z ≥ 1.40).
Consulting Table A.3 in the Statistical Tables, we see that the area under the standard normal curve to the right of z = 1.40 is 0.081. Therefore, based on this approximation, the probability that at least 30% of a sample of 100 individuals diagnosed with non-small cell lung cancer will survive for five years is 0.081. Using the binomial distribution, the exact probability would be calculated as 0.101. (A continuity correction could have been applied in this example – just as it was used in the previous section – to obtain an estimate closer to the exact probability. When n is reasonably large, the effect of the correction should be negligible.) Summary: Sampling Distribution of pˆ
14.3
Term
Notation
Population proportion
p
Sample proportion
pˆ =
Sampling distribution
p pˆ is N(p, p(1 − p)/n) for large n
Mean of pˆ
p
Standard error of pˆ
p
Standard normal random variable
Z= p
x n
p(1 − p)/n pˆ − p p(1 − p)/n
is N(0, 1)
Confidence Intervals
Consider five-year survival for the population of individuals under the age of 40 who have been diagnosed with non-small cell lung cancer, where the true proportion surviving is represented by p. We wish to estimate this five-year survival probability. In a randomly selected sample of 2666 patients, only 682 survive five years [233]. Therefore, x 682 = = 0.256 n 2666 is a point estimate for p. We know that with a second random sample of size 2666, we would likely end up with a different point estimate. Therefore, in order to quantify the precision of the estimate, we also need to construct a confidence interval for p. pˆ
=
ISTUDY
328
Principles of Biostatistics
To calculate a confidence interval for a population proportion, we begin with the results of the preceding section. The central limit theorem tells us that, as long as the sample size n is p large enough, the sample proportion pˆ follows a normal distribution with mean p and standard error p(1 − p)/n. Therefore, pˆ − p Z = p p(1 − p)/n is a normal random variable with mean 0 and standard deviation 1. We know that for the standard normal distribution, 95% of the possible outcomes lie between −1.96 and 1.96, which can be written as pˆ − p P *−1.96 ≤ p ≤ 1.96+ = 0.95. p(1 − p)/n , A 95% confidence interval for p would then be the range of values for which this probability statement is true. One approach to finding a confidence interval would be to first rearrange terms so that p is in the center of the inequality, to obtain r r p(1 − p) p(1 − p) + * ≤ p ≤ pˆ + 1.96 = 0.95. P pˆ − 1.96 n n , A 95% confidence interval would therefore take the form r r * pˆ − 1.96 p(1 − p) , pˆ + 1.96 p(1 − p) + . n n , The problem is that the standard error in this expression is a function of p, the value of which is not known. Estimating p by the sample proportion p, ˆ an approximate 95% confidence interval can be written as r r ˆ − p) ˆ p(1 ˆ − p) ˆ + * pˆ − 1.96 p(1 , pˆ + 1.96 . n n , The advantage of this interval – known as the Wald confidence interval – is that it is straightforward to calculate by hand. It has been shown, however, that it is inaccurate across a wide range of values for n and p, where we cannot be 95% confident that the interval contains the population proportion p [234]. This can be true even when the sample size n is very large. Furthermore, this formula can result in a lower bound for the confidence interval that is less than 0 or an upper bound greater than 1, values which are impossible for a proportion. Therefore, we do not recommend its use. Returning to the probability statement pˆ − p P *−1.96 ≤ p ≤ 1.96+ p(1 − p)/n , -
=
0.95,
we note that solution of this equation is complicated by the square root term in the denominator. If we square all three terms of the inequality to make everything positive and solve the resulting quadratic equation for p, we obtain the Wilson confidence interval [235], which has a 95% chance of containing the true population proportion p before a sample is selected. Because the calculations are fairly complicated we do not present them here (but see Review Exercise 3), and instead rely on a statistical software package to generate the confidence interval for us. This will be demonstrated in the Further Applications section of this chapter.
ISTUDY
329
Inference on Proportions
For the lung cancer data, since n pˆ = 2666(0.256) = 682 and n(1− p) ˆ = 2666(0.744) = 1984, the sample size is large enough to justify use of the normal approximation to the binomial distribution. The 95% Wilson confidence interval for p is (0.239 , 0.273). While 0.256 is our best guess for the value of the population parameter, the interval provides a range of reasonable values for p, all of which are compatible with the sample data. We are 95% confident that the interval contains the true proportion of individuals under the age of 40 who survive five years after a diagnosis of non-small cell lung cancer. Equivalently, if we use the Wilson method to construct many different confidence intervals for p based on the data obtained from repeated random samples of size 2666, we expect that approximately 95% of these intervals will contain the true value of p, and approximately 5% will not. Another approach for constructing a confidence interval for p uses the binomial distribution itself, rather than a normal approximation. This method is particularly appropriate for small samples where the use of the normal approximation cannot be justified, and provides an exact rather than approximate interval. Because the computations involved are onerous to perform by hand, we do not present them here [236]; once again we will rely on statistical software, and revisit this method in the Further Applications section. We here note that the term “exact confidence interval” is something of a misnomer. It uses the exact distribution of the statistic, the binomial distribution, however, the method cannot always produce an interval with level of confidence exactly equal to 1 − α; rather, the level of confidence is at least 1 − α, and can sometimes be considerably higher.
14.4
Hypothesis Testing
As noted in the previous section, the distribution of five-year survival for individuals under the age of 40 who have been diagnosed with non-small cell lung cancer has an unknown population proportion p. However, suppose we do know that the proportion of patients surviving five years among those who are over 40 years of age at the time of diagnosis is 20.2% [233]. Is it plausible that the proportion surviving in the under-40 population is 0.202 as well? To determine whether this is the case, we conduct a statistical test of hypothesis. We begin by making a claim about the value of the population proportion p. If we wish to test whether the proportion of lung cancer patients surviving at least five years after diagnosis is the same among persons under 40 years of age as it is among those over the age of 40, the null hypothesis for the one-sample test is H0 : p = p0 = 0.202. For a two-sided test conducted at the 0.05 level of significance, the alternative hypothesis would be H A : p , 0.202. We next draw a random sample of dichotomous observations from the underlying population, and use this information to find the probability of obtaining a sample proportion as extreme or more extreme than the observed p, ˆ given that the null hypothesis is true and p = p0 . In order to use the normal approximation to the binomial distribution, we would need to verify that np0 and n(1 − p0 ) are both greater than 5. We could then use the test statistic z
=
pˆ − p0 p
p0 (1 − p0 )/n
.
If the null hypothesis is true, this ratio is normally distributed with mean 0 and standard deviation 1. Depending on its magnitude and the resulting p-value, we either reject or do not reject H0 .
ISTUDY
330
Principles of Biostatistics
For the sample of 2666 persons under the age of 40 who have been diagnosed with non-small cell lung cancer, pˆ = 0.256. Therefore, the test statistic is pˆ − p0
z =
p
=
√
p0 (1 − p0 )/n
0.256 − 0.202 0.202(1 − 0.202)/2666
=
6.94.
Looking at Table A.3, the p-value of the test – the probability of obtaining an outcome from a standard normal random variable that is either greater than 6.94 or less than −6.94 – must be < 0.001. We come to this conclusion because the largest value in the table is 3.49, which has a corresponding p-value < 0.001 (remember that a p-value can never truly be 0, even though the probability is listed as 0.000 in the table). Since 6.94 is larger than 3.49, it must have an even smaller p-value. We could use statistical software such as Stata or R to calculate a more precise p-value if we wanted this. Regardless, since the p-value is smaller than the level of significance 0.05, we reject the null hypothesis. The data from the sample provide evidence that the proportion of non-small cell lung cancer patients surviving five years among individuals under the age of 40 differs from 0.202, the proportion surviving among those over the age of 40. Since the 95% confidence interval for p does not contain the value 0.202, the confidence interval would have led us to the same conclusion. Just as we were able to construct a confidence interval for a population proportion using the binomial distribution itself rather than the normal approximation, an exact test of hypothesis is possible as well [236]. We will demonstrate this more computationally intensive method in the Further Applications section of this chapter. Summary: One-Sample, Two-Sided Hypothesis Test for a Proportion Null hypothesis
H0 : p = p0 or H0 : p − p0 = 0
Alternative hypothesis
H A : p , p0 or H0 : p − p0 , 0
Test statistic
z= p
pˆ − p0 p0 (1 − p0 )/n
where
n = sample size Distribution of test statistic
14.5
Standard normal
Sample Size Estimation for One Proportion
When designing a study, investigators often wish to determine the sample size necessary to provide a specified power for the hypothesis test they plan to conduct. Recall that the power of a test is the probability of rejecting the null hypothesis when it is false. When dealing with proportions,
ISTUDY
331
Inference on Proportions
power calculations are a little more complex than they were for tests based on means; however, the reasoning is quite similar. Suppose we plan to test the null hypothesis H0 : p ≤ 0.202 against the alternative
H A : p > 0.202
at the α = 0.01 level of significance. The parameter p is the proportion of non-small cell lung cancer patients under the age of 40 at the time of diagnosis who survive at least five years. Although we previously conducted a two-sided test, we are now concerned only with values of p that are greater than 0.202. If the true population proportion is as large as 0.250, we want to risk only a 10% chance of failing to reject the null hypothesis. Therefore, β is equal to 0.10, and the power of the test is 0.90. How large a sample would be required? Since α = 0.01 and we are conducting a one-sided test, we begin by noting that H0 would be rejected if the test statistic is greater than or equal to 2.32, the 99th percentile of the standard normal distribution. Therefore, we set z
=
2.32
=
√
pˆ − 0.202 . 0.202(1 − 0.202)/n
Solving for p, ˆ r
0.202(1 − 0.202) . n We would reject the null hypothesis if the sample proportion pˆ is greater than this value. We now focus on the desired power of the test. If the true proportion of patients surviving for five years is 0.250, we want to reject the null hypothesis with probability 1 − β = 0.90. The value of z that corresponds to β = 0.10 is z = −1.28; therefore, pˆ
=
0.202 + 2.32
z
=
−1.28
=
√
pˆ − 0.250 . 0.250(1 − 0.250)/n
Again solving for p, ˆ r
0.250(1 − 0.250) . n Setting the two expressions for pˆ equal to each other and solving for the sample size n, pˆ
" n
=
=
0.250 − 1.28
√ √ #2 2.32 0.202(1 − 0.202) + 1.28 0.250(1 − 0.250) 0.250 − 0.202
=
958.1.
Rounding up, a sample of size 959 would be required to have 90% power. In general, if the probability of making a type I error is α and the probability of making a type II error is β, the sample size for a one-sided, one-sample test is n
=
z p p (1 − p ) + z p p (1 − p ) 2 0 1−β 1 1 1−α 0 . p1 − p0
ISTUDY
332
Principles of Biostatistics
Note that p0 is the value of the population proportion when the null hypothesis is true, and p1 is its value if the alternative hypothesis is true. The magnitudes of these proportions – along with the values of α and β – determine the necessary sample size n. If we are interested in conducting a two-sided test, we must make an adjustment to the preceding formula. In this case, the null hypothesis would be rejected for z ≥ z1−α/2 and also for z ≤ −z1−α/2 . As a result, the required sample size would be n
=
p p 2 z p (1 − p ) + z p1 (1 − p1 ) 1−α/2 0 0 1−β . p1 − p0
Summary: Sample Size Estimation for One-Sample Test on a Proportion Significance level
α
Power
1− β
Test
One-sided
Sample Size
p p 2 z 1−α p0 (1 − p0 ) + z1−β p1 (1 − p1 ) n= p1 − p0
Test
Two-sided
Sample Size
p p z 2 1−α/2 p0 (1 − p0 ) + z1−β p1 (1 − p1 ) n= p1 − p0
Always round n up to the nearest integer.
14.6
Comparison of Two Proportions
Just as we did for the mean, we can again generalize the procedure of hypothesis testing to accommodate the comparison of two proportions. Most often we are interested in testing the null hypothesis that the proportions from two independent populations are identical, or H0 : p1 = p2, against the alternative
H A : p1 , p2 .
To conduct the test, we draw a random sample of size n1 from the population with mean p1 . If there are x 1 successes in the sample, then pˆ1 =
x1 . n1
ISTUDY
333
Inference on Proportions Similarly, we select a sample of size n2 from the population with mean p2 , and pˆ2
=
x2 . n2
In order to determine whether the observed difference in sample proportions pˆ1 − pˆ2 is too large to be attributed to chance alone, we calculate the probability of obtaining a pair of sample proportions as discrepant as or more discrepant than those observed, given that the null hypothesis is true. If this probability is sufficiently small, we reject H0 and conclude that the two population proportions are different. As always, we must specify a level of significance α before conducting the test. If the null hypothesis is true and the population proportions p1 and p2 are in fact equal, the data from the two samples can be combined to estimate this common parameter; in particular, pˆ = =
n1 pˆ1 + n2 pˆ2 n1 + n2 x1 + x2 . n1 + n2
The quantity pˆ is a weighted average of the two sample proportions pˆ1 and pˆ2 . Under p the null hypothesis, the estimator of the standard error of the difference pˆ1 − pˆ2 takes the ˆ − p)[(1/n ˆ form p(1 1 ) + (1/n2 )]. Thus, the appropriate test statistic is z
( pˆ1 − pˆ2 ) − (p1 − p2 )
=
p
p(1 ˆ − p)[(1/n ˆ 1 ) + (1/n2 )]
.
If n1 and n2 are sufficiently large, this statistic has a normal distribution with mean 0 and standard deviation 1. A commonly used criterion is that each of the quantities n1 p, ˆ n1 (1 − p), ˆ n2 p, ˆ and n2 (1 − p) ˆ be greater than 5. If these conditions are satisfied, we compare the value of the test statistic to the critical values in Table A.3 to find the p-value of the test. Based on the magnitude of the p-value, we either reject or do not reject the null hypothesis. In a study investigating morbidity and mortality among pediatric victims of motor vehicle accidents, two random samples were selected, one from the population of children who were wearing a seat belt at the time of the accident, and the other from the population who were not [237]. We would like to test the null hypothesis that the proportions of children who die as a result of the motor vehicle accident are identical in the two populations. To do this, we conduct a two-sided test at the 0.05 level of significance. In a sample of 123 children who were wearing a seat belt at the time of the accident, 3 died. Therefore, x1 3 pˆ1 = = = 0.024. n1 123 In a sample of 290 children who were not wearing a seat belt, 13 died, and pˆ2
=
x2 n2
=
13 290
=
0.045.
Is this discrepancy in sample proportions, a difference of 0.045 − 0.024 = 0.021, or 2.1%, too large to be attributed to chance or sampling variability alone? Just as when comparing two means, an investigator will sometimes begin an analysis which compares two proportions by first constructing separate confidence intervals for p1 and p2 , as in Figure 14.2. In general, if the two intervals do not overlap, this suggests that the population proportions are different. As with means, however, this is not a formal test of hypothesis. In our example, there is considerable overlap between the confidence intervals.
ISTUDY
334
Principles of Biostatistics
FIGURE 14.2 95% Wilson confidence intervals for the proportions of children dying as the result of motor vehicle accidents for those wearing versus not wearing a seat belt at the time of the accident
by
If the population proportions p1 and p2 are in fact equal, then their common value p is estimated pˆ
=
x1 + x2 n1 + n2
=
3 + 13 123 + 290
=
0.039.
Substituting the values of pˆ1 , pˆ2 , and pˆ into the expression for the test statistic, we find that ( pˆ1 − pˆ2 ) − (p1 − p2 )
z =
p
p(1 ˆ − p)[(1/n ˆ 1 ) + (1/n2 )]
=
√
(0.024 − 0.045) − 0 (0.039)(1 − 0.039)[(1/123) + (1/290)]
=
− 1.01.
According to Table A.3, the p-value of the two-sided test is 2(0.156) = 0.312. Therefore, we cannot reject the null hypothesis. The samples collected in this study do not provide evidence that the proportions of children dying differ between those who were wearing seat belts and those who were not. Note that the method presented here is merely one technique for comparing proportions from two independent populations. We will consider alternative methods in Chapter 15.
ISTUDY
335
Inference on Proportions Summary: Two-Sample, Two-Sided Hypothesis Test for Two Proportions, Independent Samples Null hypothesis
H0 : p1 = p2 or H0 : p1 − p2 = 0
Alternative hypothesis
H A : p1 , p2 or H0 : p1 − p2 , 0
Test statistic
z= p
( pˆ1 − pˆ2 ) − (p1 − p2 ) p(1 ˆ − p)[(1/n ˆ 1 ) + (1/n2 )]
where
n1, n2 = sizes of samples 1 and 2 pˆ = Distribution of test statistic
14.7
n1 pˆ1 + n2 pˆ2 n1 + n2
Standard normal
Sample Size Estimation for Two Proportions
In the study comparing the proportions of children dying as a result of motor vehicle accidents for those wearing versus not wearing seat belts, suppose the investigators had believed that a 5% difference in mortality would be clinically important. What sample size would they have needed to have 90% power to detect a difference of this magnitude? If the planned analysis is a test of the null hypothesis H0 : p1 = p2 versus the alternative hypothesis
H A : p1 , p2
conducted at the α level of significance, the sample size necessary to achieve power of 1 − β would be p p 2 z ¯ − p) ¯ + z1−β p1 (1 − p1 ) + p2 (1 − p2 ) 1−α/2 2 p(1 n= | p1 − p2 | where p¯ =
p1 + p2 . 2
This equation tells us the number of subjects needed for each of the two samples, so the total sample size required is 2n. For the study of seat belt use among children involved in motor vehicle accidents, if the investigators were interested in detecting a 5% difference in mortality we might specify this as p1 = 0.02
ISTUDY
336
Principles of Biostatistics
versus p2 = 0.07. We want to have 90% power to detect a difference of this magnitude using a two-sided test conducted at the 0.05 level of significance. Therefore, p¯
=
0.02 + 0.07 2
=
0.045
and n
=
=
p p 2 z ¯ − p) ¯ + z1−β p1 (1 − p1 ) + p2 (1 − p2 ) 1−α/2 2 p(1 | p1 − p2 | √ √ " #2 1.96 2(0.045)(1 − 0.045) + 1.28 0.02(1 − 0.02) + 0.07(1 − 0.07) | 0.02 − 0.07 |
= 358.8. The investigators would have needed 359 subjects in each of the two groups – 718 subjects in total – to ensure 90% power to detect this 5% difference in mortality.
Summary: Sample Size Estimation for Two-Sample Test on Proportions Test
Two-sided
Significance level
α
Power
1− β
Sample size
p p 2 z ¯ − p) ¯ + z 1−β p 1 (1 − p 1 ) + p 2 (1 − p 2 ) 1−α/2 2 p(1 n = | p1 − p2 |
Always round n up to the nearest integer.
14.8
Further Applications
Suppose we are interested in investigating the cognitive abilities of extremely low birth weight children, those who weighed less than 1500 grams at birth. Many of these children exhibit normal growth patterns during the first year of life; a small group, however, does not. These children suffer from a condition called perinatal growth failure, which prevents them from developing properly. One indicator of perinatal growth failure is that during the first several months of life, the infant has a head circumference measurement that is far below normal. We would like to examine the relationship between perinatal growth failure and subsequent cognitive ability once the child reaches school age [238]. In particular, we want to estimate the proportion of children suffering from this condition who, when they reach age 8 years, have full scale intelligence quotient (IQ) scores that are below 70. In the general population, IQ scores are scaled to have mean 100 and standard deviation 15; a score less than 70 – which is 2 standard deviations below the mean – suggests a deficit in functioning. To estimate the proportion of children
ISTUDY
337
Inference on Proportions TABLE 14.1 Stata output for the confidence interval for a proportion, Wilson and binomial exact methods ------ Wilson -----iq | Obs Proportion Std. Err. [95% Conf. Interval] ------+------------------------------------------------------| 33 .2424242 .0746009 .1283171 .4102462
-- Binomial Exact -iq | Obs Proportion Std. Err. [95% Conf. Interval] ------+------------------------------------------------------| 33 .2424242 .0746009 .1109233 .4225893
TABLE 14.2 R output for the confidence interval for a proportion, Wilson and binomial exact methods Wilson est lwr.ci upr.ci 0.2424242 0.1283171 0.4102462 Exact est lwr.ci upr.ci 0.2424242 0.1109233 0.4225893
with IQs in this range, a random sample of 33 infants with perinatal growth failure is chosen. At the age of 8 years, 8 children have scores below 70. Therefore, pˆ
=
x n
=
8 33
=
0.242
is a point estimate for p. In addition to this point estimate, we also wish to construct a 95% confidence interval for p which incorporates the sampling variability present in our estimate. As shown in Tables 14.1 and 14.2, the 95% Wilson confidence interval is (0.128, 0.410), and the binomial exact interval is (0.111, 0.423). While the exact interval is based on probabilities calculated using the binomial distribution, the Wilson interval applies the normal approximation. Each of these intervals can be assumed to provide a range of reasonable values for p, all of which are compatible with the sample data. Before reporting the Wilson confidence interval, we should verify that the normal approximation to the binomial distribution is valid. Since np = 33(0.242) = 8 and n(1 − p) = 33(0.758) = 25 are both greater than 5, we assume that it is. Note, however, that the Wilson and exact confidence intervals are not as close as we might expect if the sample size were larger. Although we do not know the true value of p for the population of children with perinatal growth failure, we do know that, in the general population of children who exhibited normal growth in the perinatal period, 2.3% have full scale IQ scores below 70 when they reach school age. (Note that an IQ score of 70 has a z-score equal to −2.00.) We would like to know whether this is also true for the children who suffered from perinatal growth failure. Since we are concerned with deviations that could occur in either direction, we conduct a two-sided test of the null hypothesis H0 : p = 0.023
ISTUDY
338
Principles of Biostatistics
TABLE 14.3 Stata output for a one-sample hypothesis test for a proportion, binomial exact method N Observed k Expected k Assumed p Observed p -----------------------------------------------------------33 8 .759 0.02300 0.24242 Pr(k >= 8) = 0.000001 Pr(k = 8) = 0.000001
(one-sided test) (one-sided test) (two-sided test)
at the 0.05 level of significance. The alternative hypothesis is H A : p , 0.023. Based on the random sample of 33 infants with perinatal growth failure, pˆ = 0.242. If the true population proportion is 0.023, what is the probability of selecting a sample with an observed proportion as high as 0.242, or even more extreme than this? To answer this question, we could calculate the test statistic z = =
pˆ − p0 p
p0 (1 − p0 )/n
0.242 − 0.023 √ 0.023(1 − 0.023)/33
=
8.39.
Referring to Table A.3, the p-value of the test is less than 0.001. We reject the null hypothesis at the 0.05 level of significance and conclude that, among children who experienced perinatal growth failure, the proportion having full scale IQ scores below 70 is not equal to 0.023, and is in fact larger. The binomial exact test for the null hypothesis H0 : p = 0.023 is shown in Tables 14.3 and 14.4. The p-value is calculated by using the binomial distribution to determine the probability of observing 8 children with IQ scores below 70 out of 33 sampled, or some number even more extreme than this. Here we are able to see that the p-value is much less than 0.001. Suppose that, before performing their study, the investigators had decided that if the proportion of infants with perinatal growth failure who had a full scale IQ score below 70 at 8 years of age was as high as 10%, they would want 90% power to detect a difference of this magnitude. They still planned to test the null hypothesis H0 : p = 0.023 against the alternative
H A : p , 0.023
at the α = 0.05 level of significance. What sample size would they have needed? In order to perform the calculation, we note that p0 = 0.023 is the value of the population proportion if the null hypothesis is true, while p1 = 0.10 is the value of interest if the null hypothesis
ISTUDY
339
Inference on Proportions TABLE 14.4 R output for a one-sample hypothesis test for a proportion, binomial exact method Exact binomial test data: 8 and 33 number of successes = 8, number of trials = 33, p-value = 6.498e-07 alternative hypothesis: true probability of success is not equal to 0.023 95 percent confidence interval: 0.1109233 0.4225893 sample estimates: probability of success 0.2424242
TABLE 14.5 Stata output for a sample size calculation for one proportion Estimated sample size for a one-sample proportion test Score z test Ho: p = p0 versus Ha: p != p0 Study parameters: alpha power delta p0 pa
= = = = =
0.0500 0.9000 0.0770 0.0230 0.1000
Estimated sample size: N =
78
is false. Since the test is being conducted at the α = 0.05 level of significance and the investigators want 90% power, z1−α/2 = 1.96 and z1−β = 1.28. Therefore, we calculate n
=
=
p p 2 z 1−α/2 p0 (1 − p0 ) + z1−β p1 (1 − p1 ) p1 − p0 √ √ #2 " 1.96 0.023(1 − 0.023) + 1.28 0.10(1 − 0.10) 0.10 − 0.023
= 77.5. Rounding up, a sample of size 78 would be required to achieve 90% power. Note that these calculations are typically performed using a statistical package, as shown in Tables 14.5 and 14.6. As we saw when discussing sample size calculations for means, sometimes the goal of a study is to estimate a population parameter with a certain level of precision, not to conduct a test of hypothesis. Here, we might want to determine the sample size necessary to construct a 95% confidence interval
ISTUDY
340
Principles of Biostatistics
TABLE 14.6 R output for a sample size calculation for one proportion propTestN(.1, .023, alpha = 0.05, power = 0.9, sample.type = "one.sample") 78
TABLE 14.7 R output for a sample size calculation for one proportion sample.size.prop(.05, P = 0.2, level = 0.95) sample.size.prop object: Sample size for proportion estimate Without finite population correction: N=Inf, precision e=0.05 and expected proportion P=0.2 Sample size needed: 246
for p with a specified length. When deriving a confidence interval for a proportion in Section 14.3, we saw that a 95% interval takes the form r r * pˆ − 1.96 p(1 − p) , pˆ + 1.96 p(1 − p) + . n n , This interval is centered around the point estimate p, ˆ and has length r p(1 − p) L = 2(1.96) . n Because the length of the confidence interval depends on the population proportion p, we must postulate the value of p in order to calculate sample size. Once we have specified both the length L and the postulated p, we can rearrange this formula to solve for n, or use a statistical package to do the calculation for us. The value of n which results is the minimum number of subjects needed to ensure that the 95% confidence interval has length L. Note that this calculation is based on the normal approximation to the binomial distribution, and may not be accurate for very low or very high values of p. Suppose we want to estimate the proportion of infants with perinatal growth failure who have a full scale IQ score below 70 at age 8 years. We assume that the population proportion is approximately 0.20, and wish to estimate the true proportion to within ±0.05. In this case, the length of the 95% confidence interval would be 2(0.05) = 0.10. The R output in Table 14.7 tells us that we would need 246 children to ensure a 95% confidence interval of this length. In a study evaluating the exacerbation of asthma symptoms by medications used to treat pain and fever, children 1 to 5 years of age with mild persistent asthma were randomized to receive either acetaminophen or ibuprofen as needed over a 48-week period [239]. We are interested in determining whether the proportion of children who experience at least one episode of asthma exacerbation requiring treatment with glucocorticoids over the course of the study period is the same for each of the two drugs. Consequently, we will test the null hypothesis H0 : p1 = p2
ISTUDY
341
Inference on Proportions against the two-sided alternative
H A : p1 , p2 .
In a random sample of n1 = 150 children receiving acetaminophen, 74 experienced at least one episode of asthma exacerbation; in a sample of n2 = 150 children treated with ibuprofen, 70 experienced at least one episode. Therefore, pˆ1
=
x1 n1
=
74 150
=
0.493
pˆ2
=
x2 n2
=
70 150
=
0.467.
and
If the two proportions p1 and p2 are identical, an estimate of their common value p is pˆ
x1 + x2 n1 + n2
=
=
74 + 70 150 + 150
=
0.480.
Since n1 p, ˆ n1 (1 − p), ˆ n2 p, ˆ and n2 (1 − p) ˆ are all greater than 5, the test statistic is ( pˆ1 − pˆ2 ) − (p1 − p2 )
z =
p
p(1 ˆ − p)[(1/n ˆ 1 ) + (1/n2 )]
=
√
(0.493 − 0.467) − 0 (0.480)(1 − 0.480)[(1/150) + (1/150)]
= 0.64. Looking at Table A.3, the probability that the outcome of a standard normal random variable is either greater than 0.64 or less than −0.64 is 0.522; this is the p-value of the test. Therefore, we fail to reject the null hypothesis that p1 = p2 at the 0.05 level of significance. We do not have evidence of a difference in the proportions of children experiencing asthma exacerbation for those treated with acetaminophen versus those receiving ibuprofen. The strength of the association between type of drug and occurrence of asthma exacerbation could also be quantified by means of the risk difference, p1 − p2 . The risk difference measures the difference in risk between the two groups in absolute terms. If the proportion of subjects with the outcome is identical in the two groups, the risk difference will be equal to 0. A point estimate for the risk difference in episodes of asthma exacerbation is pˆ1 − pˆ2
=
0.493 − 0.467
=
0.026.
This tells us that over the 48-week follow-up period, children treated with acetaminophen had 2.6% more episodes of asthma exacerbation than children treated with ibuprofen. In order to calculate a 95% confidence interval for the risk difference, an estimate of its standard error is required. In practice more than one formula for the standard error of a risk difference exists, and each statistical package may employ a different one. Some packages also generate an exact confidence interval. When sample sizes are large, the resulting confidence intervals will be very similar. In Table 14.8, Stata displays the 95% confidence interval of the risk difference as (−0.086, 0.140), and Table 14.9 shows a similar result from R, (−0.093, 0.146). Since these intervals contain the value 0, the data do not exclude the possibility that the proportions of children experiencing asthma exacerbation are identical in the two groups. As expected, this is the same conclusion reached with the hypothesis test. The magnitude of the association between type of drug and occurrence of asthma exacerbation can also be quantified using the risk ratio, p1 /p2 . The risk ratio measures the difference in risk
ISTUDY
342
Principles of Biostatistics
TABLE 14.8 Stata output containing confidence intervals for the difference and ratio of proportions | Exposed Unexposed | Total ----------------+----------------------+--------Cases | 74 70 | 144 Noncases | 76 80 | 156 ----------------+----------------------+--------Total | 150 150 | 300 | | Risk | .4933333 .4666667 | .48 | | | Point estimate | [95% Conf. Interval] |----------------------+---------------------Risk difference | .0266667 | -.0863611 .1396944 Risk ratio | 1.057143 | .8351338 1.33817 Attr. frac. ex. | .0540541 | -.1974129 .2527108 Attr. frac. pop | .0277778 | +--------------------------------------------chi2(1) = 0.21 Pr>chi2 = 0.6439
TABLE 14.9 R output containing confidence intervals for the difference and ratio of proportions Epidemiological 2x2 Table Analysis Input Matrix: Disease Present (Cases) Disease Absent (Controls) Risk Present 74 76 Risk Absent 70 80 Pearson Chi-Squared Statistic (Includes Yates’ Continuity Correction): 0.12 Associated p.value for H0: There is no association between exposure and outcome vs. HA: There is an association : 0.729 p.value using Fisher’s Exact Test (1 DF) : 0.729 Estimate of Odds Ratio: 1.113 95% Confidence Limits for true Odds Ratio are: [0.707, 1.751] Estimate of Relative Risk (Cohort, Col1): 1.057 95% Confidence Limits for true Relative Risk are: [0.835, 1.338] Estimate of Risk Difference (p1 - p2) in Cohort Studies: 0.027 95% Confidence Limits for Risk Difference: [-0.093, 0.146] Note: Above Confidence Intervals employ a continuity correction.
ISTUDY
343
Inference on Proportions TABLE 14.10 Stata output displaying a sample size calculation for comparing two proportions Estimated sample sizes for a two-sample proportions test Pearson’s chi-squared test Ho: p2 = p1 versus Ha: p2 != p1 Study parameters: alpha power delta p1 p2
= = = = =
0.0500 0.8000 -0.1000 0.4900 0.3900
(difference)
Estimated sample sizes: N = N per group =
772 386
between the two groups in relative rather than absolute terms. If the proportion of subjects with the outcome is identical in the two groups, the risk ratio takes the value 1. A point estimate for the risk ratio of population proportions of children experiencing at least one episode of asthma exacerbation is pˆ1 / pˆ2
=
0.493/0.467
=
1.06.
Over the 48-week follow-up period, the proportion of children with asthma exacerbation among those treated with acetaminophen was 6% higher than the proportion in the group treated with ibuprofen. Stata and R both report a 95% confidence interval for the risk ratio as (0.84, 1.34). Since this interval contains the value 1, once again the data do not exclude the possibility that the proportions of children experiencing asthma exacerbation are identical in the two groups. Suppose that after finding no evidence of a difference between acetaminophen and ibuprofen, the investigators decide to begin a new study comparing acetaminophen to a third drug. Based on the information from the original study, they assume that 49% of children treated with acetaminophen will experience at least one episode of asthma exacerbation, and they postulate that the new drug will reduce this percentage to 39%. They would like to have 80% power to detect this 10% difference if it exists, and plan to use a two-sided test of proportions conducted at the 0.05 level of significance. What sample size would they need? Tables 14.10 and 14.11 contain the relevant output from Stata and R, respectively, using all of the assumptions above. The investigators would need 386 patients per group – or 772 total – to have 80% power to detect this 10% difference.
ISTUDY
344
Principles of Biostatistics
TABLE 14.11 R output displaying a sample size calculation for comparing two proportions Two-sample comparison of proportions power calculation n p1 p2 sig.level power alternative
= = = = = =
385.6118 0.49 0.39 0.05 0.8 two.sided
NOTE: n is number in *each* group
ISTUDY
345
Inference on Proportions
14.9
Review Exercises
1. When might it be appropriate to use the normal approximation to the binomial distribution? How is it used? 2. Explain the difference between the risk difference and the risk ratio as measures of the association between a binary exposure variable and a binary outcome. 3. One way to construct a 95% confidence interval for a population proportion p is to begin with the probability statement pˆ − p P *−1.96 ≤ p ≤ 1.96+ = 0.95, p(1 − p)/n , and solve for p. (a) Rewrite the probability statement by squaring all three terms inside the inequality above. (b) Note that (−1.96) 2 = (1.96) 2 , and therefore the expression inside the inequality statement can be expressed as a quadratic equation for the variable p. Write out this quadratic equation. (c) Solve the quadratic equation for p. Based on this solution, write the expression for the Wilson confidence interval for p. 4. In the United States, the probability that a 40-year-old female will develop coronary heart disease during her lifetime is 0.32 [240]. Suppose that you select a random sample of size 40 from this population. (a) For the sample of size 40, what is the probability that 5 or fewer of the females will develop coronary heart disease during their lifetime? Compute the exact binomial probability. (b) Using the normal approximation to the binomial distribution, estimate the probability that 5 or fewer females will develop coronary heart disease. (c) Do these two methods provide consistent results? (d) Are the results consistent if the sample size is 20 rather than 40? (e) Are the results consistent if the sample size is only 10? 5. Incidence is defined as the proportion of individuals at risk for some condition who go on to develop that condition during a specified period of time. In the United States, the incidence of homelessness over a one-year period for military veterans being treated at a speciality mental health clinic is 0.056 [241]. (a) If you were to select repeated samples of size 100 from this population, what could you say about the distribution of sample proportions estimating incidence? List three properties. (b) Among the samples of size 100, what fraction has a sample proportion of 0.06 or lower? (c) What fraction has a sample proportion of 0.05 or higher? (d) What value is the 90th percentile of the distribution of sample proportions?
ISTUDY
346
Principles of Biostatistics 6. In a random sample of 746 individuals being treated in Veterans Affairs primary care clinics, 86 were determined to have post-traumatic stress disorder (ptsd) by diagnostic interview [242]. (a) What is a point estimate for p, the proportion of individuals with ptsd among the population being treated in Veterans Affairs primary care clinics? (b) Construct and interpret a 95% confidence interval for the population proportion. (c) Construct a 99% confidence interval for p. Is this interval longer or shorter than the 95% confidence interval? Explain. (d) Suppose that a prior study had reported the prevalence of ptsd among patients seen in primary care clinics in the general population to be 7%. You would like to know whether the proportion of individuals being treated in Veterans Affairs primary care clinics who have ptsd is the same. What are the null and alternative hypotheses of the appropriate test? (e) Conduct the test at the 0.01 level of significance, using the normal approximation to the binomial distribution. (f) What is the p-value? Interpret this p-value in words. (g) Do you reject or fail to reject the null hypothesis? What do you conclude? (h) Now conduct the test using the exact binomial method of hypothesis testing. Do you reach the same conclusion? 7. In a French study investigating the effectiveness of the drug mifepristone (RU 486) for terminating early pregnancy, 488 women were administered mifepristone followed 48 hours later by a single dose of a second drug, misoprostol. In 473 of these women, the pregnancy was terminated and the conceptus completely expelled [243]. (a) Estimate the proportion of successfully terminated early pregnancies among women using the described treatment regimen. (b) Construct a 95% confidence interval for the true population proportion p. (c) Interpret this confidence interval. (d) Construct a 90% confidence interval for p. How does the 90% confidence interval compare to the 95% interval? 8. You plan to conduct a study to estimate the prevalence of electronic cigarette use in the past 30 days by high school students in grade 10. You postulate that prevalence will be around 20%. (a) If you wish to estimate prevalence to within ±5% using a 95% confidence interval – meaning that the length of the confidence interval will be 10% – what sample size would you need? (b) If you want to estimate prevalence to within ±2%, what sample size would you need? 9. You plan to conduct a study to compare the prevalence of electronic cigarette use in the past 30 days by high school students in grade 10 to the prevalence in grade 12. Based on a previous study, you assume that the prevalence of e-cigarette use among students in grade 12 is 25%. You intend to conduct a two-sided, one-sample test of proportions at the 0.05 level of significance. (a) If the prevalence of e-cigarette use among 10th graders is postulated to be 20%, what sample size would be required to have 80% power?
ISTUDY
Inference on Proportions
347
(b) If the prevalence of e-cigarette use among 10th graders is postulated to be 15%, what sample size would be required to have 80% power? (c) If the prevalence of e-cigarette use among 10th graders is postulated to be 15% but you plan to perform your test at the 0.01 level of significance, what sample size would be required to have 80% power? 10. You plan to conduct a study to compare the prevalence of electronic cigarette use in the past 30 days by high school students in grade 10 to the prevalence in grade 12. You do not have any information about prevalence in either group. You intend to conduct a two-sided, two-sample test of proportions at the 0.05 level of significance, and wish to have 90% power for your test. (a) If the prevalence of e-cigarette use is postulated to be 20% for 10th graders and 25% for 12th graders, what sample size would be required? Assume that you want to enroll equal numbers of 10th and 12th graders. (b) If the prevalence of e-cigarette use is postulated to be 15% for 10th graders and 25% for 12th graders, what sample size would be required? (c) If the prevalence of e-cigarette use is postulated to be 15% for 10th graders and 25% for 12th graders, and you are satisfied with 80% power rather than 90%, what sample size would be required? 11. Suppose you are interested in determining whether the advice given by a physician during a routine physical examination is effective in encouraging patients to stop smoking. In a study of current smokers, one group of patients was given a brief talk about the hazards of smoking and was encouraged to quit [244]. A second group received no advice pertaining to smoking. All patients were given a follow-up examination. In the sample of 114 patients who had received the advice, 11 reported that they had quit smoking. In the sample of 96 patients who had not, 7 had quit smoking. (a) Estimate and interpret the true difference in population proportions p1 − p2 . (b) Use a statistical package to construct a 95% confidence interval for this difference. (c) At the 0.05 level of significance, test the null hypothesis that the proportions of patients who quit smoking are identical for those who received advice and those who did not. (d) Do you believe that the advice given by physicians is effective? Why or why not? 12. A study was conducted to investigate the use of community-based treatment programs among Medicaid beneficiaries suffering from severe mental illness [245]. The study involved assigning a sample of 311 patients to a prepaid medical plan and a sample of 310 patients to the traditional Medicaid program. After a specified period of time, the number of persons in each group who had visited a community crisis center in the previous three months was determined. Among the individuals assigned to the prepaid plan, 13 had visited a crisis center; among those receiving traditional Medicaid, 22 had visited a center. (a) For each group, estimate the proportion of patients who had visited a community crisis center in the previous three months. (b) Estimate the risk ratio for visiting a community crisis center for individuals receiving traditional Medicaid versus a prepaid plan. (c) Construct a 95% confidence interval for the risk ratio.
ISTUDY
348
Principles of Biostatistics (d) At the 0.05 level of significance, test the null hypothesis that the proportions are identical in the two populations. (e) What do you conclude?
13. Suppose you are interested in investigating factors that affect the prevalence of tuberculosis among intravenous drug users. In a group of 97 individuals who admit to sharing needles, 24.7% had positive tuberculin skin test results; among 161 drug users who deny sharing needles, 17.4% had positive test results [246]. (a) Assuming that the population proportions of positive skin test results are in fact equal, estimate their common value p. (b) Test the null hypothesis that the proportions of intravenous drug users who have positive tuberculin skin test results are identical for those who share needles and those who do not. (c) What is the probability distribution of the test statistic? (d) What is the p-value? What do you conclude? (e) Construct a 95% confidence interval for the true difference in proportions. 14. Patients with craniopharyngioma – a rare, noncancerous type of brain tumor – are at increased risk of cardiovascular mortality [247]. A study was conducted to estimate the prevalence of metabolic syndrome (MetS), a known risk factor for cardiovascular disease, in a combined Dutch-Swedish population of craniopharyngioma patients. To be diagnosed with MetS, an individual needed to exhibit abnormal levels on at least three of the following five criteria: body mass index, fasting glucose, triglycerides, HDL cholesterol, and blood pressure. Indicators of metabolic syndrome for a sample of patients are contained in the dataset metabolic. (a) In this population, what is a point estimate for the proportion of individuals with metabolic syndrome? (b) Is the sample size sufficiently large to apply the normal approximation to the binomial distribution? If so, construct and interpret a 95% Wilson confidence interval for the population proportion. (c) Calculate a 95% exact confidence interval for p. (d) How do the two confidence intervals compare? (e) In the Dutch population, the prevalence of MetS among individuals with craniopharyngioma has been reported to be 0.29. You are interested in evaluating whether the prevalence is the same in the population diagnosed with craniopharyngioma. What are the null and alternative hypotheses of the appropriate test? (f) Conduct the exact test at the 0.05 level of significance. What is the p-value? (g) What do you conclude? 15. A randomized study compared death or severe neurodevelopmental impairment at age 2 years for preterm infants receiving either the drug erythropoietin (a neuroprotective agent) or placebo within 24 hours of birth [248]. Data are contained in the dataset erythro. Treatment group is saved under the variable name treatment, and outcome at age 2 years under the name outcome. (a) Calculate point estimates for the outcome in each of the two treatment groups separately. (b) Estimate and interpret the risk ratio, comparing patients who got erythropoietin to patients who got placebo.
ISTUDY
Inference on Proportions
349
(c) Construct a 95% confidence interval for the risk ratio. What does this confidence interval tell you? (d) At the 0.05 level of significance, test the null hypothesis that the proportions of preterm infants who either die or experience neurodevelopmental impairment at age 2 years are the same in the two treatment groups. What is the p-value? (e) What do you conclude? 16. Intimate partner violence (ipv) toward a woman either before or during her pregnancy has been documented as a risk factor for the health of both the mother and her unborn child. A study conducted in the postnatal wards of a public hospital in Bangladesh examined the relationship between experience of ipv by a woman and the birth weight of the infant [80]. Data are contained in the dataset ipv. (a) The investigators defined low birth weight as < 2.5 kilograms and normal birth weight as ≥ 2.5 kilograms; this information is saved under the variable name low_bwt. Estimate the proportion of low birth weight infants in the population of newborns of mothers giving birth at a public hospital in Bangladesh. (b) Construct a 95% confidence interval for the proportion of low birth weight infants. (c) A binary variable indicating whether a woman experienced physical intimate partner violence during her pregnancy is saved as ipv_p. Estimate the proportions of low birth weight infants for women who experienced physical intimate partner violence during pregnancy, and for those who did not. (d) At the 0.05 level of significance, test the null hypothesis that the proportions of low birth weight infants are identical for those whose mothers experienced physical intimate partner violence and those whose mothers did not. What is the value of the test statistic? What is its probability distribution? (e) What is the p-value of the test? (f) What do you conclude? (g) Estimate and interpret the risk difference of giving birth to a low birth weight child for women who experienced physical intimate partner violence versus those who did not. Construct a 95% confidence interval for the risk difference. 17. The data set lowbwt contains information for a sample of 100 low birth weight infants born in two teaching hospitals in Boston, Massachusetts [81]. Indicators of a maternal diagnosis of preeclampsia during the pregnancy – a condition characterized by high blood pressure and other potentially serious complications – are saved under the variable name preeclampsia. The value 1 represents a diagnosis of preeclampsia and 0 means no such diagnosis. (a) Estimate the proportion of low birth weight infants whose mothers experienced preeclampsia during pregnancy. (b) Construct a 95% confidence interval for the true population proportion p. (c) Interpret the 95% confidence interval. (d) What assumptions did you make when constructing the confidence interval?
ISTUDY
ISTUDY
15 Contingency Tables
CONTENTS 15.1 15.2 15.3 15.4 15.5 15.6
Chi-Square Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.1 2 × 2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.2 r × c Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . McNemar’s Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Berkson’s Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
351 351 356 358 360 365 366 373
In the preceding chapter, we use the normal approximation to the binomial distribution to conduct a test of hypothesis comparing two independent proportions. However, we could have achieved the same result using an alternative technique. When working with categorical random variables, we often arrange the counts in a tabular format known as a contingency table. The rows of the table represent the outcomes of one variable, and the columns represent the outcomes of the other. The entries in the table are the counts that correspond to a particular combination of categories.
15.1
Chi-Square Test
Suppose we are interested in hypothesis testing procedures which allow us to evaluate the association between two nominal variables arranged as a contingency table. For example, we might wish to evaluate whether there is a relationship between diagnosis of hypertension yes or no (2 categories) and a person’s marital status categorized as never married, married, divorced, or widowed (4 categories). This would result in a table with 2 rows and 4 columns. Before we discuss this, however, we begin with the simpler case where both variables of interest are dichotomous.
15.1.1 2 × 2 Tables Consider the 2 × 2 table below which displays the results of a study which investigated the effectiveness of bicycle safety helmets in preventing head injury [249]. The data consist of a sample of 793 individuals who were involved in bicycle accidents during a specified one-year period. Of the 793 individuals who were involved in bicycle accidents, 147 were wearing safety helmets at the time of the incident and 646 were not. Among those wearing helmets, 17 suffered head injuries requiring the attention of a doctor, whereas the remaining 130 did not. Among the individuals not wearing safety helmets, 218 sustained serious head injuries, and 428 did not. The entries in the contingency table – 17, 130, 218, and 428 – are the observed counts within each combination of categories.
DOI: 10.1201/9780429340512-15
351
ISTUDY
352
Principles of Biostatistics Head Injury
Wearing Helmet
Total
Yes
No
Yes
17
218
235
No
130
428
558
Total
147
646
793
To examine the effectiveness of bicycle safety helmets, we wish to know whether there is an association between the incidence of head injury and the use of helmets among individuals who have been involved in accidents. To determine this, we test the null hypothesis H0 : The proportion of persons sustaining head injuries among the population of individuals wearing a safety helmet at the time of the accident is equal to the proportion of persons sustaining head injuries among those not wearing a helmet against the alternative H A : The proportions of persons sustaining head injuries are not equal in the two populations. We conduct the test at the α = 0.05 level of significance. The first step in carrying out the test is to calculate the expected count for each cell of the contingency table, given that H0 is true. Under the null hypothesis, the proportions of individuals experiencing head injuries among those wearing helmets and those not wearing helmets are identical; therefore, we can ignore the two separate categories and treat all 793 individuals as a single homogeneous sample. In this sample, the overall proportion of persons sustaining head injuries is 235 793
=
29.6%,
and the proportion not experiencing head injuries is 558 793
=
70.4%.
As a result, of the 147 individuals wearing safety helmets at the time of the accident, we would expect that 29.6%, or ! 235 147 (0.296) = 147 = 43.6, 793 suffer head injuries, and 70.4%, or 147 (0.704)
=
558 147 793
! =
103.4,
do not. Similarly, among the 646 bicyclists not wearing safety helmets, we would expect that 29.6%, or ! 235 646 (0.296) = 646 = 191.4, 793 sustain head injuries, whereas 70.4%, or 646 (0.704)
=
646
558 793
! =
454.6,
ISTUDY
353
Contingency Tables
do not. In general, the expected count for a given cell in the table is equal to the row total multiplied by the column total divided by the table total. To minimize roundoff error, we usually compute expected counts to a fraction of a person. Thus, for the four categories in the original table, the expected counts are: Wearing Helmet
Head Injury
Total
Yes
No
Yes
43.6
191.4
235.0
No
103.4
454.6
558.0
Total
147.0
646.0
793.0
Note that the row totals and column totals are the same as those in the table of observed counts. In general, if a 2 × 2 table of observed frequencies for a sample of size n can be represented as follows, Variable 2
Variable 1
No
a
b
a+b
c
d
c+d
a+c
b+d
n
Yes No Total
Total
Yes
then the corresponding table of expected counts is:
Variable 1
Variable 2
Total
Yes
No
Yes
(a + b)(a + c)/n
(a + b)(b + d)/n
a+b
No
(c + d)(a + c)/n
(c + d)(b + d)/n
c+d
a+c
b+d
n
Total
As noted above, the row and column totals in the table of expected counts are identical to those in the observed table. These marginal totals have been held fixed by design; we calculate the cell entries that would have been expected given that there is no association between the row and column classifications and that the number of individuals within each group does not change. The chi-square test compares the observed frequencies in each category of the contingency table (represented by O) with the expected frequencies given that the null hypothesis is true (denoted by E). It is used to determine whether the deviations between the observed and the expected counts, O − E, are too large to be attributed to chance or sampling variability alone. Since there is more than one cell in the table, these deviations must be combined in some way. To perform the test for the counts in a contingency table with r rows and c columns, we calculate the sum X2
=
rc X (Oi − Ei ) 2 , Ei i=1
where rc is the number of cells in the table. The probability distribution of this sum is approximated by a chi-square ( χ2 ) distribution with (r − 1)(c − 1) degrees of freedom. A 2 × 2 table has
ISTUDY
354
Principles of Biostatistics
FIGURE 15.1 Chi-square distributions with 1, 3, and 6 degrees of freedom (2 − 1)(2 − 1) = 1 degree of freedom; a 3 × 4 table has (3 − 1)(4 − 1) = 6 degrees of freedom. To ensure that the sample size is large enough to make this approximation valid, no cell in the table should have an expected count less than 1, and no more than 20% of the cells should have an expected count less than 5 [250]. Like the F distribution, the chi-square distribution is not symmetric. A chi-square random variable cannot be negative. It assumes values from zero to infinity, and is skewed to the right. As is true for all probability distributions, however, the total area beneath the curve is equal to 1. Also, as with the t and F distributions, there is a different chi-square distribution for each possible value of the degrees of freedom. The distributions with small degrees of freedom are highly skewed. As the number of degrees of freedom increases, the distributions become less skewed and more symmetric. This is illustrated in Figure 15.1. Table A.6 in Statistical Tables is a condensed table of areas for the chi-square distribution with various degrees of freedom. For a particular value of df, the entry in the table is the outcome of χ2df that cuts off the specified area in the upper tail of the distribution. Given a chi-square distribution with 1 degree of freedom, for instance, χ21 = 3.84 cuts off the upper 5% of the area under the curve; it is therefore the 95th percentile of this distribution. Since all the expected counts for the data pertaining to bicycle safety helmets are greater than 5, we can proceed with the chi-square test. The next step is to calculate the sum X2
=
4 X (Oi − Ei ) 2 . Ei i=1
For a 2 × 2 contingency table, this test statistic has an approximate chi-square distribution with (2 − 1)(2 − 1) = 1 degree of freedom. Note that we are using discrete observations to estimate χ2 , a continuous probability distribution. The approximation is quite good for tables with many degrees of freedom, but may not be valid for 2 × 2 tables with df = 1. Therefore, we sometimes apply a continuity correction in this situation. When we are working with a 2 × 2 table, the corrected test
ISTUDY
355
Contingency Tables statistic is X2
=
4 X (|Oi − Ei | − 0.5) 2 , Ei i=1
where |Oi − Ei | is the absolute value of the difference between Oi and Ei . The term −0.5 in the numerator is often referred to as the Yates’ correction. The effect of this term is to decrease the value of the test statistic, and thus increase the corresponding p-value. Although the Yates correction is used extensively, some investigators question its validity, believing that it results in an overly conservative test which may fail to reject a false null hypothesis [251]. As long as n is reasonably large, however, the effect of the correction factor is negligible. Substituting the observed and expected counts for the bicycle safety helmet data into the continuity corrected equation, we find that X2
=
(|130 − 103.4| − 0.5) 2 (|17 − 43.6| − 0.5) 2 + 43.6 103.4 (|428 − 454.6| − 0.5) 2 (|218 − 191.4| − 0.5) 2 + + 191.4 454.6
= 15.62 + 6.59 + 3.56 + 1.50 = 27.27. For a χ2 distribution with 1 degree of freedom, Table A.6 tells us that the p-value is less than 0.001. Although we are considering only one tail of the chi-square distribution, the test is two-sided; large outcomes of (Oi − Ei ) 2 can result when the observed value is larger than the expected value, and also when it is smaller. Since p < α, we reject the null hypothesis and conclude that the proportions of individuals suffering head injuries are not identical in the two populations. Among persons involved in bicycle accidents, the use of a safety helmet reduces the incidence of head injury. If we represent a 2 × 2 table in the general format shown below, Variable 2
Variable 1 Yes No Total
Total
Yes
No
a c
b d
a+b c+d
a+c
b+d
n
the test statistic X2 can also be expressed as X2
=
n [|ad − bc| − (n/2)]2 . (a + c)(b + d)(a + b)(c + d)
Because there is no need to compute a table of expected counts, this form of the statistic is more convenient computationally if performing the calculations by hand. However, it does not have the same straightforward interpretation as a measure of the deviations between observed and expected counts.
ISTUDY
356
Principles of Biostatistics Summary: Hypothesis Test for a 2×2 Contingency Table, Independent Samples Null hypothesis
H0 : There is no association between the exposure and the outcome
Alternative hypothesis
H A : There is an association between the exposure and the outcome
Test
Chi-square test, 2 × 2 table
Test statistic
X2 =
4 X (Oi − Ei ) 2 Ei
where
i=1
Oi is the observed frequency in cell i Ei is the expected frequency in cell i Distribution of test statistic
Chi-square with 1 degree of freedom
15.1.2 r × c Tables In the case of a 2 × 2 table, the chi-square test for independent proportions is equivalent to the hypothesis test which uses the normal approximation to the binomial distribution presented in Section 14.5. The chi-square test is used more often in practice, however. In addition to being relatively easy to compute, it has the advantage that it can be generalized to accommodate the comparison of three or more proportions. In this situation, we arrange the data as an r ×c contingency table, where r is the number of rows and c the number of columns. Consider the following data taken from a study that investigates the accuracy of death certificates [252]. In two hospitals – a community hospital, labeled A, and a university hospital, labeled B – the causes of death listed on the death certificates were compared to autopsy results. The data are displayed in a 2 × 3 contingency table below.
Hospital
Death Certificate Status Confirmed No Recoding Accurate Change
Total
A B
157 268
18 44
54 34
229 346
Total
425
62
88
575
Of the 575 death certificates considered, 425 were confirmed to be accurate, 62 either lacked information or contained inaccuracies but did not require recoding of the underlying cause of death, and 88 were incorrect and required recoding. We would like to determine whether the results of the study suggest different practices in completing death certificates at the two hospitals. To do this, we test the null hypothesis H0 : Within each category of certificate status, the proportions of death certificates in Hospital A are identical,
ISTUDY
357
Contingency Tables
against the alternative hypothesis that the proportions are not the same. Another way to formulate the null hypothesis is to say that there is no association between hospital and death certificate status. Here, the alternative hypothesis would be that there is an association. We use the chi-square test and set the significance level at α = 0.05. The first step in carrying out the test is to calculate the expected count for each cell in the contingency table. Recall that the expected count is equal to the row total multiplied by the column total divided by the table total. For instance, among death certificates confirmed to be accurate, we expect that 229 × 425 = 169.3 575 would be in Hospital A, whereas 346 × 425 = 255.7 575 would be in Hospital B. For all six categories in the table, the expected counts are as follows: Death Certificate Status Hospital
Total
Confirmed Accurate
No Change
Recoding
A B
169.3 255.7
24.7 37.3
35.0 53.0
229.0 346.0
Total
425.0
62.0
88.0
575.0
Since all expected counts are greater than 5, we calculate the sum X2
=
6 X (Oi − Ei ) 2 . Ei i=1
We are working with a 2 × 3 contingency table, and consequently do not include the Yates continuity correction. Therefore, X2
=
(157 − 169.3) 2 (18 − 24.7) 2 (54 − 35.0) 2 + + 169.3 24.7 35.0 (268 − 255.7) 2 (44 − 37.3) 2 (34 − 53.0) 2 + + + 255.7 37.3 53.0
=
0.89 + 1.82 + 10.31 + 0.59 + 6.81 + 1.20
=
21.62.
For a χ2 distribution with (2 − 1)(3 − 1) = 2 degrees of freedom, p < 0.001. Since p is less than α, we reject the null hypothesis and conclude that the proportions of death certificates in Hospital A are not identical for the three different categories of certificate status; equivalently, there is an association between hospital and death certificate status. Referring back to the original contingency table, we note that Hospital A has 68.6% of its certificates confirmed to be accurate, 7.9% inaccurate but not needing to be changed, and 23.6% requiring recoding, while Hospital B has 77.5% of certificates confirmed to be accurate, 12.7% inaccurate but not needing to be changed, and 9.8% requiring recoding. It appears that Hospital A, the community hospital, contains a larger proportion of death certificates that are incorrect and require recoding than Hospital B.
ISTUDY
358
Principles of Biostatistics
As mentioned previously, the chi-square test is based on an approximation that works best when the sample size n is large. However, we can also conduct a test of hypothesis comparing two or more proportions using an alternative technique that allows us to compute the exact probability of the occurrence of the observed frequencies in the contingency table, given that there is no association between the row and column classifications and that the marginal totals remain fixed. This technique, known as Fisher’s exact test, is especially useful when the sample size is small. It involves enumerating all possible tables with the same row and column totals and summing the probabilities of the observed table and all “more extreme” tables to determine a p-value. Because the computations involved can be arduous – just like they are for the binomial exact test – the details are not presented in this text [236]. However, a number of statistical software packages perform Fisher’s exact test in addition to the chi-square test, as we shall see in the Further Applications section. Summary: Hypothesis Test for an r × c Contingency Table, Independent Samples Null hypothesis
H0 : There is no association between the exposure and the outcome
Alternative hypothesis
H A : There is an association between the exposure and the outcome
Test
Chi-square test, r × c table
Test statistic
X2 =
rc X (Oi − Ei ) 2 Ei
where
i=1
Oi is the observed frequency in cell i Ei is the expected frequency in cell i Distribution of test statistic
15.2
Chi-square with (r − 1)(c − 1) degrees of freedom
McNemar’s Test
Assuming we are still interested in comparing proportions, what can we do if the data are paired rather than independent? As with continuous data, the distinguishing characteristic of paired samples for counts is that each observation in the first group has a corresponding observation in the second group. If we are working with a 2 × 2 table, we can use McNemar’s test. Consider the following information taken from a study that investigates acute myocardial infarction among members of the Navajo Nation [253]. In the study, 144 victims of acute myocardial infarction were age- and sex-matched with 144 individuals free of heart disease. The members of each pair were then asked whether they had ever been diagnosed with diabetes. The results are presented in the following. Among the 144 individuals who had experienced acute myocardial infarction, 46 were diagnosed with diabetes; among those who were free of heart disease, only 25 suffered from diabetes. We would like to know what these data tell us about the proportions of diabetics in the two groups defined by presence or absence of heart disease.
ISTUDY
359
Contingency Tables Diabetes
MI
Total
Yes
No
Yes
46
25
71
No
98
119
217
Total
144
144
288
The preceding table has the same format as the 2 × 2 contingency table used in the comparison of proportions from independent samples; therefore, we might be tempted to apply the chi-square test or Fisher’s exact test. Note, however, that we have a total of 288 observations but only 144 pairs. Each matched pair in the study provides two responses: one for the individual who suffered a myocardial infarction, and one for the individual who did not. Since the chi-square test disregards the paired nature of the data, it is inappropriate in this situation. We must take the pairing into account in our analysis. Suppose that we take another look at the raw data from the study, but this time classify them in the following manner:
MI
No MI
Total
Diabetes
No Diabetes
Diabetes No Diabetes
9 16
37 82
46 98
Total
25
119
144
Of the 46 members of the Navajo Nation who had experienced acute myocardial infarction and who were diabetic, 9 were matched with individuals who had diabetes and 37 with individuals who did not. Of the 98 victims of myocardial infarction who did not suffer from diabetes, 16 were paired with diabetics and 82 were not. Each entry in the table corresponds to the combination of responses for a matched pair rather than an individual person. We can now conduct a proper paired analysis. However, we must first change the statement of the null hypothesis to reflect the new format of the data. Instead of testing whether the proportion of diabetics among individuals who have experienced acute myocardial infarction is equal to the proportion of diabetics among those who have not, we test H0 : There are equal numbers of pairs in which the victim of acute myocardial infarction is a diabetic and the matched individual free of heart disease is not, and in which the person without heart disease is a diabetic but the individual who has experienced an infarction is not, or, more concisely, H0 : There is no association between diabetes and the occurrence of acute myocardial infarction. The alternative hypothesis is that an association exists. We conduct this test at the α = 0.05 level of significance. The concordant pairs – or the pairs of responses in which either two diabetics or two nondiabetics are matched – provide no information for testing a null hypothesis about differences in diabetic status. Therefore, we discard these data and focus only on the discordant pairs, or the pairs of responses in which a person who has diabetes is paired with an individual who does not.
ISTUDY
360
Principles of Biostatistics
Let r represent the number of pairs in which the victim of acute myocardial infarction suffers from diabetes and the individual free of heart disease does not, and s the number of pairs in which the person without heart disease is a diabetic but the individual who had an infarction is not. If the null hypothesis is true, r and s should be approximately equal. Therefore, if the difference between r and s is large, we would want to reject the null hypothesis of no association. To conduct McNemar’s test, we calculate the statistic X2
=
[|r − s| − 1]2 . (r + s)
This ratio has an approximate chi-square distribution with 1 degree of freedom. The term −1 in the numerator is a continuity correction; again, we are using discrete counts to estimate the continuous χ2 distribution. For the data investigating the relationship between diabetes and myocardial infarction, r = 37 and s = 16. Therefore, X2
=
[|37 − 16| − 1]2 (37 + 16)
=
[21 − 1]2 53
=
7.55.
For a chi-square distribution with 1 degree of freedom, 0.001 < p < 0.01. Since p is less than α, we reject the null hypothesis. For the Navajo population, we conclude that if there is a difference between individuals who experience infarction and those who do not; victims of acute myocardial infarction are more likely to suffer from diabetes than the individuals free from heart disease who have been matched on age and sex. Summary: Hypothesis Test for a 2 × 2 Contingency Table, Paired Samples
15.3
Null hypothesis
H0 : There is no association between the exposure and the outcome
Alternative hypothesis
H A : There is an association between the exposure and the outcome
Test
McNemar’s test
Test statistic
X2 =
Distribution of test statistic
Chi-square with 1 degree of freedom
[|r − s| − 1]2 where r and s are the (r + s) numbers of discordant pairs
Odds Ratio
Although the chi-square test allows us to determine whether an association exists between two independent nominal random variables, and McNemar’s test does the same for paired dichotomous
ISTUDY
361
Contingency Tables
variables, neither test provides a measure of the strength of the association. In Chapter 14 we introduced the risk difference and the risk ratio as two measures for quantifying the magnitude of the effect. For a 2 × 2 table displaying information on two independent dichotomous variables, another such measure is the odds ratio. In Chapter 5 we state that if an event occurs with probability p, the odds of the event are p/(1 − p) to 1. If an event occurs with probability 1/2, for instance, the odds of the event are (1/2)/(1 − 1/2) = (1/2)/(1/2) = 1 to 1. Conversely, if the odds of an event are a to b, the probability of the event is a/(a + b). As we also saw, if we have two dichotomous random variables representing a disease and an exposure, the odds ratio is defined as the odds of disease among exposed individuals divided by the odds of disease among the unexposed, or OR =
P(disease | exposed)/[1 − P(disease | exposed)] . P(disease | unexposed)/[1 − P(disease | unexposed)]
Alternatively, the odds ratio may be defined as the odds of exposure among diseased individuals divided by the odds of exposure among the nondiseased, or P(exposure | diseased)/[1 − P(exposure | diseased)] . P(exposure | nondiseased)/[1 − P(exposure | nondiseased)]
OR =
These two different expressions for the odds ratio are mathematically equivalent. Suppose that our data – which consist of a sample of n individuals – are again arranged in the form of a 2 × 2 contingency table:
Disease No Disease Total
Exposed
Unexposed
Total
a c
b d
a+b c+d
a+c
b+d
n
In this case, we would estimate that P(disease | exposed)
a , a+c
=
and P(disease | unexposed)
=
therefore, 1 − P(disease | exposed)
=
1−
and
b ; b+d
a a+c
=
c , a+c
b = b+d Using these results, we can express an estimator of the odds ratio as 1 − P(disease | unexposed)
L OR
=
=
d . b+d
1−
[a/(a + c)]/[c/(a + c)] [b/(b + d)]/[d/(b + d)]
=
a/c b/d
=
ad . bc
This estimator is simply the cross-product ratio of the entries in the 2 × 2 table. Consider the following data, taken from a study that attempts to determine whether the use of electronic fetal monitoring during labor affects the frequency of cæsarean section deliveries [254]. Cæsarian delivery can be thought of as the “disease,” and electronic monitoring as the “exposure.” Of the 5824 infants included in the study, 2850 were electronically monitored during labor and 2974 were not. The outcomes are as follows:
ISTUDY
362
Principles of Biostatistics Caesarean Delivery
Exposure Yes No
Total
Yes
358
229
587
No
2492
2745
5237
Total
2850
2974
5824
The odds of being delivered by cæsarean section in the group that was monitored relative to the group that was not monitored are estimated by L OR
(358)(2745) (229)(2492)
=
=
1.72.
These data suggest that the odds of being delivered by cæsarean section for fetuses electronically monitored during labor are 1.72 times the odds of cæsarian delivery for fetuses not monitored. Therefore, there appears to be a moderatey strong association between the use of electronic fetal monitoring and the eventual method of delivery. This does not imply, however, that electronic monitoring somehow causes a cæsarean delivery; it is possible that the fetuses at higher risk are the ones that are monitored. The cross-product ratio is simply a point estimate of the strength of association between two dichotomous random variables. To gauge the uncertainty in this estimate, we must calculate a confidence interval as well. The width of the interval reflects the amount of variability in our L Recall that the derivation of the expression for a 95% confidence interval for a mean estimator OR. µ, ! σ σ x¯ − 1.96 √ , x¯ + 1.96 √ , n n relied on the assumption that the underlying population values were normally distributed. When computing a confidence interval for the odds ratio, we must make the same assumption of underlying normality. A problem arises, however, in that the probability distribution of the odds ratio is skewed to the right. Although it cannot take on negative values, the odds ratio can assume any positive value between 0 and infinity. In contrast, the probability distribution of the natural logarithm of the odds ratio is more symmetric and approximately normal. Therefore, when calculating a confidence interval for the odds ratio, we typically work in the log scale. To ensure that the sample size is large enough, the expected value of each entry in the contingency table must be at least 5. The expression for a 95% confidence interval for the natural logarithm of the odds ratio is L − 1.96 se[ln(OR)] L , ln(OR) L + 1.96 se[ln(OR)]) L . (ln(OR) To compute a confidence interval for ln(OR), we must first know the standard error of this quantity. For a 2 × 2 table arranged in the following manner, Exposed
Unexposed
Total
Disease
a
b
a+b
No Disease
c
d
c+d
a+c
b+d
n
Total
L is estimated by the standard error of ln(OR) r L sDe[ln(OR)]
=
1 1 1 1 + + + . a b c d
ISTUDY
363
Contingency Tables
If any one of the entries in the contingency table is equal to 0, the standard error is undefined. In this case, adding 0.5 to each of the values a, b, c, and d will correct the situation and still provide a reasonable estimate; thus, the modified estimate of the standard error is s 1 1 1 1 L + + + . sDe[ln(OR)] = (a + 0.5) (b + 0.5) (c + 0.5) (d + 0.5) The appropriate estimate for the standard error can be substituted into the expression for the confidence interval above. To find a 95% confidence interval for the odds ratio itself, we take the antilogarithm of the upper and lower limits of the interval for ln(OR) to get (eln(OR)−1.96 sDe[ln(OR)] , eln(OR)+1.96 sDe[ln(OR)] ) . L
L
L
L
For the data examining the relationship between electronic fetal monitoring status during labor and the eventual method of delivery, the log of the estimated odds ratio is L ln(OR)
=
ln(1.72)
=
0.542,
L is and the estimated standard error of ln(OR) r 1 1 1 1 L sDe[ln(OR)] = + + + a b c d r 1 1 1 1 + + + = 358 229 2492 2745 = 0.089. Therefore, a 95% confidence interval for the log of the odds ratio is (0.542 − 1.96(0.089) , 0.542 + 1.96(0.089)) or
(0.368 , 0.716),
and a 95% confidence interval for the odds ratio itself is (e0.368 , e0.716 ) or
(1.44 , 2.05).
We are 95% confident that the odds of delivery by cæsarean section are between 1.44 and 2.05 times higher for fetuses monitored during labor than for those not monitored. Note that this interval does not contain the value 1. An odds ratio of 1 would imply that fetuses that are monitored and those that are not monitored have identical odds of cæsarean delivery. An odds ratio can also be calculated to estimate the strength of association between two paired dichotomous random variables. Using the notation introduced in Section 15.2, r and s represent the numbers of each type of discordant pair in the study. For the investigation of the relationship between acute myocardial infarction and diabetes among members of the Navajo Nation, for example, r represents the number of pairs in which the subject who had suffered acute myocardial infarction was diabetic and the matched individual free of heart disease was not, and s the number of pairs in which the person without heart disease was diabetic and the subject who had an infarction was not.
ISTUDY
364
Principles of Biostatistics
In this case, the odds ratio of suffering from diabetes for individuals who have experienced an acute myocardial infarction versus those who have not is estimated by L OR
=
r s
37 16
=
=
2.31.
L √ By noting that the estimated standard error of ln(OR) for paired dichotomous data is equal to (r + s)/r s, we can also calculate a confidence interval for the true population odds ratio. A 95% interval for the natural logarithm of the odds ratio is L ± 1.96 se[ln(OR)]. L ln(OR) Since
L ln(OR)
=
ln(2.31)
and r L sDe[ln(OR)]
=
r+s rs
s =
=
0.837
37 + 16 37(16)
=
0.299,
the 95% confidence interval for ln(OR) would be (0.837 − 1.96(0.299), 0.837 + 1.96(0.299)) or
(0.251, 1.423).
A 95% confidence interval for the odds ratio itself is then (e0.251, e1.423 ) or
(1.29, 4.15).
Note that this interval does not contain the value 1. Summary: Odds Ratio Independent variables
Paired variables
Odds ratio
L = ad OR bc
L =r OR s
CI lower bound
L L e ln(OR) − z 1−α/2 sDe[ln(OR)]
L L e ln(OR) − z 1−α/2 sDe[ln(OR)]
CI upper bound
L L e ln(OR) + z 1−α/2 sDe[ln(OR)]
L L e ln(OR) + z 1−α/2 sDe[ln(OR)]
Standard error
L = sDe[ln(OR)]
r
1 1 1 1 + + + a b c d
r L = sDe[ln(OR)]
r+s rs
ISTUDY
365
Contingency Tables
15.4
Berkson’s Fallacy
Although the odds ratio is a useful measure of the strength of association between two dichotomous random variables, it provides a valid estimate of the magnitude of the effect only if the sample of observations on which it is based is random. This point is sometimes forgotten. A restricted sample, such as a sample of patients from a single hospital, is usually much easier to obtain. This restricted sample is then used to make inference about the population as a whole. In one study, the investigators surveyed 2784 individuals – 257 of whom were hospitalized – and determined whether each subject suffered from either a disease of the circulatory system or a respiratory illness, or both [255]. If they had limited their questioning to the 257 hospitalized patients only, the results would have been as follows: Circulatory Disease
Respiratory Disease Yes No
Total
Yes No
7 13
29 208
36 221
Total
20
237
257
An estimate of the odds ratio of having respiratory illness among individuals who suffer from a disease of the circulatory system versus those who do not is L OR
=
(7)(208) (29)(13)
=
3.86.
The chi-square test of the null hypothesis that there is no association between the two diseases yields X2
=
(|29 − 33.2| − 0.5) 2 (|7 − 2.8| − 0.5) 2 + 2.8 33.2 2 (|13 − 17.2| − 0.5) (|208 − 203.8| − 0.5) 2 + + 17.2 203.8
= 4.89 + 0.41 + 0.80 + 0.07 = 6.17. For a chi-square distribution with 1 degree of freedom, 0.01 < p < 0.025. Therefore, we reject the null hypothesis at the 0.05 level of significance and conclude that individuals who have a disease of the circulatory system are more likely to suffer from respiratory illness than individuals who do not. Now consider the entire sample of 2784 individuals, which consists of both hospitalized and nonhospitalized subjects. Circulatory Disease
]
Respiratory Disease Yes No
Total
Yes No
22 202
1171 2389
1193 2591
Total
224
2560
2784
ISTUDY
366
Principles of Biostatistics
For these data, the estimate of the odds ratio is L OR
=
(22)(2389) (171)(202)
=
1.52.
This value is much lower than the estimate of the odds ratio calculated for the hospitalized patients only. In addition, the value of the chi-square test statistic is X2
=
(|171 − 177.47| − 0.5) 2 (|22 − 15.53| − 0.5) 2 + 15.53 177.47 (|202 − 208.47| − 0.5) 2 (|2389 − 2382.53| − 0.5) 2 + + 208.47 2382.53
= 2.29 + 0.20 + 0.17 + 0.01 = 2.67. For a chi-square distribution with 1 degree of freedom, Table A.8 tells us that p > 0.10. We can no longer reject the null hypothesis that there is no association between respiratory and circulatory diseases at the 0.05 level of significance. Why do the conclusions drawn from these two samples differ so drastically? To answer this question, we must consider the rates of hospitalization that occur within each of the four disease subgroups. Among the 22 individuals suffering from both circulatory and respiratory disease, 7 are hospitalized. Therefore, the rate of hospitalization for this subgroup is 7 = 31.8%. 22 The rate of hospitalization among subjects with respiratory illness alone is 13 = 6.4%. 202 Among individuals with circulatory disease only, the rate is 29 = 17.0%, 171 and among persons suffering from neither disease, the rate of hospitalization is 208 = 8.7%. 2389 Thus, individuals with both circulatory and respiratory diseases are much more likely to be hospitalized than individuals in any of the three other subgroups. Also, subjects with circulatory disease are more likely to be hospitalized than those with respiratory illness. If we sample only patients who are hospitalized, therefore, our conclusions about the relationship between these two diseases will be biased. We are more likely to select an individual who is suffering from both illnesses than a person in any of the other subgroups, and more likely to select a person with circulatory disease than one with respiratory problems. As a result, we observe an association that does not actually exist. This kind of spurious relationship among variables – which is evident only because of the way in which the sample were sampled – is known as Berkson’s fallacy.
15.5
Further Applications
When we are presented with independent samples of nominal data that have been grouped into categories, the chi-square test can be used to determine whether the proportions of some event of
ISTUDY
367
Contingency Tables
interest are identical in the various groups. For example, consider the following data taken from a study investigating an outbreak of gastroenteritis – an inflammation of the membranes of the stomach and small intestine – following a lunch served in the cafeteria of a United States high school. Among a sample of 263 students who bought lunch in the school cafeteria on the day in question, 225 ate prepared sandwiches and 38 did not [256]. The numbers of cases of gastroenteritis in each group are displayed below. Health Status
Ate Sandwich Yes No
Ill Not Ill
109 116
4 34
113 150
Total
225
38
263
Total
Among the students who ate prepared sandwiches, 48.4% became ill; among those who did not, 10.5% became ill. We would like to test the null hypothesis that there is no association between the consumption of a sandwich and the onset of gastroenteritis, or H0 : The proportion of students becoming ill among those who ate the sandwiches is equal to the proportion becoming ill among those who did not eat the sandwiches, at the α = 0.05 level of significance. The alternative hypothesis is that an association does exist. To begin, we must calculate the expected count for each cell of the 2 × 2 table. Under the null hypothesis, the proportions of students developing gastroenteritis are identical in the two groups. Treating all 263 students as a single sample, the overall proportion of students becoming ill is 113 263
=
43.0%,
and the proportion of students not becoming ill is 150 263
=
57.0%.
Therefore, among the 225 students who ate the sandwiches, we would expect that 43%, or ! 113 225 = 96.7, 263 become ill and that 57%, or
150 225 263
! =
128.3,
do not. Similarly, among the 38 students who did not eat the sandwiches, we expect that ! 113 38 = 16.3 263 become ill and 38
150 263
! =
21.7
ISTUDY
368
Principles of Biostatistics
TABLE 15.1 Stata output displaying the chi-square test and Fisher’s exact test Health | Ate Sandwich | Status | Yes No | Total -----------+---------------------+---------Ill | 109 4 | 113 Not Ill | 116 34 | 150 -----------+---------------------+---------Total | 225 38 | 263 Pearson chi2(1) = Fisher’s exact = 1-sided Fisher’s exact =
19.0742
Pr = 0.000 0.000 0.000
do not. Thus, the table of expected counts is as follows: Health Status
Ate Sandwich Yes No
Total
Ill Not Ill
96.7 128.3
16.3 21.7
113.0 150.0
Total
225.0
38.0
263.0
Since the expected count in each cell of the table is greater than 5, we proceed with the chi-square test by calculating the statistic X2
= =
4 X (|Oi − Ei | − 0.5) 2 Ei i=1
(|109 − 96.7| − 0.5) 2 (|4 − 16.3| − 0.5) 2 + 96.7 16.3 (|116 − 128.3| − 0.5) 2 (|34 − 21.7| − 0.5) 2 + + 128.3 21.7
= 1.44 + 8.54 + 1.09 + 6.42 = 17.49. For a chi-square distribution with 1 degree of freedom, p < 0.001. Since p is less than α, we reject H0 and conclude that the proportions of students developing gastroenteritis are not identical in the two groups. Among students eating lunch at the high school cafeteria on the day in question, the consumption of a prepared sandwich was associated with an increased risk of illness. Most computer packages are able to perform the chi-square test. The relevant output from Stata is contained in Table 15.1, and the output from R in Table 15.2. In addition to the p-values for the chi-square test, the p-values for Fisher’s exact test are shown as well. For both tests, the p-value is the probability of obtaining the observed number of successes, or a value even more extreme that that – meaning further from the condition specified by the null hypothesis – given that the null hypothesis is true. Now consider the situation in which we have paired samples of dichotomous data. The following information comes from a study that examines changes in smoking status over a two-year period [257].
ISTUDY
369
Contingency Tables TABLE 15.2 R output displaying the chi-square test and Fisher’s exact test Ate sandwich Did not each sandwich Ill 109 4 Not ill 116 34 Number of cases in table: 263 Number of factors: 2 Test for independence of all factors: Chisq = 19.074, df = 1, p-value = 1.257e-05
In one year, a sample of 2110 adults over the age of 18 were asked to identify themselves as smokers or nonsmokers. Two years later, the same 2110 individuals were again asked whether they were currently smokers or nonsmokers. Of the 717 individuals who smoked in 1980, 620 were still smoking in 1982, and 97 had stopped. Of the 1393 nonsmokers in 1980, 1317 remained nonsmokers in 1982, and 76 had begun to smoke. Each entry in the table corresponds to the paired response of a single individual. Second Survey Smoker Nonsmoker
First Survey
Total
Smoker Nonsmoker
620 76
97 1317
717 1393
Total
696
1414
2110
We would like to test the null hypothesis that there is no association between smoking status and year, or, more formally, H0 : Among the individuals who changed their smoking status between 1980 and 1982, equal numbers switched from being smokers to nonsmokers and from being nonsmokers to smokers. The alternative hypothesis is that there is an association, or that there was a tendency for smoking status to change in one direction versus the other. In order to reach a conclusion, we conduct McNemar’s test at the α = 0.05 level of significance. Note that we have r = 97 pairs in which a smoker becomes a nonsmoker and s = 76 pairs in which a nonsmoker becomes a smoker. To evaluate H0 , we calculate the test statistic X2
=
[|r − s| − 1]2 (r + s)
=
[|97 − 76| − 1]2 (97 + 76)
=
2.31.
For a chi-square distribution with 1 degree of freedom, p > 0.10. Since p is greater than α, we cannot reject the null hypothesis. The sample does not provide evidence that there is an association between smoking status and year. Stata output for McNemar’s test is contained in Table 15.3, and the corresponding R output in Table 15.4. Note that for this particular test, Stata ignores the variable names; the labels “Cases” and “Controls” refer to each of the two paired samples – 1980 and 1982 for our data – and “Exposed” and “Unexposed” classify the status of each subject.
ISTUDY
370
Principles of Biostatistics
TABLE 15.3 Stata output displaying McNemar’s test | Controls | Cases | Exposed Unexposed | Total ------------+------------------------+-----------Exposed | 620 97 | 717 Unexposed | 76 1317 | 1393 ------------+------------------------+-----------Total | 696 1414 | 2110 McNemar’s chi2(1) = 2.55 Prob > chi2 = 0.1104 Exact McNemar significance probability = 0.1281
TABLE 15.4 R output displaying McNemar’s test McNemar’s Chi-squared test data: smoke McNemar’s chi-squared = 2.5491, df = 1, p-value = 0.1104 Exact test McNemar’s Chi-squared test with continuity correction data: smoke McNemar’s chi-squared = 2.3121, df = 1, p-value = 0.1284
ISTUDY
371
Contingency Tables
In both of the previous examples, we determined whether an association exists between two dichotomous random variables, but did not directly measure the strength of that association. For a 2 × 2 contingency table, one way to quantify the association is to use an odds ratio. For the data from the study that examined the outbreak of gastroenteritis in a United States high school, the odds of becoming ill among those who ate prepared sandwiches relative to those who did not are estimated by L = (109)(34) = 7.99. OR (4)(116) The odds of becoming ill for students who ate prepared sandwiches are 7.99 times the odds for students who did not. There appears to be a strong association between eating the sandwiches and a student’s health. To measure the uncertainty in this point estimate, we can calculate a confidence interval for the L is true population odds ratio. Note that the logarithm of OR L ln(OR)
=
ln(7.99)
=
2.078,
L is and that the estimated standard error of ln(OR) r 1 1 1 1 L sDe[ln(OR)] = + + + 109 4 116 34
=
0.545.
Therefore, a 95% confidence interval for the log of the odds ratio is (2.078 − 1.96(0.545) , 2.078 + 1.96(0.545)) or
(1.010 , 3.146),
and a 95% confidence interval for the odds ratio itself is (e1.010 , e3.146 ) or
(2.75 , 23.24).
We are 95% confident that the odds of becoming ill for students who ate prepared sandwiches are between 2.75 and 23.24 times larger than the odds for students who did not. Note that, as expected, this interval does not contain the value 1. For the paired dichotomous data from the study that looked at changes in smoking status over a two-year interval, the odds of smoking in the first survey versus the second are estimated by L OR
=
97 76
=
1.28.
To calculate a 95% confidence interval for the true population odds ratio, first note that the logarithm L is of OR L ln(OR) = ln(1.28) = 0.244, L is and then that the estimated standard error of ln(OR) s 97 + 76 L sDe[ln(OR)] = 97(76)
=
0.153.
Therefore, a 95% confidence interval for the log of the odds ratio is (0.244 − 1.96(0.153) , 0.244 + 1.96(0.153))
ISTUDY
372 or
Principles of Biostatistics (−0.056 , 0.544),
and a 95% confidence interval for the odds ratio itself is (e−0.056 , e0.544 ) or
(0.95 , 1.72).
This interval does contain the value 1. Recall that using McNemar’s test, we were unable to reject the null hypothesis of no association between year and smoking status.
ISTUDY
373
Contingency Tables
15.6
Review Exercises
1. How does the chi-square test statistic use the observed frequencies in a contingency table to determine whether an association exists between two nominal random variables? 2. Describe the properties of the chi-square probability distribution. 3. How does the null hypothesis evaluated using McNemar’s test differ from that evaluated by the chi-square test? 4. How can the use of a restricted rather than a random sample affect the results of an analysis? 5. Consider the chi-square distribution with 2 degrees of freedom. (a) What proportion of the area under the curve lies to the right of χ2 = 9.21? (b) What proportion of the area under the curve lies to the right of χ2 = 7.38? (c) What value of χ2 cuts off the upper 10% of the distribution? 6. Consider the chi-square distribution with 17 degrees of freedom. (a) What proportion of the area under the curve lies to the right of χ2 = 33.41? (b) What proportion of the area under the curve lies to the left of χ2 = 27.59? (c) What value of χ2 cuts off the upper 10% of the distribution? 7. The following data come from a study designed to investigate drinking problems among college students [258]. A group of students were asked whether they had ever driven an automobile while drinking. Four years later, after the legal drinking age had been raised, a different group of college students were asked the same question. Drove While Drinking
Survey First
Second
Yes No
1250 1387
1991 1666
2241 3053
Total
2637
2657
5294
Total
(a) Use the chi-square test to evaluate the null hypothesis that the population proportions of students who drove while drinking are the same in the two calendar years. (b) What is the probability distribution of the test statistic? (c) What do you conclude about the behavior of college students? (d) Again test the null hypothesis that the proportions of students who drove while drinking are identical for the two calendar years. This time, use the method based on the normal approximation to the binomial distribution that was presented in Section 14.6. Do you reach the same conclusion? (e) Construct a 95% confidence interval for the true difference in population proportions. (f) Does the 95% confidence interval contain the value 0? Would you have expected that it would?
ISTUDY
374
Principles of Biostatistics 8. A study was conducted to evaluate the relative efficacy of supplementation with calcium versus calcitriol in the treatment of postmenopausal osteoporosis [259]. Calcitriol is an agent that has the ability to increase gastrointestinal absorption of calcium. A number of patients withdrew from this study prematurely due to the adverse effects of treatment which include thirst, skin problems, and neurologic symptoms. The relevant data appear below. Treatment
Withdrawal Yes No
Total
Calcitriol Calcium
27 20
287 288
314 308
Total
47
575
622
(a) Compute the sample proportion of subjects who withdrew from the study in each treatment group. (b) Test the null hypothesis that there is no association between treatment group and withdrawal from the study at the 0.05 level of significance. What is the p-value of the test? (c) Do you reject or fail to reject the null hypothesis? What do you conclude? 9. In a survey conducted in Italy, physicians with different specialties were questioned regarding the surgical treatment of early breast cancer [260]. In particular, they were asked whether they would recommend radical surgery regardless of a patient’s age (R), conservative surgery only for younger patients (CR), or conservative surgery regardless of age (C). The results of this survey are presented below. (a) At the 0.05 level of significance, test the null hypothesis that there is no association between physician specialty and recommended treatment. First use the chi-square test, and then Fisher’s exact test? (b) What do you conclude? Specialty Internal Surgery Radiotherapy Oncology Gynecology Total
Recommended Surgery R CR C 6 22 42 23 61 127 2 3 54 1 12 43 1 12 31 33 110 297
Total 70 211 59 56 44 440
10. The following table compiles data from six studies designed to investigate the accuracy of death certificates. The causes of death listed on 5373 death certificates were compared to autopsy results. Of those considered, 3726 certificates were confirmed to be accurate, 783 either lacked information or contained inaccuracies but did not require recoding of the underlying cause of death, and 864 were incorrect and required recoding [261].
ISTUDY
375
Contingency Tables
Hospital 1955–1965 1970 1970–1971 1975–1977 1977–1978 1980 Total
Death Certificate Status Confirmed No Recoding Accurate Change 2040 367 327 149 60 48 288 25 70 703 197 252 425 62 88 121 72 79 3726 783 864
Total 2734 257 383 1152 575 272 5373
(a) Do you believe that the results are homogeneous or consistent across studies? Explain. (b) It should be noted that autopsies are not performed at random; in fact, many are done because the cause of death listed on the certificate is uncertain. What problem might arise if you attempt to use the results of these studies to make inference about the population as a whole? 11. Suppose that you are interested in investigating the association between retirement status and heart disease. One concern might be the age of the subjects: an older person is more likely to be retired, and also more likely to have heart disease. In one study, therefore, 127 victims of cardiac arrest were matched on a number of characteristics that included age with 127 healthy control subjects; retirement status was then ascertained for each subject [262].
Healthy
Cardiac Arrest Retired Not Retired
Total
Retired Not Retired
27 20
12 68
39 88
Total
47
80
127
(a) Test the null hypothesis that there is no association between retirement status and cardiac arrest. (b) What do you conclude? (c) Estimate the odds ratio of being retired for healthy individuals versus those who have experienced cardiac arrest. (d) Construct a 95% confidence interval for the true population odds ratio. Does this interval contain the value 1? What does this suggest? 12. In response to a study suggesting a link between lack of circumcision and cervical cancer, an investigation was conducted to assess the accuracy of reported circumcision status [263]. Before registering at a cancer institute, all patients were asked to fill out a questionnaire. For males, the data requested included circumcision status. This information was confirmed by interview. Subsequently, all patients received a complete physical examination during which the physician noted whether the man was circumcised or not. The data collected over a two-month period are presented in the following table.
ISTUDY
376
Principles of Biostatistics Examination
Patient Statement Yes No
Total
Yes No
37 19
47 89
84 108
Total
56
136
192
At the 0.05 level of significance, does there appear to be an association between the results of the physical examination and the patient’s own response? If so, what is the relationship? 13. In an effort to evaluate the health effects of air pollutants containing sulfur, individuals living near a pulp mill in Finland were questioned about various symptoms following a strong emission released in one particular month [264]. The same subjects were questioned again four months later during a low exposure period. A summary of the responses related to the occurrence of headaches is presented below. High Exposure Yes No
Low Exposure
2 8
2 33
4 41
10
35
45
Yes No Total
Total
(a) Test the null hypothesis that there is no association between exposure to air pollutants containing sulfur and the occurrence of headaches. (b) What do you conclude? 14. In a study of the risk factors for invasive cervical cancer that was conducted in Germany, the following data were collected relating smoking status to the presence or absence of cervical cancer [265]. Smoker
Nonsmoker
Total
Cancer No Cancer
108 163
117 268
225 431
Total
271
385
656
(a) Estimate the odds ratio of invasive cervical cancer for smokers versus nonsmokers. (b) Construct a 95% confidence interval for the population odds ratio. (c) Test the null hypothesis that there is no association between smoking status and the presence of cervical cancer at the 0.05 level of significance. What do you conclude? 15. In France, a study was conducted to investigate potential risk factors for ectopic pregnancy [266]. Of the 279 women who experienced ectopic pregnancy, 28 had suffered from pelvic inflammatory disease. Of the 279 women who did not, 6 had suffered from pelvic inflammatory disease.
ISTUDY
377
Contingency Tables
(a) Construct a 2 × 2 contingency table for these data. (b) Estimate the odds ratio of experiencing ectopic pregnancy for women who have suffered from pelvic inflammatory disease versus women who have not. (c) Calculate a 99% confidence interval for the population odds ratio. (d) Would a 95% confidence interval be longer or shorter than the 99% interval? Explain. 16. The data below were taken from a study investigating the associations between spontaneous abortion and various risk factors including alcohol use [267]. Alcohol Use (drinks per week) 0 1–2 3–6 7–20 21+
Number of Pregnancies 33164 9099 3069 1527 287
Spontaneous Abortions 6793 2068 776 456 98
(a) For each level of alcohol consumption, estimate the probability that a woman who becomes pregnant will undergo a spontaneous abortion. (b) For each category of alcohol use, estimate the relative odds of experiencing a spontaneous abortion for women who consume some amount of alcohol versus those who do not consume any. (c) In each case, calculate a 95% confidence interval for the odds ratio. (d) Based on these data, what do you conclude? 17. In a study of hiv infection among women entering the New York State prison system, 475 inmates were cross-classified with respect to hiv seropositivity and their histories of intravenous drug use [268]. These data are saved in a dataset called prison. The indicators of seropositivity are saved under the variable name hiv, and those of intravenous drug use under ivdu. (a) Among women who have used drugs intravenously, what proportion are hiv positive? Among women who have not used drugs intravenously, what proportion are hiv positive? (b) At the 0.05 level of significance, test the null hypothesis that there is no association between history of intravenous drug use and hiv seropositivity. What is the p-value? (c) What do you conclude? (d) Estimate the odds ratio for being hiv positive for women who have used intravenous drugs versus those who have not. (e) Construct a 95% confidence interval for this odds ratio. 18. A study was conducted to determine whether geographic variations in the use of medical and surgical services could be explained in part by differences in the appropriateness with which physicians use these services [269]. One concern might be that a high rate of inappropriate use of a service is associated with high overall use within a particular region. For the procedure coronary angiography, three geographic areas were studied: a high-use site (Site 1), a low-use urban site (Site 2), and a low-use rural site (Site 3).
ISTUDY
378
Principles of Biostatistics Within each geographical region, each use of this procedure was classified as appropriate, equivocal, or inappropriate by a panel of expert physicians. The information is saved in a dataset called angiography. Site number is saved under the variable name site, and level of appropriateness under appropriate. (a) At the 0.05 level of significance, test the null hypothesis that there is no association between geographic region and the appropriateness of use of coronary angiography. What is the p-value? (b) What is the probability distribution of the test statistic? (c) What do you conclude? (d) If you conclude that there is an association between geographic region and level of appropriateness, how do the three sites differ?
19. Two different questionnaire formats designed to measure alcohol consumption – one encompassing all types of food in the diet and the other specifically targeting alcohol use – were compared for males and females between 50 and 65 years of age living in a particular community [270]. For each of the alcoholic beverages beer, liquor, red wine, and white wine, each subject was classified as either a nondrinker (never or less than one drink per month) or a drinker (one or more drinks per month) according to each of the questionnaires. The relevant information pertaining to beer consumption is saved in the dataset alcohol. Categories for the generic questionnaire are saved under the name generic_quest, and those for the questionnaire targeting alcohol use are saved under alcohol_quest. (a) Test the null hypothesis that there is no association between drinking status and type of questionnaire. (b) What do you conclude? 20. Since the role of rescue breathing in cardiopulmonary resuscitation was uncertain, a randomized study was conducted to compare survival to hospital discharge among individuals experiencing out-of-hospital cardiac arrest when bystanders were instructed to perform chest compression plus rescue breathing versus chest compression alone [271]. Data from this study are saved in the dataset cpr. Dispatcher instructions are saved under the name procedure, and survival to hospital discharge under the name survival. (a) Among individuals receiving chest compression plus rescue breathing, what proportion survived to hospital discharge? Among those receiving chest compression only, what proportion survived to hospital discharge? (b) At the 0.05 level of significance, test the null hypothesis that the proportions of patients surviving to hospital discharge are the same for individuals receiving chest compression plus rescue breathing versus chest compression alone. What is the p-value? (c) What do you conclude? (d) Estimate and interpret the risk ratio for survival to hospital discharge for individuals receiving chest compression plus rescue breathing versus chest compression alone. (e) Calculate a 95% confidence interval for the risk ratio. 21. Intimate partner violence (ipv) toward a woman either before or during her pregnancy has been documented as a risk factor for the health of both the mother and her unborn child. A study conducted in the postnatal wards of a public hospital in Bangladesh examined the relationship between experience of ipv by a woman and the birth weight of the infant [80].
ISTUDY
Contingency Tables
379
Data are contained in the dataset ipv. Low birth weight was defined as < 2.5 kilograms, and normal birth weight as ≥ 2.5 kilograms; this information is saved under the variable name low_bwt. A binary variable indicating whether a woman experienced physical intimate partner violence during her pregnancy is saved as ipv_p. (a) In Chapter 14 Review Exercise 16, the method for comparing two proportions introduced in Section 14.6 was used to test the null hypothesis that the proportions of low birth weight infants are identical for those whose mothers experienced physical intimate partner violence and those whose mothers did not. Use the chi-square test to evaluate the same null hypothesis. (b) What is the value of the test statistic? What is its probability distribution? (c) What is the p-value of the test? (d) What do you conclude?
ISTUDY
ISTUDY
16 Correlation
CONTENTS 16.1 16.2 16.3 16.4 16.5
Two-Way Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pearson Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spearman Rank Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
381 382 387 389 395
In the preceding chapters, we discuss measures of the strength of association between two dichotomous random variables. We now begin to investigate the relationships that can exist among continuous variables. One statistical technique often used to measure such an association is known as correlation analysis. Correlation is defined as the quantification of the degree to which two continuous random variables are related, provided that the relationship is linear.
16.1
Two-Way Scatter Plot
Suppose that we are interested in a pair of continuous random variables, each of which is measured on the same set of persons, hospitals, countries, or other units of study. For example, we might wish to investigate the relationship between the percentage of children who have been immunized against the infectious diseases diphtheria, pertussis, and tetanus (dpt) in a given country, and the corresponding mortality rate for children under 5 years of age. The United Nations Children’s Fund considers the under-5 mortality rate to be one of the most important indicators of the level of well-being for a population of children. The data for a sample of 20 countries from around the world are shown in Table 16.1 [272]. If X represents the percentage of children receiving all three of the required doses of the dpt vaccine, and Y represents the under-5 mortality rate, we have a pair of outcomes (x i , yi ) for each nation in the sample. The first country on the list, Bolivia, has a dpt immunization percentage of 83% and an under-5 mortality rate of 27 per 1000 live births; therefore, this country is represented by the data point (83 , 27). Before we conduct any type of analysis of these data, we should always create a two-way scatter plot. If we place the outcomes of the X variable along the horizontal axis and the outcomes of the Y variable along the vertical axis, each point on the graph represents a combination of values (x i , yi ). We can often determine whether a relationship exists between x and y – the outcomes of the random variables X and Y – simply by examining the graph. As an example, the data from Table 16.1 are plotted in Figure 16.1. The percentage of children immunized against dpt is shown on the horizontal axis and the under-5 mortality rate on the vertical axis. Not surprisingly, the mortality rate tends to decrease as the percentage of children immunized increases.
DOI: 10.1201/9780429340512-16
381
ISTUDY
382
Principles of Biostatistics
TABLE 16.1 Percentage of children immunized against dpt and under-5 mortality rate for a sample of 20 countries, 2018 Country Bolivia Brazil Cambodia Canada China Egypt Ethiopia Finland France Ghana Greece India Italy Japan Mexico Poland Russian Federation Senegal Turkey United Kingdom
16.2
Percentage Immunized 83 83 92 91 99 95 72 91 96 97 99 89 95 99 88 95 97 81 98 94
Mortality Rate per 1000 Live Births 27 14 28 5 9 21 55 2 4 48 4 37 3 2 13 4 7 44 11 4
Pearson Correlation Coefficient
In the underlying population from which the sample of points (x i , yi ) is selected, the correlation between the random variables X and Y is denoted by the Greek letter ρ (rho). The correlation ρ quantifies the strength of the linear relationship between the outcomes x and y. It can be thought of as the average of the product of the standard normal deviates of X and Y ; in particular, " # (X − µ x ) (Y − µ y ) ρ = average . σx σy The estimator of ρ is known as the Pearson correlation coefficient, or simply the correlation coefficient. The correlation coefficient is denoted by r and is calculated as r
= =
! ! n X 1 x i − x¯ yi − y¯ (n − 1) i=1 sx sy Pn ¯ i − y¯ ) 1 i=1 (x i − x)(y . (n − 1) s x sy
ISTUDY
383
Correlation
FIGURE 16.1 Under-5 mortality rate versus percentage of children immunized against dpt for 20 countries, 2018 In this formula, x¯ and y¯ are the sample means of the x and y values, respectively, and s x and s y are the sample standard deviations. An equivalent formula for r is Pn ¯ i − y¯ ) i=1 (x i − x)(y . r = q Pn Pn [ i=1 (x i − x) ¯ 2 ][ i=1 (yi − y¯ ) 2 ] The correlation coefficient is a dimensionless number, meaning that it has no units of measurement. The maximum value that r can achieve is 1, and its minimum value is −1. Therefore, for any given set of observations, −1 ≤ r ≤ 1. The values r = 1 and r = −1 occur when there is an exact linear relationship between x and y; if we were to plot all pairs of outcomes (x , y), the points would lie on a straight line. Examples of perfect correlation are illustrated in Figures 16.2 (a) and (b). As the relationship between x and y deviates from perfect linearity, r moves away from 1 or −1 and closer to 0. If y tends to increase in magnitude as x increases, r is greater than 0 and x and y are said to be positively correlated. If y decreases as x increases, r is less than 0 and the two variables are negatively correlated. If r = 0, as it does for the samples of points pictured in Figures 16.2 (c) and (d), there is no linear relationship between x and y and the variables are uncorrelated. However, as can be seen in Figure 16.2 (d), a nonlinear relationship may exist. Consider for example the association between age and the rate of fatal motor vehicle accidents in the United States in 2014–2015, shown in Figure 16.3 [273]. We can see that the rate is lowest for individuals between the ages of 30 and 69, and higher for both younger people and older ones. A “U-shaped” relationship of this type would not be well quantified by a correlation coefficient, since the association is not linear. For the data in Table 16.1, the mean percentage of children immunized against dpt is 20
n
x¯
=
1X xi n i=1
=
1 X xi 20 i=1
=
91.7%,
ISTUDY
384
Principles of Biostatistics
FIGURE 16.2 Scatter plots displaying possible relationships between X and Y : (a) perfect positive correlation, (b) perfect negative correlation, (c) no correlation, (d) no correlation
FIGURE 16.3 Rate of fatal motor vehicle crashes per 100 million miles driven versus age of driver, 2014–2015
ISTUDY
385
Correlation and the mean value of the under-5 mortality rate is 20
n
y¯
=
1X yi n i=1
1 X yi 20 i=1
=
=
17.1 per 1000 live births.
The correlation coefficient is =
¯ i − y¯ ) i=1 (x i − x)(y qP Pn n [ i=1 (x i − x) ¯ 2 ][ i=1 (yi − y¯ ) 2 ]
=
i=1 (x i − 91.7)(yi − 17.1) qP Pn n [ i=1 (x i − 91.7) 2 ][ i=1 (yi − 17.1) 2 ]
=
−0.63.
Pn
r
Pn
Based on this sample, there appears to be a moderately strong linear relationship between the percentage of children immunized against dpt in a specified country and its under-5 mortality rate. Since r is negative, mortality rate decreases in magnitude as percent immunization increases. Care must be taken when interpreting this relationship, however. An effective immunization program might be the primary reason for the decrease in mortality, or it might be a ramification of a successful comprehensive health care system that is itself the cause of the decrease. The correlation coefficient merely tells us that a linear relationship exists between two variables; it does not specify whether the relationship is cause-and-effect. One additional word of caution: In this example, countries are the units of study, not individual people. We do not know whether the children who were immunized against dpt are the same individuals who survived to the age of 5 years, or not. When using the correlation coefficient, we are not able to make inference at a more granular level than that at which the data were collected. In the example above, we used immunization and mortality rates collected for each country, and therefore cannot draw conclusions about the particular people who live in those countries. The error in reasoning that can result when we use aggregate data to make inference about individuals is called an ecological fallacy. Just as we made inference about a population mean µ based on the sample mean x, ¯ we would also like to be able to draw conclusions about the unknown population correlation ρ using the sample correlation coefficient r. Most frequently, we are interested in determining whether any linear relationship exists between the random variables X and Y . This can be accomplished by testing the null hypothesis that there is no such correlation in the underlying population, H0 : ρ = ρ0 = 0, versus the alternative hypothesis
H A : ρ , 0.
The procedure is similar to other tests of hypotheses we have encountered; it involves calculating a test statistic which is then used to find the probability of obtaining a sample correlation coefficient as extreme or more extreme than the observed value r given that the null hypothesis is true. Since the estimated standard error of r may be expressed as r 1 − r2 sDe(r) = , n−2
ISTUDY
386
Principles of Biostatistics
the test statistic takes the form t
= =
r − ρ0 sDe(r) r −0
(1 − r 2 )/(n − 2) r n−2 . = r 1 − r2 p
If we assume that the pairs of observations (x i , yi ) were obtained randomly and that both X and Y are normally distributed, this test statistic has a t distribution with n − 2 degrees of freedom, given that the null hypothesis is true. Suppose we want to know whether a linear relationship exists between percent immunization against dpt and the under-5 mortality rate in the population of countries around the world. We conduct a two-sided test of the null hypothesis of no linear association at the α = 0.05 level of significance. Recall that we previously found r to be equal to −0.63. Therefore, r n−2 t = r 1 − r2 s 20 − 2 = −0.63 1 − (−0.63) 2 = −3.44. Referring to Table A.4, we observe that for a t distribution with 18 degrees of freedom, 2(0.0005) < p < 2(0.005), so 0.001 < p < 0.01. Given that ρ = 0, the probability of observing a sample correlation coefficient as far from 0 as r = −0.63, or even more extreme than this, is quite small. We reject the null hypothesis at the 0.05 level of significance. Based on this sample, there is evidence that the true population correlation ρ is different from 0. Under-5 mortality rate decreases as the percentage of children immunized increases. (Note, however, that neither percentage of children immunized against dpt nor under-5 mortality rate is normally distributed; the percentage immunized is skewed to the left, and the mortality rate skewed to the right. Therefore, the hypothesis testing procedure performed above cannot be assumed to be accurate for these data.) The testing procedure described is valid only for the special case in which the hypothesized value of the population correlation is equal to 0. If ρ is equal to some other value ρ0 , then the sampling distribution of r is skewed, and the test statistic no longer follows a t distribution. Methods for testing the more general hypothesis H0 : ρ = ρ0 are available [174], but are beyond the scope of this text. The Pearson correlation coefficient r has several limitations. First, it quantifies only the strength of the linear relationship between two variables. If X and Y have a nonlinear relationship – as they do in Figures 16.2 (d) and 16.3 – it will not provide a valid measure of this association. Second, care must be taken when the data contain any outliers, or pairs of observations that lie considerably outside the range of the other data points. The sample correlation coefficient is highly sensitive to extreme values and, if one or more are present, often gives misleading results. Third, the estimated correlation should never be extrapolated beyond the observed ranges of the variables; the relationship between X and Y may change outside of this region. Finally, it must be kept in mind that a high correlation between two variables does not in itself imply a cause-and-effect relationship.
ISTUDY
387
Correlation Summary: Pearson Correlation Coefficient ! ! n 1 X x i − x¯ yi − y¯ (n − 1) sx sy i=1
Pearson correlation coefficient
r =
Null hypothesis
H0 : ρ = ρ0 = 0
Alternative hypothesis
H A : ρ = ρ0 , 0 r
16.3
n−2 1 − r2
Test statistic
t = r
Distribution of test statistic
t distribution with n − 2 degrees of freedom
Spearman Rank Correlation Coefficient
Like other parametric techniques which assume that the populations we are sampling from are normally distributed, the Pearson correlation coefficient is sensitive to outlying values. We may be interested in calculating a measure of association that is more robust. One approach is to rank the two sets of outcomes x and y separately and calculate a rank correlation coefficient. This procedure – which results in a quantity known as the Spearman rank correlation coefficient – may be classified among the nonparametric methods presented in Chapter 13. The Spearman rank correlation coefficient, denoted r s , is simply the Pearson correlation r calculated for the ranked values of x and y. Therefore, Pn i=1 (x r i − x¯ r )(yr i − y¯r ) rs = q Pn Pn [ i=1 (x r i − x¯ r ) 2 ][ i=1 (yr i − y¯r ) 2 ] where x r i and yr i are the ranks associated with the ith subject rather than the actual observations. An equivalent method for computing r s is provided by the formula Pn 6 i=1 d i2 rs = 1 − . n (n2 − 1) Here, n is the number of data points in the sample and d i is the difference between the rank of x i and the rank of yi . Like the Pearson correlation coefficient, the Spearman rank correlation ranges in value from −1 to 1. Values of r s close to the extremes indicate a high degree of correlation between x and y; values near 0 imply a lack of linear association between the two variables. Suppose that we were to rank the percentages of children immunized against dpt and the under-5 mortality rates presented in Table 16.1 from smallest to largest, separately for each variable, assigning average ranks to tied observations. The results are shown in Table 16.2, along with the difference in ranks for each country, and the squares of these differences. Using the computational formula for r s , the Spearman rank correlation coefficient is Pn 6 i=1 d i2 6(1974.5) = 1− = − 0.49. rs = 1 − 20(399) n (n2 − 1)
ISTUDY
388
Principles of Biostatistics
TABLE 16.2 Ranked percentages of children immunized against dpt and under-5 mortality rates for 20 countries, 2018 Country Ethiopia Senegal Bolivia Brazil Mexico India Canada Finland Cambodia United Kingdom Egypt Italy Poland France Ghana Russian Federation Turkey China Greece Japan
Percentage Immunized
Rank
Mortality Rate*
Rank
di
d i2
72 81 83 83 88 89 91 91 92 94 95 95 95 96 97 97 98 99 99 99
1 2 3.5 3.5 5 6 7.5 7.5 9 10 12 12 12 14 15.5 15.5 17 19 19 19
55 44 27 14 13 37 5 2 28 4 21 3 4 4 48 7 11 9 4 2
20 18 15 13 12 17 8 1.5 16 5.5 14 3 5.5 5.5 19 9 11 10 5.5 1.5
−19 −16 −11.5 −9.5 −7 −11 −0.5 6 −7 4.5 −2 9 6.5 8.5 −3.5 6.5 6 9 13.5 17.5
361 256 132.25 90.25 49 121 0.25 36 49 20.25 4 81 42.25 72.25 12.25 42.25 36 81 182.25 306.25
Total:
1974.5
* Per 1000 live births
This value is somewhat smaller in magnitude than the Pearson correlation coefficient – perhaps r is inflated due to the nonnormality of the data – but still suggests a moderate negative relationship between the percentage of children immunized against dpt and the under-5 mortality rate. The Spearman rank correlation coefficient may also be thought of as a measure of the concordance of the ranks for the outcomes x and y. If the 20 measurements of percent immunization against dpt and under-5 mortality rate in Table 16.2 happened to be ranked in the same order for each variable – meaning that the country with the ith largest percent immunization also has the ith largest mortality rate for all values of i – then each difference d i would be equal to 0, and rs
=
1−
6(0) n (n2 − 1)
=
1.
If the ranking of the first variable is the inverse of the ranking of the second – so that the country with the largest percentage of children immunized against dpt has the smallest under-5 mortality rate, and so forth – it can be shown that r s = −1. When there is no linear correspondence between the two sets of ranks, then r s = 0. If the sample size n is not too small – in particular, if it is greater than or equal to 10 – and if we can assume that pairs of ranks (x r i , yr i ) are chosen randomly, then we may test the null hypothesis
ISTUDY
389
Correlation that the unknown population correlation is equal to 0, H0 : ρ = 0,
using the same procedure we used for the Pearson correlation. For the ranked data of Table 16.2, the test statistic is s n−2 ts = r s 1 − r s2 s 20 − 2 = −0.49 = − 2.38. 1 − (−0.49) 2 For a t distribution with 18 degrees of freedom, 0.02 < p < 0.05. Therefore, we reject H0 at the 0.05 level and conclude that the true population correlation is less than 0. This testing procedure does not require that X and Y be normally distributed. Like other nonparametric techniques, the Spearman rank correlation coefficient has advantages and disadvantages. It is much less sensitive to outlying values than the Pearson correlation coefficient. In addition, it can be used when one or both of the relevant variables are discrete or ordinal. Because it relies on ranks rather than actual observations, however, the nonparametric method does not use all the information known about a distribution. Summary: Spearman Rank Correlation Coefficient Spearman rank correlation coefficient
Pn i=1 (x r i − x¯ r )(y r i − y¯ r ) rs = q Pn Pn [ i=1 (x r i − x¯ r ) 2 ][ i=1 (y r i − y¯ r ) 2 ]
Null hypothesis
H0 : ρ = ρ0 = 0
Alternative hypothesis
H A : ρ = ρ0 , 0 s
16.4
n−2
Test statistic
ts = r s
Distribution of test statistic
t distribution with n − 2 degrees of freedom
1 − r s2
Further Applications
In Chapter 2, we use a two-way scatter plot to investigate the association between two measures of lung function – forced vital capacity(fvc) and forced expiratory volume in one second (fev1 ), both measured in liters – for a sample of 19 asthmatic individuals who participated in a study investigating the physical effects of sulfur dioxide [48]. The measurements for these subjects are displayed using a two-way scatter plot in Figure 16.4 and listed in Table 16.3. Note that higher values of fvc appear to be associated with higher fev1 . We would like to determine whether there is any evidence of a linear relationship between these two quantities.
ISTUDY
390
Principles of Biostatistics
FIGURE 16.4 Forced vital capacity versus forced expiratory volume in one second for 19 asthmatic subjects Let x represent the observed values of forced expiratory volume in one second, and y the observed forced vital capacities. The mean fev1 is x¯ =
20 1 X x i = 3.78 liters, 19 i=1
y¯ =
190 1 X yi = 4.92 liters. 19 i=1
and the mean fvc is
The Pearson correlation coefficient, which quantifies the degree to which these outcomes are linearly related, is Pn ¯ i − y¯ ) i=1 (x i − x)(y r = q Pn Pn [ i=1 (x i − x) ¯ 2 ][ i=1 (yi − y¯ ) 2 ] =
P19
i=1 (x i − 3.78)(yi − 4.92) = 0.86. q P19 P 2] [ i=1 (x i − 3.78) 2 ][ 19 (y − 4.92) i i=1
A value of −1 or 1 would imply that an exact linear relationship exists. If r were equal to 0, that would mean there is no linear relationship at all. Based on the sample of 19 asthmatic individuals, there appears to be a strong relationship between forced vital capacity and forced expiratory volume in one second for patients with asthma. Since r is positive, fvc increases as fev1 increases. We might also wish to use the information in the sample to test the null hypothesis that there is no linear relationship between forced vital capacity and forced expiratory volume in one second in
ISTUDY
391
Correlation
TABLE 16.3 Forced vital capacity (fvc) and forced expiratory volume in one second (fev1 ) for 19 asthmatic subjects Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
FVC (l) 4.7 4.3 3.5 4.0 3.2 4.7 4.3 4.7 5.2 4.2 3.5 3.2 2.6 2.0 4.0 3.9 3.0 4.5 2.4
FEV1 (l) 5.8 5.1 4.1 5.4 4.4 5.3 6.8 5.3 6.2 6.2 4.6 4.0 3.9 2.8 5.3 4.5 3.8 5.8 4.1
the underlying population of asthmatic patients, or H0 : ρ = 0, using a two-sided test conducted at the 0.05 level of significance. Since r = 0.86, r n−2 t = r 1 − r2 s 19 − 2 = 0.86 = 6.95. 1 − (0.86) 2 For a t distribution with 17 degrees of freedom, p < 0.001. We reject H0 and conclude that the true population correlation is not equal to 0; since r is positive, forced vital capacity increases in magnitude as forced expiratory volume in one second increases. (Note that this hypothesis testing procedure assumes that both X and Y are normally distributed.) If we are interested in calculating a more robust measure of association between two variables, we can order the sets of outcomes x and y from smallest to largest and compute the rank correlation instead. The Spearman rank correlation is simply the Pearson r calculated using ranks rather than the actual observations. The ranked measures of fvc and fev1 are shown in Table 16.4, along with the differences in ranks for each asthmatic subject and the squares of these differences. The Spearman
ISTUDY
392
Principles of Biostatistics
TABLE 16.4 Ranked forced vital capacity (fvc) and forced expiratory volume in one second (fev1 ) for 19 asthmatic subjects Subject 14 17 13 12 19 3 5 16 11 2 6 15 8 4 1 18 10 9 7
FVC 2.8 3.8 3.9 4.0 4.1 4.1 4.4 4.5 4.6 5.1 5.3 5.3 5.3 5.4 5.8 5.8 6.2 6.2 6.8
Rank 1 2 3 4 5.5 5.5 7 8 9 10 12 12 12 14 15.5 15.5 17.5 17.5 19
FEV1 2.0 3.0 2.6 3.2 2.4 3.5 3.2 3.9 3.5 4.3 4.7 4.0 4.7 4.0 4.7 4.5 4.2 5.2 4.3
Rank 1 4 3 5.5 2 7.5 5.5 9 7.5 13.5 17 10.5 17 10.5 17 15 12 19 13.5
di 0 −2 0 −1.5 3.5 −2 1.5 −1 1.5 −3.5 −5 1.5 −5 3.5 −1.5 0.5 5.5 −1.5 5.5 Total:
d i2 0 4 0 2.25 12.25 4 2.25 1 2.25 12.25 25 2.25 25 12.25 2.25 0.25 30.25 2.25 30.25 167.75
rank correlation for the data is rs
=
1−
6 n
2 i=1 d i (n2 − 1)
Pn
=
1−
6(167.75) 19(360)
=
0.85.
This value is nearly identical in magnitude to the Pearson correlation coefficient; it also signifies a strong positive relationship between forced vital capacity and forced expiratory volume in one second. To test the null hypothesis that the unknown population correlation is equal to 0, or H0 : ρ = 0, using the Spearman rank correlation, we calculate s n−2 ts = r s 1 − r s2 s 19 − 2 = 0.85 = 6.65. 1 − (0.85) 2
ISTUDY
393
Correlation TABLE 16.5 Stata output for the Pearson correlation coefficient | fvc fev -------------+-----------------fvc | 1.0000 | | 19 | fev | 0.8643 1.0000 | 0.0000 | 19 19
TABLE 16.6 R output for the Pearson correlation coefficient Pearson’s product-moment correlation data: data$fev and data$fvc t = 7.0859, df = 17, p-value = 1.827e-06 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.6751854 0.9468261 sample estimates: cor 0.8643268
For a t distribution with 17 degrees of freedom, p < 0.001. We again reject the null hypothesis that the population correlation is equal to 0, and conclude that as fvc increases, fev1 also increases. This testing procedure does not require that X and Y be normally distributed. Finding either the Pearson correlation coefficient or the Spearman rank correlation for two random variables involves quite a bit of computation; however, most computer packages will calculate both correlations for us. Table 16.5 shows the Stata output for the Pearson correlation coefficient. The number underneath the correlation 0.86 is the p-value for the test of the null hypothesis H0 : ρ = 0. Here, 0.0000 means that p < 0.0001. Table 16.6 shows similar output from R. Tables 16.7 and 16.8 display the Stata and R output for the Spearman rank correlation.
TABLE 16.7 Stata output for the Spearman rank correlation coefficient Number of obs = Spearman’s rho =
19 0.8499
Test of Ho: fvc and fev are independent Prob > |t| = 0.0000
ISTUDY
394 TABLE 16.8 R output for the Spearman rank correlation coefficient Spearman’s rank correlation rho data: data$fev and data$fvc S = 171.13, p-value = 4.1e-06 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8498897
Principles of Biostatistics
ISTUDY
395
Correlation
16.5
Review Exercises
1. When investigating the relationship between two continuous random variables, why is it important to start by creating a scatter plot of the data? 2. What are the strengths and limitations of the Pearson correlation coefficient? 3. How does the Spearman rank correlation differ from the Pearson correlation? 4. If a test of hypothesis indicates that the correlation between two random variables is not significantly different from 0, does this necessarily imply that the variables are independent? Explain. 5. In a study conducted in Italy, 10 patients with hypertriglyceridemia were placed on a low-fat, high-carbohydrate diet. Before the start of the diet, cholesterol and triglyceride measurements were recorded for each subject [274]. Patient 1 2 3 4 5 6 7 8 9 10
Cholesterol Level (mmol/l) 5.12 6.18 6.77 6.65 6.36 5.90 5.48 6.02 10.34 8.51
Triglyceride Level (mmol/l) 2.30 2.54 2.95 3.77 4.18 5.31 5.53 8.83 9.48 14.20
(a) Construct a two-way scatter plot for these data. (b) Does there appear to be any evidence of a linear relationship between cholesterol and triglyceride levels prior to the diet? (c) Compute r, the Pearson correlation coefficient. (d) At the 0.05 level of significance, test the null hypothesis that the population correlation ρ is equal to 0. What do you conclude? (e) Calculate r s , the Spearman rank correlation coefficient. (f) How does the value of r s compare to r? (g) Using r s , again test the null hypothesis that the population correlation is equal to 0 at the 0.05 level of significance. What do you conclude? 6. Thirty-five patients with ischemic heart disease, a suppression of blood flow to the heart, took part in a series of tests designed to evaluate the perception of pain. In one part of the study, the patients exercised until they experienced angina or chest pain; time until the onset of angina and the duration of the attack were recorded. The observations are saved in the dataset ischemic_hd [275]. Time to angina in seconds is saved under the variable name time; the duration of angina, also in seconds, is saved under the name duration.
ISTUDY
396
Principles of Biostatistics (a) Create a two-way scatter plot of duration of angina versus time to angina. (b) In the population of patients with ischemic heart disease, does there appear to be any evidence of a linear relationship between time to angina and the duration of the attack? (c) Does the duration of angina tend to increase or decrease as time to angina increases? (d) Construct histograms for both time to angina and duration of angina. Do these measurements appear to be normally distributed? (e) Compute the Spearman rank correlation. (f) Test the null hypothesis H0 : ρ = 0 at the 0.05 level of significance. Do you reject the null hypothesis? What do you conclude? (g) Using r s , again test the null hypothesis that the population correlation is equal to 0. What do you conclude? 7. Suppose you are interested in determining whether a relationship exists between the fluoride content in a public water supply and the dental caries experience of children drinking this water. Data from a study examining 7257 children in 21 cities are saved in the dataset water [276]. The fluoride content of the public water supply in each city, measured in parts per million, is saved under the variable name fluoride. The number of dental caries per 100 children examined is saved under the name caries. The total dental caries experience is obtained by summing the numbers of filled teeth, teeth with untreated dental caries, teeth requiring extraction, and missing teeth. (a) Construct a two-way scatter plot of number of dental caries per 100 children versus fluoride content in the public water supply. (b) Calculate the correlation between the number of dental caries and fluoride content. (c) Test the null hypothesis that the true population correlation ρ is equal to 0. What do you conclude? (d) For the 21 cities in the study, the highest fluoride content in a given water supply is 2.6 ppm. If you were to increase the fluoride content of the water to more than 4 ppm, do you believe that the number of dental caries per 100 children would decrease? Explain. 8. One of the functions of the Federation of State Medical Boards is to collect data summarizing disciplinary actions taken against nonfederal physicians by medical licensing boards. “Serious actions” include license revocations, suspensions, and probations. For each of the years 2007 through 2011, the number of serious actions per 1000 doctors was ranked by state from highest to lowest. The ranks are contained in a dataset called actions [277]; the ranks for 2007 are saved under the variable name rank2007, those for 2008 under rank2008, and so on. (a) Which states have the highest rates of serious actions in each of the five years 2007 through 2011? Which states have the lowest rates? (b) Construct a two-way scatter plot for the ranks of disciplinary actions in 2008 versus the ranks in 2007. (c) Does there appear to be a relationship between these two quantities? (d) Calculate the correlation for the two sets of ranks. (e) Is this correlation significantly different from 0? What do you conclude?
ISTUDY
Correlation
397
(f) Calculate the correlations for the ranks in 2007 and those in 2009, 2010, and 2011. What happens to the magnitude of the correlation as the years being compared get further apart? (g) Are each of these three correlations significantly different from 0? (h) Do you believe that all states are equally strict in taking disciplinary action against physicians? Explain. 9. The dataset maternal_health contains information on maternal health factors for a sample of 20 countries [272]. The variable attendant is the percentage of births attended by a skilled health care worker, typically a doctor, nurse, or midwife, and maternal_mr is the maternal mortality ratio, expressed as the number of deaths from pregnancy-related causes per 100,000 live births. (a) Construct a two-way scatter plot of maternal mortality ratio versus the percentage of births attended by a skilled health care worker. (b) Does there appear to be a linear relationship between these two maternal health indicators? (c) Calculate the Pearson correlation coefficient, and the Spearman rank correlation coefficient. Which measure do you prefer? Why? (d) Test the null hypothesis that the true population correlation is equal to 0. What is the p-value? (e) Do you reject or fail to reject the null hypothesis? What do you conclude? (f) Based on this analysis, are you able to conclude that women who are attended by a skilled health care worker when they are giving birth are less likely to die from pregnancy-related causes? Explain. 10. The dataset lowbwt contains information collected for a sample of 100 low birth weight infants born in two teaching hospitals in Boston, Massachusetts [81]. Measurements of systolic blood pressure are saved under the variable name sbp, and values of the Apgar score recorded five minutes after birth – an index of neonatal asphyxia or oxygen deprivation – are saved under the name apgar5. Apgar score is an ordinal random variable that takes values between 0 and 10. (a) Calculate the Spearman rank correlation coefficient for systolic blood pressure and five-minute Apgar score for this population of low birth weight infants. (b) Does Apgar score tend to increase or decrease as systolic blood pressure increases? (c) Test the null hypothesis H0 : ρ = 0 at the 0.05 level of significance. Do you reject or fail to reject the null hypothesis? What do you conclude? 11. The Bayley Scales of Infant Development yield scores on two indices — the Psychomotor Development Index (pdi) and the Mental Development Index (mdi) — which can be used to assess a child’s level of functioning in each of these areas at approximately one year of age. As part of a study assessing the development and neurologic status of children who underwent reparative heart surgery during the first three months of life, the Bayley Scales were administered to a sample of one-year-old infants born with congenital heart disease. The data are contained in the dataset bayley [189]. pdi scores are saved under the variable name pdi, while mdi scores are saved under mdi.
ISTUDY
398
Principles of Biostatistics (a) Construct a two-way scatter plot of pdi versus mdi. (b) Does there appear to be any evidence of a linear relationship between these two measures? (c) Compute r, the Pearson correlation coefficient. (d) At the 0.05 level of significance, test the null hypothesis that the population correlation ρ is equal to 0. What do you conclude?
ISTUDY
17 Simple Linear Regression
CONTENTS 17.1 17.2
17.3
17.4 17.5
Regression Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.1 Population Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.2 Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.3 Inference for Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.4 Inference for Predicted Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.1 Coefficient of Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.2 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
399 402 402 404 408 410 413 413 414 416 419 425
Like correlation analysis, simple linear regression is a technique that is used to explore the nature of the relationship between two continuous random variables. The primary difference between these two analytical methods is that regression enables us to investigate the change in one variable, called the response or outcome, which corresponds to a given change in the other, the explanatory variable. Correlation analysis makes no such distinction; the two variables involved are treated symmetrically. The ultimate objective of regression analysis is to predict or estimate the value of the response that is associated with a fixed value of the explanatory variable. An example of a situation in which regression analysis might be preferred to correlation is illustrated by the pediatric growth charts in Figures 17.1 and 17.2 [278]. Among children of both sexes, head circumference appears to increase linearly between the ages of 4 and 18 years. Rather than quantifying the strength of this association, we might be interested in predicting the change in head circumference that corresponds to a one year increase in age. In this case, head circumference is the response, and age is the explanatory variable. An understanding of their relationship helps parents and pediatricians to monitor growth and detect possible cases of macrocephaly and microcephaly.
17.1
Regression Concepts
Suppose that we are interested in the probability distribution of a continuous random variable Y . The outcomes of Y , denoted y, are the head circumference measurements in cm for the population of low birth weight infants – defined as those weighing less than 1500 grams – born in two teaching hospitals in Boston, Massachusetts [81]. We are told that the mean head circumference for the infants in this population is µ y = 27.0 cm, DOI: 10.1201/9780429340512-17
399
ISTUDY
400
Principles of Biostatistics
FIGURE 17.1 Head circumference versus age for boys (Source: United States Head Circumference Growth Reference Charts, Journal of Pediatrics, 2010 [278])
FIGURE 17.2 Head circumference versus age for girls (Source: United States Head Circumference Growth Reference Charts, Journal of Pediatrics, 2010 [278])
ISTUDY
401
Simple Linear Regression and that the standard deviation is
σy
=
2.5 cm.
Since the distribution of measurements is roughly normal, we are able to say that approximately 95% of the infants have head circumferences that measure between
and
µ y − 1.96 σ y
=
27.0 − (1.96)(2.5)
=
22.1 cm
µ y + 1.96 σ y
=
27.0 + (1.96)(2.5)
=
31.9 cm.
Suppose we also know that the head circumferences of newborn infants increase with gestational age, and that for each specified age x the distribution of measurements is approximately normal. For example, the head circumferences of infants whose gestational age is 26 weeks are normally distributed with mean µ y |26 = 24.0 cm and standard deviation
σ y |26
=
1.6 cm.
Similarly, the head circumferences of infants whose gestational age is 29 weeks are approximately normal with mean µ y |29 = 26.5 cm and standard deviation
σ y |29
=
1.6 cm,
whereas the measurements for infants whose gestational age is 32 weeks are normal with mean µ y |32 = 29.0 cm and standard deviation
σ y |32 = 1.6 cm.
For each value of gestational age x, the standard deviation σ y | x is constant and is less than σ y . In fact, it can be shown that σ y2 | x = (1 − ρ2 ) σ y2 , where ρ is the correlation between X and Y in the underlying population [279]. If X and Y have no linear relationship, then ρ = 0 and σ y2 |x
=
(1 − 0) σ y2
=
σ y2 .
For the random variables head circumference and gestational age, σ y = 2.5 cm and σ y |x = 1.6 cm. Therefore, (1.6) 2 = (1 − ρ2 ) (2.5) 2, and
s ρ
=
1−
(1.6) 2 (2.5) 2
=
√
0.5904
=
± 0.77.
There is a fairly strong correlation between head circumference and gestational age in the underlying population of low birth weight infants. Using this method of calculation, however, we cannot determine whether the correlation is positive or negative. Because the standard deviation of the distribution of head circumference measurements for infants of a specified gestational age (σ y | x = 1.6 cm) is smaller than the standard deviation for infants of all ages combined (σ y = 2.5 cm), working with a single value of gestational age allows
ISTUDY
402
Principles of Biostatistics
us to be more precise in our descriptions. For example, we can say that approximately 95% of the values of head circumference for the population of infants whose gestational age is 26 weeks lie between µ y |26 − 1.96 σ y |26 = 24.0 − (1.96)(1.6) = 20.9 cm and
µ y |26 + 1.96 σ y |26
=
24.0 + (1.96)(1.6)
=
27.1 cm.
Similarly, roughly 95% of the infants whose gestational age is 29 weeks have head circumferences between µ y |29 − 1.96 σ y |29 = 26.5 − (1.96)(1.6) = 23.4 cm and
µ y |29 + 1.96 σ y |29
=
26.5 + (1.96)(1.6)
=
29.6 cm,
whereas 95% of the infants whose gestational age is 32 weeks have measurements between
and
µ y |32 − 1.96 σ y |32
=
29.0 − (1.96)(1.6)
=
25.9 cm
µ y |32 + 1.96 σ y |32
=
29.0 + (1.96)(1.6)
=
32.1 cm.
In summary, the respective intervals are as follows: Gestational Age (weeks) 26 29 32
Interval Containing 95% of the Observations (20.9 , 27.1) (23.4 , 29.6) (25.9 , 32.1)
Each of these intervals is constructed to enclose 95% of the population head circumference values for infants of a particular gestational age. None is as wide as (22.1, 31.9), the interval computed for the entire population of low birth weight infants. In addition, the intervals shift to the right – containing higher head circumference values – as gestational age increases.
17.2
The Model
We are now ready to take what we know about the probability distributions of head circumference Y for individual values of gestational age X, and use this information to model Y across the entire range of X values.
17.2.1
Population Regression Line
As noted in the preceding section, mean head circumference increases as gestational age increases. Based on the means plotted in Figure 17.3, the relationship is linear. One way to quantify this relationship is to fit a model of the form µy | x
=
β0 + β1 x,
ISTUDY
403
Simple Linear Regression
FIGURE 17.3 Population regression line of mean head circumference versus gestational age for low birth weight infants, µ y | x = 2.3 + 0.83x where µ y | x is the mean head circumference of low birth weight infants whose gestational age is x weeks. This model – known as the population regression line – is the equation of a straight line. The parameters β0 and β1 are constants called the coefficients of the equation; β0 is the y-intercept of the line and β1 is its slope. The y-intercept is the mean value of the response y when x is equal to 0, or µ y |0 . The slope is the change in the mean value of y that corresponds to a one unit increase in x. If β1 is positive, µ y | x increases in magnitude as x increases. If β1 is negative, µ y | x decreases as x increases. Even if the relationship between mean head circumference and gestational age is a perfect straight line as implied by this model, the relationship between individual values of head circumference and age is not. As previously noted, the distribution of head circumference measurements for all low birth weight infants of a particular gestational age x is approximately normal with mean µ y | x and standard deviation σ y |x . The scatter around the mean is a result of the natural variation among children; we would not expect all low birth weight infants whose gestational age is 29 weeks to have exactly the same head circumference. To accommodate this scatter, we actually fit a model of the form y
=
β0 + β1 x +
where is the distance a particular outcome y lies from the population regression line µy | x
=
β0 + β1 x.
If is positive, then the value y is greater than µ y |x and y lies above the line. If is negative, y is less than µ y | x and y lies below the line. (The term in the model is often called the error, but it does not mean that we are making a mistake. As previously noted, we do not expect all observations to lie on the line.)
ISTUDY
404
Principles of Biostatistics
FIGURE 17.4 Normality of the outcomes Y for given values of X In simple linear regression, the coefficients of the population regression line are estimated using a random sample of observations (x i , yi ). Before we attempt to fit such a line, however, we must make a few assumptions: 1. For a specified value of x, which is considered to have been measured without error, the distribution of the y values is normal with mean µ y | x and standard deviation σ y | x . This concept is illustrated in Figure 17.4. 2. The relationship between µ y | x and x is described by the straight line µ y |x
=
β0 + β1 x.
3. For any specified value of x, σ y | x – the standard deviation of the outcomes y – does not change. This assumption of constant variability across all values of x is known as homoscedasticity. It is analogous to the assumption of equal variances in the two-sample t test or the one-way analysis of variance. 4. The outcomes y are independent.
17.2.2
Method of Least Squares
Consider Figure 17.5, the two-way scatter plot of head circumference versus gestational age for a sample of 100 low birth weight infants born in Boston, Massachusetts. The explanatory variable is displayed on the horizontal axis and the response or outcome appears on the vertical axis. The data points themselves vary widely, but the overall pattern suggests that head circumference increases as gestational age increases. In previous chapters we attempted to estimate a population parameter – such as a mean or a proportion – based on the observations in a randomly chosen sample; similarly, we estimate the
ISTUDY
405
Simple Linear Regression
FIGURE 17.5 Head circumference versus gestational age for a sample of 100 low birth weight infants coefficients of a population regression line using a single sample of measurements. Suppose that we were to draw an arbitrary line through the scatter of points in Figure 17.5. One such line is shown in Figure 17.6. Lines sketched by two different individuals are unlikely to be identical, even when both persons are attempting to depict the same trend. The question then arises as to which line best describes the relationship between mean head circumference and gestational age. What is needed is a more objective procedure for estimating the line. (A word of explanation regarding Figure 17.5: Although the graph contains information for 100 infants, there appear to be far fewer data points in the scatter plot. Since the values of the continuous measurements head circumference and gestational age are each rounded to the nearest integer in this dataset, many infants end up with identical values of these variables; consequently, some data points are plotted on top of others. To make this more transparent, different size circles are used. The larger the circle, the more subjects are represented by that data point. For instance, there is only one infant with gestational age 23 weeks and head circumference 21 cm, hence the small circle at that point. In contrast, the largest circle at gestational age 29 weeks and head circumference 27 cm represents 9 unique infants.) One mathematical technique for fitting a straight line to a set of points (x i , yi ) is known as the method of least squares. Observe that each of the 100 data points representing measurements of head circumference and gestational age lies some vertical distance from the arbitrary line drawn in Figure 17.6; we label this distance ei . If yi is the observed outcome of Y for a particular value x i , and yˆi (yi -hat) is the corresponding point on the fitted line, then ei
=
yi − yˆi .
The distance ei is called the residual associated with yi and is an estimate of the term in the model; the residual is the distance between yi and the line. If all the residuals for a set of data points are equal to 0, this implies that each point (x i , yi ) lies directly on the fitted line. The points are as close to the line as they can be; there is no natural variation in the response. Since this is not typically the case, however, we choose a criterion for fitting a line that makes the residuals as small as possible.
ISTUDY
406
Principles of Biostatistics
FIGURE 17.6 Arbitrary line depicting a relationship between head circumference and gestational age The sum of the squares of the residuals, n X
e i2
=
i=1
n X
(yi − yˆi ) 2,
i=1
is called the residual sum of squares. The least squares regression line is constructed so that the residual sum of squares is minimized. The process of fitting the least squares line represented by yˆ
=
βˆ0 + βˆ1 x
involves finding βˆ0 and βˆ1 , the estimates of the population regression coefficients β0 and β1 . Using calculus to minimize the error sum of squares n X i=1
e i2
=
n X
(yi − yˆi ) 2
i=1
and
n X
(yi − βˆ0 − βˆ1 x i ) 2,
i=1
we find that βˆ1
=
¯ i− i=1 (x i − x)(y Pn (x ¯ 2 i=1 i − x)
Pn = βˆ0
=
y¯ )
y¯ − βˆ1 x. ¯
These equations yield the slope and the y-intercept for the fitted least squares line. In the expression for βˆ1 , the numerator is the sum of the cross-products of deviations around the mean for x and y; the denominator is the sum of squared deviations around the mean for x alone. The equation for βˆ0 is expressed in terms of the estimated slope βˆ1 . Once we know βˆ0 and βˆ1 , we are able to substitute various values of x into the equation for the line, solve for the corresponding values of yˆ – the
ISTUDY
407
Simple Linear Regression
FIGURE 17.7 Least squares regression of head circumference on gestational age, yˆ = 3.9143 + 0.7801x estimated mean of Y for that value of x – and plot these points to draw the least squares regression line. The least squares line fitted to the 100 measurements of head circumference and gestational age is yˆ = 3.9143 + 0.7801x. This line – which is plotted in Figure 17.7 – has a residual sum of squares that is smaller than the sum for any other line that could be drawn through the scatter of points. The y-intercept of the fitted line is 3.9143 cm. Theoretically, this is the mean value of head circumference that corresponds to a gestational age of 0 weeks. In this example, however, an age of 0 weeks does not make sense, and we will therefore not attempt to interpret the y-intercept. The slope of the line is 0.7801 cm/week, implying that for each one week increase in gestational age, an infant’s head circumference increases by 0.7801 cm on average.
Summary: Simple Linear Regression Model Population regression line
µ y | x = β0 + β1 x where µ y |x is the mean value of an outcome y for a given value of x, β0 is the y-intercept of the line, and β1 is the slope of the line
Fitted regression line
yˆ = βˆ0 + βˆ1 x
ISTUDY
408
Principles of Biostatistics
17.2.3 Inference for Regression Coefficients We would like to be able to use the least squares regression line yˆ
βˆ0 + βˆ1 x
=
to make inference about the population regression line µ y |x
=
β0 + β1 x.
We can begin by saying that βˆ0 is a point estimate of the population y-intercept β0 and βˆ1 is a point estimate of the slope β1 . If we were to select repeated samples of size n from the underlying population of paired outcomes (x, y) and calculate a least squares line for each set of observations, the estimated values of β0 and β1 would vary √ from sample to sample. We need the standard errors of these estimators – just as we needed σ/ n, the standard error of the sample mean X – to be able to construct confidence intervals and conduct tests of hypotheses. It can be shown that σy | x se( βˆ1 ) = q Pn ¯ 2 i=1 (x i − x) and
s se( βˆ0 )
=
σ y |x
1 x¯ 2 + Pn . n ¯ 2 i=1 (x i − x)
The standard errors of the estimated coefficients βˆ0 and βˆ1 both depend on σ y | x , the standard deviation of the y values for a given x. In practice, this value is usually unknown. As a result, we must estimate σ y | x by the sample standard deviation s y | x , where s Pn 2 i=1 (yi − yˆ i ) s y |x = . n−2 Note that this formula involves the sum of the squared deviations of the actual observations yi from the fitted values yˆi . This sum of squared residuals is the quantity that was minimized when we fit the least squares line. The estimate s y |x is often called the standard deviation from regression. For the least squares regression of head circumference on gestational age, =
sy | x
1.5904.
This estimate can be used to compute sDe( βˆ1 )
=
sy | x qP n
i=1 (x i
and
s sDe( βˆ0 )
=
sy | x
− x) ¯ 2
=
1 x¯ 2 + Pn n ¯ 2 i=1 (x i − x)
0.0631
=
1.8291.
The slope is usually the more important coefficient in the linear regression equation; it quantifies the average change in y that corresponds to each one unit change in x. We can test the null hypothesis that the population slope is equal to β10 , or H0 : β1 = β10, against the alternative
H A : β1 , β10
ISTUDY
409
Simple Linear Regression
by finding p, the probability of observing an estimated slope as extreme as or more extreme than the observed βˆ1 given that β10 is the true population value. The test is carried out by calculating the statistic βˆ1 − β10 . t = sDe( βˆ1 ) If the null hypothesis is true, this ratio has a t distribution with n − 2 degrees of freedom. Using Table A.4, we can find the probability p. We then compare p to α – the significance level of the test – to determine whether we should reject or not reject H0 . Most frequently we are interested in determining whether the slope β1 is equal to 0, so that the null hypothesis is H0 : β1 = 0. If the population slope is equal to 0, then µy | x
=
β0 + (0)x
=
β0 .
There is no linear relationship between x and y. In this case, the mean value of y is the same regardless of the value of x; it is always equal to β0 . For the head circumference and gestational age data, this would imply that the mean value of head circumference is the same for infants of all gestational ages. It can be shown that a test of the null hypothesis H0 : β1 = 0 is mathematically equivalent to the test of H0 : ρ = 0, where ρ is the correlation between head circumference and gestational age in the underlying population of low birth weight infants. In fact, ! sy ˆ , β1 = r sx where s x and s y are the standard deviations of the x and y values, respectively [279]. Both null hypotheses claim that y does not change as x increases. To conduct a two-sided test of the null hypothesis that the true slope relating head circumference to gestational age is equal to β10 = 0 at the 0.05 level of significance, we calculate t
=
βˆ1 − β10 ˆ sDe( β)
=
0.7801 − 0 0.0631
=
12.36.
For a t distribution with 100 − 2 = 98 degrees of freedom, p < 0.001. Therefore, we reject the null hypothesis that the slope β1 is equal to 0 in favor of the alternative hypothesis that it is not. In the underlying population of low birth weight infants, there is a statistically significant linear relationship between head circumference and gestational age. Because the slope is positive, head circumference increases as gestational age increases. In addition to conducting a test of hypothesis for β1 , we can also calculate a confidence interval for the true population slope. For a t distribution with 98 degrees of freedom, approximately 95% of the observations fall between −1.98 and 1.98. Therefore, ( βˆ1 − 1.98 sDe( βˆ1 ) , βˆ1 + 1.98 sDe( βˆ1 ))
ISTUDY
410
Principles of Biostatistics
is a 95% confidence interval for β1 . Since we previously found that sDe( βˆ1 ) the interval is
=
0.0631,
(0.7801 − 1.98(0.0631) , 0.7801 + 1.98(0.0631))
or
(0.6564 , 0.9038).
While 0.7801 is the point estimate for β1 – our best guess at the value of β1 , based on the data in our sample – we are 95% confident that these limits cover the true population slope. If we are interested in testing whether the population intercept is equal to a specified value β00 , we can use calculations that are analogous to those for the slope. We compute the test statistic t
=
βˆ00 − β0 sDe( βˆ0 )
and compare this value to the t distribution with n − 2 degrees of freedom. We can also construct a confidence interval for the true population intercept β0 , just as we calculated an interval for the slope β1 . However, if the observed data points tend to be far from the intercept – as they are for the head circumference and gestational age data, where the smallest value of gestational age is x = 23 weeks – there is very little practical value in making inference about the y-intercept. As we have already noted, a gestational age of 0 weeks does not make any sense. In fact, it is dangerous to extrapolate the fitted line beyond the range of the observed values x, The relationship between X and Y might be quite different outside this range. Summary: Inference for Regression Coefficients Coefficient
Intercept (β 0 )
Slope (β 1 )
Null hypothesis
H0 : β 0 = β 00
H0 : β 1 = β 10
Alternative hypothesis
H A : β 0 , β 00
H A : β 1 , β 10
Test statistic
t=
βˆ 0 − β 00 sDe( βˆ 0 )
where s 1 x¯ 2 sDe( βˆ 0 ) = s y | x + Pn , n (x ¯ 2 i=1 i − x) s
Pn
sy|x = Distribution of test statistic
i=1 (y i
− yˆ i ) 2
n−2
t distribution with n − 2 degrees of freedom
t=
βˆ 1 − β 10 sDe( βˆ 1 ) where
sDe( βˆ 1 ) = q P
s sy|x =
sy|x n i=1 (x i
Pn
i=1 (y i
− x) ¯ 2
,
− yˆ i ) 2
n−2
t distribution with n − 2 degrees of freedom
17.2.4 Inference for Predicted Values In addition to making inference about the population slope and intercept, we might also be interested in using the least squares regression line to estimate the mean value of y corresponding to a particular
ISTUDY
411
Simple Linear Regression
value of x, and to construct a 95% confidence interval for this mean. If we have a sample of 100 observations, for instance, the confidence interval will take the form ( yˆ − 1.98 sDe( yˆ ) , yˆ + 1.98 sDe( yˆ )) where yˆ is the predicted mean of the normally distributed outcomes, and the standard error of yˆ is estimated by s" # 1 (x − x) ¯ 2 . + Pn sDe( yˆ ) = s y |x n ¯ 2 i=1 (x i − x) Note the term (x − x) ¯ 2 in the formula for the standard error. This quantity takes the value 0 when x is equal to x, ¯ and gets larger as x moves farther and farther away. As a result, if x is near x, ¯ the confidence interval is relatively narrow. It grows wider as x moves away from x. ¯ Intuitively, we are more confident about the mean value of the response when we are closer to the mean value of the explanatory variable. Return once again to the head circumference and gestational age data. When x = 29 weeks, = βˆ0 + βˆ1 x = 3.9143 + (0.7801)(29) = 26.54 cm.
yˆ
The value 26.54 cm is a point estimate for the mean value of y when x is equal to 29. The estimated standard error of yˆ is sDe( yˆ ) = 0.159 cm. Therefore, a 95% confidence interval for the mean value of y is (26.54 − 1.98(0.159) , 26.54 + 1.98(0.159)) or
(26.23 , 26.85).
The curved lines in Figure 17.8 represent the 95% confidence limits on the mean value of y for each observed value of x, from 23 weeks to 35 weeks. As we move further away from x = 29, which is very close to x, ¯ the confidence limits gradually become wider. Sometimes, instead of predicting the mean value of y for a given value of x, we instead wish to predict an individual value of y for a new member of the population. The predicted individual value is denoted by y˜ , or y-tilde, and is identical to the predicted mean yˆ ; in particular, y˜
=
βˆ0 + βˆ1 x
=
yˆ .
The standard error of y˜ , however, is not the same as the standard error of yˆ . When computing sDe( yˆ ), we were interested only in the variability of the estimated mean of the y values. When considering an individual y, we have an extra source of variability to account for – the dispersion of the y values themselves around that mean. Recall that for a given value of x, the outcomes y are normally distributed with standard deviation σ y | x . Therefore, we would expect the expression for the standard error of y˜ to incorporate an extra term involving σ y | x – or its estimator s y |x – which is not included in the expression for the standard error of yˆ . In fact, q s2y |x + sDe( yˆ ) 2 sDe( y˜ ) = s" =
sy | x
1+
# 1 (x − x) ¯ 2 + Pn . n ¯ 2 i=1 (x i − x)
ISTUDY
412
Principles of Biostatistics
FIGURE 17.8 The 95% confidence limits on the predicted mean of y for a given value of x Once again, the term (x − x) ¯ 2 implies that the standard error is smallest when x is equal to x, ¯ and gets larger as x moves away from x. ¯ If we have a sample of 100 observations, a 95% prediction interval for an individual outcome y takes the form ( y˜ − 1.98 sDe( y˜ ) , y˜ + 1.98 sDe( y˜ )). Because of the extra source of variability, the limits on a predicted individual value of y are wider than the limits on the predicted mean y for the same value of x. Suppose that a new child is selected from the underlying population of low birth weight infants. If this newborn has a gestational age of 29 weeks, then y˜
=
βˆ0 + βˆ1 x
=
3.9143 + (0.7801)(29)
=
26.54 cm.
The standard error of y˜ is estimated as sDe( y˜ )
=
q
s2y | x + sDe( yˆ ) 2
=
q
(1.5904) 2 + (0.159) 2
=
1.598 cm.
Therefore, a 95% prediction interval for an individual new value of head circumference is (26.54 − 1.98(1.598) , 26.54 + 1.98(1.598)) or
(23.38 , 29.70).
ISTUDY
413
Simple Linear Regression
FIGURE 17.9 The 95% confidence limits on an individual predicted y for a given value of x The curved lines in Figure 17.9 are the 95% limits on an individual value of y for each observed value of x from 23 to 35 weeks. Note that these bands are considerably farther from the least squares regression line than the 95% confidence limits around the mean value of y.
17.3
Evaluation of the Model
After generating a least squares regression line represented by yˆ
=
βˆ0 + βˆ1 x,
we might wonder how well this model actually fits the observed data. Is it a good model? There are several methods available to help evaluate the fit of a linear regression model.
17.3.1
Coefficient of Determination
One way to get a sense of the fit is to compute the coefficient of determination. The coefficient of determination is represented by R2 , and is the square of the Pearson correlation coefficient r between Y and X; consequently, r 2 = R2 . Since r can assume any value in the range −1 to 1, R2 must lie between 0 and 1. If R2 = 1, all of the data points in the sample fall directly on the least squares line. If R2 = 0, there is no linear relationship between x and y. The coefficient of determination can be interpreted as the proportion of the variability among the observed values of y that is explained by the linear regression of y on x. This interpretation derives
ISTUDY
414
Principles of Biostatistics
from the relationship between σ y , the standard deviation of the outcomes of the response variable Y , and σ y | x , the standard deviation of y for a specified value of the explanatory variable X, that was presented in Section 17.1: σ y2 | x = (1 − ρ2 )σ y2 . Recall that ρ is the correlation between X and Y in the underlying population. If we replace σ y and σ y |x by their estimators – the sample standard deviations s y and s y | x – and ρ by the Pearson correlation coefficient r, we have s2y | x
=
(1 − r 2 )s2y
=
1−
=
(1 − R2 )s2y .
Solving this equation for R2 , R2
s2y | x
=
s2y
s2y − s2y | x s2y
.
Since s2y | x is the variation in the y values that still remains after accounting for the linear relationship between y and x, s2y − s2y | x must be the variation in y that is explained by this relationship. Thus, R2 is the proportion of the total observed variability among the y values that is explained by the linear regression of y on x. For the regression of head circumference on gestational age, the coefficient of determination can be shown to be R2 = 0.6095. This value implies a moderately strong linear relationship between gestational age and head circumference; in particular, 60.95% of the variability among the observed values of head circumference is explained by the linear relationship between head circumference and gestational age. The remaining 100 − 60.95
=
39.05%
of the variation is not explained by this relationship.
Summary: Coefficient of Determination Coefficient of determination
The proportion of variability among the observed values of y that is explained by the linear regression of y on x R2 = r 2, 0 ≤ R2 ≤ 1 where r is the Pearson correlation coefficient
17.3.2
R2 = 1
All data points fall on the least squares line, perfect linear relationship between x and y
R2 = 0
No linear relationship between x and y
Residual Plots
Another strategy for evaluating how well the least squares regression line fits the observed data in the sample used to construct it – focusing in particular on whether the assumptions of the linear model are met – is to generate a two-way scatter plot of the residuals against the fitted or predicted
ISTUDY
415
Simple Linear Regression
FIGURE 17.10 Residuals versus fitted values of head circumference values of the response variable. For example, one particular child in the sample of 100 low birth weight infants has gestational age 29 weeks and head circumference 27 cm. The child’s predicted head circumference, given that x i = 29 weeks, is yˆi
=
βˆ0 + βˆ1 x i
=
3.9143 + (0.7801)(29)
=
26.54 cm.
The residual associated with this observation is ei
=
yi − yˆi
=
27.0 − 26.54
=
0.46.
Therefore, the point (26.54 , 0.46) would be included on the graph. Figure 17.10 is a scatter plot of the points ( yˆi , ei ) for all 100 observations in the sample of low birth weight infants. A plot of the residuals serves three purposes. First, it can help us to detect outlying observations in the sample. In Figure 17.10, one residual in particular is somewhat larger than the others; this point is associated with a child whose gestational age is 31 weeks and whose head circumference is 35 cm. We would predict the infant’s head circumference to be only yˆ
=
3.914 + 0.7801(31)
=
28.10 cm.
The method of least squares can be very sensitive to such outliers in the data, especially if they correspond to relatively large or relatively small values of x. When it is believed that an outlier is the result of an error in measuring or recording a particular observation, removal of this point improves the fit of the regression line. However, care must be taken not to throw away unusual data points that are in fact valid; these observations might be the most interesting ones in the data set. A plot of the residuals can also suggest a failure in the assumption of homoscedasticity. Recall that homoscedasticity means that the standard deviation of the outcomes y, or σ y |x , is constant across all values of x. If the range of the magnitudes of the residuals either increases or decreases as yˆ gets larger – producing a fan-shaped scatter such as the one in Figure 17.11 – this implies
ISTUDY
416
Principles of Biostatistics
FIGURE 17.11 Violation of the assumption of homoscedasticity that σ y | x does not take the same value for all values of x. In this case, simple linear regression is not the appropriate technique for modeling the relationship between x and y. No such pattern is evident in Figure 17.10, the residual plot for the sample of head circumference and gestational age measurements for 100 low birth weight infants. Thus, the assumption of homoscedasticity does not appear to have been violated. (While it is not an issue for this dataset, we should note that it can be difficult to evaluate this and other assumptions based on a residual plot if the number of data points is small.) Finally, if the residuals do not exhibit a random scatter but instead follow a distinct trend – ei increases as yˆi increases, for example – this would suggest that the true relationship between x and y might not be linear. In this situation, a transformation of x or y or both might be appropriate. When transforming a variable, we simply measure it on a different scale. In many ways, it is analogous to measuring a variable in different units; height can be measured in either inches or centimeters, for example. Often, a curvilinear relationship between two variables can be transformed into a more straightforward linear one with a transformation. If this is possible, we can use simple linear regression to fit a model to the transformed data.
17.3.3
Transformations
Consider Figure 17.12. This graph is a two-way scatter plot of crude birth rate per 1000 population versus gross domestic product (gdp) per capita for 241 countries around the world [280]. The GDP is expressed in United States dollars. Note that birth rate decreases as GDP increases. The relationship, however, is not a linear one. Instead, birth rate drops off rapidly at first; when the GDP per capita reaches approximately $15,000, it begins to level off. Consequently, if we wish to describe the relationship between birth rate and GDP, we cannot use simple linear regression without applying some type of transformation first.
ISTUDY
417
Simple Linear Regression
FIGURE 17.12 Birth rate per 1000 population versus GDP per capita for 241 countries, 2015 When the relationship between x and y is not linear, we begin by looking at transformations of the form x p or y p where p =
1 1 . . . − 3, −2, −1, − , ln, , 1, 2, 3, . . . . 2 2
Note that “ln” refers to the natural logarithm of x or y rather than an exponent. Thus, possible √ transformations might be ln(y), x 1/2 = x, or x 2 . The circle of powers – or the ladder of powers, as it is sometimes called – provides a general guideline for choosing a transformation. The strategy is illustrated in Figure 17.13. If the plotted data resemble the pattern in Quadrant I, for instance, an appropriate transformation would be either “up” on x or “up” on y. In other words, either x or y would be raised to a power greater than p = 1; the more curvature in the data, the higher the value of p needed to achieve linearity. We might try replacing x by x 2 , for example. If a two-way scatter plot suggests that the relationship between y and x 2 is linear, we would fit a model of the form
instead of the usual
yˆ
=
βˆ0 + βˆ1 x 2
yˆ
=
βˆ0 + βˆ1 x.
If the data follow the trend in Quadrant II, we would want to transform “up” on y or “down” on x, meaning that we would either raise x to a power that is less than 1 or raise y to a power greater √ than 1. Therefore, we might try replacing x by x or ln(x). Whichever transformation is chosen, we must always verify that the assumption of homoscedasticity is valid. The data in Figure 17.12 most closely resemble the pattern in Quadrant III. Therefore, we would want to raise either x or y to a power that is less than 1. We might try replacing GDP with its natural logarithm, for instance. The effect of this transformation is illustrated in Figure 17.14. Note that the relationship between birth rate and the logarithm of GDP appears much more linear than the
ISTUDY
418
Principles of Biostatistics
FIGURE 17.13 The circle of powers
FIGURE 17.14 Birth rate per 1000 population versus the natural logarithm of GDP per capita
ISTUDY
419
Simple Linear Regression
FIGURE 17.15 Length versus gestational age for a sample of 100 low birth weight infants relationship between birth rate and GDP itself. Therefore, we could fit a simple linear regression model of the form yˆ = βˆ0 + βˆ1 ln(x). Although the units are unfamiliar – GDP is measured in ln(US dollars) instead of dollars – this transformation allows us to apply a method that would otherwise be inappropriate.
17.4
Further Applications
Suppose that we are now interested in the relationship between a child’s length and his or her gestational age for the population of low birth weight infants, again defined as those weighing less than 1500 grams. We begin our analysis by constructing a two-way scatter plot of length versus gestational age for the sample of 100 low birth weight infants born in Boston, Massachusetts. The plot is shown in Figure 17.15. The points on the graph exhibit a great deal of scatter; however, it is clear that length increases as gestational age increases. Furthermore, the relationship appears to be a linear one. To estimate the true population regression line µy | x
=
β0 + β1 x,
where µ y |x is the mean length for low birth weight infants of the specified gestational age x, β0 is the y-intercept of the line, and β1 is its slope, we apply the method of least squares to fit the model yˆ = βˆ0 + βˆ1 x.
ISTUDY
420
Principles of Biostatistics
TABLE 17.1 Stata output for the simple linear regression of length on gestational age Source | SS df MS ---------+-----------------------------Model | 575.73916 1 575.73916 Residual | 687.02084 98 7.01041674 ---------+-----------------------------Total | 1262.76 99 12.7551515
Number of obs F(1, 98) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
100 82.13 0.0000 0.4559 0.4504 2.6477
--------------------------------------------------------------------length | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+----------------------------------------------------------gestage | .9516035 .1050062 9.06 0.000 .7432221 1.159985 _cons | 9.328174 3.045163 3.06 0.003 3.285148 15.3712 ---------------------------------------------------------------------
Rather than use calculus to minimize the sum of the squared residuals n X
n X
e i2 =
i=1
(yi − yˆi ) 2 =
i=1
n X
(yi − βˆ0 − βˆ1 x i ) 2,
i=1
we will use the computer to do the calculations for us. Table 17.1 shows the relevant output from Stata, and Table 17.2 the output from R. The top portion of Table 17.1 displays an analysis of variance table on the left and additional information about the model on the right; we will return to some of this information later on. The bottom portion contains the estimated coefficients for the least squares regression line. The column on the far left lists the names of the response (length) and explanatory (gestage) variables; the label “ cons” refers to the y-intercept or constant term in the equation. The point estimates for the regression coefficients, βˆ0 and βˆ1 , appear in the second column. Rounding these values to four decimal places (which R does automatically – see the central portion of Table 17.2 labeled Coefficients), the fitted least squares regression line for this sample of 100 low birth weight infants is yˆ = 9.3282 + 0.9516x. The y-intercept of 9.3282 cm is the estimated mean value of length that corresponds to an x value of 0; in this example, however, a gestational age of 0 weeks does not make sense and the intercept cannot be interpreted. The slope of the line indicates that for each one week increase in gestational age, an infant’s length increases by 0.9516 cm on average. Both the Stata and R output display the estimated standard errors of βˆ0 and βˆ1 . Suppose that we wish to test the null hypothesis that the population slope is equal to 0, or H0 : β1 = 0. The appropriate test statistic is t
=
βˆ1 − 0 sDe( βˆ1 )
=
0.9516 − 0 0.1050
=
9.062.
ISTUDY
421
Simple Linear Regression TABLE 17.2 R output displaying the simple linear regression of length on gestational age Residuals: Min 1Q -13.1183 -1.1183
Median 0.2931
3Q 1.4100
Max 4.1721
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.3282 3.0452 3.063 0.00283 ** gestage 0.9516 0.1050 9.062 1.31e-14 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.648 on 98 degrees of freedom Multiple R-squared: 0.4559,Adjusted R-squared: 0.4504 F-statistic: 82.13 on 1 and 98 DF, p-value: 1.311e-14
This test statistic is also provided in the output, along with the p-value that corresponds to a twosided test. Since p is less than 0.001, we reject the null hypothesis that β1 is equal to 0. (Note that Stata reports the p-value as 0.000, but it is not really equal to 0. The p-value is just too small to be displayed in the Stata output, where only three decimal places are shown. The R output tells us that the p-value is 0.0000000000000131.) Based on this sample of low birth weight infants, length increases as gestational age increases. Since the data set contains measurements for 100 infants, and we know that for a t distribution with 100 − 2 = 98 degrees of freedom, 95% of the observations fall between −1.98 and 1.98, a 95% confidence interval for the population slope takes the form ( βˆ1 − 1.98 sDe( βˆ1 ) , βˆ1 + 1.98 sDe( βˆ1 )). If we substitute the values of βˆ1 and sDe( βˆ1 ) from the table, the 95% confidence interval is (0.9516 − 1.98 (0.1050) , 0.9516 + 1.98 (0.1050)) or
(0.7432 , 1.1600).
These confidence limits are provided in the Stata output in Table 17.1. We might also be interested in using the least squares regression line to estimate the mean value of length corresponding to a specified value of gestational age, and to construct a 95% confidence interval for this mean. If we have a sample of 100 observations, the confidence interval takes the form ( yˆ − 1.98 sDe( yˆ ) , yˆ + 1.98 sDe( yˆ )). For the 100 low birth weight infants in the sample – each with a particular gestational age x – we can use the computer to obtain the corresponding predicted length yˆ , as well as its estimated standard error sDe( yˆ ). The data for the first 10 infants in the sample are shown in Table 17.3. When x is equal to 29 weeks, we observe that yˆ
=
9.3282 + 0.9516(29)
=
36.93 cm.
ISTUDY
422
Principles of Biostatistics
TABLE 17.3 Predicted values of length (cm) and estimated standard errors for the first 10 infants in the sample Gestational Age (wk) 29 31 33 31 30 25 27 29 28 29
Predicted Length yˆ 36.925 38.828 40.731 38.828 37.876 33.118 35.021 36.925 35.973 36.925
Standard Error of yˆ 0.2650 0.3452 0.5063 0.3452 0.2893 0.4868 0.3309 0.2650 0.2808 0.2650
The estimated standard error of yˆ is 0.265 cm. Therefore, a 95% confidence interval for the mean value of length is (36.93 − 1.98(0.265) , 36.93 + 1.98(0.265)) or
(36.41 , 37.45).
Analogous confidence intervals can be calculated for each observed value of x from 23 weeks to 35 weeks. The confidence intervals become wider as we move further away from x, ¯ the mean of the x values. The predicted individual value of y for a new member of the population is identical to the predicted mean of y; for an infant whose gestational age is 29 weeks, y˜
=
9.3282 + 0.9516(29)
=
36.93 cm.
Its standard error, however, is not the same. In addition to the variability of the estimated mean value of y it also incorporates the variation of the y values around that mean, the standard deviation from regression s y |x . The standard deviation from regression is shown in the top portion of the Stata output in Table 17.1, beside the label Root MSE. In R, it is called the residual standard error. Note that s y |x = 2.6477, and the estimated standard error of y˜ is sDe( y˜ )
=
q
s2y | x + sDe( yˆ ) 2
=
q
(2.6477) 2 + (0.265) 2
=
Therefore, a 95% prediction interval for the individual new value of length is (36.93 − 1.98(2.661) , 36.93 + 1.98(2.661)), or
(31.66 , 42.20).
2.661.
ISTUDY
423
Simple Linear Regression TABLE 17.4 Residuals (measured in cm) for the first 10 infants in the sample Length (cm) 41 40 38 38 38 32 33 38 30 34
Predicted Length yˆ 36.925 38.828 40.731 38.828 37.876 33.118 35.021 36.925 35.973 36.925
Residual 4.075 1.172 −2.731 −0.828 0.124 −1.118 −2.021 1.075 −5.973 −2.925
Because of the extra source of variability, this interval is quite a bit wider than the 95% confidence interval for the predicted mean value of y. After generating the least squares regression line, we might wish to have some idea about how well this model fits the observed data. One way to evaluate the fit is to examine the coefficient of determination. In Table 17.1, the value of R2 is displayed in the top portion of the output on the right-hand side. In Table 17.2, it is on the bottom left. For the simple linear regression of length on gestational age, the coefficient of determination is R2
=
0.4559.
This means that approximately 45.6% of the variability among the observed values of length is explained by the linear relationship between length and gestational age. The remaining 54.4% is not explained by this relationship. The line of output directly below R2 , labeled adjusted R2 (Adj R-square), will be discussed in the next chapter. A second technique for evaluating the fit of the least squares regression line to the sample data involves looking at a two-way scatter plot of the residuals versus the predicted values of length. The residuals are obtained by subtracting the fitted values yˆi from the actual observations yi ; the calculations may be performed using a computer package. Table 17.4 shows the observed and predicted values of length, along with the differences between them, for the first 10 infants in the sample. Figure 17.16 is a scatter plot of the points ( yˆi , ei ) for all 100 low birth weight infants. Looking at the residual plot, we see that there is one point with a particularly low (negative) residual that appears to be an outlier. We might try removing this point, fitting a new line, and then comparing the two models to see how much of an effect the point has on the estimated regression coefficients. However, there is no evidence that the assumption of homoscedasticity has been violated, or that a transformation of either the response or the explanatory variable is necessary.
ISTUDY
424
Principles of Biostatistics
FIGURE 17.16 Residuals versus fitted values of length for a sample of 100 low birth weight infants
ISTUDY
425
Simple Linear Regression
17.5
Review Exercises
1. What is the main distinction between correlation analysis and simple linear regression? 2. What assumptions do you make when using the method of least squares to estimate a population regression line? 3. Explain the least squares criterion for obtaining estimates of regression coefficients for a linear model. 4. Why is it dangerous to extrapolate an estimated linear regression line outside the range of the observed data values? 5. Given a specified value of the explanatory variable in a simple linear regression model, how does a confidence interval constructed for the mean of the response differ from a prediction interval constructed for a new, individual value of the response? Explain. 6. Why might you need to consider transforming either the response or the explanatory variable when fitting a simple linear regression model? How is the circle of powers used in this situation? 7. For a given sample of data, how can a two-way scatter plot of the residuals versus the fitted values of the response be used to evaluate the fit of a least squares regression line? 8. Figure 17.17 displays a two-way scatter plot of cbvr – the response of cerebral blood volume in the brain to changes in carbon dioxide tension in the arteries – versus gestational age for a sample of 17 newborn infants [281]. The graph also shows the fitted least squares regression line for these data. The investigators who constructed the model determined that the slope of the line β1 is significantly higher than 0. (a) Suppose that you are interested in only those infants who are born prematurely. If you were to eliminate the four data points corresponding to newborns whose gestational age is 38 weeks or greater, would you still believe that there is a significant increase in cbvr as gestational age increases? (b) In an earlier study, the same investigators found no obvious relationship between cbvr and gestational age in newborn infants; gestational age was not useful in predicting cbvr. Would this information cause you to modify your answer above? 9. Oxygen uptake at maximum level of exercise (vo2 max) is considered to be the best available index for quantifying an individual’s exercise capacity. One set of equations for predicting vo2 max based on a subject’s age were derived separately for males and females using simple linear regression analysis [282]. If X represents age, the equations are: Males
yˆ = 4.2 − 0.032x
Females
yˆ = 2.6 − 0.014x
(a) Based on these equations, does vo2 max increase or decrease as age increases? Explain. (b) What is the predicted vo2 max for a 30-year-old male? For a 50-year-old male?
ISTUDY
426
Principles of Biostatistics
FIGURE 17.17 Response of cerebral blood volume (cbvr) versus gestational age for a sample of 17 newborns (c) What is the predicted vo2 max for a 30-year-old female? For a 50-year-old female? 10. Measurements of length and weight for a sample of 20 low birth weight infants are contained in the dataset twenty [81]. The length measurements are saved under the variable name length, and the corresponding birth weights under weight. (a) Construct a two-way scatter plot of birth weight versus length for the 20 infants in the sample. Without doing any calculations, sketch your best guess for the least squares regression line directly on the scatter plot. (b) Now compute the true least squares regression line. Draw this line on the scatter plot. Does the actual least squares line concur with your guess? Based on the two-way scatter plot, it is clear that one point lies outside the range of the remainder of the data. To illustrate the effect that the outlier has on the model, remove this point from the data set. (c) Compute the new least squares regression line based on the sample of size 19, and sketch this line on the original scatter plot. How does the least squares line change? In particular, comment on the values of the slope and the intercept. (d) Compare the coefficients of determination (R2 ) and the standard deviations from regression (s y | x ) for the two least squares regression lines. Explain how these values changed when you removed the outlier from the original data set. Why did they change? 11. In the 11 years before the passage of the Federal Coal Mine Health and Safety Act of 1969, the fatality rates for underground miners varied little. After the implementation of that act, however, fatality rates decreased steadily until 1979. The fatality rates for the years 1970 through 1981 are provided below [283]; for computational purposes, calendar
ISTUDY
427
Simple Linear Regression
years have been converted to a scale beginning at 1. This information is contained in the dataset miner. Values of the response, fatality rate, are saved under the name rate, and values of the explanatory variable calendar year under the name year. Calendar Year 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981
Fatality Rate per 1000 Employees 2.419 1.732 1.361 1.108 0.996 0.952 0.904 0.792 0.701 0.890 0.799 1.084
Year 1 2 3 4 5 6 7 8 9 10 11 12
(a) Construct a two-way scatter plot of fatality rate versus year. What does this plot suggest about the relationship between these two variables? (b) To model the trend in fatality rates, fit the least squares regression line yˆ
=
βˆ0 + βˆ1 x
where x represents year. Using both the coefficient of determination R2 and a plot of the residuals versus the fitted values of fatality rate, comment on the fit of the model to the observed data. (c) Now transform the explanatory variable x to ln(x). Create a scatter plot of fatality rate versus the natural logarithm of year. (d) Fit the least squares model yˆ
=
βˆ0 + βˆ1 ln(x).
Use the coefficient of determination and a plot of the residuals versus the fitted values of fatality rate to compare the fit of this model to the model constructed in part (b). (e) Transform x to 1/x. Construct a two-way scatter plot of fatality rate versus the reciprocal of year. (f) Fit the least squares model ! 1 yˆ = βˆ0 + βˆ1 . x Using the coefficient of determination and a plot of the residuals, comment on the fit of this model and compare it to the previous ones. (g) Which of the three models appears to fit the data best? Defend your selection. 12. A study was conducted to examine relationships among sociodemographic, anthropometric and behavioral factors and lipid levels in a rural African population. The dataset lipid
ISTUDY
428
Principles of Biostatistics contains information for a sample of 1859 adults living in northern Ghana [284]. Measurements of total cholesterol are saved under the variable name total_cholesterol, subcutaneous abdominal fat measurements under the name sc_fat, and body mass index under bmi. (a) Construct a two-way scatter plot of total cholesterol versus subcutaneous abdominal fat. Does the graph suggest anything about the nature of the relationship between these variables? (b) Using total cholesterol as the response and subcutaneous fat as the explanatory variable, compute the least squares regression line. Interpret the estimated slope and y-intercept of the line. What do they mean in words? (c) At the 0.05 level of significance, test the null hypothesis that the true population slope β1 is equal to 0. What is the probability distribution of the test statistic? What is the p-value of the test? What do you conclude? (d) What is the estimated mean total cholesterol for adults in rural northern Ghana with subcutaneous abdominal fat 0.75 cm? What is the estimated mean for adults with subcutaneous abdominal fat 1.75 cm? (e) What is the coefficient of determination for this model? Interpret this value. (f) Construct a plot of the residuals versus the fitted values of total cholesterol. What does the residual plot tell you about the fit of the model to the observed data? (g) Now construct a two-way scatter plot of total cholesterol versus body mass index. What does the graph suggest about the relationship between these variables? (h) Using total cholesterol as the response and body mass index as the explanatory variable, compute the least squares regression line. Interpret the estimated slope and y-intercept of the line in words. (i) At the 0.05 level of significance, test the null hypothesis that the true population slope β1 is equal to 0. What do you conclude? (j) What is the coefficient of determination for this model? Compared to subcutaneous abdominal fat, does the linear relationship with body mass index explain more of less of the variability in total cholesterol? (k) Construct a plot of the residuals versus the fitted values of total cholesterol for the model with explanatory variable body mass index. Is there any evidence that model assumptions were violated?
13. The Gapminder Foundation is an organization dedicated to educating the public by using data to dispel common myths about the so-called developing world. The organization attempts to show how actual trends in health and economics contradict the narratives that emanate from sensationalist media coverage of catastrophes, tragedies, and other unfortunate events. The dataset gapminder contains health outcomes for 185 countries for calendar year 2000 [285]. Life expectancy in years for each country is saved under the variable name life_expectancy, and fertility rate, defined as the total number of children that would be born to each woman residing in the country if she were to survive to the end of her child-bearing years, under the variable name fertility. (a) Construct a two-way scatter plot of life expectancy versus fertility. What does the graph suggest about the relationship between these variables? (b) Using life expectancy as the response and fertility as the explanatory variable, compute the least squares regression line. Interpret the estimated slope and y-intercept of the line. What do they mean in words?
ISTUDY
Simple Linear Regression
429
(c) At the 0.05 level of significance, test the null hypothesis that the true slope β1 is equal to 0. What do you conclude? (d) What is the estimated mean life expectancy for a country where the average number of children born per woman is 6? (e) What percentage of the variability in life expectancy is explained by its linear relationship with fertility? 14. The dataset lowbwt contains information for the sample of 100 low birth weight infants born in Boston, Massachusetts [81]. Measurements of systolic blood pressure are saved under the variable name sbp, and values of gestational age under the name gestage. (a) Construct a two-way scatter plot of systolic blood pressure versus gestational age. Does the graph suggest anything about the nature of the relationship between these variables? (b) Using systolic blood pressure as the response and gestational age as the explanatory variable, compute the least squares regression line. Interpret the estimated slope and y-intercept of the line. What do they mean in words? (c) At the 0.05 level of significance, test the null hypothesis that the true population slope β1 is equal to 0. What do you conclude? (d) What is the estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks? (e) Construct a 95% confidence interval for the true mean value of systolic blood pressure when x = 31 weeks. (f) Suppose that you randomly select a new child from the population of low birth weight infants with gestational age 31 weeks. What is the predicted systolic blood pressure for this child? (g) Construct a 95% prediction interval for this new value of systolic blood pressure. (h) What is the coefficient of determination for this model? What does it mean? (i) How is R2 related to the Pearson correlation coefficient r? (j) Construct a plot of the residuals versus the fitted values of systolic blood pressure. What does the residual plot tell you about the fit of the model to the observed data?
ISTUDY
ISTUDY
18 Multiple Linear Regression
CONTENTS 18.1
18.2 18.3 18.4 18.5
The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1.1 Least Squares Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1.2 Inference for Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1.3 Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1.4 Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
431 432 434 435 436 438 440 442 451
In the preceding chapter, we see how simple linear regression can be used to explore the nature of the relationship between two continuous random variables. In particular, it allows us to predict the value of a response or outcome that corresponds to a given value of an explanatory variable. If knowing the value of a single explanatory variable improves our ability to predict the response, however, we might suspect that additional explanatory variables could be used to our advantage. To investigate the more complicated relationship among a number of different variables, we use a natural extension of simple linear regression known as multiple linear regression, or multivariable linear regression.
18.1
The Model
Using multiple linear regression analysis, we estimate the population equation µ y | x1, x2, ..., x q
=
β0 + β1 x 1 + β2 x 2 + · · · + β q x q
where x 1, x 2, . . . and x q are the outcomes of q distinct explanatory variables X1, X2, . . . and X q , and µ y | x1, x2, ..., x q is the mean value of y when the explanatory variables assume these particular values. The parameters β0, β1, β2, . . . and βq are constants, the coefficients of the equation. Mathematically, the y-intercept β0 is the mean value of the response y when all explanatory variables take the value 0, or µ y |0,0, ...,0 . The slope βi is the change in the mean value of y that corresponds to a one unit increase in x i , given that all other explanatory variables in the model remain constant. To accommodate the natural variation in measures of the response – we do not expect all subjects with the same values of the explanatory variables to have exactly the same outcome – we actually fit a model of the form y = β0 + β1 x 1 + β2 x 2 + · · · + β q x q + . The coefficients of the population regression equation are estimated using a random sample of observations represented as (x 1i , x 2i , . . . , x qi , yi ). However, just as we have to make a number of DOI: 10.1201/9780429340512-18
431
ISTUDY
432
Principles of Biostatistics
assumptions for the model with a single explanatory variable, we must make an analogous set of assumptions for the more complex multiple regression model. These assumptions are as follows: • For specified values x 1, x 2, . . . x q , all of which are considered to be measured without error, the distribution of the y values is normal with mean µ y |x1, x2, ..., x q and standard deviation σ y | x1, x2, ..., x q . • The relationship between µ y | x1, x2, ..., x q and x 1, x 2, . . . and x q is represented by the equation µ y | x1, x2, ..., x q = β0 + β1 x 1 + β2 x 2 + · · · + βq x q . • For any set of values x 1, x 2, . . . and x q , σ y | x1, x2, ..., x q is constant. As in simple linear regression, this property is referred to as homoscedasticity. • The outcomes y are independent.
18.1.1
Least Squares Regression Equation
To estimate the population regression equation, we use the method of least squares to fit the model yˆ
βˆ0 + βˆ1 x 1 + βˆ2 x 2 + · · · + βˆq x q .
=
This technique requires that we minimize the sum of the squares of the residuals, in this case n X i=1
e2i
=
n X
(yi − yˆi ) 2
=
i=1
n X
(yi − βˆ0 − βˆ1 x 1i − βˆ2 x 2i − · · · − βˆq x qi ) 2 .
i=1
Recall that yi is the observed outcome of the response Y for particular values x 1i , x 2i , . . . and x qi , while yˆi is the corresponding value from the fitted equation. When a single explanatory variable was involved, the fitted model was simply a straight line. With two explanatory variables, the model represents a plane in three-dimensional space; with three or more variables, it is a hyperplane in higher dimensional space. Although the calculations are more complicated than they were for models with a single explanatory variable, they do not present a problem as long as a computer is available. In Chapter 17, we find a significant linear relationship between head circumference and gestational age for low birth weight infants. The fitted least squares regression line is yˆ
=
3.9143 + 0.7801x.
We might wonder whether head circumference also depends on the birth weight of an infant. Figure 18.1 is a two-way scatter plot of head circumference versus birth weight for a sample of 100 low birth weight infants born in Boston, Massachusetts [81]. The graph suggests that head circumference increases as birth weight increases. Given that we have already accounted for gestational age, does birth weight further improve our ability to predict the head circumference of a child? Suppose that we let x 1 represent gestational age and x 2 designate birth weight. The fitted least squares regression equation is yˆ
=
8.3080 + 0.4487x 1 + 0.0047x 2 .
The intercept of 8.3080 cm is, in theory, the mean value of head circumference for low birth weight infants with gestational age 0 weeks and birth weight 0 grams. In this example, neither an age of 0 nor a weight of 0 makes sense, so we will not attempt to interpret the y-intercept. The estimated coefficient of gestational age is not what it was when age was the only explanatory variable in the model; its value has decreased from 0.7801 to 0.4487 cm/week. This implies that, given that a child’s birth weight remains constant, each one week increase in gestational age corresponds to a 0.4487 cm
ISTUDY
433
Multiple Linear Regression
FIGURE 18.1 Head circumference versus birth weight for a sample of 100 low birth weight infants increase in head circumference, on average. Equivalently, given two infants with the same birth weight but such that the gestational age of the first child is one week greater than the gestational age of the second, the first child would have a head circumference approximately 0.4487 cm larger. Similarly, the coefficient of birth weight indicates that if a child’s gestational age does not change, each one gram increase in birth weight results in a 0.0047 cm increase in head circumference, on average. Note that in the fitted regression equation, the estimated coefficient of gestational age is much larger than the coefficient of birth weight; in fact, it is approximately 100 times larger. Does this mean that gestational age is more important than birth weight for explaining an infant’s head circumference? It does not. We must keep the units of measurement in mind. The estimated coefficient βˆ1 is telling us the mean change in head circumference for each one week increase in gestational age, while βˆ2 is the mean change in head circumference for each one gram increase in birth weight (in each case, assuming that the other explanatory variable remains constant). Since the units of measurement for the coefficients are different – one is measured in cm/week and the other in cm/gm – it is meaningless to compare the two coefficients directly. In fact, if we had recorded birth weight in kilograms rather than grams, the coefficient of birth weight would increase by a factor of 1000.
Summary: Multiple Linear Regression Model Population regression line
µ y | x 1, x 2, ..., x q = β0 + β1 x 1 + β2 x 2 + · · · + βq x q where µ y | x 1, x 2, ..., x q is the mean value of outcome y for x 1, x 2, . . . , x q , and β0, β1, β2, . . . , βq are the coefficients of the model
Fitted regression line
yˆ = βˆ0 + βˆ1 x 1 + βˆ2 x 2 + · · · + βˆq x q
ISTUDY
434
Principles of Biostatistics
18.1.2 Inference for Regression Coefficients Just as when applying simple linear regression analysis, we would like to be able to use the least squares regression model yˆ
=
βˆ0 + βˆ1 x 1 + βˆ2 x 2 + · · · + βˆq x q
to make inference about the population regression equation µ y |x1, x2, ..., x q
=
β0 + β1 x 1 + β2 x 2 + · · · + β q x q .
The regression coefficients βˆ0 through βˆq are estimated using a sample of observations drawn from the underlying population; their values would change if a different sample were selected. Therefore, we need the standard errors of these estimators to be able to make inference about the true population parameters. Tests of hypotheses for the population intercept and slopes can be carried out just as they were for the model containing a single explanatory variable, with two differences. First, when testing the null hypothesis H0 : βi = βi0 against the alternative
H A : βi , βi0,
we assume that the values of all other explanatory variables x j , x i remain constant. Second, if the null hypothesis is true, the test statistic t
βˆi − βi0 sDe( βˆi )
=
does not follow a t distribution with n − 2 degrees of freedom. Instead, it has a t distribution with n − q − 1 degrees of freedom, where q is the number of explanatory variables in the model. For the model containing both gestational age and birth weight, q is equal to 2 and the appropriate t distribution has 100 − 2 − 1 = 97 degrees of freedom. This t distribution is used to find p, the probability of observing an estimated slope as extreme as or more extreme than βˆi , given that the true population slope is βi0 . For the 100 low birth weight infants born in Boston, it can be shown that sDe( βˆ0 ) = 1.5789, and
sDe( βˆ1 )
=
0.0672,
sDe( βˆ2 )
=
0.00063.
To conduct a two-sided test of the null hypothesis that β1 – the true slope relating head circumference to gestational age, assuming that the value of birth weight remains constant – is equal to 0, we calculate the test statistic t
=
βˆ1 − β10 sDe( βˆ1 )
=
0.4487 − 0 0.0672
=
6.68.
For a t distribution with 97 degrees of freedom, p < 0.001. Therefore, we reject the null hypothesis at the 0.05 level of significance and conclude that β1 is greater than 0. Similarly, to test the null hypothesis H0 : β2 = 0
ISTUDY
435
Multiple Linear Regression against the alternative
H A : β2 , 0,
assuming that gestational age remains constant, we calculate t = =
βˆ2 − β20 sDe( βˆ2 ) 0.0047 − 0 = 7.47. 0.00063
Once again p < 0.001, and we conclude that β2 is significantly greater than 0. Therefore, head circumference increases as either gestational age or birth weight increases. We must bear in mind, however, that multiple tests of hypothesis based on the same set of data are generally not independent. If each individual test is conducted at the α level of significance, the overall probability of making a type I error – or rejecting a null hypothesis that is true – is in fact larger than α. In addition to conducting tests of hypotheses, we can also calculate confidence intervals for the population regression coefficients. Furthermore, we can construct a confidence interval for the predicted mean value of Y and a prediction interval for a predicted individual y corresponding to a given set of values for the explanatory variables. In all cases, the procedures are analogous to those used when a single explanatory variable was involved. Summary: Inference for Regression Coefficients Coefficient
βi
Null hypothesis
H0 : βi = βi0
Alternative hypothesis
H A : βi , βi0
Test statistic
t=
Distribution of test statistic
βˆi − βi0 sDe( βˆi )
t distribution with n − q − 1 degrees of freedom
18.1.3 Indicator Variables All the explanatory variables we have considered up to this point have been measured on a continuous scale. However, regression analysis can be generalized to incorporate discrete or nominal explanatory variables as well. For example, we might wonder whether an expectant mother’s diagnosis of preeclampsia during pregnancy – a condition characterized by high blood pressure and other potentially serious complications – affects the head circumference of her child. The diagnosis of preeclampsia is a dichotomous random variable; a woman either had it or she did not. We would like to be able to quantify the effect of preeclampsia on head circumference by comparing infants whose mothers suffered from this condition to infants whose mothers did not. Since the explanatory variables in a regression analysis must assume numerical values, we designate the presence of preeclampsia during pregnancy by 1 and its absence by 0. These numbers do not represent any actual measurements; they simply identify the categories of the dichotomous random variable. Because its values do not have any quantitative meaning, an explanatory variable of this sort is called an indicator variable.
ISTUDY
436
Principles of Biostatistics
Suppose we add the indicator variable preeclampsia to the regression equation that already contains gestational age. For the sake of simplicity, we ignore birth weight for the moment. The fitted least squares regression model is yˆ
=
1.4956 + 0.8740x 1 − 1.4123x 3,
where x 1 represents gestational age and x 3 represents preeclampsia. The coefficient of preeclampsia is negative, indicating that mean head circumference decreases as the value of preeclampsia increases from 0 to 1. A test of the null hypothesis H0 : β3 = 0 against the alternative
H A : β3 , 0,
assuming that gestational age does not change, results in a test statistic of t = −3.48 and p = 0.001. Therefore, we reject the null hypothesis at the 0.05 level of significance and conclude that β3 is less than 0. Given two infants with identical gestational ages, head circumference would be smaller on average for the child whose mother experienced preeclampsia during pregnancy than for the child whose mother did not. In order to better understand a regression model containing one continuous explanatory variable and one dichotomous explanatory variable, we can think about the least squares regression equation fitted to the sample of 100 low birth weight infants as two different models, corresponding to the two possible values of the dichotomous random variable preeclampsia. When x 3 = 1, for instance, indicating that a woman was diagnosed with preeclampsia during pregnancy, yˆ
=
1.4956 + 0.8740x 1 − 1.4123(1)
=
0.0833 + 0.8740x 1 .
yˆ
=
1.4956 + 0.8740x 1 − 1.4123(0)
=
1.4956 + 0.8740x 1 .
When x 3 = 0,
The two lines are plotted in Figure 18.2. Note that the equations for the infants whose mothers were diagnosed with preeclampsia and those whose mothers were not have identical slopes. In either group, a one week increase in gestational age is associated with a 0.8740 cm increase in head circumference, on average. This is the consequence of fitting a single regression model to the two different groups of infants. Since one line lies entirely above the other – as determined by the different y-intercepts – the equations also suggest that across all values of gestational age, children whose mothers were not diagnosed with preeclampsia have larger head circumference measurements than children whose mothers were diagnosed with preeclampsia.
18.1.4
Interaction Terms
In some situations, one explanatory variable has a different relationship with the response depending on the value of a second explanatory variable. As an example, a one week increase in gestational age might have a different effect on a child’s head circumference depending on whether the infant’s mother had experienced preeclampsia during pregnancy or not. To model a relationship of this kind, we create an interaction term. In linear regression, an interaction term is generated by multiplying together the outcomes of two explanatory variables x i and x j to create a third variable x i x j , which is then included in the model. Suppose we wish to add an interaction between gestational age and preeclampsia to the regression model that already contains these two variables individually. We would multiply the outcomes for gestational age, x 1 , by the outcomes for preeclampsia, x 3 , to create a new variable x 1 x 3 . (Because x 3
ISTUDY
437
Multiple Linear Regression
FIGURE 18.2 Fitted least squares regression lines for different levels of preeclampsia can assume only two possible values, x 1 x 3 would be equal to 0 when x 3 = 0 and equal to x 1 when x 3 = 1.) In the population of low birth weight infants, this new explanatory variable would have a corresponding slope β13 . Based on the sample of 100 infants, the fitted least squares model is yˆ
=
Testing the null hypothesis versus the alternative
1.7629 + 0.8646x 1 − 2.8150x 3 + 0.0462x 1 x 3 . H0 : β13 = 0 H A : β13 , 0,
we are unable to reject H0 at the 0.05 level of significance. We conclude that this sample does not provide evidence that gestational age has a different effect on head circumference depending on whether a mother experienced preeclampsia during pregnancy or not. Because the interaction term is not statistically significant, we would not want to retain it in the regression model. If it had achieved significance, however, we might again wish to evaluate the separate models corresponding to the two possible values of the dichotomous random variable preeclampsia. When x 3 = 1, the least squares equation would be yˆ
= 1.7629 + 0.8646x 1 − 2.8150(1) + 0.0462x 1 (1) =
−1.0521 + 0.9108x 1 .
=
1.7629 + 0.8646x 1 − 2.8150(0) + 0.0462x 1 (0)
=
1.7629 + 0.8646x 1 .
When x 3 = 0, yˆ
ISTUDY
438
Principles of Biostatistics
FIGURE 18.3 Fitted least squares regression lines for different levels of preeclampsia, interaction term included These two lines are plotted in Figure 18.3; note that they have different intercepts and different slopes. In the range of interest, however, one line still lies completely above the other. This implies that, across all relevant gestational ages, infants whose mothers did not experience preeclampsia have larger head circumference measurements on average than infants whose mothers were diagnosed with this condition.
18.2
Model Selection
As a general rule, we prefer to include in a multivariable regression model only those explanatory variables that help us to predict or to explain the observed variability in the response y, the coefficients of which can be accurately estimated – what we call a parsimonious model. Consequently, if we are presented with a number of potential explanatory variables, how do we decide which ones to retain in the model and which to leave out? This decision is usually made based on a combination of statistical and nonstatistical considerations. Initially, we should have some prior knowledge as to which variables might be important. To study the full effect of each of these explanatory variables, however, it would be necessary to perform a separate regression analysis for each possible combination of the variables. The resulting models would then be evaluated according to some statistical criteria. This strategy for finding the “best” regression equation is known as the all possible models approach. While it is the most thorough method, it is also extremely time consuming. If we have a large number of potential explanatory variables, the procedure may not be feasible. As a result, we frequently resort to one of several alternative approaches for choosing a regression model. The two most commonly used procedures are known as forward selection and backward elimination.
ISTUDY
Multiple Linear Regression
439
Forward selection proceeds by introducing variables into the model one at a time. The model is evaluated at each step, and the process continues until some specified statistical criterion is achieved. For example, we might begin by including the single explanatory variable that yields the largest coefficient of determination, and thus explains the greatest proportion of the observed variability in y. We next put into the equation the variable that increases R2 the most, assuming that the first variable remains in the model and that the increase in R2 is statistically significant. The procedure continues until we reach a point where none of the remaining variables explains a significant amount of the additional variability in y. (A different statistical criterion might be chosen to find the best model. Rather than selecting variables that increase the coefficient of determination, for instance, we might choose those which increase the adjusted R2 , or which decrease the standard deviation from regression.) Backward elimination begins by including all explanatory variables in the model. Variables are dropped one at a time, beginning with the one that reduces R2 by the least amount and thus explains the smallest proportion of the observed variability in y, given the other variables in the model. If the decrease in R2 is not statistically significant, the variable is left out of the model permanently. The equation is evaluated at each step, and the procedure is repeated until each of the variables remaining in the model explains a significant portion of the observed variation in the response. When features of both the forward selection and backward elimination techniques are used together, the method is called stepwise selection. We begin as if we were using the forward selection procedure, introducing variables into the model one at a time. As each new explanatory variable is entered into the equation, however, all previously entered variables are checked to ensure that they maintain their statistical significance. Consequently, a variable entered into the model in one step might be dropped out again at a later step. Note that it is possible that we could end up with different final models, depending on which procedure is applied. Sometimes, rather than building a model containing explanatory variables which help to predict a response or outcome, we are interested in examining the relationship between a single explanatory variable and the response, taking into account the effects of one or more confounding variables. Suppose that the explanatory variable of interest is binary, representing the presence or absence of an exposure or treatment. If a simple linear regression model indicates that the mean value of the response is higher in the exposed group than in the unexposed group, it is possible that this difference is due to the exposure itself. But it is also possible that the two groups being compared differ with respect to another factor known to be associated with the response of interest, such as age, or sex, or socioeconomic position. This confounder could obscure the true relationship between exposure status and response, making the relationship appear either stronger or weaker than it really is. In this situation, we could fit a multivariable linear regression model representing the explanatory variable of interest by X1 and the confounder by X2 . (If there is more than one confounding variable, we would use X2 , X3 , . . . X q .) The estimated coefficient βˆ1 would then be interpreted as the relationship between X1 and the response, holding all other explanatory variables constant. This is what we mean when we say that we have “adjusted for” the confounding variables. In order to interpret the coefficient of the exposure variable X1 in this way, we would want to leave the confounders in the linear regression model regardless of their statistical significance. Irrespective of the strategy we choose to fit a particular model, we should always check for the presence of collinearity. Collinearity occurs when two or more of the explanatory variables are correlated to the extent that they convey essentially the same information about the observed variation in y. One symptom of collinearity is the instability of the estimated coefficients and their standard errors. In particular, the standard errors often become very large; this implies that there is a great deal of sampling variability in the estimated coefficients. In the regression model that contains gestational age, preeclampsia, and the interaction between the two, preeclampsia and the gestational age–preeclampsia interaction are highly correlated. In fact, the Pearson correlation coefficient quantifying the linear relationship between these two variables
ISTUDY
440
Principles of Biostatistics
TABLE 18.1 Comparison of models with and without an interaction term
Coefficient of preeclampsia Standard error Test statistic p-value R2 Adjusted R2
Interaction Term Not Included −1.412 0.406 −3.477 0.001 0.653 0.646
Interaction Term Included −2.815 4.985 −0.565 0.574 0.653 0.642
is r = 0.997. This model and the model that did not include the interaction term are contrasted in Table 18.1. When the interaction term is included in the equation, the estimated coefficient of preeclampsia doubles in magnitude. In addition, its standard error increases by a factor of 12. In the model without the interaction term, the coefficient of preeclampsia is significantly different from 0 at the 0.05 level; when the interaction term is present, it no longer achieves statistical significance. The coefficient of determination does not change when the interaction is included. It remains 65.3%. Furthermore, the adjusted R2 decreases slightly. These facts taken together indicate that the inclusion of the gestational age – preeclampsia interaction term in the regression model does not explain any additional variability in the observed values of head circumference, beyond that which is explained by gestational age and preeclampsia alone. The information supplied by this term is redundant.
18.3
Evaluation of the Model
Using techniques such as the coefficient of determination and a plot of the residuals, we are able to assess how well a particular least squares model fits the observed data. For example, it can be shown that the model containing gestational age and birth weight explains 75.2% of the variation in the observed head circumference measurements; the model containing gestational age alone explained 60.9%. This increase in R2 suggests that adding the explanatory variable birth weight to the model improves our ability to predict head circumference for the population of low birth weight infants. We must be careful when comparing coefficients of determination from two different models. The inclusion of an additional variable in a model can never cause R2 to decrease; knowledge of both gestational age and birth weight, for example, can never explain less of the observed variability in head circumference than knowledge of gestational age alone. To get around this problem, we can use a second measure, called the adjusted R2 , that compensates for the added complexity of a model. The adjusted R2 increases when the inclusion of a variable improves our ability to predict the response and decreases when it does not. Consequently, the adjusted R2 allows us to make a more judicious comparison between models that contain different numbers of explanatory variables. Like the coefficient of determination, the adjusted R2 is an estimator of the population correlation ρ squared; unlike R2 , however, it cannot be directly interpreted as the proportion of the variability among the observed values of y that is explained by the linear regression model. Figure 18.4 displays a scatter plot of the residuals from the model containing both gestational age and birth weight versus the fitted values of head circumference from the same model for the sample
ISTUDY
Multiple Linear Regression
441
FIGURE 18.4 Residuals versus fitted values of head circumference of 100 low birth weight infants. There is one residual with a particularly large value that could be considered an outlier; this point corresponds to a child with gestational age 31 weeks, birth weight 900 grams, and head circumference 35 cm. We would predict the infant’s head circumference to be only yˆ = 8.3080 + 0.4487(31) + 0.0047(900) = 26.5 cm. Note that this outlier was also evident in Figure 18.1, the scatter plot of head circumference versus birth weight. We might try removing the point, refitting the least squares model, and determining how much of an effect this outlier has on the estimated coefficients. (If we were to do this, we would find that the point has only a small effect on the values of βˆ0 , βˆ1 , and βˆ2 . The relationships between head circumference and both gestational age and birth weight remain statistically significant, with p < 0.001 in each case.) There is no evidence that the assumption of homoscedasticity has been violated – note the absence of a fan-shaped scatter – or that a transformation of either the response or one of the explanatory variables is necessary. In Section 18.1, we noted that it is not possible to determine which explanatory variable in a multivariable linear regression model is most important for explaining the outcome by comparing their estimated coefficients. A different way to think about this, however, would be to compare changes in coefficients of determination. Once we have a final multivariable model, we would first calculate the R2 of that full model. Then, we could remove each of the explanatory variables one at a time, recalculate the R2 for each of these reduced models, and compare the values of the coefficients of determination with and without each explanatory variable. The relative importance of the explanatory variables can then be ordered by the magnitudes of these reductions in R2 , from largest to smallest.
ISTUDY
442
18.4
Principles of Biostatistics
Further Applications
In Chapter 17, we used gestational age to help predict length for a sample of 100 low birth weight infants – defined as those weighing less than 1500 grams – born in Boston, Massachusetts. We found that a significant linear relationship exists between these two variables. In particular, length increases as gestational age increases. We now wish to determine whether the length of an infant also depends on the age of its mother. To begin the analysis, we create a two-way scatter plot of length versus mother’s age for the 100 infants in the sample. The plot is displayed in Figure 18.5. Based on the graph, and disregarding the one outlying value, length does not appear to either increase or decrease as mother’s age increases. Given that we have already accounted for gestational age, does the inclusion of mother’s age in the regression model further improve our ability to predict a child’s length? To estimate the true population regression of length on gestational age and mother’s age, µ y | x 1, x 2
=
β0 + β1 x 1 + β2 x 2,
we fit a least squares model of the form yˆ
=
βˆ0 + βˆ1 x 1 + βˆ2 x 2 .
Table 18.2 shows the relevant output from Stata, and Table 18.3 displays output from R. The “Coef.” column in the Stata output and the corresponding “Estimate” column in R contain information related to the estimated coefficients βˆ0 , βˆ1 , and βˆ2 . As noted, the fitted least squares regression equation is yˆ = 9.09 + 0.936x 1 + 0.0247x 2 . The coefficient of gestational age is 0.936, implying that if mother’s age remains constant, each one week increase in gestational age corresponds to a 0.936 cm increase in length, on average. Similarly, the coefficient of mother’s age suggests that if gestational age remains constant, a one year increase in mother’s age would result in an approximate 0.0247 cm increase in length. A test of the null hypothesis H0 : β1 = 0 is rejected at the 0.05 level of significance, while H0 : β2 = 0 is not. Consequently, based on this sample of low birth weight infants, we conclude that length increases as gestational age increases, but that length does not vary with mother’s age. The relationships between the response length and the two explanatory variables are each adjusted for the other. Recall from Chapter 17 that gestational age alone explains 45.6% of the variability in the observed values of length; gestational age and mother’s age together explain 45.8%. The adjusted coefficient of determination, shown in the output, has actually decreased slightly from 45.0% to 44.6%. This lack of change in R2 , combined with our failure to reject the null hypothesis that the coefficient of mother’s age is equal to 0, demonstrates that adding this explanatory variable to the model does not improve our ability to predict length for the low birth weight infants in this population. We now wish to investigate whether an expectant mother’s diagnosis of preeclampsia during pregnancy affects the length of her child. To do this, we add an indicator variable representing preeclampsia status – where a diagnosis of preeclampsia is represented by 1 and no diagnosis by 0 – to the model already containing gestational age. As shown in both the Stata output in Table 18.4 and the R output in Table 18.5, the least squares regression equation for this model is yˆ
=
6.284 + 1.070x 1 − 1.777x 3 .
ISTUDY
443
Multiple Linear Regression
FIGURE 18.5 Length versus mother’s age for a sample of 100 low birth weight infants
TABLE 18.2 Stata output displaying the regression of length on gestational age and mother’s age Source | SS df MS ---------+------------------------------Model | 577.752198 2 288.876099 Residual | 685.007802 97 7.06193611 ---------+------------------------------Total | 1262.76 99 12.7551515
Number of obs F(2, 97) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
100 40.91 0.0000 0.4575 0.4463 2.6574
---------------------------------------------------------------------length | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-----------------------------------------------------------gestage | .9360867 .1093252 8.56 0.000 .7191064 1.153067 momage | .0247236 .0463071 0.53 0.595 -.0671832 .1166305 _cons | 9.090871 3.088481 2.94 0.004 2.961091 15.22065 ----------------------------------------------------------------------
ISTUDY
444
Principles of Biostatistics
TABLE 18.3 R output displaying the regression of length on gestational age and mother’s age Residuals: Min 1Q -12.9628 -1.2408
Median 0.3321
3Q 1.5156
Max 4.2229
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.09087 3.08848 2.943 0.00406 ** gestage 0.93609 0.10933 8.562 1.69e-13 *** momage 0.02472 0.04631 0.534 0.59463 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.657 on 97 degrees of freedom Multiple R-squared: 0.4575,Adjusted R-squared: 0.4463 F-statistic: 40.91 on 2 and 97 DF, p-value: 1.31e-13
TABLE 18.4 Stata output displaying the regression of length on gestational age and preeclampsia Source | SS df MS ----------+-------------------------------Model | 619.253622 2 309.626811 Residual | 643.506378 97 6.63408638 ----------+-------------------------------Total | 1262.76 99 12.7551515
Number of obs F(2, 97) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
100 46.67 0.0000 0.4904 0.4799 2.5757
-----------------------------------------------------------------------length | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------+------------------------------------------------------------gestage | 1.069883 .1121039 9.54 0.000 .8473879 1.292378 preeclamp | -1.777381 .6939918 -2.56 0.012 -3.154763 -.3999996 _cons | 6.284326 3.191824 1.97 0.052 -.0505613 12.61921 ------------------------------------------------------------------------
ISTUDY
445
Multiple Linear Regression TABLE 18.5 R output displaying the regression of length on gestational age and preeclampsia Residuals: Min 1Q -13.0314 -1.3109
Median 0.6192
3Q 1.6891
Max 3.8288
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.2843 3.1918 1.969 0.0518 . gestage 1.0699 0.1121 9.544 1.3e-15 *** preeclamp -1.7774 0.6940 -2.561 0.0120 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.576 on 97 degrees of freedom Multiple R-squared: 0.4904,Adjusted R-squared: 0.4799 F-statistic: 46.67 on 2 and 97 DF, p-value: 6.321e-15
The coefficient of preeclampsia, where preeclampsia status is represented by x 3 , is negative and is significantly different from 0 at the 0.05 level (p = 0.012). Given two infants with identical gestational ages, the child whose mother experienced preeclampsia would tend to be 1.78 cm shorter, on average, than the child whose mother had not. Also note that the coefficient of determination has increased from 45.6% for gestational age alone to 49.0% for the model containing both gestational age and preeclampsia; the adjusted R2 has also increased from 45.0% to 48.0%. To better understand this least squares model, we could examine two different models corresponding to the two possible values of the dichotomous random variable preeclampsia. When x 3 = 1, indicating that a mother did experience preeclampsia during pregnancy,
When x 3 = 0,
yˆ
=
6.284 + 1.070x 1 − 1.777(1)
=
4.507 + 1.070x 1 .
yˆ
=
6.284 + 1.070x 1 − 1.777(0)
=
6.284 + 1.070x 1 .
The two lines are plotted in Figure 18.6. Note that the lines have identical slopes; for either group, a one week increase in gestational age corresponds to a 1.07 cm increase in length, on average. To determine whether an increase in gestational age has a different relationship with length for infants whose mothers were diagnosed with preeclampsia versus infants whose mothers were not, we could add to the model an additional variable which is the interaction between gestational age and preeclampsia. The interaction term is created by multiplying together the outcomes of the two random variables representing gestational age and preeclampsia status. The Stata output corresponding to this model is presented in Table 18.6, and the R output in Table 18.7. Based on the sample of 100 low birth weight infants, the fitted least squares model is yˆ
=
6.608 + 1.058x 1 − 3.477x 3 + 0.0559x 1 x 3 .
We are unable to reject the null hypothesis that β13 , the coefficient of the interaction term, is equal to 0 (p = 0.84). The adjusted R2 has decreased from 48.0% to 47.5%. Furthermore, the high correlation between preeclampsia and the gestational age–preeclampsia interaction – the Pearson correlation coefficient is equal to 0.997 – has introduced collinearity into the model. Note that the standard error of the estimated coefficient of preeclampsia is approximately 12 times larger than it was in the
ISTUDY
446
Principles of Biostatistics
FIGURE 18.6 Fitted least squares regression lines for different levels of preeclampsia
TABLE 18.6 Stata output displaying the linear regression of length on gestational age, preeclampsia, and their interaction Source | SS df MS ----------+------------------------------Model | 619.522097 3 206.507366 Residual | 643.237903 96 6.70039483 ----------+------------------------------Total | 1262.76 99 12.7551515
Number of obs F(3, 96) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
100 30.82 0.0000 0.4906 0.4747 2.5885
----------------------------------------------------------------------length | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------+-----------------------------------------------------------gestage | 1.058458 .126295 8.38 0.000 .8077647 1.309152 preeclamp | -3.477085 8.519838 -0.41 0.684 -20.38883 13.43466 gesttox | .0559409 .2794651 0.20 0.842 -.4987929 .6106747 _cons | 6.608269 3.592847 1.84 0.069 -.5234757 13.74001 -----------------------------------------------------------------------
ISTUDY
447
Multiple Linear Regression
TABLE 18.7 R output displaying the linear regression of length on gestational age, preeclampsia, and their interaction Residuals: Min 1Q -13.070 -1.241
Median 0.638
3Q 1.696
Max 3.813
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.60827 3.59285 1.839 0.069 . gestage 1.05846 0.12630 8.381 4.41e-13 *** preeclamp -3.47708 8.51984 -0.408 0.684 gestage:preeclamp 0.05594 0.27947 0.200 0.842 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.589 on 96 degrees of freedom Multiple R-squared: 0.4906,Adjusted R-squared: 0.4747 F-statistic: 30.82 on 3 and 96 DF, p-value: 4.839e-14
model that did not contain the interaction term. Therefore, we conclude that there is no evidence that gestational age has a different effect on length depending on whether a mother experienced preeclampsia during pregnancy or not. Returning to the model that contains gestational age and preeclampsia status but not their interaction term, a plot of the residuals is displayed in Figure 18.7. There appears to be one outlier in set of observations. We might consider dropping this data point, refitting the least squares equation, and comparing the two models to determine how much of an effect the point has on the estimated coefficients. However, the assumption of homoscedasticity has not been violated, and a transformation of variables does not appear to be necessary. There is another type of explanatory variable that we have not yet considered. Suppose we have a measurement that is categorical, but not dichotomous. The values could be either nominal or ordinal. For example, instead of preeclampsia diagnosis yes or no, we might have the three categories no preeclampsia, mild preeclampsia, and severe preeclampsia. Or, rather than having gestational age as a continuous variable, it might have been classified into the four categories < 28 weeks, 28 to 29 weeks, 30 to 31 weeks, and ≥ 32 weeks. How would we include an explanatory variable of this type in a regression model? When we have a categorical explanatory variable with more than two classes, we begin by choosing one category to be the “reference group.” The reference group is the category against which each of the other classes will be compared. We then create a separate indicator variable for each of the categories that is not the reference group. Each of these indicator variables takes the value 1 for subjects in the category, and 0 for subjects not in the category. For example, suppose that among the four classes for gestational age, we choose < 28 weeks as the reference group. In this case, we must create three separate indicator variables, one for each of the remaining categories. These new indicator variables are listed below. Note that the indicator variable named ga_28_29 represents infants with gestational ages 28–29 weeks. It takes the value 1 for all subjects in this category, and 0 otherwise. Similarly, the indicator variable ga_30_31 represents infants with gestational ages 30–31 weeks, taking the value 1 for individuals in this class and 0 otherwise. The reference group does not
ISTUDY
448
Principles of Biostatistics
FIGURE 18.7 Residuals versus fitted values of length get its own indicator variable; it is the category for which all three of the other indicators take the value 0. Gestational Age Category F R-squared Adj R-squared Root MSE
= = = = = =
100 19.52 0.0000 0.3789 0.3595 2.8583
----------------------------------------------------------------------length | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------+-----------------------------------------------------------ga_28_29 | 3.232258 .7320423 4.42 0.000 1.779166 4.68535 ga_30_31 | 4.575 .7827865 5.84 0.000 3.021181 6.128819 ga_ge32 | 6.133333 .903884 6.79 0.000 4.339138 7.927529 _cons | 33.8 .5218577 64.77 0.000 32.76412 34.83588 -----------------------------------------------------------------------
TABLE 18.9 R output displaying the linear regression of length on categories of gestational age Residuals: Min 1Q -13.8000 -1.4812
Median 0.4125
3Q 1.9677
Max 5.2000
Coefficients: Estimate Std. Error t value (Intercept) 33.8000 0.5219 64.769 ga_28_29 3.2323 0.7320 4.415 ga_30_31 4.5750 0.7828 5.845 ga_ge32 6.1333 0.9039 6.786 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01
Pr(>|t|) < 2e-16 2.64e-05 6.96e-08 9.51e-10
*** *** *** ***
‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.858 on 96 degrees of freedom Multiple R-squared: 0.3789,Adjusted R-squared: 0.3595 F-statistic: 19.52 on 3 and 96 DF, p-value: 5.824e-10
ISTUDY
450
Principles of Biostatistics
into categories of equal width and create an indicator variable for each class, we would expect the increments between the coefficients for successive categories to be approximately equal if the relationship is truly linear. If this is not the case, the relationship is not linear and we may need to consider a transformation.
ISTUDY
451
Multiple Linear Regression
18.5
Review Exercises
1. What assumptions do you make when using the method of least squares to estimate a population regression equation containing two or more explanatory variables? 2. Given a multivariable linear regression model with a total of q distinct explanatory variables, how would you make inference about a single coefficient β j ? 3. Explain how the coefficient of determination R2 and adjusted R2 can be used to help evaluate the fit of a multiple linear regression model. 4. What is the function of an interaction term in a regression model? How is an interaction term created? 5. If you are performing an analysis with a single response and several potential explanatory variables, how would you decide which variables to include in a multivariable linear regression model and which to leave out? 6. How can collinearity between two explanatory variables affect the estimated coefficients in a regression model? 7. In a study designed to examine the effects of adding oats to the typical American diet, individuals were randomly divided into two different groups. Twice a day, the first group substituted oats for other foods containing carbohydrates. The members of the second group did not make any changes to their diet. One outcome of interest is the serum cholesterol level of each individual eight weeks after the start of the study. Explanatory variables that might be associated with this response include diet group, serum cholesterol level at the start of the study, body mass index (measured in kg/m2 ), and sex. The estimated coefficients and standard errors from a multiple linear regression model constructed based on a sample of size 76 and containing these four explanatory variables are displayed below [286]. Variable Diet group Baseline cholesterol Body mass index Sex
Coefficient −11.25 0.85 0.23 −3.02
Standard Error 4.33 0.07 0.65 4.42
(a) Conduct tests of the null hypotheses that each of the four coefficients in the population regression equation is equal to 0. At the 0.05 level of significance, which of the explanatory variables are associated with serum cholesterol level eight weeks after the start of the study, adjusting for the others? (b) What is the probability distribution of each of the test statistics calculated in part (a)? (c) If an individual’s body mass index were to increase by 1 kg/m2 while the values of all other explanatory variables remained constant, what would happen to his or her serum cholesterol level?
ISTUDY
452
Principles of Biostatistics (d) If an individual’s body mass index were to increase by 10 kg/m2 while the values of all other explanatory variables remained constant, what would happen to his or her serum cholesterol level? (e) The indicator variable sex is coded so that 1 represents a male and 0 a female. Who is more likely to have a higher serum cholesterol level eight weeks after the start of the study, a male or a female? How much higher would it be, on average? 8. The dataset lipid contains information for a sample of 1859 adults living in northern Ghana, collected as part of a study examining the relationships among sociodemographic, anthropometric, and behavioral factors, and lipid levels in a rural African population [284]. Measurements of total cholesterol are saved under the variable name total_cholesterol, and measurements of subcutaneous abdominal fat under the name sc_fat. (a) We previously found a significant linear relationship between these two variables (Chapter 17, Review Exercise 12), where total cholesterol was the response and subcutaneous abdominal fat the explanatory variable. Rerun this simple linear regression model and interpret the slope. (b) In this dataset, is there an association between subcutaneous abdominal fat and body mass index? Measurements of body mass index are saved under the variable name bmi. Explore this relationship using correlation analysis. If there is an association, do individuals with higher measurements of subcutaneous fat have higher or lower body mass indices? (c) We might believe that body mass index is a confounder in the relationship between subcutaneous abdominal fat and total cholesterol. Run a linear regression model which adjusts for body mass index, and interpret the slope coefficient associated with the explanatory variable subcutaneous fat. (d) At the 0.05 level of significance, test the null hypothesis that the true population slope β1 associated with subcutaneous abdominal fat is equal to 0. What do you conclude? (e) In addition to body mass index, sex and age might also be confounders in the relationship between subcutaneous abdominal fat and total cholesterol. In this dataset, age is provided as a categorical variable with classes 40–44 years, 45–49 years, 50– 54 years, and 55–60 years (variable age_category). There are also three indicator variables representing the categories 45–49, 50–54, and 55–60, using age 40–44 years as the reference group. Run a linear regression model with response total cholesterol and explanatory variable subcutaneous abdominal fat, which adjusts for body mass index, sex, and age group. Interpret the slope coefficient associated with subcutaneous fat. (f) At the 0.05 level of significance, test the null hypothesis that the true population slope β1 associated with subcutaneous abdominal fat is equal to 0. What do you conclude? 9. For the population of low birth weight infants, a significant linear relationship was found to exist between systolic blood pressure and gestational age (Chapter 17, Review Exercise 14). Recall that the relevant data are in the file lowbwt [81]. The measurements of systolic blood pressure are saved under the variable name sbp, and the corresponding gestational ages under gestage. Also in the data set is apgar5, the five-minute apgar score for each infant. (The apgar score is an indicator of a child’s general state of health five minutes after it is born. Although it is actually an ordinal measurement, it is often treated as if it were continuous.)
ISTUDY
453
Multiple Linear Regression
(a) Construct a two-way scatter plot of systolic blood pressure versus five-minute apgar score. Does there appear to be a linear relationship between these two variables? (b) Using systolic blood pressure as the response and gestational age and apgar score as the explanatory variables, fit the least squares model yˆ
=
βˆ0 + βˆ1 x 1 + βˆ2 x 2 .
Interpret βˆ1 , the estimated coefficient of gestational age. What does it mean in words? Similarly, interpret βˆ2 , the estimated coefficient of five-minute apgar score. (c) What is the estimated mean systolic blood pressure for the population of low birth weight infants whose gestational age is 31 weeks and whose five-minute apgar score is 7? (d) Test the null hypothesis H0 : β2 = 0 at the 0.05 level of significance. What is the p-value? What do you conclude? (e) Comment on the magnitude of R2 . Does the inclusion of five-minute apgar score in the model already containing gestational age improve your ability to predict systolic blood pressure? (f) Construct a plot of the residuals versus the fitted values of systolic blood pressure. What does this plot tell you about the fit of the model to the observed data? (g) The data set lowbwt also contains sex, a dichotomous random variable. Add the indicator variable sex – where 1 represents a male and 0 a female – to the model that contains gestational age. Given two infants with identical gestational ages, one male and the other female, which would tend to have the higher systolic blood pressure? How much higher, on average? (h) Construct a two-way scatter plot of systolic blood pressure versus gestational age. On the graph, draw the two separate least squares regression lines corresponding to males and to females. Is the sex difference in systolic blood pressure at each value of gestational age significantly different from 0? (i) Add to the model a third explanatory variable that is the interaction between gestational age and sex. Does gestational age have a different effect on systolic blood pressure depending on the sex of the infant? (j) Would you choose to include sex and the gestational age–sex interaction term in the regression model simultaneously? Why or why not? 10. The Bayley Scales of Infant Development produce two scores – the Psychomotor Development Index (pdi) and the Mental Development Index (mdi) – which can be used to assess a child’s level of functioning. As part of a study examining the development and neurologic status of children who underwent reparative heart surgery during the first three months of life, the Bayley Scales were administered to a sample of one-year-old infants born with congenital heart disease. Prior to heart surgery, the children had been randomized to one of two different treatment groups, called “circulatory arrest” and “lowflow bypass,” which differed in the specific way in which the operation was performed. The data for this study are saved in the data set bayley [189]. PDI scores are saved under the variable name pdi, MDI scores under mdi, and indicators of treatment group under trtment. For the treatment group variable, 0 represents circulatory arrest and 1 is low-flow bypass. (a) In Chapter 11, the two-sample t test was used to compare mean PDI and MDI scores for infants assigned to the circulatory arrest and low-flow bypass treatment groups.
ISTUDY
454
Principles of Biostatistics
(b)
(c) (d) (e)
These analyses could also be performed using linear regression. Fit two simple linear regression models – one with PDI score as the response and the other with MDI score – that both have the indicator of treatment group as the explanatory variable. Who is more likely to have a higher PDI score, a child assigned to the circulatory arrest treatment group or one assigned to the low-flow bypass group? How much higher would the score be, on average? Who is more likely to have a higher MDI score? How much higher, on average? Is the treatment group difference in either PDI or MDI scores statistically significant at the 0.05 level? What do you conclude? How do the results based on the linear regression model compare to those obtained using the two-sample t test?
ISTUDY
19 Logistic Regression
CONTENTS 19.1 19.2 19.3 19.4 19.5 19.6 19.7 19.8
The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1.1 Logistic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1.2 Fitted Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indicator Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
455 457 458 460 464 466 467 468 469 474
When studying linear regression, we estimate a population regression equation µ y | x1, x2, ..., x q
=
β0 + β1 x 1 + β2 x 2 + · · · + β q x q
by fitting a model of the form y
=
β0 + β1 x 1 + β2 x 2 + · · · + β q x q + .
The response Y is continuous, and is assumed to follow a normal distribution. We are concerned with predicting or estimating the mean value of the response corresponding to a given set of values for the explanatory variables. There are many situations, however, in which the response of interest is dichotomous rather than continuous. Examples of variables that assume only two possible values are disease status (disease is either present or absent) and survival following surgery (a patient is either alive or dead). In general, the value 1 is used to represent a “success,” or the outcome we are most interested in, and 0 represents a “failure.” The mean of the dichotomous random variable Y , designated p, is the proportion of times that Y takes the value 1. Equivalently, p
=
P(Y = 1)
=
P(“success”).
Just as we estimate the mean value of the response when Y is continuous, we would like to be able to estimate the probability p associated with a dichotomous response for various values of an explanatory variable. To do this, we use a technique known as logistic regression.
19.1
The Model
Among marathon runners, hyponatremia – defined as a decrease in blood sodium concentration to a value less than or equal to 135 millimoles per liter – can cause life-threatening illness and, in extreme DOI: 10.1201/9780429340512-19
455
ISTUDY
456
Principles of Biostatistics
FIGURE 19.1 Diagnosis of hyponatremia versus weight gain during the race for a sample of marathon runners cases, death. In a sample of 488 adults who completed the Boston Marathon and who are considered to be representative of the larger population of runners who complete marathons, 62 were diagnosed with hyponatremia [287]. Let Y be a dichotomous random variable for which the value 1 represents a diagnosis of hyponatremia in a runner and 0 no such diagnosis. We would estimate the probability that a runner develops hyponatremia by the sample proportion pˆ
=
62 488
=
0.127.
Overall, 12.7% of runners in the sample are diagnosed with this condition. We might suspect there are certain factors which affect the likelihood that a particular individual will develop hyponatremia. If we could classify a runner according to these characteristics, it might be possible to calculate a more informative estimate of their probability of developing hyponatremia. Since it was hypothesized that excessive fluid consumption during the race might be associated with development of hyponatremia, for example, one factor of interest is a runner’s change in weight from the beginning of the marathon to its end. If the response Y were continuous, we would begin an analysis by constructing a scatter plot of the response versus the continuous explanatory variable. A graph of hyponatremia versus weight gain in pounds is displayed in Figure 19.1 for the 455 individuals for whom weight gain was measured. Note that all points lie on one of two parallel lines, depending on whether Y takes the value 0 or 1. There does appear to be a tendency for individuals who develop hyponatremia to have higher weight gain, on average, while those who do not develop hyponatremia have lower weight gain. There is a lot of overlap, however, and the nature of this relationship is not clear from the graph. Since the two-way scatter plot is not particularly helpful, we might instead explore whether an association exists between a diagnosis of hyponatremia and weight gain during the race by arbitrarily dividing the runners who had their weight gain recorded into three groups with similar numbers of people in each category: those losing at least 1.36 pounds, those losing less than 1.36 pounds, but
ISTUDY
457
Logistic Regression
not gaining more than 0.01 pounds, and those gaining 0.01 pounds or more. We could then estimate the probability that an individual will develop hyponatremia in each of these subgroups individually. Weight Gain (pounds) ≤ −1.36 > −1.36, < 0.01 ≥ 0.01
Number of Runners 159 161 135
Number with Hyponatremia 6 13 38
pˆ 0.038 0.081 0.281
The estimated probability of hyponatremia increases as the amount of weight gain increases, from a low of 0.038 for runners who lose the most weight to a high of 0.281 for those who gain weight. Since there does appear to be a relationship between these two variables, we would like to be able to use a runner’s weight gain during the race to help us predict the likelihood that they will develop hyponatremia.
19.1.1 Logistic Function Our first strategy might be to fit a model of the form p
=
β0 + β1 x,
where x represents weight gain. This is simply the standard linear regression model in which y – the outcome of a continuous, normally distributed random variable – has been replaced by p. As before, β0 is the intercept of the line and β1 is its slope. On inspection, however, this model is not feasible. Since p is a probability, it is restricted to taking values between 0 and 1. The term β0 + β1 x, in contrast, could easily yield a value that lies outside this range. We might try to solve this problem by fitting the model =
p
e β0 +β1 x .
This equation guarantees that the estimate of p is positive. We would soon realize, however, that this model is also unsuitable. Although the term e β0 +β1 x cannot produce a negative estimate of p, it can result in a value that is greater than 1. To accommodate this additional constraint, we consider a model of the form e β0 +β1 x . p = 1 + e β0 +β1 x The expression on the right, called a logistic function, cannot yield a value that is either negative or greater than 1. Consequently, it restricts the estimated value of p to the required range. Recall that if an event occurs with probability p, the odds in favor of the event are p/(1 − p) to 1. Thus, if a success occurs with probability p
=
e β0 +β1 x , 1 + e β0 +β1 x
the odds in favor of success are p 1−p
=
e β0 +β1 x /(1 + e β0 +β1 x ) 1/(1 + e β0 +β1 x )
Taking the natural logarithm of each side of this equation, " # f g p ln = ln e β0 +β1 x = 1−p
=
e β0 +β1 x .
β0 + β1 x.
ISTUDY
458
Principles of Biostatistics
Thus, modeling the probability p with a logistic function is equivalent to fitting a linear regression model in which the continuous response y has been replaced by the logarithm of the odds of success for a dichotomous random variable. Instead of assuming that the relationship between p and x is linear, we assume that the relationship between ln[p/(1 − p)] and x is linear. The technique of fitting a model of this form is known as logistic regression.
19.1.2
Fitted Equation
In order to use a marathon runner’s weight gain during the race to help us predict the probability that they will develop hyponatremia, we fit the model # " pˆ = βˆ0 + βˆ1 x. ln 1 − pˆ Although we divided weight gain into three categories when initially exploring its relationship with the outcome, we now use the original continuous measurement as the explanatory variable for the logistic regression model. As in linear regression, βˆ0 and βˆ1 are estimates of the population coefficients. However, we do not apply the method of least squares – which assumes that the response is continuous and normally distributed – to fit a logistic model. Instead, we use maximum likelihood estimation [288]. Recall that this technique uses the information in a sample to find the parameter estimates that are most likely to have produced the observed data. For the sample of runners, the estimated logistic regression equation is " # pˆ ln = − 1.8849 + 0.7284x. 1 − pˆ The intercept βˆ0 = −1.8849 is the estimated log odds of hyponatremia for a runner with weight gain equal to 0 pounds, or no change in weight at all. The coefficient of the explanatory variable βˆ1 = 0.7284 implies that for each one pound increase in weight gain, the log odds that the runner develops hyponatremia increase by 0.7284 on average. When the log odds increase, the odds of the outcome increase, and the probability p increases as well. In order to determine whether this relationship is statistically significant, we test the null hypothesis that there is no relationship between p and x, H0 : β1 = 0, against the alternative
H A : β1 , 0.
If the null hypothesis is true, the probability of being diagnosed with hyponatremia is the same regardless of the amount of weight gain. In order to conduct the test, we need to know the estimated standard error of βˆ1 . Then, if H0 is true and the sample size is sufficiently large, the test statistic z
=
βˆ1 sDe( βˆ1 )
can be assumed to follow a standard normal distribution. Using a statistical package such as Stata or R, we find that the standard error of βˆ1 is 0.1103, and the test statistic is therefore z =
0.7284 0.1103
= 6.60.
ISTUDY
459
Logistic Regression
The probability of observing a test statistic as extreme as or more extreme than 6.60 given that the null hypothesis is true is smaller than 0.001. Since this p-value is less than the significance level 0.05, we reject the null hypothesis. We conclude that in the underlying population of marathon runners, the probability of being diagnosed with hyponatremia increases as the amount of weight gained during the race increases. In addition to conducting a test of hypothesis for β1 , we can also calculate a confidence interval for this population regression coefficient. If the sample size is large enough to assume normality of the test statistic, then ˆ βˆ1 ), βˆ1 + 1.96 se( ˆ βˆ1 )) ( βˆ1 − 1.96 se( ˆ βˆ1 ) = 0.1103, is an approximate 95% confidence interval for β1 . Since we previously noted that se( the interval is (0.7284 − 1.96 (0.1103), 0.7284 + 1.96 (0.1103)) or
(0.5122, 0.9447).
While 0.7284 is the point estimate for β1 – our best guess at its value, given the sample of data we have – we are 95% confident that these limits contain the true value. In order to estimate the probability that a runner with a particular weight gain develops hyponatremia, we simply substitute the appropriate value of x into the equation above. To estimate the probability that an individual gaining 1 pound is diagnosed with hyponatremia, for example, we substitute the value x = 1 to find # " pˆ = −1.8849 + 0.7284(1) ln 1 − pˆ = −1.1565. Taking the antilogarithm of each side of the equation, pˆ 1 − pˆ
=
e−1.1565
=
0.3146.
Finally, solving for p, ˆ
0.3146 = 0.239. 1 + 0.3146 The estimated probability that a runner who gains 1 pound during the race develops hyponatremia is 0.239. Using similar calculations, we find that the estimated probability of being diagnosed with hyponatremia for a runner who loses 1 pound (or equivalently, gains −1 pound) is pˆ
=
pˆ
=
0.068,
whereas the probability for a runner whose weight does not change at all – a gain of 0 pounds – is pˆ
=
0.132.
If we calculate the estimated probability pˆ for each observed value of weight gain x in the dataset and plot pˆ versus x, the result would be the curve in Figure 19.2. According to the logistic regression model, the estimated value of p increases as weight gain increases. As previously noted, however, the relationship between p and x is not linear.
ISTUDY
460
Principles of Biostatistics
FIGURE 19.2 Logistic regression of hyponatremia on weight gain: ln[ p/(1 ˆ − p)] ˆ = −1.8849 + 0.7284x Summary: Simple Logistic Regression Model Logistic function
e β 0 +β 1 x 1 + e β 0 +β 1 x where p is the population probability of success p=
" Population regression line
ln
# f g p = ln e β 0 +β 1 x = β0 + β1 x 1−p
where β0 and β1 are the coefficients of the model " Fitted equation
19.2
ln
# pˆ = βˆ0 + βˆ1 x 1 − pˆ
Indicator Variables
Like the linear regression model, the logistic regression model can include categorical explanatory variables in addition to continuous ones. Suppose we are interested in the relationship between hyponatremia and sex, categorized as female or male. We could begin by noting that the proportion of females with the outcome is 37/166 = 0.223, while the proportion of males with the outcome is 25/322 = 0.078. In the sample of marathon runners, the estimated probability of hyponatremia is
ISTUDY
461
Logistic Regression higher for females than it is for males. Suppose we now fit the model " # pˆ ln = βˆ0 + βˆ1 x 1 − pˆ
where x is the dichotomous random variable indicating sex. If female sex is represented by 1 and male sex by 0, the equation estimated from the sample is " # pˆ ln = − 2.4749 + 1.2260x. 1 − pˆ The coefficient of sex is positive, implying that the log odds of developing hyponatremia – and thus the probability p itself – is higher for females than for males. It can be shown that if an explanatory variable x is dichotomous, its estimated coefficient βˆ1 has a special interpretation. In this case, the antilogarithm of βˆ1 – the exponentiated value of βˆ1 – is the estimated odds ratio of the response for the two possible levels of x. For example, the odds ratio of developing hyponatremia for female runners versus male runners is L OR
=
ˆ
e β1
=
e1.2260
=
3.41.
This tells us that the odds of developing hyponatremia for females are 3.41 times the odds for males. The same results could have been obtained after arranging the sample data as a 2 × 2 contingency table. Among 166 female runners, 37 developed hyponatremia; among 322 male runners, 25 developed hyponatremia. Sex
Hyponatremia
Female 37 129 166
Yes No Total
Male 25 297 322
Total 62 426 488
The odds ratio estimated by computing the cross-product of the entries in the contingency table, L OR
=
(37)(297) (25)(129)
=
3.41,
is identical to that obtained from the logistic regression model. A confidence interval for the odds ratio can be calculated from the model by computing a confidence interval for the coefficient β1 , and then taking the antilogarithm of its upper and lower limits. If sDe( βˆ1 ) = 0.2795, a 95% confidence interval for β1 is (1.2260 − 1.96(0.2795) , 1.2260 + 1.96(0.2795)) or A 95% confidence interval for
(0.6781 , 1.7738). e β1 ,
the odds ratio, is (e0.6781 , e1.7738 )
or
(1.97 , 5.89).
ISTUDY
462
Principles of Biostatistics
TABLE 19.1 Logistic regression of hyponatremia on weight gain category Variable
Coefficient
Weight gain category 2 Weight gain category 3 Intercept
0.8064 2.3016 −3.2387
Standard Error 0.5068 0.4581 0.4162
Test Statistic 1.59 5.02 −7.78
p-value 0.112 < 0.001 < 0.001
We are 95% confident that these limits cover the true population odds ratio for female runners versus male runners. Note that the confidence interval does not contain the value 1. Therefore, the sample provides evidence that the probability of developing hyponatremia is different for females and males, and, because the estimated odds ratio is greater than 1, the probability is higher for females. Categorical explanatory variables are not limited to binary measurements. Suppose we consider the three categories of weight gain during the race investigated in Section 19.1. Here we must choose one category to be the reference group – the category against which each of the other groups will be compared – and create a separate indicator variable for each of the other categories. Each indicator takes the value 1 for subjects in the category, and 0 otherwise. Suppose that we choose a weight loss of at least 1.36 pounds as the reference group. In this case we would create two indicator variables, one for each of the other categories. These indicator variables are shown in the table below. Weight Gain (pounds) ≤ −1.36 −1.35 to 0.00 ≥ 0.01
Weight Gain Category 2 0 1 0
Weight Gain Category 3 0 0 1
Now that we have created these indicator variables, we put them both into the logistic regression model at the same time. The results are shown in Table 19.1. Because the coefficients of both weight gain categories are positive, we know that the probability of hyponatremia is higher for runners who lose no more than 1.35 pounds during the race relative to those who lose at least 1.36 pounds, and higher for runners who gain weight relative to those who lose at least 1.36 pounds. Furthermore, the odds ratio of being diagnosed with hyponatremia for runners who lose no more than 1.35 pounds versus those who lose 1.36 pounds or more is L OR
=
ˆ
e β1
=
e0.8064
=
2.24,
and the odds ratio of hyponatremia for runners who gain weight versus those who lose at 1.36 pounds or more is L = e βˆ1 = e2.3016 = 9.99. OR Again we see that the odds of hyponatremia increase as the amount of weight gained increases, just as we saw when weight gain was treated as a continuous variable. Note that an odds ratio interpretation can be applied to continuous explanatory variables as well as categorical ones. For continuous weight gain during the race, for example, the estimated coefficient βˆ1 is 0.7284. This means that for each 1 pound increase in weight gain, the log odds that
ISTUDY
Logistic Regression
463
FIGURE 19.3 Observed log odds of hyponatremia within each quintile of weight gain versus quintile midpoints a runner develops hyponatremia increase by 0.7284, on average. If we exponentiate this coefficient to get e0.7284 = 2.07, we can interpret this value as the odds ratio associated with each 1 pound increase in weight gain. Therefore, the odds of developing hyponatremia for an individual gaining 2 pounds during the race are 2.07 times the odds for a runner gaining 1 pound, and the odds of hyponatremia for an individual gaining 1 pound are 2.07 times the odds of a runner gaining 0 pounds. As noted in Chapter 18, indicator variables can be used to explore the nature of the relationship between a continuous explanatory variable and a binary response to help determine whether the relationship is truly linear. Initially we created three categories of weight gain to begin our investigation of its association with the diagnosis of hyponatremia. More generally, we might divide a continuous explanatory variable into five or ten categories of equal width and create an indicator variable for each class (omitting the one chosen to be the reference category). We would expect the increments between the coefficients of successive categories to be approximately equal if the relationship is truly linear. If this is not the case, we may need to consider a transformation of the continuous explanatory variable. Another way to evaluate whether the relationship is linear is to calculate the log odds of the outcome within each quintile or decile of the explanatory variable, and plot the log odds versus the group midpoint. In Figure 19.3, for example, the observed log odds of hyponatremia within each quintile of weight gain is plotted against the midpoint of that quintile. The resulting graph – where the points lie quite close to a straight line – justifies the use of weight gain during the race as a continuous explanatory variable for predicting hyponatremia.
ISTUDY
464
Principles of Biostatistics
TABLE 19.2 Logistic regression of hyponatremia on weight gain and female sex Variable Weight gain Female sex Intercept
19.3
Coefficient 0.7026 0.8695 −2.3009
Standard Error 0.1138 0.3109 0.2346
Test Statistic 6.17 2.80 −9.81
p-value < 0.001 0.005 < 0.001
Multiple Logistic Regression
Now that we have seen that both weight gain during the race and female sex are associated with the probability that a marathon runner will be diagnosed with hyponatremia, we might wonder whether including these two explanatory variables in the same model will improve our ability to predict p. In other words, given that we have already accounted for weight gain, does knowing a runner’s sex further improve our ability to predict whether they will experience hyponatremia? To model the probability p as a function of the two explanatory variables, we fit a model of the form " # pˆ ln = βˆ0 + βˆ1 x 1 + βˆ2 x 2 1 − pˆ where x 1 designates weight gain and x 2 represents sex. The estimated logistic regression equation based on the sample of runners is " # pˆ ln = − 2.3009 + 0.7026x 1 + 0.8695x 2 . 1 − pˆ As we see in Table 19.2, the coefficients of both weight gain and sex have decreased somewhat now that another explanatory variable has been added to the model. However, both are still significantly different from 0 at the 0.05 level. The coefficient of weight gain tells us that, holding sex constant, a one pound increase in weight gain is associated with a 0.7026 increase in the log odds of hyponatremia. We can also say that the odds ratio for hyponatremia associated with a 1 pound increase in weight gain is L = e 0.7026 = 2.02, OR adjusting for sex. Furthermore, holding weight gain constant, a female runner has a log odds of hyponatremia which is 0.8695 higher than for a male. The odds ratio for hyponatremia for females versus males is L = e 0.8695 = 2.39, OR adjusting for weight gain. To estimate the probability that a male runner who gains one pound during the race will develop hyponatremia, we substitute the values x 1 = 1 pound and x 2 = 0 into the estimated equation to find " # pˆ ln = −2.3009 + 0.7026(1) + 0.8695(0) 1 − pˆ = −1.5983.
ISTUDY
465
Logistic Regression Taking the antilogarithm of each side results in pˆ 1 − pˆ
=
=
e−1.5983
0.2022,
and solving for pˆ we have pˆ
=
0.2022 1 + 0.2022
=
0.2022 1.2022
=
0.168.
The estimated probability is 0.168. For a female runner who loses one pound, # " pˆ ln = − 2.3009 + 0.7026(−1) + 0.8695(1) = − 2.1340. 1 − pˆ Therefore,
pˆ 1 − pˆ
and
=
e−2.1340
=
0.1184,
0.1184 = 0.106. 1.1184 While evaluation of the logistic regression model is beyond the scope of this text [288], we note that one way to judge the goodness-of-fit of a model to the observed sample data is to stratify the sample into subgroups – as we do below, looking at categories of weight gain and sex – and compare the observed proportion of cases with hyponatremia within each of the subgroups to the predicted probability of hyponatremia based on the model. pˆ
=
Weight Gain (pounds)
Observed Proportion with Hyponatremia Males Females
≤ −1.36 −1.35 to 0.00 ≥ 0.01
0.015 0.044 0.233
0.154 0.129 0.339
Predicted Proportion with Hyponatremia Males Females 0.031 0.051 0.196
0.074 0.120 0.382
Here we note that the model seems to work quite well, generating predicted probabilities of hyponatremia which are fairly close to the observed values. Summary: Multiple Logistic Regression Model Logistic function
p=
1 + e β 0 +β 1 x 1 +β 2 x 2 +···+β q x q where p is the population probability of success "
Population regression line
e β 0 +β 1 x 1 +β 2 x 2 +···+β q x q
ln
# p = β0 + β1 x 1 + β2 x 2 + · · · + β q x q 1−p
where β0, β1, β2, . . . , βq are model coefficients " Fitted equation
ln
# pˆ = βˆ0 + βˆ1 x 1 + βˆ2 x 2 + · · · + βˆq x q 1 − pˆ
ISTUDY
466
19.4
Principles of Biostatistics
Simpson’s Paradox
Now consider the data from a study investigating the relationship between current smoking status and the presence of aortic stenosis, a narrowing or stricture of the aorta that impedes the flow of blood to the body [289]. Aortic Stenosis Yes No Total
Smoker Yes No 51 54 43 67 94 121
Total 105 110 215
Using aortic stenosis as the dichotomous response variable with 1 representing presence of disease and 0 no disease, and smoking status as the single dichotomous explanatory variable where 1 denotes a smoker and 0 a nonsmoker, the estimated logistic regression equation is # " pˆ = −0.2157 + 0.3863x. ln 1 − pˆ Therefore, the estimated odds ratio of aortic stenosis for smokers versus nonsmokers is L OR
=
e0.3863
=
1.47.
The odds of developing aortic stenosis for smokers are 1.47 times the odds for nonsmokers. Since biological sex is known to be associated with both smoking status and aortic stenosis, we suspect that it could influence the observed relationship between them. Therefore, we might begin an analysis by examining the association in males and females separately, using a stratified analysis. Males Smoker Yes No 37 25 24 20 61 45
Aortic Stenosis Yes No Total
Total 62 44 106
For males, the estimated logistic regression equation is # " pˆ ln = 0.2231 + 0.2097x 1 − pˆ and the odds ratio of aortic stenosis for smokers relative to nonsmokers is L OR
Aortic Stenosis Yes No Total
=
e0.2097
=
Females Smoker Yes No 14 29 19 47 33 76
1.23.
Total 43 66 109
ISTUDY
467
Logistic Regression TABLE 19.3 Logistic regression of aortic stenosis on smoking status and sex Variable Smoking status Male sex Intercept
Coefficient 0.1946 0.7199 −0.4882
Standard Error 0.2903 0.2881 0.2159
Test Statistic 0.67 2.50 −2.26
p-value 0.503 0.012 0.024
For females, the estimated logistic regression equation is " # pˆ ln = − 0.4829 + 0.1775x 1 − pˆ and the odds ratio of aortic stenosis for smokers relative to nonsmokers is L OR
=
e0.1775
=
1.19.
We observe a similar trend in each subgroup of the population; for both males and females, the odds of developing aortic stenosis are higher among smokers than they are among nonsmokers. Note, however, that both odds ratios are lower than the odds ratio estimated when males and females are combined. If the effect of sex is ignored, the strength of the association between smoking and aortic stenosis appears greater than it is for either males or females alone. This phenomenon is an example of Simpson’s paradox. Simpson’s paradox occurs when the magnitude or direction of the relationship between two variables is influenced by the presence of a third factor. In this case, sex is a confounder in the relationship between exposure and disease; males are more likely to smoke than females (58% versus 30%), and are also more likely to have aortic stenosis (58% versus 39%). Therefore, failure to account for the effect of sex causes the magnitude of the association to appear greater than it actually is. When evaluating the relationship between smoking status and aortic stenosis, we can adjust for sex by including it as a second explanatory variable in the logistic regression model. As seen in Table 19.3, representing smoking status by x 1 and sex by x 2 – and coding the indicator of sex to take the value 1 for a male and 0 for a female – the estimated model is " # pˆ ln = − 0.4882 + 0.1946x 1 + 0.7199x 2 . 1 − pˆ The odds of aortic stenosis among smokers relative to nonsmokers, adjusting for sex, are L OR
19.5
=
e0.1946
=
1.21.
Interaction Terms
Just as we saw for linear regression models, interaction terms can be used in logistic regression models to explore whether one explanatory variable has a different relationship with the response
ISTUDY
468
Principles of Biostatistics
TABLE 19.4 Logistic regression of hyponatremia on weight gain, female sex, and the weight gain × sex interaction Variable Weight gain Female sex Weight gain × sex Intercept
Coefficient 0.6868 0.8550 0.0366 −2.2954
Standard Error 0.1503 0.3241 0.2301 0.2359
Test Statistic 4.57 2.64 0.16 −9.73
p-value < 0.001 0.008 0.874 < 0.001
depending on the value of a second explanatory variable. Returning to the example exploring risk factors for hyponatremia among marathon runners, a 1 pound increase in weight gain might have a different relationship with the outcome for females versus males. An interaction term is generated by multiplying together the outcomes of two explanatory variables x i and x j to create a new variable x i x j , which is then included in the model. Suppose we wish to add the interaction between weight gain during the race and sex to the regression model that already contains these two variables individually. We multiply the outcomes of weight gain, x 1 , by the outcomes of sex, x 2 , to create a new variable x 1 x 2 . (Because x 2 can assume only two possible values, the interaction term would be equal to 0 when x 2 = 0 and equal to x 1 when x 2 = 1.) In the population of marathon runners, this new explanatory variable would have coefficient β12 . Based on the sample of 488 runners, the fitted logistic regression model is " # pˆ ln = − 2.2954 + 0.6868x 1 + 0.8550x 2 + 0.0366x 1 x 2, 1 − pˆ as seen in Table 19.4. Testing the null hypothesis H0 : β12 = 0 versus the alternative
H A : β12 , 0,
we are unable to reject the null hypothesis at the 0.05 level of significance (p = 0.874). We conclude that there is no evidence of an interaction; the sample does not provide evidence that weight gain has a different relationship with diagnosis of hyponatremia for female runners versus male runners. We might also wish to know whether the relationship between smoking status and aortic stenosis is different for males and females. Here we would add the interaction between smoking status and sex to the logistic regression model. We multiply the outcomes of smoking status, x 1 , by the outcomes of sex, x 2 , to again create a new variable x 1 x 2 . This explanatory variable would have coefficient β12 . Based on the model in Table 19.5, we are unable to reject the null hypothesis that β12 = 0. The data do not provide evidence that the relationship between aortic stenosis and smoking status differs for males and females. (This is not surprising; recall that the estimated odds ratios were 1.23 for males and 1.19 for females.)
19.6
Model Selection
The process of deciding which explanatory variables to include in a multivariable logistic regression model parallels that used for linear regression. We generally prefer parsimonious models containing
ISTUDY
469
Logistic Regression TABLE 19.5 Logistic regression of aortic stenosis on smoking status and sex Variable Smoking status Male sex Smoking × sex Intercept
Coefficient 0.1775 0.7060 0.0323 −0.4829
Standard Error 0.4241 0.3818 0.5818 0.2361
Test Statistic 0.42 1.85 0.06 −2.04
p-value 0.676 0.064 0.956 0.041
only those explanatory variables that help us to predict the probability of the outcome. To create a model, we might consider the all possible models approach, forward selection, backward elimination, or stepwise selection. Regardless of the approach used, if the goal is predicting the probability of the outcome, all explanatory variables in the final model will be statistically significant at some specified level. Alternatively, if we need to adjust for one or more confounding variables when examining the relationship between an explanatory variable and a response, we would include these confounders in the model regardless of their statistical significance.
19.7
Further Applications
Suppose we are interested in identifying factors that influence the probability that a low birth weight infant will experience a germinal matrix hemorrhage, a particular type of hemorrhage in the brain. Germinal matrix hemorrhage is a dichotomous random variable that takes the value 1 if this outcome occurs and 0 if it does not. We use the sample of 100 low birth weight infants born in Boston, Massachusetts, to estimate the probability of a hemorrhage [81]. In the group as a whole, 15 infants experienced the outcome, so pˆ = 15/100 = 0.15. We would like to determine whether the head circumference of an infant is associated with the probability that he or she will suffer a brain hemorrhage. Because the response is dichotomous, we fit a logistic regression model of the form " # pˆ ln = βˆ0 + βˆ1 x 1 1 − pˆ where x 1 represents head circumference. Table 19.6 shows the relevant output from Stata, and Table 19.7 contains output from R. The fitted equation is " # pˆ ln = 1.193 − 0.1117x 1 . 1 − pˆ The coefficient βˆ1 = −0.1117 implies that for each 1 cm increase in head circumference, the log odds of experiencing a hemorrhage decrease by 0.1117 on average. The odds ratio associated with each 1 cm increase in head circumference is L OR
=
e−0.1117
=
0.894.
ISTUDY
470
Principles of Biostatistics
TABLE 19.6 Stata output displaying the logistic regression of germinal matrix hemorrhage on head circumference Logistic regression
Number of obs LR chi2(1) Prob > chi2 Pseudo R2
Log likelihood = -41.788106
= = = =
100 0.97 0.3258 0.0114
----------------------------------------------------------------------gmh | Coef. Std. Err. z P>|z| [95% Conf. Interval] ----------+-----------------------------------------------------------headcirc | -.1117081 .1152569 -0.97 0.332 -.3376075 .1141913 _cons | 1.192854 3.00632 0.40 0.692 -4.699425 7.085133 -----------------------------------------------------------------------
TABLE 19.7 R output displaying the logistic regression of germinal matrix hemorrhage on head circumference Deviance Residuals: Min 1Q Median -0.7408 -0.6065 -0.5472 Coefficients: Estimate (Intercept) 1.1929 headcirc -0.1117
3Q -0.4929
Std. Error 3.0063 0.1153
Max 2.1764
z value 0.397 -0.969
Pr(>|z|) 0.692 0.332
(Dispersion parameter for binomial family taken to be 1) Null deviance: 84.542 Residual deviance: 83.576 AIC: 87.576
on 99 on 98
degrees of freedom degrees of freedom
ISTUDY
471
Logistic Regression
TABLE 19.8 Predicted probabilities of experiencing a germinal matrix hemorrhage for the first ten infants in the sample Head Circumference (cm) 29 23 28 27 26 26 27 28 28 26
Hemorrhage 0 1 0 0 0 1 0 0 0 0
Predicted pˆ 0.114 0.202 0.126 0.139 0.153 0.153 0.139 0.126 0.126 0.153
We see that the odds of germinal matrix hemorrhage appear to decrease as head circumference increases. However, the null hypothesis H0 : β1 = 0 fails to be rejected at the 0.05 level of significance (p = 0.33). Therefore, this sample does not provide evidence that the probability of a brain hemorrhage differs depending on a child’s head circumference. If we now wish to estimate the probability that an infant with a particular head circumference will suffer a germinal matrix hemorrhage – keeping in mind that any differences in probabilities for various values of head circumference are not statistically significant at the 0.05 level – we substitute the appropriate value of x 1 into the estimated equation and solve for p. ˆ Given a child whose head circumference is 28 cm, for instance, " # pˆ ln = 1.193 − 0.1117(28) 1 − pˆ = −1.9346. Therefore,
pˆ 1 − pˆ
and
=
e−1.9346
=
0.1445,
0.1445 0.1445 = = 0.126. 1 + 0.1445 1.1445 The predicted probabilities of experiencing a hemorrhage for the first ten infants in the sample are listed in Table 19.8. Note that the calculated probabilities decrease slightly as head circumference increases, although as previously noted these differences are not statistically significant. We now attempt to determine whether the sex of an infant is associated with the probability that they experience a germinal matrix hemorrhage. To do this, we fit a logistic regression model of the form " # pˆ ln = βˆ0 + βˆ2 x 2 1 − pˆ pˆ
=
ISTUDY
472
Principles of Biostatistics
TABLE 19.9 Stata output displaying the logistic regression of germinal matrix hemorrhage on sex Logistic regression
Number of obs LR chi2(2) Prob > chi2 Pseudo R2
Log likelihood = -40.768541
= = = =
100 3.00 0.2226 0.0355
----------------------------------------------------------------------gmh | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------+-------------------------------------------------------------sex | -.8938179 .6230019 -1.43 0.151 -2.114879 .3272433 _cons | -1.408767 .33635 -4.19 0.000 -2.068001 -.7495334 --------------------------------------------------------------------------------------------------------------------------------------------grmhem | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] --------+-------------------------------------------------------------sex | .4090909 .2548644 -1.43 0.151 .1206479 1.387139 _cons | .2444444 .0822189 -4.19 0.000 .1264383 .472587 -----------------------------------------------------------------------
where x 2 takes the value 1 for a male and 0 for a female. Table 19.9 shows two different versions of output from Stata, one with the estimated coefficient βˆ2 , and the other with the estimated odds ratio. Table 19.10 contains the output from R. The fitted equation is " # pˆ ln = − 1.4088 − 0.8938x 2 . 1 − pˆ Because the estimated coefficient of sex is negative, the log odds of experiencing a hemorrhage is lower for males than for females, with odds ratio L OR
=
e0.8938
=
0.41.
Males appear to be less likely to suffer a hemorrhage than females. However, a test of the null hypothesis H0 : β2 = 0 results in a p-value of 0.15. At the 0.05 level of significance, the probability of a hemorrhage cannot be said to vary by sex.
ISTUDY
473
Logistic Regression TABLE 19.10 R output displaying the logistic regression of germinal matrix hemorrhage on sex Deviance Residuals: Min 1Q Median -0.6613 -0.6613 -0.4366
3Q -0.4366
Max 2.1899
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.4088 0.3363 -4.188 2.81e-05 *** sex -0.8938 0.6230 -1.435 0.151 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 84.542 Residual deviance: 82.294 AIC: 86.294
on 99 on 98
degrees of freedom degrees of freedom
ISTUDY
474
Principles of Biostatistics
19.8
Review Exercises
1. When the response variable of interest is dichotomous rather than continuous, why is it not appropriate to fit a linear regression model using the probability of “success” as the outcome? 2. What is the logistic function? 3. How does logistic regression differ from linear regression? 4. How can a logistic regression model be used to estimate the odds ratio of some outcome event for an exposed group versus an unexposed group? 5. Explain Simpson’s paradox. 6. In a study investigating maternal risk factors for congenital syphilis, syphilis is a dichotomous response variable where 1 represents the presence of disease in a newborn and 0 its absence [290]. The estimated coefficients from a logistic regression model containing the explanatory variables cocaine or crack use, marital status, number of prenatal visits to a doctor, alcohol use, and level of education are listed below. The estimated intercept is not shown. Variable Cocaine/crack use Marital status Number of prenatal visits Alcohol use Level of education
Coefficient βˆ 1.354 0.779 −0.098 0.723 0.298
(a) As an expectant mother’s number of prenatal visits to the doctor increases, does the probability that her child will be born with congenital syphilis increase or decrease? Explain. (b) Marital status is a dichotomous random variable for which the value 1 indicates that a woman is unmarried and 0 that she is married. What is the odds ratio for the outcome that a newborn will suffer from congenital syphilis for unmarried versus married mothers, adjusting for the other risk factors in the model? (c) Cocaine or crack use is also a dichotomous random variable; the value 1 indicates that a woman used drugs during her pregnancy and 0 that she did not. What is the estimated odds ratio that a child will be born with congenital syphilis for women who used cocaine or crack versus those who did not, adjusting for the other risk factors in the model? (d) The estimated coefficient of cocaine or crack use has standard error 0.162. Construct a 95% confidence interval for the population odds ratio. What can you say about the relationship between drug use and the presence of congenital syphilis in a newborn, adjusting for the other risk factors in the model?
ISTUDY
475
Logistic Regression
7. A study was conducted to investigate intravenous drug use among high school students in the United States [291]. Drug use is characterized as a dichotomous random variable where 1 indicates that a student has injected drugs within the past year and 0 that they have not. Factors that might be related to drug use are instruction about the human immunodeficiency virus (hiv) in school, age of the student, sex, and general knowledge about hiv, including the various modes of transmission and ways to reduce risk. The estimated coefficients and standard errors from a logistic regression model containing all of these explanatory variables as well as the interaction between instruction and sex are displayed below. Variable Intercept hiv instruction Age Sex male hiv knowledge hiv instruction × sex
Coefficient −1.183 0.039 −0.164 1.212 −0.187 −0.663
Standard Error 0.859 0.421 0.092 0.423 0.048 0.512
(a) Conduct tests of the null hypotheses that each of the coefficients in the population regression equation is equal to 0, adjusting for the others. At the 0.05 level of significance, which of the explanatory variables are associated with the probability of intravenous drug use in the past year? (b) As a student becomes older, does the probability that they have used intravenous drugs in the past year increase or decrease? (c) The dichotomous random variable sex is coded so that 1 represents a male and 0 a female. What are the estimated odds of injecting drugs for males relative to females? (d) Does hiv instruction have a different relationship with the probability of intravenous drug use for males versus females? Explain. 8. A study was conducted to examine the association between consumption of caffeinated coffee and nonfatal myocardial infarction among males between the ages of 21 and 54 years. The dataset coffee contains information on a sample of 2496 adult males [292]. A binary variable indicating whether a study participant drinks caffeinated coffee is saved under the variable name coffee, with the value 1 indicating that he does drink coffee and 0 that he does not. The response occurrence of myocardial infarction is saved under the name mi, with 1 indicating that this event did occur and 0 that it did not. (a) Fit a logistic regression model of the form " # pˆ ln = βˆ0 + βˆ1 x 1 1 − pˆ where the response is occurrence of myocardial infarction, and x 1 represents coffee consumption. Interpret the odds ratio of myocardial infarction for males who drink caffeinated coffee versus those who do not. (b) Suppose you are concerned that smoking status is a confounder in the relationship between coffee consumption and myocardial infarction. The variable smoke takes the value 1 if a male is a self-reported smoker and 0 otherwise. Fit separate logistic regression models for smokers and nonsmokers. Within each subgroup, interpret the
ISTUDY
476
Principles of Biostatistics odds ratio of myocardial infarction for males who drink coffee versus those who do not. (c) Using the entire sample, fit a logistic regression model to examine the relationship between coffee consumption and the occurrence of myocardial infarction, adjusting for smoking status. Interpret the odds ratio associated with coffee consumption. (d) At the 0.05 level of significance, test the null hypothesis that the odds ratio associated with coffee consumption is equal to 1. What do you conclude? (e) Construct a 95% confidence interval for the odds ratio of occurrence of myocardial infarction for coffee drinkers versus nondrinkers, adjusting for smoking status. 9. A group of children 5 years of age and younger who were free of respiratory problems were enrolled in a cohort study examining the relationship between parental smoking and the subsequent development of asthma. The relevant data are contained in the dataset asthma [293]. The variable smoke takes the value 1 if a child’s mother smoked ≥ 1/2 pack of cigarettes per day, and 0 if the mother smoked < 1/2 pack per day. The variable asthma takes the value 1 for a diagnosis of asthma before the age of 12 years, and 0 otherwise. (a) The relationship between maternal cigarette smoking status and a diagnosis of asthma before the age of 12 years was initially examined for boys and girls separately. The variable sex takes the value 1 for boys, and 0 for girls. Estimate the odds ratio for asthma for boys who mothers smoke ≥ 1/2 pack of cigarettes per day versus those whose mothers smoke less. (b) Estimate the odds ratio for asthma for girls who mothers smoke ≥ 1/2 pack of cigarettes per day versus those whose mothers smoke less. (c) For the cohort of children as a whole, estimate the odds ratio for a diagnosis of asthma among those whose mothers smoke ≥ 1/2 pack of cigarettes per day versus those whose mothers smoke less, adjusting for sex. Construct a 95% confidence interval. What do you conclude about this association? (d) If you were concerned that the odds ratio for a diagnosis of asthma among children whose mothers smoke ≥ 1/2 pack of cigarettes per day versus those whose mothers smoke less is actually different for boys and girls, how would you evaluate this? Explain.
10. Intimate partner violence (ipv) toward a woman either before or during her pregnancy has been documented as a risk factor for the health of both the mother and her unborn child. A study conducted in the postnatal wards of a public hospital in Bangladesh examined the relationship between experience of ipv by a woman and the birth weight of the infant [80]. Data are contained in the dataset ipv. Low birth weight was defined as < 2.5 kilograms, and normal birth weight as ≥ 2.5 kilograms; this information is saved under the variable name low_bwt. A binary variable indicating whether a woman experienced physical intimate partner violence during her pregnancy is saved as ipv_p. (a) Using low birth weight as the response, fit a logistic regression model of the form # " pˆ = βˆ0 + βˆ1 x 1 ln 1 − pˆ where x 1 represents physical ipv. Estimate and interpret βˆ1 . (b) What is the odds ratio of giving birth to a low birth weight infant for mothers who experienced physical ipv versus those who did not?
ISTUDY
Logistic Regression
477
(c) Test the null hypothesis that the true population odds ratio is equal to 1. What is the value of the test statistic? What is its probability distribution? (d) What is the p-value of the test? (e) Do you reject or fail to reject the null hypothesis? What do you conclude? (f) What is the predicted probability of a low birth weight infant for a mother who experienced physical ipv? For a mother who did not? (g) Estimate a logistic regression model with outcome low birth weight and the single continuous explanatory variable mother’s age. Interpret the odds ratio associated with mother’s age. (h) Test the null hypothesis that the population odds ratio for low birth weight associated with a 1 year increase in mother’s age is equal to 0. What is the p-value for this test? What do you conclude? (i) What is the predicted probability of having a low birth weight infant for an expectant mother who is 22 years of age? (j) Is the relationship between physical ipv and low birth weight statistically significant after adjusting for mother’s age? 11. The data set lowbwt contains information for the sample of 100 low birth weight infants born in Boston, Massachusetts [81]. The variable grmhem is a dichotomous random variable indicating whether an infant experienced a germinal matrix hemorrhage. The value 1 designates that a hemorrhage occurred and 0 that it did not. The infants’ fiveminute apgar scores are saved under the name apgar5, and indicators of preeclampsis – where 1 represents a diagnosis of preeclampsia during pregnancy for the child’s mother and 0 no such diagnosis – under the variable name preeclampsia. (a) Using germinal matrix hemorrhage as the response, fit a logistic regression model of the form # " pˆ = βˆ0 + βˆ1 x 1 ln 1 − pˆ where x 1 is five-minute apgar score. Interpret βˆ1 , the estimated coefficient of apgar score. (b) If a particular child has a five-minute apgar score of 3, what is the predicted probability that this child will experience a brain hemorrhage? What is the probability if the child’s score is 7? (c) At the 0.05 level of significance, test the null hypothesis that the population parameter β1 is equal to 0. What do you conclude? (d) Now fit the regression model " # pˆ ln = βˆ0 + βˆ2 x 2 1 − pˆ where x 2 represents preeclampsia status. Interpret βˆ2 , the estimated coefficient of preeclampsia. (e) For a child whose mother was diagnosed with preeclampsia during pregnancy, what is the predicted probability of experiencing a germinal matrix hemorrhage? What is the probability for a child whose mother was not diagnosed? (f) What are the estimated odds of suffering a germinal matrix hemorrhage for children whose mothers were diagnosed with preeclampsia relative to children whose mothers were not?
ISTUDY
478
Principles of Biostatistics (g) Construct a 95% confidence interval for the population odds ratio associated with preeclampsia status. Does this interval contain the value 1? What does this tell you?
12. A randomized study was conducted to compare survival to hospital discharge among individuals experiencing out-of-hospital cardiac arrest when bystanders were instructed to perform chest compression plus rescue breathing versus chest compression alone [271]. Data from this study are saved in the dataset cpr. Dispatcher instructions are saved under the name procedure, and survival to hospital discharge under the name survival. (a) Construct a 2 × 2 table of survival to hospital discharge by dispatcher instructions. Among individuals receiving chest compression plus rescue breathing, what proportion survived to hospital discharge? Among those receiving chest compression only, what proportion survived to hospital discharge? (b) Using the table, estimate the odds ratio of survival to discharge for individuals receiving chest compression plus rescue breathing versus chest compression alone. (c) Now fit a logistic regression model with response survival to hospital discharge and binary explanatory variable dispatcher instructions. Using the model, estimate the odds ratio of survival to discharge for individuals receiving chest compression plus rescue breathing versus chest compression alone. Do you get the same value? (d) At the 0.05 level of significance, test the null hypothesis that the odds ratio of surviving to hospital discharge for individuals receiving chest compression plus rescue breathing versus chest compression alone is equal to 1. What is the p-value? (e) What do you conclude? (f) Calculate a 95% confidence interval for the odds ratio.
ISTUDY
20 Survival Analysis
CONTENTS 20.1 20.2 20.3 20.4 20.5 20.6
Life Table Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Product-Limit Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Log-Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cox Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
481 487 491 495 496 505
In Chapter 4 we use the life table to consider the problem of quantifying how long individuals in a population survive as that population ages. More generally, the variable we are interested in may be the length of time from an initial start point until the occurrence of some specified event. This is often the time from birth until death, but might also be the time from transplant surgery until the new organ fails, or the time from start of maintenance therapy for a patient whose cancer has gone into remission until the relapse of disease. The time interval between the start point and the subsequent event – often called a failure event – is known as the survival time. The analysis of time to event data generally focuses on estimating the probability that an individual will survive beyond a given length of time. One common occurrence when working with survival data is that not all individuals are observed until their respective times of failure. If the time interval between the start point and the subsequent failure event can be quite long, the data may be analyzed before the failure event has occurred in all study subjects. Not everyone has died, or has experienced organ failure, or has had their cancer return. Others who either move away before the study is complete or who refuse to participate any longer are said to be lost to follow-up. The incomplete observation of time to failure is known as censoring. The presence of censored observations distinguishes the analysis of survival data from other analyses of continuous measurements. A distribution of survival times can be characterized by a survival function, represented by S(t). S(t) is defined as the probability that an individual survives beyond time t. Equivalently, for a given t, S(t) specifies the proportion of individuals who have not yet failed at that time. If T is a random variable representing survival time, then S(t) = P(T > t). The graph of S(t) versus t is called a survival curve. Survival curves have been used for many years. A study published in 1938 investigated the effects of tobacco on human longevity among White males over the age of 30 [294]. Three categories of males were considered: nonusers of tobacco, moderate smokers, and heavy smokers. The results of the study are presented in Figure 20.1. As is evident from the graph, the smoking of tobacco is associated with a shorter duration of life; in addition, longevity is affected by the amount of tobacco used. It seems that these results were ignored not only at the time of publication, but in subsequent years as well. DOI: 10.1201/9780429340512-20
479
ISTUDY
480
Principles of Biostatistics
FIGURE 20.1 Survival curves for three categories of White males: nonusers of tobacco, moderate smokers, and heavy smokers, 1938
ISTUDY
481
Survival Analysis Summary: Survival Function
20.1
Term
Notation
Survival function
S(t) = P(T > t) is the probability an individual survives beyond time t
Survival curve
Graph of S(t) versus t
Life Table Method
In Chapter 4 we describe the period life table as a means of quantifying the life expectancy of a population. The period life table is created based on observing a cross-section of the population over a short period of time. The life table method groups survival times for members of a population into intervals of fixed length, often one year. Using slightly different notation than in Chapter 4 – we replace x by t – the first three columns of the table enumerate: age, the time interval starting at age t; mortality rate, the proportion of individuals alive at the beginning of the interval who fail prior to the end of the interval (qt ), also known as the hazard function; and persons alive, the number of individuals alive at the beginning of the age interval (l t ). Then, if l 0 is the number of people alive at time 0 and l t the number still alive at time t, the next column of the table is the proportion of individuals who have not yet failed at time t, that is, lt S(t) = . l0 Table 20.1 contains a portion of the United States life table for 2016 shown in Chapter 4 [113]. The fourth column of the table is the survival function at time t, where S(t)
=
lt . 100, 000
The corresponding survival curve is plotted in Figure 20.2. Table 20.1 is an example of a period, or current, life table. It is constructed from data gathered over a relatively short period of time within each age interval. However, the persons represented in one age interval are not the same as those followed in each subsequent interval. The life table method can also be applied to a sample of individuals drawn from a population. Ideally, we would prefer to work with a cohort life table, which tracks a group of people longitudinally over their entire lifetimes. This method is not practical for large population studies, however. It would involve following a sizable group of individuals for over 100 years. However, it is often used in smaller studies in which patients are enrolled sequentially and followed for shorter periods of time. Furthermore, these methods are often applied to samples rather than entire populations; inference is then made based on what is observed in the sample. Consider the data presented in Table 20.2. A total of 12 hemophiliacs, all 40 years of age or younger at the time of hiv seroconversion, were followed from the time of primary aids diagnosis between 1984 and 1989 until death [295]. In all cases, transmission of hiv had occurred through infected blood products. We would actually prefer that our starting point be the time at which an individual contracted aids rather than the time of diagnosis, but this information was not known. For most of the patients, treatment was not available. What are we able to infer about the survival of the population of hemophiliacs diagnosed with aids in the mid to late 1980s on the basis of this sample of 12 individuals drawn from that population?
ISTUDY
482
Principles of Biostatistics
TABLE 20.1 United States life table for individuals less than 30 years of age, 2016 Age (years)
Mortality Rate (q t )
Persons Alive (l t )
S(t)
0–1
0.005864
100,000
1.0000
1–2
0.000396
99,414
0.9941
2–3
0.000262
99,374
0.9937
3–4
0.000197
99,348
0.9935
4–5
0.000158
99,329
0.9933
5–6
0.000151
99,313
0.9931
6–7
0.000135
99,298
0.9930
7–8
0.000121
99,285
0.9929
8–9
0.000108
99,273
0.9927
9–10
0.000095
99,262
0.9926
10–11
0.000089
99,252
0.9925
11–12
0.000095
99,244
0.9924
12–13
0.000122
99,234
0.9923
13–14
0.000175
99,222
0.9922
14–15
0.000249
99,205
0.9921
15–16
0.000328
99,180
0.9918
16–17
0.000410
99,147
0.9915
17–18
0.000502
99,107
0.9911
18–19
0.000603
99,057
0.9906
19–20
0.000706
98,997
0.9900
20–21 21–22
0.000814 0.000914
98,927 98,847
0.9893 0.9885
22–23
0.000994
98,757
0.9876
23–24
0.001048
98,658
0.9866
24–25
0.001083
98,555
0.9856
25–26
0.001112
98,448
0.9845
26–27
0.001143
98,339
0.9834
27–28
0.001177
98,226
0.9823
28–29
0.001216
98,111
0.9811
29–30
0.001260
97,992
0.9799
ISTUDY
483
Survival Analysis
FIGURE 20.2 Survival curve for the United States population, 2016
TABLE 20.2 Interval from primary aids diagnosis until death for a sample of 12 hemophiliac patients at most 40 years of age at hiv seroconversion Patient Number 1 2 3 4 5 6 7 8 9 10 11 12
Survival (months) 2 3 6 6 7 10 15 15 16 27 30 32
ISTUDY
484
Principles of Biostatistics
Using the life table method, we could summarize the data for the 12 patients as in Table 20.3. Note that the first column contains survival time after diagnosis rather than age. A survival time of t months means that an individual survived until time t and then died immediately after. Since 1 out of the 12 individuals in the initial cohort died at 2 months, the proportion of patients dying in the interval 2–3 months is estimated as =
q2
1 12
=
0.0833.
One of the remaining 11 individuals died at 3 months; consequently, =
q3
1 11
=
0.0909.
Similarly, 2 of the remaining 10 patients died at six months, and =
q6
2 10
=
0.2000.
Recall from Chapter 4 that qt is also called the hazard function. In time intervals not containing a death, such as 0–1 months and 1–2 months, the estimated hazard function is equal to 0. The fourth column of Table 20.3 contains the proportion of individuals who do not fail during a given interval. In the interval 2–3 months, for example, the proportion of patients who died is q2 = 0.0833, and thus the proportion who survived is 1 − q2
=
1 − 0.0833
=
0.9167.
In time intervals not containing a death, the estimated proportion of patients who do not fail is 1. The proportions of individuals who do not fail in each interval can be used to estimate the survival function. Note that since no one in the sample died at time 0 months, the estimate of D S (0) = P(T > 0) is D S (0) = 1. Subsequent values of D S (t) can be calculated using the multiplicative rule of probability. For example, let A be the event that a patient is alive during the interval 0-1 months, or D S(0). B is the event that the person survives at time 1 month given they were alive up to that point, or 1 − q1 . Therefore, the event that the patient survives longer than 1 month can be represented by A ∩ B. The multiplicative rule of probability states that D S (1)
= P(T > 1) = P( A ∩ B) = P( A) P(B | A) = D S (0) (1 − q1 ) =
(1.0000)(1.0000)
=
1.0000.
Similarly, the probability that a patient survives longer than 2 months is the probability of being alive during the interval 1–2 months multiplied by the probability of not failing at time 2 months, given that they were alive up until this point, or D S (2)
= D S (1) (1 − q2 )
=
(1.0000)(0.9167)
=
0.9167.
=
0.8333,
The probability of living longer than 3 months is estimated by D S (3)
= D S (2) (1 − q3 )
=
(0.9167)(0.9091)
ISTUDY
485
Survival Analysis
TABLE 20.3 Life table method of estimating S(t) for hemophiliac patients at most 40 years of age at hiv seroconversion Months Since Diagnosis t 0–1 1–2 2–3 3–4 4–5 5–6 6–7 7–8 8–9 9–10 10–11 11–12 12–13 13–14 14–15 15–16 16–17 17–18 18–19 19–20 20–21 21–22 22–23 23–24 24–25 25–26 26–27 27–28 28–29 29–30 30–31 31–32 32–33
Mortality Rate qt 0.0000 0.0000 0.0833 0.0909 0.0000 0.0000 0.2000 0.1250 0.0000 0.0000 0.1429 0.0000 0.0000 0.0000 0.0000 0.3333 0.2500 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.3333 0.0000 0.0000 0.5000 0.0000 1.0000
Persons Alive lt 12 12 12 11 10 10 10 8 7 7 7 6 6 6 6 6 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 1 1
Survival Rate 1 − qt 1.0000 1.0000 0.9167 0.9091 1.0000 1.0000 0.8000 0.8750 1.0000 1.0000 0.8571 1.0000 1.0000 1.0000 1.0000 0.6667 0.7500 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.6667 1.0000 1.0000 0.5000 1.0000 0.0000
D S (t) 1.0000 1.0000 0.9167 0.8333 0.8333 0.8333 0.6667 0.5833 0.5833 0.5833 0.5000 0.5000 0.5000 0.5000 0.5000 0.3333 0.2500 0.2500 0.2500 0.2500 0.2500 0.2500 0.2500 0.2500 0.2500 0.2500 0.2500 0.1667 0.1667 0.1667 0.0833 0.0833 0.0000
ISTUDY
486
Principles of Biostatistics
FIGURE 20.3 Survival curve for hemophiliac patients at most 40 years of age at hiv seroconversion, estimated using the life table method and so on. After 32 months, all the patients in the sample have died; as a result, D S (32) = 0. At this point, there are no individuals remaining in the sample who have not yet failed. A survival curve can be approximated by plotting the survival function D S (t) generated using the life table method versus the time point representing the start of each interval, and connecting the points with straight lines. Using this method, a survival curve for hemophiliacs at most 40 years of age at hiv seroconversion in the mid to late 1980s is shown in Figure 20.3. (We must keep in mind that this curve was estimated based on a very small sample of 12 patients at the start of the aids epidemic. Today, survival for patients diagnosed with hiv/aids is much better.) Summary: Life Table Method Notation
Definition
t
Age; the time interval beginning at age t
qt
Proportion of individuals alive at the beginning of the interval t to t + 1 who die within the interval; mortality rate; hazard function
lt
Number of individuals alive at the beginning of the age interval
S(t)
Proportion of individuals who have not failed at time t; St = l t /l 0 where l 0 is the number of people alive at time 0
1 − qt
Proportion of individuals alive at the beginning of the interval t to t + 1 who survive the interval; survival rate
ISTUDY
487
Survival Analysis
TABLE 20.4 Product-limit method of estimating the survival function S(t) for hemophiliac patients at most 40 years of age at hiv seroconversion Survival Time (months) 0 2 3 6 7 10 15 16 27 30 32
20.2
qt
1 − qt
D S (t)
0.0000 0.0833 0.0909 0.2000 0.1250 0.1429 0.3333 0.2500 0.3333 0.5000 1.0000
1.0000 0.9167 0.9091 0.8000 0.8750 0.8571 0.6667 0.7500 0.6667 0.5000 0.0000
1.0000 0.9167 0.8333 0.6667 0.5833 0.5000 0.3333 0.2500 0.1667 0.0833 0.0000
Product-Limit Method
When we use the life table method, the estimated survival function D S (t) changes only during the time intervals in which at least one death occurs. For smaller datasets, such as the sample of 12 hemophiliac patients diagnosed with hiv/aids, there can be many intervals without a single death. In these instances, it is not efficient to present the survival function in this way. The product-limit method of estimating a survival function, also called the Kaplan-Meier method, is a nonparametric technique that uses the exact survival time for each individual in a sample instead of grouping the times into intervals. Table 20.4 displays the product-limit estimate of the survival function for the sample of 12 hemophiliacs under the age of 40 at the time of hiv seroconversion. Instead of time intervals, the first column of the table contains the exact times at which at least one failure occurred; patients died 2 months after diagnosis, 3 months after diagnosis, 6 months after diagnosis, and so on. The patient with the longest survival died 32 months after primary aids diagnosis. The second column of the table lists the proportions of patients alive just prior to each time t who fail at that time, and the third column the proportions of individuals who do not fail at t. Using the multiplicative rule of probability, the proportions of individuals who survive beyond each time t can be estimated; the technique is the same as it was for the life table method. The survival curve corresponding to the survival function in Table 20.4 is plotted in Figure 20.4. When the product-limit method is used, D S (t) is assumed to remain the same over the time periods between deaths. It changes only when a subject fails. Keep in mind that D S (t) was calculated using the data in a single sample of observations drawn from the underlying population. If we were to select a second sample of 12 hemophiliacs and calculate another survival function using the product-limit method, the results would differ from those in Figure 20.4. D S (t) is an estimate of the true population survival function for all hemophiliacs diagnosed with hiv/aids in the mid to late 1980s who were at most 40 years of age at hiv seroconversion. To quantify the sampling variability in this estimate, we must calculate the standard error of D S (t) at each
ISTUDY
488
Principles of Biostatistics
FIGURE 20.4 Survival curve for hemophiliac patients at most 40 years of age at hiv seroconversion estimated using the product-limit method time t and use this standard error to construct confidence bands around the survival curve [296]. Figure 20.5 displays 95% confidence bands for the product-limit estimate. The product-limit method for estimating a survival curve can be generalized to account for the partial information about survival times available from censored observations. Suppose that when the data for the 12 hemophiliac aids patients were analyzed, the individuals with the second and sixth longest survival times had not yet died. Instead, they were still alive after 3 and 10 months of follow-up, respectively. In Table 20.5, these censored observations are designated by a plus (+) sign. The product-limit estimate of the survival function incorporating the censored times is calculated in Table 20.6, and plotted in Figure 20.6. Each small x on the graph denotes a censored survival time. Note that D S (t) does not change from its previous value if the observation at time t is censored; however, this observation is not used to calculate the probability of failure at any subsequent time point. At time 3, for instance, a patient is censored but no one dies. Therefore, q3
=
0 11
=
0,
and D S (3) = D S (2). At time 6, one individual out of 12 died at 2 months and another was censored at 3 months; only 10 individuals remain at risk, and, since 2 of these die, q6
=
2 10
=
0.2000.
ISTUDY
489
Survival Analysis
FIGURE 20.5 Survival curve for hemophiliac patients at most 40 years of age at hiv seroconversion, with 95% confidence bands
TABLE 20.5 Interval from primary aids diagnosis until death for a sample of 12 hemophiliac patients at most 40 years of age at hiv seroconversion, censored observations included Patient Number 1 2 3 4 5 6 7 8 9 10 11 12
Survival (months) 2 3+ 6 6 7 10+ 15 15 16 27 30 32
ISTUDY
490
Principles of Biostatistics
TABLE 20.6 Product-limit method of estimating S(t) for hemophiliac patients at most 40 years of age at hiv seroconversion, censored observations included Time 0 2 3 6 7 10 15 16 27 30 32
qt 0.0000 0.0833 0.0000 0.2000 0.1250 0.0000 0.3333 0.2500 0.3333 0.5000 1.0000
1 − qt 1.0000 0.9167 1.0000 0.8000 0.8750 1.0000 0.6667 0.7500 0.6667 0.5000 0.0000
D S (t) 1.0000 0.9167 0.9167 0.7333 0.6417 0.6417 0.4278 0.3208 0.2139 0.1069 0.0000
FIGURE 20.6 Survival curve for hemophiliac patients at most 40 years of age at hiv seroconversion, censored observations included
ISTUDY
491
Survival Analysis
TABLE 20.7 Interval from primary aids diagnosis until death for a sample of 21 hemophiliac patients, stratified by age at hiv seroconversion Age ≤ 40 Years Patient Survival Number (months) 1 2 2 3 3 6 4 6 5 7 6 10 7 15 8 15 9 16 10 27 11 30 12 32
20.3
Age > 40 Years Patient Survival Number (months) 1 1 2 1 3 1 4 1 5 2 6 3 7 3 8 9 9 22
Log-Rank Test
Instead of simply describing the survival times for a single group of subjects, we often want to compare the distributions of survival times for two or more different populations. Our goal would be to determine whether survival differs systematically between the groups. Recall the data for the 12 hemophiliacs – all 40 years of age or younger at the time of hiv seroconversion – presented in Table 20.2. We might wish to compare this distribution of survival times from primary aids diagnosis until death to the distribution of survival times for another group of hemophiliacs who were all over age 40 at the time of seroconversion. Survival times for the two groups are listed in Table 20.7, and the product-limit estimates of the survival curves are plotted in Figure 20.7. Survival for patients undergoing hiv seroconversion at an earlier age is represented by the upper curve in the figure, and survival for patients undergoing seroconversion at a later age by the lower curve. At any point in time following aids diagnosis, the estimated probability of survival beyond that time is higher for individuals who were younger at seroconversion. We would of course expect some sampling variability in these estimates. Therefore we ask – is the difference between the two curves greater than might be expected by chance alone? One of a number of different methods available for testing the null hypothesis that two or more distributions of survival times are identical is a nonparametric technique called the log-rank test. The idea behind the log-rank test is that we construct a separate 2 × 2 contingency table displaying group status (in this example, age category at seroconversion ≤ 40 years or > 40 years) versus survival status for each time point t at which a death occurs. When t is equal to 1 month, for example, none of the 12 patients who were younger than 40 years at seroconversion die, but 4 of the 9 older patients do. Therefore, the 2 × 2 table for t = 1 month is:
ISTUDY
492
Principles of Biostatistics
FIGURE 20.7 Survival curves for two groups of hemophiliac patients, stratified by age at hiv seroconversion
Group Age ≤ 40 Age > 40 Total
Failure Yes No 0 12 4 5 4 17
Total 12 9 21
Similarly, when t is equal to 2 months, one of the 12 younger patients and one of the 5 remaining older patients die. Consequently, the 2 × 2 table for t = 2 months is: Group Age ≤ 40 Age > 40 Total
Failure Yes No 1 11 1 4 2 15
Total 12 5 17
Once the entire sequence of 2 × 2 tables has been generated, the information contained in the tables is accumulated into a test statistic that compares the observed number of failures at each time t to the expected number of failures given that the distributions of survival times for the two age groups are identical [297]. If the null hypothesis is true, the test statistic has an approximate chi-square distribution with 1 degree of freedom. (In general, if k groups are being compared, the test statistic has a chi-square distribution with k − 1 degrees of freedom.) For the two groups of hemophiliacs diagnosed with hiv/aids in the mid to late 1980s, a log-rank test of the null hypothesis H0 : S ≤40 (t) = S>40 (t)
ISTUDY
493
Survival Analysis
FIGURE 20.8 Survival curves for moderate risk breast cancer patients in two treatment groups against the alternative hypothesis H A : S ≤40 (t) , S>40 (t) results in a p-value of 0.025. This is the probability of finding a difference in survival as great or greater than that observed, given that the null hypothesis is true and the survival curves are identical. Since p is less than α = 0.05, we reject H0 and conclude that patients experiencing seroconversion at an earlier age lived longer after primary aids diagnosis than individuals undergoing seroconversion at a later age. As another example, consider the data from a clinical trial comparing two different treatment regimens for moderate risk breast cancer [298]. We wish to compare the distributions of survival times after diagnosis for women receiving treatment A versus women receiving treatment B, and determine whether either treatment prolongs survival relative to the other. Plots of the two productlimit survival curves, with time since breast cancer diagnosis on the horizontal axis, are displayed in Figure 20.8. There does not appear to be a difference in survival for patients in the two treatment groups; note the great deal of overlap in the curves. Furthermore, a log-rank test of the null hypothesis H0 : S A (t) = S B (t) results in a p-value of 0.88. We are unable to reject H0 at the 0.05 level of significance. The data do not provide evidence of a difference in survival for women in the two treatment groups. However, when the individuals enrolled in the clinical trial are separated into two distinct subpopulations – premenopausal women versus postmenopausal women – treatment does appear to have an effect on survival. As shown in Figure 20.9, treatment A improves survival for premenopausal women. The product-limit survival curve for women receiving treatment A lies above the survival curve for those receiving treatment B; at any point in time following diagnosis, the estimated probability of survival is higher for women receiving treatment A. The log-rank test p-value is
ISTUDY
494
Principles of Biostatistics
FIGURE 20.9 Survival curves for premenopausal, moderate risk breast cancer patients in two treatment groups 0.052. In contrast, Figure 20.10 suggests that treatment B is more effective in prolonging survival for postmenopausal women. In this case, the survival curve for women receiving treatment B lies above the curve for those receiving treatment A. The log-rank p-value for this comparison is 0.086. Since the treatment effects are going in opposite directions in the two different subpopulations, they cancel each other out when the groups are combined. This serves as a reminder that care must be taken not to ignore not only important confounding variables, but also important interactions.
Summary: Log-Rank Test Log-rank test
A nonparametric technique for testing whether two or more distributions of survival times are identical
Null hypothesis
H0 : S A (t) = S B (t) where A and B represent two independent groups
Alternative hypothesis
H A : S A (t) , S B (t)
Test statistic distribution
Chi-square distribution with 1 degree of freedom
ISTUDY
Survival Analysis
495
FIGURE 20.10 Survival curves for postmenopausal, moderate risk breast cancer patients in two treatment groups
20.4
Cox Proportional Hazards Model
As we have seen, the log-rank test can be used to compare survival times in two or more independent groups defined by some risk factor of interest. Subjects can have variable amounts of follow-up, and some survival times may be censored. However, the log-rank test is not practical if there is more than one categorical risk factor related to survival, or if the risk factor of interest is a continuous measurement. In these instances, a regression model can be used to examine the relationships between time to the outcome event and one or more explanatory variables. One regression model commonly used for a time to event outcome is the Cox proportional hazards model. Unlike a logistic regression model, the Cox proportional hazards model takes into account not just whether an event occurs, but when it occurs. For a single explanatory variable X, the model takes the form " # h(t) ln = βˆ1 x 1 . h0 (t) The function h(t) is the hazard function. Instead of the probability that an individual fails in a particular interval of time given that they are alive at the beginning of that interval (qt ) as we saw with the life table, here it is defined as the instantaneous rate of failure at time t given that the individual has survived up until time t. Rather than having discrete intervals of time, here time is considered to be continuous, and the instantaneous rate of failure is allowed to vary over time. The “baseline hazard” h0 (t) is the hazard when the explanatory variable X takes the value 0. For a dichotomous explanatory variable, h0 (t) is the hazard for subjects who do not have the characteristic of interest. As with other regression models, explanatory variables can be categorical or continuous. Coefficients can be interpreted in a manner analogous to the logistic regression model. If the explanatory
ISTUDY
496
Principles of Biostatistics
TABLE 20.8 Time to recurrence of brain metastasis for a sample of 23 patients treated with radiotherapy Patient Number 1 2 3 4 5 6 7 8 9 10 11 12
Recurrence (weeks) 2 2 2 3 4 5 5 6 7 8 9 10
Patient Number 13 14 15 16 17 18 19 20 21 22 23
Recurrence (weeks) 14 14 18 19 20 22 22 31 33 39 195
variable is dichotomous, we can exponentiate the estimated coefficient βˆ1 to obtain a hazard ratio. The hazard ratio is the ratio of the hazards for two study subjects, one of whom has the risk factor of interest while the other does not. The hazard ratio can be interpreted as the instantaneous relative risk of failure at time t, given that both individuals have survived up until time t. For a continuous explanatory variable, the estimated hazard ratio is the relative risk of failure associated with a one unit increase in X. As noted above, the hazard for an individual study participant can vary over time. The hazard ratio, however, cannot. The hazard ratio for those who have a risk factor versus those who do not takes the same value at all times t. It is the same at 5 days of follow-up, at 5 months, and at 5 years. In fact, this is what is meant by “proportional hazards.” It is important to verify that the proportional hazards assumption is reasonable before using this model; there are a variety of techniques available to do this [297].
20.5
Further Applications
Suppose we are interested in studying patients with systemic cancer who subsequently develop metastasis in the brain; the ultimate goal is to prolong their lives by controlling the disease. A sample of 23 such patients, all of whom were treated with radiotherapy, were followed from the first day of treatment until recurrence of the original brain tumor [299]. Recurrence is defined as the reappearance of a metastasis at exactly the same site, or, in the case of patients whose tumor never completely disappeared, enlargement of the original lesion. Times to recurrence for the 23 patients are presented in Table 20.8. What can we infer about the reappearance of brain metastasis based on the information in this sample? We could begin our analysis by summarizing the recurrence time data using the life table method. For intervals of length two weeks, we first determine the proportion of patients who experienced a recurrence within each interval. The results are presented in the second column of Table 20.9. Since
ISTUDY
497
Survival Analysis TABLE 20.9 Life table method of estimating S(t) for patients with brain metastasis treated with radiotherapy Weeks Since Treatment t 0–2 2–4 4–6 6–8 8–10 10–12 12–14 14–16 16–18 18–20 20–22 22–24 24–26 26–28 28–30 30–32 32–34 34–36 36–38 38–40 40+
Mortality Rate qt 0.0000 0.1739 0.1579 0.1250 0.1429 0.0833 0.0000 0.1818 0.0000 0.2222 0.1429 0.3333 0.0000 0.0000 0.0000 0.2500 0.3333 0.0000 0.0000 0.5000 1.0000
Persons Alive lt 23 23 19 16 14 12 11 11 9 9 7 6 4 4 4 4 3 2 2 2 1
Survival Rate 1 − qt 1.0000 0.8261 0.8421 0.8750 0.8571 0.9167 1.0000 0.8182 1.0000 0.7778 0.8571 0.6667 1.0000 1.0000 1.0000 0.7500 0.6667 1.0000 1.0000 0.5000 0.0000
D S (t) 1.0000 0.8261 0.6957 0.6087 0.5217 0.4783 0.4783 0.3913 0.3913 0.3043 0.2609 0.1739 0.1739 0.1739 0.1739 0.1304 0.0870 0.0870 0.0870 0.0435 0.0000
4 out of 23 individuals had a recurrence of the original brain metastasis at least two weeks but not more than four weeks after the start of treatment, the proportion of individuals failing in the interval 2-4 is estimated as 4 q2 = = 0.1739. 23 Similarly, 3 of the remaining 19 patients had a recurrence between four and six weeks after treatment; therefore, 3 q4 = = 0.1579. 19 In time intervals not containing a failure, the estimated proportion of individuals experiencing a recurrence is equal to 0. The fourth column of Table 20.9 contains the proportion of patients who did not fail during a given interval, 1 − qt . These proportions of individuals who did not experience a recurrence can be used to estimate the survival function S(t). Since none of the patients in the sample failed at time 0 months, the estimate of S(0) = P(T > 0) is D S (0)
=
1.
ISTUDY
498
Principles of Biostatistics
Subsequent values of D S (t) are calculated using the multiplicative rule of probability. The probability that an individual has not experienced a recurrence during the interval 0-2 is S(0), and the probability that the patient does not fail in the interval 2-4 given that they did not fail prior to this is 1 − q2 . Therefore, the probability that an individual does not experience a recurrence until more than 2 weeks after the start of treatment is estimated by D S (2)
= D S (0)(1 − q2 )
=
(1.0000)(0.8261)
=
0.8261.
Similarly, the probability that a patient fails after 4 weeks is estimated by D S (4)
= D S (2)(1 − q4 )
=
(0.8261)(0.8421)
=
0.6957.
By the last interval in the table – the only one that is not of length two weeks – every patient in the study has experienced a recurrence of the original metastasis. Consequently, D S (40)
=
0.
Since we are dealing with a relatively small group of patients, we might prefer to estimate the survival function using the product-limit method. The product-limit method of estimating S(t) is a nonparametric technique that uses the exact recurrence time for each individual instead of grouping the times into intervals. In this case, the three patients who experienced a tumor recurrence two weeks after their initial treatment would not be grouped with the individual who failed three weeks after the start of treatment. Table 20.10 displays the product-limit estimate of S(t) for the sample of 23 patients treated for brain metastasis. The first column of the table contains the exact times at which the failures occur rather than time intervals. The second column lists the proportions of patients who had not failed prior to time t who experience a recurrence at that time, and the third column contains the proportions of individuals who do not fail at t. Using the multiplicative rule of probability, these proportions are used to estimate the survival function S(t). The corresponding survival curve is plotted in Figure 20.11. The product-limit method for estimating a survival curve can be modified to take into account the partial information about recurrence times that is available from censored observations. In Table 20.11, censored survival times for the sample of 23 patients treated with radiotherapy are denoted by a plus (+) sign. These patients either died before experiencing a recurrence of their original brain metastasis, or remained tumor-free at the end of the follow-up period. The productlimit estimate of the survival function is calculated in Table 20.12, and the corresponding survival curve is plotted in Figure 20.12. When an observation is censored, it is not used to calculate the probability of failure at any subsequent time point. (Also, note that if the longest survival time in a sample is censored, the curve does not drop down to the horizontal axis to indicate an estimated survival probability equal to 0.) Here, at time 2 weeks, 3 patients are censored but no one experiences a tumor recurrence. Therefore, 0 = 0. q2 = 23 At time 3, one of the remaining 20 patients dies, and q3
=
1 20
=
0.0500.
Rather than make inference about survival in a single population, we often want to compare the distributions of survival times for two or more different populations. For example, we might wish to compare the times to recurrence of brain metastasis for patients treated with radiotherapy alone versus those undergoing surgical removal of the tumor and subsequent radiotherapy. Survival times for both groups are presented in Table 20.13, and the corresponding product-limit survival curves
ISTUDY
499
Survival Analysis
TABLE 20.10 Product-limit method of estimating S(t) for patients with brain metastasis treated with radiotherapy Time 0 2 3 4 5 6 7 8 9 10 14 18 19 20 22 31 33 39 195
qt 0.0000 0.1304 0.0500 0.0526 0.1111 0.0625 0.0667 0.0714 0.0769 0.0833 0.1818 0.1111 0.1250 0.1429 0.3333 0.2500 0.3333 0.5000 1.0000
1 − qt 1.0000 0.8696 0.9500 0.9474 0.8889 0.9375 0.9333 0.9286 0.9231 0.9167 0.8182 0.8889 0.8750 0.8571 0.6667 0.7500 0.6667 0.5000 0.0000
D S (t) 1.0000 0.8696 0.8261 0.7826 0.6957 0.6522 0.6087 0.5652 0.5217 0.4783 0.3913 0.3478 0.3043 0.2609 0.1739 0.1304 0.0870 0.0435 0.0000
TABLE 20.11 Time to recurrence of brain metastasis for a sample of 23 patients treated with radiotherapy, censored observations included Patient Number 1 2 3 4 5 6 7 8 9 10 11 12
Recurrence (weeks) 2+ 2+ 2+ 3 4 5 5+ 6 7 8 9+ 10
Patient Number 13 14 15 16 17 18 19 20 21 22 23
Recurrence (weeks) 14 14+ 18+ 19+ 20 22 22+ 31+ 33 39 195+
ISTUDY
500
Principles of Biostatistics
FIGURE 20.11 Survival curve for patients with brain metastasis treated with radiotherapy
FIGURE 20.12 Survival curve for patients with brain metastasis treated with radiotherapy, censored observations included
ISTUDY
501
Survival Analysis
TABLE 20.12 Product-limit method of estimating S(t) for patients with brain metastasis treated with radiotherapy, censored observations included Time 0 2 3 4 5 6 7 8 9 10 14 18 19 20 22 31 33 39 195
qt 0.0000 0.0000 0.0500 0.0526 0.0556 0.0625 0.0667 0.0714 0.0000 0.0833 0.0909 0.0000 0.0000 0.1429 0.1667 0.0000 0.3333 0.5000 0.0000
1 − qt 1.0000 1.0000 0.9500 0.9474 0.9444 0.9375 0.9333 0.9286 1.0000 0.9167 0.9091 1.0000 1.0000 0.8571 0.8333 1.0000 0.6667 0.5000 1.0000
D S (t) 1.0000 1.0000 0.9500 0.9000 0.8500 0.7969 0.7437 0.6906 0.6906 0.6331 0.5755 0.5755 0.5755 0.4933 0.4111 0.4111 0.2741 0.1370 0.1370
are plotted in Figure 20.13. Based on the curves, it appears that individuals treated with both surgery and postoperative radiotherapy have fewer recurrences of brain metastases, and the recurrences that do happen take place at a later time. The log-rank test can be used to evaluate the null hypothesis that the distributions of recurrence times are identical in the two treatment groups. The test statistic calculated compares the observed number of recurrences at each time to the expected number given that H0 is true. Although the calculations are somewhat complicated, they do not present a problem as long as a computer is available. The output from Stata is presented in Table 20.14, and the output from R in Table 20.15. For each treatment group – where Group 1 contains patients treated with radiotherapy alone and Group 2 contains those treated with both surgery and radiotherapy – the table displays the observed and expected numbers of events, or, in this case, recurrences of brain tumor. It also shows the test statistic and its corresponding p-value. Since p = 0.0001, we reject the null hypothesis and conclude that surgical removal of the brain metastasis followed by radiotherapy results in a longer time to recurrence of the original tumor than radiotherapy alone. We could also examine the relationship between treatment group and time to recurrence using the Cox proportional hazards model. Table 20.16 displays the relevant Stata output, and Table 20.17 the output from R. The hazard ratio of 7.3 indicates that a patient treated with radiotherapy alone has a greater instantaneous relative risk of failure than a patient treated with surgery and radiotherapy at each time t after treatment, given that both individuals have survived up to time t. The 95% confidence interval for the hazard ratio is (2.4, 22.0), which does not contain the value 1. Furthermore, if we
ISTUDY
502
Principles of Biostatistics
TABLE 20.13 Time to recurrence of brain metastasis for a sample of 48 patients, stratified by treatment Radiotherapy Alone Patient Recurrence Number (weeks) 1 2+ 2 2+ 3 2+ 4 3 5 4 6 5 7 5+ 8 6 9 7 10 8 11 9+ 12 10 13 14 14 14+ 15 18+ 16 19+ 17 20 18 22 19 22+ 20 31+ 21 33 22 39 23 195+
Surgery/Radiotherapy Patient Recurrence Number (weeks) 1 2+ 2 2+ 3 6+ 4 6+ 5 6+ 6 10+ 7 14+ 8 21+ 9 23 10 29+ 11 32+ 12 34+ 13 34+ 14 37 15 37+ 16 42+ 17 51 18 57 19 59 20 63+ 21 66+ 22 71+ 23 71+ 24 73+ 25 85+
ISTUDY
503
Survival Analysis
FIGURE 20.13 Survival curves for patients with brain metastasis, stratified by treatment TABLE 20.14 Stata output displaying the log-rank test Log-rank test for equality of survivor functions | Events Events treatment | observed expected ----------+------------------------1 | 12 4.90 2 | 5 12.10 ----------+------------------------Total | 17 17.00 chi2(1) = Pr>chi2 =
15.78 0.0001
TABLE 20.15 R output displaying the log-rank test N treatment=1 23 treatment=2 25 Chisq= 15.8
Observed 12 5
Expected 4.9 12.1
(O-E)^2/E 10.28 4.16
on 1 degrees of freedom, p= 7e-05
(O-E)^2/V 15.8 15.8
ISTUDY
504
Principles of Biostatistics
TABLE 20.16 Stata output displaying the Cox proportional hazards model for outcome time to recurrence and explanatory variable treatment group No. of subjects = No. of failures = Time at risk = Log likelihood
=
48 17 1421 -45.706603
Number of obs
=
48
LR chi2(1) Prob > chi2
= =
14.11 0.0002
----------------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ----------+-----------------------------------------------------------treatment | 7.316884 4.106386 3.55 0.000 2.435647 21.98052 -----------------------------------------------------------------------
TABLE 20.17 R output displaying the Cox proportional hazards model for outcome time to recurrence and explanatory variable treatment group coef exp(coef) se(coef) z p treatment 1.9902 7.3169 0.5612 3.546 0.000391 Likelihood ratio test=14.11 n= 48, number of events= 17
on 1 df, p=0.0001727
test the null hypothesis that the hazard ratio is equal to 1, we would reject H0 with p < 0.001. We conclude that the risk of recurrence is greater among patients treated with radiotherapy alone, the same conclusion that was reached using the log-rank test. This interpretation of course assumes that the assumption of proportional hazards is satisfied.
ISTUDY
505
Survival Analysis
20.6
Review Exercises
1. What is a survival function? 2. What are censored observations? How do these observations occur? 3. How does the life table method of estimating a survival curve differ from the product-limit method? 4. Suppose that you are interested in examining the survival times of individuals who receive bone marrow transplants for nonneoplastic disease [300]. Presented below are the survival times in months for 8 such patients. Assume that none of the observations are censored. 3.0 4.5 5.0 10.0 15.5 18.5 25.0 34.0 (a) What is the median survival time for these patients? (b) For fixed intervals of length two weeks, use the life table method to estimate the survival function S(t). (c) Is this life table cross-sectional or longitudinal? (d) Construct a survival curve based on the life table estimate of S(t). (e) Use the product-limit method to estimate the survival function. (f) Construct a survival curve based on the product-limit estimate. 5. Displayed below are the survival times in months since diagnosis for 10 hiv/aids patients suffering from concomitant esophageal candidiasis, an infection due to candida yeast, and cytomegalovirus, a herpes infection that can cause serious illness [301]. Censored observations are denoted by a plus (+) sign. Patient Number 1 2 3 4 5 6 7 8 9 10
Survival (months) 0.5 1 1 1 2 5+ 8+ 9 10+ 12+
(a) How many deaths were observed in this sample of patients? (b) Use the product-limit method to estimate the survival function S(t). (c) What is D S (1), the estimated probability of survival at 1 month? What is the estimated probability of survival at 5 months? At 6 months? (d) Construct a survival curve based on the product-limit estimate.
ISTUDY
506
Principles of Biostatistics 6. In a Danish study evaluating the relationship between the measles, mumps, rubella (mmr) vaccine and autism in children, the hazard ratio was reported as 0.93, with 95% confidence interval (0.85, 1.02) [302]. The hazard ratio compares children who received the MMR vaccine to those who did not. Interpret the reported results. 7. In the 1980s, a study was conducted to examine the effects of the drug ganciclovir on hiv/aids patients suffering from disseminated cytomegalovirus infection [303]. Two groups of patients were followed; 18 were treated with the drug, and 11 were not. The results of this study are contained in the dataset cytomegalo. Survival times in months after diagnosis are saved under the variable name time, and indicators of censoring status – where 1 designates that a death occurred and 0 that an observation was censored – under the name death. Values of treatment group, where 1 indicates that a patient took the drug and 0 that he or she did not, are saved under the name group. (a) How many deaths occurred in each treatment group? (b) Use the product-limit method to estimate the survival function for each treatment group. (c) Construct survival curves for the two treatment groups based on the product-limit estimate of S(t). (d) Does it appear that the individuals in one group survive longer than those in the other group? (e) Use the log-rank test to evaluate the null hypothesis that the distributions of survival times are identical in the two groups. What do you conclude? 8. In a study of bladder cancer, tumors were removed from the bladders of 86 patients [304]. Subsequently, the individuals were assigned to be treated with either a placebo or with the drug thiopeta. Time to the first recurrence of tumor in months is saved under the variable name time in the dataset bladder. Treatment status is saved under the name group; the value 1 represents treatment with the drug and 0 with placebo. Indicators of censoring status – where 1 designates that a tumor did recur and 0 that it did not and that the observation was censored – are saved under the name recurrence. (a) Use the product-limit method to estimate the survival function in each treatment group. (b) Construct survival curves based on the product-limit estimates. (c) Does it appear that the individuals in one group have a longer time to first recurrence of tumor than those in the other group? (d) Test the null hypothesis that the distributions of recurrence times are identical in the two treatment groups. What do you conclude? (e) The variable number is an indicator of the number of tumors initially removed from the bladder; 1 indicates that a patient had a single tumor, and 2 that the individual had two or more tumors. For patients treated with the placebo, test the null hypothesis that the distributions of recurrence times are identical for individuals who had one tumor and for those who had two or more tumors. What do you conclude? 9. A study conducted in Pakistan examined risk factors for mortality among patients admitted to the hospital with a diagnosis of heart failure and left ventricular systolic dysfunction [305]. Time from diagnosis to death or last follow-up in days is saved under the variable name time in the dataset heartfail. The variable death takes the value 1 if a patient died and 0 if the observation was censored. An indicator of whether each
ISTUDY
Survival Analysis
507
individual had been diagnosed with diabetes prior to their admission is saved under the name diabetes; the value 1 indicates that a patient did have diabetes and 0 that they did not. (a) Use the product-limit method to estimate the survival function for individuals diagnosed with diabetes and those not diagnosed with diabetes. (b) Fit a Cox proportional hazards regression model to evaluate the relationship between time to death from heart failure and prior diagnosis of diabetes. Interpret the estimated hazard ratio for diabetes. (c) Do these data suggest that a diagnosis of diabetes is a significant risk factor for mortality among patients diagnosed with heart failure? Explain. (d) Fit a Cox proportional hazards regression model to evaluate the relationship between time to death from heart failure and prior diagnosis of diabetes, adjusting for patient age at hospital admission and sex. Does your interpretation of the relationship between diabetes and mortality change after adjusting for these potential confounders?
ISTUDY
ISTUDY
21 Sampling Theory
CONTENTS 21.1
21.2 21.3 21.4
Sampling Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.3 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.4 Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.5 Ratio Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.6 Two-Stage Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.7 Design Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.8 Nonprobability Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sources of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
511 512 514 515 519 521 523 526 527 528 530 535
By a small sample we may judge of the whole piece. Miguel de Cervantes Don Quixote: The Ingenious Gentleman of La Mancha (1605) Inference is one of the fundamental goals of statistics. When making inference, we attempt to describe some characteristic of a population using the information contained in a sample of observations drawn from that population. Up to this point, when estimating a mean we use models that assume the size of the underlying population – such as the population of serum cholesterol level measurements for all adult males in the United States – is infinite, with mean µ and standard deviation σ. From this population we select a random sample of size n, and then rely on the central limit theorem to justify that the distribution of the sample mean x is approximately normal with mean µ and √ standard deviation σ/ n. In order for the conclusions drawn to be valid, however, the sample must be representative of the population. How we obtain the sample is critical, and this chapter provides further details on some of the important issues regarding sampling theory. Suppose that, instead of being infinite, the underlying population we wish to describe is finite and consists of N subjects or elements. Even though finite, if N is large, it may still not be feasible to evaluate all elements of the population; in such instances, we would again like to make inference about a specified population characteristic using the information contained in a sample. If we select an observation from an infinite population, this action does not change the population. However, if we select an observation from a finite population, the population from which we sample the next observation is different; it is now of size N − 1, instead of N. How does this impact our inference? Another slight change when sampling from a finite population is that we can continue to talk about the mean of the population, µ, but we can also talk about its total, defined as N µ. Of the two, the latter is often the quantity of greater interest to demographers. For example, when planning health requirements at the regional level, it is the total population of the region that matters.
DOI: 10.1201/9780429340512-21
509
ISTUDY
510
Principles of Biostatistics
If we visit the website for the National Center For Health Statistics (nchs), we find that it has a “Stat of the Day.” On June 28, 2020, the statistic reported was: The number of lightning deaths in the United States has dropped by more than 50% in 20 years from 64 in 1999 to 23 in 2018. Source: National Vital Statistics System, 2018
We expect such nuggets from the nchs since they are the repository for all United States vital statistics. Returning to the site we also find the following: The charge of the nchs is to provide statistical information that will guide actions and policies to improve the health of the American people. As the Nation’s principal health statistics agency, nchs leads the way with accurate, relevant, and timely data. To this end, the nchs not only collects and stores vital statistics, it also carries out important surveys about the United States population. These surveys measure characteristics of samples of the population, often over time, to monitor the health of the population as a whole. A physician measures characteristics of a patient to better guide that patient’s care. In Public Health the same principles hold; we need to measure the characteristics of the ‘patient,’ except now the patient is a community of people rather than a single person. We might wish to improve the nutritional status of all residents of a health district, for example. It may not be possible to measure each member of the population as often as we would like, so we turn to surveys to provide the information we will use to infer the nutritional status of the populace in question. One such survey, the National Health Interview Survey (nhis), has monitored the health of the United States since 1957. Nhis data on a broad range of health topics are collected through personal household interviews. The number of statistics gathered are too numerous to list here, but are contained on the nhis website [306]. Another survey, the National Survey of Family Growth, collects information on families, fertility, and health from a national sample of males and females aged 15–49 in the household population of the United States [307]. Many other surveys have been and are currently being carried out by the nchs [308]. Surveys are also conducted by other governmental agencies, such as the American Community Survey [32]. This survey is a valuable source of information about the population of the United States, and is conducted annually by the Census Bureau. It includes demographic, social, economic, and housing information based on a sample of approximately 3 million individuals. Other countries also carry out health surveys. One such program, funded by usaid and targeted at the economically developing nations of Africa, Asia, and Latin America, is the Demographic and Health Surveys program [309]. This organization has collected and disseminated health data through more than 400 surveys in over 90 countries. What all these surveys have in common is that they utilize probability samples; whether a particular person is included in the sample is decided by a random device. Probability samples allow us to not only make inference about an important characteristic of a population, but also to calculate a measure of uncertainty associated with that inference. The simple random sample described below is one example of such a sampling scheme, but there are other ways to obtain a probability sample. Each has its own advantages and limitations, and we discuss some of these in this chapter. From a statistical perspective, the accuracy and precision associated with a survey design is of primary importance. But the cost of the survey and the time it takes to execute are also important, and must be considered when choosing a design. These are the driving considerations between various sampling designs, and we investigate these criteria in this chapter. Note that to allow the standard errors to be compared – but not confuse the design descriptions – we summarize the formulas in boxes at the end of each subsection.
ISTUDY
Sampling Theory
511
FIGURE 21.1 Steps in sampling from a finite population
21.1
Sampling Designs
The individual elements in a population being studied are called study units or sampling units; each element might be a person, a family, a city, a hospital, a clinic, an object, or anything else that is the unit of analysis in a population. For example, suppose we wish to determine the average amount of alcohol consumed each week by 15- to 17-year-old teenagers living in the Commonwealth of Massachusetts. In this case, the study units would be teenagers between the ages of 15 and 17 residing in Massachusetts at a particular time. Not all health surveys have people as the study unit. For example, the previously mentioned Demographic and Health Surveys are nationally-representative household surveys. Here, the sampling unit is a household. The population we would like to describe is called the target population. In the preceding example, the target population is all 15- to 17-year-old teenagers living in Massachusetts. In many situations, the target population is not accessible. If we are using school records to select a sample of teenagers, for instance, those who do not attend high school would have no chance of being included. After we account for practical constraints, the group from which we can actually sample is known as the study population. This concept is illustrated in Figure 21.1. A list of the elements in the study population is called the sampling frame. The study population must always be described when reporting the results of a survey; it is a critical inference parameter. Note that a random sample, although representative of the study population from which it is selected, may not be representative of the target population. If the two groups differ in some important way – perhaps the study population is younger than the target population – the selected sample is said to be biased. For example, if we check hospital records for a particular condition, we might not see as many individuals with a mild manifestation of that condition as we should, since patients with milder manifestations would not go to the hospital as often as those with more serious symptoms of the disease. Selection bias is a systematic tendency to exclude certain members of the target population, and yet claim that the inference applies to the entire target population.
ISTUDY
512
Principles of Biostatistics
FIGURE 21.2 Choosing a random sample of households in Nigeria (photo, courtesy of Professor Joseph Valadez) As epitomized in Figure 21.1, most real samples are imperfect. There is a step between being chosen by the random device to be a member of the sample and actually being measured within the sample; in the figure, these nonresponders are highlighted in peach. Furthermore, even when individuals are in the sample, some of their information may be missing; examples of this are are highlighted in yellow. How we handle the nonresponders and the missing data seriously impacts our inference and should be a component of the narrative when describing the analysis.
21.1.1
Simple Random Sampling
The simplest and yet conceptually most powerful random sample that can be drawn from the study population is a simple random sample. With simple random sampling, each study unit in the population has an equal chance of being included in the sample, and, at the same time, every sample of the same size n has an equal chance of being drawn. Study units are independently selected one at a time until the desired sample size is achieved. Figure 21.2 shows an instance of choosing a random sample in the field, with the aid of a table of random numbers. In a simple random sample taken from a finite population of size N, the probability that a particular unit is chosen is n/N. This quantity, called the sampling fraction of the study population, depends only on the sample size n and the population size N. It does not depend on any characteristic of the sampling unit, including the characteristics under study. Inclusion depends only on the individual’s address or position in the sampling frame. The concept of a simple random sample is ubiquitous, possibly because it seems so easy to obtain. Even its name implies this. In reality, however, it is difficult to obtain such a sample. Furthermore, once a sample has been selected, it is usually impossible to prove that it was randomly chosen. The contrary is easier to prove. As a case in point, Title 28 of the U.S. Code §1861 states [310]: It is the policy of the United States that all litigants in Federal courts entitled to trial by jury shall have the right to grand and petit juries selected at random from a fair cross section of the community in the district or division wherein the court convenes.
ISTUDY
513
Sampling Theory
The state of Connecticut’s federal jury office chooses its random samples from a database it maintains for this purpose [311]. Unfortunately, for nearly three years, the computer program read the “d” at the end of “Hartford,” a city in Connecticut, to mean “deceased,” and thus no one from Hartford was ever chosen to be on a jury. For some unspecified reason, anyone living in New Britain, Connecticut, was also excluded from the jury list. The result of excluding these two cities was that 63% of the voting-age African-American population and 68% of the voting-age Hispanic population in the jury district were excluded – fractions much larger than for other racial/ethnic groups. As a result, “the jury pool for the district court in Connecticut became significantly whiter” [311]. Assuming it can be obtained, an advantage of a simple random sample is that it allows us to estimate not only parameters of the study population, but also their associated standard errors. For example, we can estimate the population mean µ using the sample mean x. Note that in the calculation of the sample mean, each observation has equal weight, 1/n. When the study population of size N has mean µ and standard deviation σ, a finite version of the central limit theorem states that√the distribution√of the sample mean over all possible samples has mean µ and standard deviation 1 − (n/N )(σ/ n), also called the standard error of the sample mean. The latter quantity is associated with the accuracy of our inference about the mean. Without a random sample, we would not know the standard error. The standard error of√the sample mean for a finite population differs from that for an infinite population by a factor of 1 − (n/N ), due to the fact that there are only a finite number of distinct samples possible. The square of this quantity, 1 − (n/N ), is called the finite population correction factor. This factor applies when we are taking repeated samples from the same population. If it is to be used as an estimator when sampling from different populations, then it is advisable to revert to the formula for an infinite population. For fixed sample size n and very large population size N, n/N is close to 0. In this case, the finite population correction factor is approximately 1, √ and we return to the familiar situation in which the standard deviation of the sample mean is σ/ n. If the entire population is included in the sample – something that can only happen with a finite population – then n/N is equal to 1, and the standard deviation is 0. This reflects that when the entire population is evaluated, there is no sampling variability in the mean; we get the same sample every time. Suppose we are interested in the vaccination coverage of a village with 150 children. With the ith child we can associate a measure x i , which we define as x i = 0 if the ith child is not vaccinated, and x i = 1 if the ith child is vaccinated, for each of the children: i = 1, 2, . . . , 150. To monitor the village vaccination coverage, we can take a simple random sample of 20 children and check their vaccination status. Let S be the subscripts of those sampled. So, for example, if the 20 children chosen at random are child 4, child 17, child 31, etc., then, S = {4, 17, 31, . . .}. If we define x
=
1 X x i, 20 i ∈S
then x is the proportion of the children in the sample who are vaccinated. If we consider the entire population of children, 150 1 X xi µ = 150 i=1 is the proportion of the population vaccinated. If the sample was a simple random sample, then x is an unbiased estimator of µ. What this means, for a finite population, is that if we select many samples of size n and each time obtain a sample mean x, the average of these sample means is µ.
ISTUDY
514
Principles of Biostatistics Summary: Simple Random Sample Population size
N
Sample size
n
Sample values Sample mean
x 1, x 2, . . . , x n Pn x = i=1 x i /n
Weight
1/n
Sample variance
Pn s2 = i=1 (x i − x) 2 / (n − 1) q n )s 2 /n (1 − N
Standard error for x
21.1.2 Systematic Sampling If a sampling frame of the N elements in a finite population is available, systematic sampling can be performed. Systematic sampling shares some similarities with simple random sampling, but can be easier to apply in practice. If a sample of size n is desired, the sampling fraction for the population is n/N . This is equivalent to a sampling fraction of 1/(N/n), which means we should sample 1 unit from among every N/n. For simplicity, assume this to be an integer k = N/n. The initial sample unit is randomly selected from the first k units on the list. We then move down the list, selecting every kth consecutive unit. To select a sample of n 15- to 17-year-old teenagers living in Massachusetts, for example, we would first randomly select a number between 1 and k = N/n. Suppose we choose 4. We would then obtain information about the 4th individual on the list, as well as persons 4 + k, 4 + 2k, 4 + 3k, and so on. Ideally, a sampling frame should be a complete list of all members of the target population. In reality, however, this is rarely the case. In some situations it is impossible to devise a sampling frame for the study population. Suppose we are interested in the individuals who will use a particular health care clinic over the next year; in this case, we cannot finish the list until the year is complete. But even when a sampling frame is not available, systematic random sampling can sometimes be applied. We still wish to sample a fraction of 1 in N/n, so if the population size N is unknown, it must be estimated. In this case, the initial study unit is randomly selected from the first k units that become available, for some k. After this, each kth consecutive unit is chosen. For example, we might sample the 3rd person entering a health care clinic on a certain day, and every 10th person thereafter. The sampling frame is compiled as the study progresses. If the sampling fraction k = n/N is not an integer, we can modify the algorithm to maintain a random sample. As for integer k, we would start by generating a uniform number between 1 and k. If this number is u, we would then generate numbers u, u + k, u + 2k, . . . u + (n − 1)k. The final step is to turn these n numbers into integers by discarding their fractional parts. Unlike simple random sampling, systematic sampling requires the selection of only a single random number – a random number between 1 and k. Also, it distributes the sample evenly over the entire sampling frame. Bias may arise if there is some type of periodic or cyclic sequence to the list. For example, if k is an even integer, and if the numbers on the list are sequential, then those in the sample will all be even or odd, depending on the first number chosen. This may mean we choose houses on one side of the street only, but such patterns are rare. If we can assume that the list is randomly ordered, then each individual has an equal chance of being chosen. However, this does not mean we have a simple random sample. Not every sample of size n has an equal chance of being chosen. To see this, consider the first step where we choose a number between 1 and k. There are only k possible samples we can choose – far less than the number of possible simple random
ISTUDY
515
Sampling Theory
samples. This is an example of a cluster sampling design with a single cluster. Cluster designs are addressed in Section 21.1.4. Unfortunately, a systematic sample does not give us a measure of the standard error of the mean. To overcome this shortcoming, a systematic sample is sometimes (incorrectly) treated as a simple random sample [312]. In this case, the sample variance and associated standard error are usually calculated using the formulas for a simple random sample. Summary: Systematic Sample Population size Sampling skip
k = N/n
Starting index
u uniform on 1 to k
Start index string
u, u + k, u + 2k, . . . , u + (n − 1)k
Integer part of index
21.1.3
N
buc, bu + kc, bu + 2kc, . . . , bu + (n − 1)kc Pn x i=1 i
Sample mean
x=
Sample variance
Does not exist
Standard error for x
Does not exist
Stratified Sampling
In simple random sampling we do not consider any information not contained in the sample. It is possible that a particular subgroup of the population is not represented in the sample purely by chance, and we would not know that. When sampling teenagers in Massachusetts, for instance, our sample might not have any 17-year-old males, simply by chance. To know that we have this avoidable, biased perspective, we need to know that 17-year-old males exist in Massachusetts. This information is not available from the sample that has no 17-year-old males. But if we had reliable, external knowledge that this group exists, and its relative size, we can accommodate it in our survey. Indeed, if we have information about the relative sizes of subgroups, we can avoid the problem by selecting a stratified random sample – one in which all strata are represented in appropriate relative sizes. As we see, below, using this external information usually results in a more accurate survey. To illustrate the concept of stratification, consider that during the Great Depression, neighborhoods across the United States were classified by the Home Owners’ Loan Corporation (holc), an agency of the Federal Government [313]. The holc were tasked with refinancing home mortgages that were in default to prevent their foreclosure. Neighborhoods were graded and color coded, primarily as a function of their racial/ethnic makeup and wealth status. The example in Figure 21.3 demonstrates the classification of towns surrounding Boston in 1938. The classes are, in decreasing order of mortgage lending risk: grade 1, dark green; grade 2, blue; grade 3, yellow; and grade 4, red. There is a lot of information included in this labeling, even to this day, as this social policy has had such a lasting effect. In general, since variability is the bane of biostatistics – and there is much less variability within a neighborhood of one color than there is across neighborhoods of different colors – it would seem ill-advised to ignore this information if our ultimate aim is to make inference about the population as a whole. Theory agrees with our intuition. Another example of stratification is given in Figure 21.4, where we see the 9 provinces of South Africa and their estimated total fertility rates (tfr) for 2016–2021 [314]. The tfr is defined to be the average number of children born to women during their child bearing years, here taken to be ages 15 to 49 years. The rates vary across provinces, so knowing all provincial numbers is of use to policy makers and other stakeholders. If we wish to aggregate further and get a single tfr to represent the
ISTUDY
516
Principles of Biostatistics
FIGURE 21.3 Map of the Boston area in 1938, showing holc classifications entire nation, we can of course use the country-wide tfr, but we can also calculate it from the nine provincial numbers. If we accept the population numbers and proportions shown in Table 21.1 as accurate, we can use the formula from Section 3.2 to define the country average: X
= =
H X Nh Xh N h=1
(0.1025 × 2.88) + (0.0484 × 2.34) + (0.2781 × 1.91) + (0.1932 × 2.70) + (0.0978 × 2.87) + (0.0779 × 2.54) + (0.0203 × 2.67) + (0.064 × 2.65) + (0.1178 × 2.00)
= 2.40. We can monitor this overall statistic over time to track national trends or acute abnormalities. The provincial averages can also be used to check for spatial trends or irregularities. There are other stratification schemes that might be of interest. A popular one is to classify each household as being either rural (roughly 1/3 of South Africa) or urban. As another example, our strata might consist of various combinations of sex and age: 15-year-old males, 15-year-old females, 16-year-old males, 16-year-old females, 17-year-old males, and 17-year-old females. The options are limitless. Formally, to select a stratified random sample, we first divide the population into H strata with the hth stratum having size Nh . We then treat each of these strata as a smaller population, make inference within each stratum, then combine across the strata. As an example, start with Table 21.1 which shows the estimated number of women between the ages of 15 and 49 in each of the 9 provinces in
ISTUDY
517
Sampling Theory
FIGURE 21.4 South African provincial estimated average total fertility rate (tfr), 2016–2021
TABLE 21.1 Mid-year 2019 female population (ages 15–49 years) estimates by province, tfr for 2016-2021, and a random sample of 1000 South African women Province
h
Population Estimate Nh
Percent of Total
Eastern Cape Free State Gauteng KwaZulu-Natal Limpopo Mpumalanga Northern Cape North West Western Cape
1 2 3 4 5 6 7 8 9
1,626,473 767,216 4,411,677 3,065,144 1,552,179 1,234,948 321,860 1,015,626 1,868,672
10.25 4.84 27.81 19.32 9.78 7.79 2.03 6.40 11.78
2.88 2.34 1.91 2.70 2.87 2.54 2.67 2.65 2.00
107 41 273 207 83 71 17 67 134
15,863,795
100.00
2.40
1000
South Africa
tfr 2016–2021
Sample Size
ISTUDY
518
Principles of Biostatistics
South Africa. From this population we obtain a stratified sample of size n by choosing a separate simple random sample of size nh from stratum h, for each of the H strata, such that =
n
H X
nh .
h=1
This method ensures that each stratum is represented in the overall sample – assuming each nh > 0. Note that, in contrast to a simple random sample, not all members of the population necessarily have an equal chance of being in the sample. But, within a stratum we have a random sample. Thus within a stratum, each member of the population has an equal chance of being in that sample. However, we have flexibility in the choice of the stratum-specific sample sizes, the nh . So, for example, we may “oversample" small subgroups of a population to provide sufficient numbers for more in-depth stratum-specific analyses. Since we have a simple random sample within each individual stratum, we can estimate each stratum-specific population mean by using the sample mean within that stratum. We then estimate the overall population mean by a weighted average of the stratum-specific means: Mimicking the formula above, the estimator is x
=
H X Nh xh N h=1
where x h is the sample mean from stratum h. If we had to do repeated sampling on a weekly basis, for example, our weights would not vary from week to week, only the sample means within the strata would vary. Removing this source of variability in the weights should lead to a less variable, or more precise, estimator. To amplify on this point, we simulated a simple random sample of 1000 women from the population of 15,863,795 described in Column 3 of Table 21.1. Their distribution amongst the provinces is shown in Column 4 of the same table. The actual sample we drew, showing the provinces each woman came from, is the last column of Table 21.1. We only show the numbers from each province and not the tfr. In a simple random sample, these are the empirical weights (Column 6 divided by 1000) we would use to calculate the average of the random sample from their sample provincial averages (not shown). These weights would vary from sample to sample, and none would be as accurate as the actual population weights in Column 4. In contrast, the accurate weights in Column 4 are the ones we would use when calculating the stratified average, s2
=
H 1 X nh 2 s , n h=1 N h
and the square root of this term can be used as the standard error associated with the mean estimator defined above. Note that the variances between study units in different strata do not make a contribution to the standard error. Therefore, we are able to make the overall variance (or standard deviation) smaller by choosing subgroups where the study units within a particular stratum are as homogeneous as possible – decreasing the within-stratum variances – and the units in distinct strata are as different as possible – increasing the between-strata variances. If the stratum-specific sample sizes are chosen properly,
ISTUDY
519
Sampling Theory
the estimated mean of a population obtained from a stratified sample has a smaller variance, more precise, than the mean of a simple random sample of the same size n. Summary: Stratified Random Sample In stratified sampling we take a simple random sample from each individual stratum. We can make separate inferences for each stratum, and can also combine the stratum-specific results according to the optimal weighting scheme to obtain results for the population as a whole, across all strata. Strata Population size Sample size Sample values, stratum h Sample mean, stratum h Sample mean Sample variance, stratum h Standard error for x
21.1.4
h = 1, 2, . . . , H PH N = h=1 Nh PH n = h=1 nh x 1h , x 2h , . . . , x n h h Pn h x h = i=1 x ih /nh Pn x = h=1 Nh x h /N Pn h (x ih − x h ) 2 /(nh − 1) s2h = i=1 qP H (1 − n h )( N h ) 2 s 2 /n Nh N h=1 h h
Cluster Sampling
We now arrive at one of those historical misfortunes that seem designed to confuse us unnecessarily. Consider changing the label from stratum to cluster, while retaining the same definitions as in the previous section – what we have been calling a stratum, we now call a cluster. This gives us a head start on talking about cluster sampling. Cluster sampling can be used if the population is formed of natural clusters, or if a sampling frame of the study population is difficult to compile. The clusters are called the primary sampling units (psu); within a cluster, each study unit is called a secondary sampling unit (ssu). This is where cluster sampling diverges from stratified sampling. Whereas in stratified sampling we choose a random sample in each of the strata, in cluster sampling we first select a random sample of clusters, and then look at all study units within the selected clusters. In stratified sampling, chance enters via the sampling performed within each stratum. In cluster sampling, it occurs in the choice of which clusters to enumerate. When it is less expensive to obtain an ssu than it is to reach a psu, cluster sampling becomes less expensive than simple random sampling. Consider the travel cost required to obtain a sample when the psu are villages, and the ssu is a household within a village. With cluster sampling we might need only visit a few villages, whereas with simple random sampling we might potentially need to go to a different village for each observation. Return your attention to South Africa, and consider the provinces as the psu and the hospitals within the provinces as the ssu. Table 21.2 contains data we can use to illustrate cluster sampling [315]. Suppose we do not have all the information in the table, and wish to estimate the number of hospital beds in South Africa. (The number is given as 102,229, but in this exercise we act as if we do not know this.) Rather than go to every province and count all the beds, we decide that we can only afford to study 3 of the 9 provinces. We can simulate an example and use the data in Table 21.3 to explain the procedure. The first step is to decide which provinces will be in our sample. We do this by taking a simple random sample of size 3 from the integers 1 through 9. Suppose we get 1, 4, and 8. So choose our clusters to be:
ISTUDY
520
Principles of Biostatistics
TABLE 21.2 Distribution of hospital beds in public and private hospitals in the provinces in South Africa (1)
(2)
Province Eastern Cape Free State Gauteng KwaZulu-Natal Limpopo Mpumalanga Northern Cape North West Western Cape
(3)
(4) (5) (6) (7) (8) Public Per Private Per Total Population hosp. 105 un- hosp. 105 hosp. beds insured beds insured beds
1 6,786,900 10,833 179.34 1684 217.65 2 2,786,800 3717 162.66 2325 403.49 3 12,914,800 14,855 155.44 14,326 417.02 4 10,694,400 18,087 192.19 4802 359.21 5 5,630,500 7241 139.79 576 117.59 6 4,999,300 4792 110.18 1382 207.85 7 1,166,700 1654 166.79 361 200.92 8 3,676,300 3412 106.68 1465 290.87 9 6,116,300 6326 138.67 4391 281.54
South Africa
12,517 6042 29,181 22,889 7817 6174 2015 4877 10,717
(9) Per 105 total 184.43 216.80 225.95 214.03 138.83 123.50 172.71 132.66 175.22
54,772,000 70,917 154.14 31,312 357.30 102,229 186.64
Eastern Cape (12,517 beds), KwaZulu-Natal (22,889 beds), and North West (4,877 beds). We can treat these numbers as a simple random sample of size 3 psu, and proceed to apply our usual simple random sampling methods. For example, the value of an unbiased estimator of the average number of hospital beds in a province in South Africa is x
=
1 (12, 517 + 22, 889 + 4, 877) 3
=
13, 427.67.
Because this is a simulated exercise, we know the true answer; the average number of beds per province in South Africa is reported to be µ = 102, 229/9 = 11, 358.8. On the basis of sampling just 3 provinces, we are off by 2069. The sample standard deviation is r 1 s = (12, 517 − 13, 427.7) 2 + (22, 889 − 13, 427.7) 2 + (4, 877 − 13, 427.7) 2 2 =
9040.5,
and the estimated standard error of the mean is r n s2 se = 1− = N n
√
2 (9040.5) 3
=
4261.7.
This large standard error is due to the fact that we took a sample of only 3 psu. It also reflects the variability in the provincial numbers. One can improve on this standard error by choosing more provinces, of course, but also by incorporating more information into our analysis. We next show one way to do this.
ISTUDY
521
Sampling Theory Summary: Cluster Sample
In cluster sampling we first take a random sample of n psu, which serve as the sampling units. For each psu chosen, we then take a census of all ssu within that psu. Population level psui
i = 1, . . . , N
Number of ssu in psui
Mi
Total number of ssu
M=
Measure of jth ssu in ith psu
x i, j
Mean in ith psu
µi =
P Mi
Overall mean
µ =
PN
Overall variance of means
σ2
Sample level Clusters sampled Sampled psu mean Sample variance Standard error for µ
21.1.5
PN
Mi
i=1
x i j /Mi
j=1
=
i=1
Mi µi /M
PN i=1
(µi − µ) 2 /(N − 1)
i 1, i 2, . . . , i n P µ = nj=1 µi j P s2 = nj=1 (µi j − µ) 2 /(n − 1) qP n (1 − n ) s 2 /n N j=1
Ratio Estimator
As previously described, to save us from counting each hospital bed in South Africa to determine the total number of beds, we can instead randomly chose 3 provinces, find the average number of hospital beds in these provinces, and assume that this average represents the number of beds in an “average” province. The final step is then to multiply the province average by 9 – the number of provinces – to estimate the total beds in the country. This calculation is mathematically equivalent to taking the total number of beds in the 3 chosen provinces and multiplying that number by 3. This logic would be sound and lead to a good estimator if it were true that the 3 provinces in our sample actually contained one-third of all the hospital beds in the country. Of course, we do not really know what proportion of the total beds is contained in the sample; we just know that we chose the provinces randomly, so our estimator of the total is unbiased. Suppose we have some additional, relevant information. For instance, we might have the second column in Table 21.2, which contains the population size of each province. First, it is not unreasonable to have such census knowledge from some outside source, and secondly, it should prove useful. Consider (6, 786, 900 + 10, 694, 400 + 3, 676, 300)/54, 772, 000 = 38.63% The 3 provinces represent one-third (3/9 = 33.33%) of the provinces in the country, but 38.63%, more than one-third, of the total population. In order to extrapolate to the provinces where we did not measure the number of beds, which of these two fractions – 33.33% or 38.63% – would serve us better?
ISTUDY
522
Principles of Biostatistics
FIGURE 21.5 Number of hospital beds versus population size in the 9 South African provinces, 2014, with the least squares linear regression line superimposed We are interested in the number of beds, not the number of individuals in the population, but the two should be related. To confirm this intuition, we can use the information in Table 21.2 to create Figure 21.5, plotting the total number of beds in each province versus the population size of that province. (Reminder – ordinarily, we do not have all the information in Table 21.2. It is only for the sake of this discussion that we do.) The scatter plot confirms the relationship between these two quantities, which is also evident in the last column, Column 9, in Table 21.2. In Figure 21.5 we have also superimposed the least squares linear regression line; in fact, the Pearson correlation coefficient between the number of beds in a province and its population size is 0.98. The population variable in the third column of Table 21.2 is an example of an auxiliary variable to go along with the outcome variable in the eighth column. The idea behind a ratio estimator is that we measure two related variables. One is measured on every psu, such as the population per province. The other is measured only on some psu, a sample of them. For example in these sampled psu, we can also count the number of hospital beds. This gives us, in these psu, the number of beds per person in the population. We can then extend this ratio, number of beds per person, to the population as a whole. This is called a ratio estimator. The technique was proposed by Laplace to estimate the population of France at the beginning of the 19th century without needing to perform a complete census [316]: Parishes throughout France kept good records of all baptisms. Laplace chose 30 departements to “cover” France. He estimated the population of those departements by sampling a number of communes within the chosen departements. Then, from the birth numbers, he calculated the proportion of all births in France that occurred in those 30 communes. Applying the same technique here, we get as the ratio estimator: population in cluster total population
=
beds in cluster total beds
ISTUDY
523
Sampling Theory
6, 786, 900 + 10, 694, 400 + 3, 676, 300 54, 772, 000
=
0.3863 =
12, 517 + 22, 889 + 4, 877 total beds 40, 283 total beds
Therefore, we estimate the total number of beds to be: 40, 283 0.3863
=
104, 279
Equivalently, we estimate the average number of hospital beds per province to be 104, 279/9 = 11, 586.5. Recall from Table 21.2 that the actual mean is 11,358.8. In this example, the ratio estimate represents an improvement over the cluster sampling estimate, which was 13,427.67. In general, because it utilizes more information, the ratio estimator tends to be more precise than the cluster estimator of the same size. The more relevant the information – in this example, we can ask how well the points in Figure 21.6 are represented by a straight line – the greater the improvement in the estimate. It is important to note that the ratio estimator is not an unbiased estimator of a population parameter. This can be rectified if rather than sampling clusters at random, as we did above, we instead sample proportionate to cluster size. This is called probability proportionate to size (pps) sampling. We leave the interested reader to pursue this further in a more specialized text, such as the excellent [317]. Summary: Ratio Estimator Let Y be an auxiliary variable available for each member of the population. If Y is closely related to X, the outcome variable of interest, we can use Y to improve the estimator of the population mean of X. Population size
N
Sample size
n
Auxiliary variable
y1, . . . , yn , yn+1, . . . y N
Sampled variable
Sample mean
x 1, . . . , x n PN µ y = i=1 yi /N Pn y = i=1 yi /n Pn x = i=1 x i /n
Sample ratio
b = x/y
Ratio estimator
µx = b y
Population mean Sample mean
21.1.6
Two-Stage Cluster Sampling
In cluster sampling, the primary sampling units are the clusters themselves. Within the sampled clusters we take a census – measuring everyone – to obtain the cluster parameters of interest. The
ISTUDY
524
Principles of Biostatistics
FIGURE 21.6 Smallpox incidence in West and Central Africa, 1967–1970, shaded areas represent the range between the highest and lowest incidences reported in 1962–1966 logical question is then, why not replace the census within the chosen clusters by simple random samples, and base our inference on these samples? An alternative way to reach the same question is to start with stratified sampling, but instead of sampling each stratum, we sample only some of the strata. Both of these questions lead us to the two-stage cluster design where we first sample clusters (strata), and then sample individual units within the chosen clusters. Random sampling enters into the design twice, once at each of the two stages. One of the primary applications of two-stage cluster sampling is area sampling, where the clusters are counties, townships, city blocks, or other well-defined geographic sections of a population. A prominent example is the who Expanded Programme on Immunization (epi), whose roots can be traced to the successful who Smallpox Eradication Programme in the early 1970s. A typical instance of its impact is shown in Figure 21.6 [318]. The experience gained from the eradication program-related surveys was adapted for infant immunization coverage, and became the epi 30x7 cluster survey methodology where 30 villages are chosen at random within an area, and 7 people are sampled in each of the villages [319]. Since then, the epi cluster survey methodology has been used in hundreds of surveys to excellent effect [320]. Two-stage cluster sampling is usually more economical than other types of sampling; it saves both time and money. Unfortunately, what we gain on the swings, we lose on the roundabouts. When estimating a population parameter with samples of the same size, cluster sampling typically produces a larger standard error than simple random sampling. For example, a single simple random sample of size 210 would result in a smaller standard error than a two-stage 30x7 sample, even though the size is also 210. To explore how to measure the accuracy of our inference, we introduce some notation. We first sample n psu, and denote the subscripts of the clusters chosen by i 1, . . . , i n . In the second stage, we select simple random samples – possibly of different sizes – within each chosen psu. In psui , the sample is of size mi , and the observations are x i j1 , x i j2 , . . . , x i j m i . Then, for each psui where
ISTUDY
525
Sampling Theory i = i 1, . . . , i n , the sample mean is xi
=
mi X
x i j k /mi .
k=1
For example, to monitor vaccine coverage, we can set x i j k = 1 or 0, depending on whether the jk th child in the ith psu is or is not vaccinated. We can use these sample means to estimate the mean of the ssu within psui – to estimate the coverage in that cluster. We can also combine these sample means with appropriate weights to provide an estimator of the overall population mean (coverage): x
n X
=
Mi x i k /M
k=1
So far in this section, our estimators do not present anything novel. The difficulty enters when estimating variances and standard errors. To calculate the standard error of the estimator of the population mean, we must consider the two sources of randomness due to sampling: first when determining the choice of the psu, and second when determining the choice of the ssu. As a result, the standard error has two components. For the first component within each sampled psui we consider the variability of the sampled ssu s2i =
mi X
(x i j k − x i ) 2 /(mi − 1).
k=1
We can average these variances over the different sized psus, and divide by M 2 to find the variance of the sample mean, n s2i k N X M (M − m ) . i ik ik mi k nM 2 k=1 k We use the finite population correction factor to accommodate possibly sizable sampling fractions in individual psu. The second component of the standard error measures the variability among the means of the psu. Consider the variability among the psu totals, s2m =
n
1 X 2 M (x i k − x) 2 . n − 1 k=1 i
Weight this to get the component contribution N (N − n) s2m . nM 2 Now the variance associated with the unbiased estimator of the population mean can be estimated with an unbiased estimator which is the sum of these two components, n s2i k N X N M (M − m ) + (N − n) s2m . i i i k k k 2 mi k nM 2 nM k=1
The standard error can be estimated by the square root of this variance estimator. This decomposition of the standard error helps us determine how to improve the precision of our inference. The obvious strategy of making the mi large and as close as possible to their respective Mi , and also making the n large and close to N, needs to be tempered by the fact that the s i and s m might also increase at the same time. We explore this further in the next section.
ISTUDY
526
Principles of Biostatistics Summary: Two-Stage Cluster Design The population is segmented into nonoverlapping clusters, called psu, and a sample of these are chosen. Then, within these selected clusters, we choose random samples of ssu. Population level psui
i = 1, . . . , N
Number of ssu in psui
Mi
Population size
M =
Measure of jth ssu in psui Mean in psui Population mean Variance in psui Overall variance of means
PN i=1
Mi
x i, j j = 1, . . . , Mi P i µi = M x /Mi j=1 i j PN µ = i=1 Mi µi /M P N P Mi = i=1 x /M j=1 i j P Mi 2 σi = j=1 (x i j − µi ) 2 /(Mi − 1) PN σ 2m = i=1 (µi − µ) 2 /(N − 1)
Sample level Sample mean in psui
xi =
Pm i
Average ssu size in sample
M =
Pn
Population mean estimator
x =
Pn
k=1 k=1 k=1
Sample variance of means
s2i s2m
Variance of mean estimator
N {(N nM2
Sample variance in psui
= =
Pm i k=1 Pn k=1
x i, j k /mi Mi k /n Mi x i k /M (x i j k − x i ) 2 /(mi − 1) Mi2 (x i k − x) 2 /(n − 1)
− n)s2m +
Pn
s 2i
M (Mi k − mi k ) m ik } k=1 i k k
21.1.7 Design Effect Since with two-stage cluster sampling we study a limited number of randomly chosen clusters, it would be advantageous if each of these clusters mirrored the population as a whole. In the extreme, when each cluster is truly just a smaller version of the population, there would not be any variability between the psu means – as estimated by s2m above – and we would need very few clusters in our sample. Furthermore, the variability within a psu should be constant across psu, and that would be the only source of variability. Indeed, in this extreme case, we are reduced to a simple random P sample of size m = nj=1 mi j , and the standard error for estimating any population parameter would be approximately that of a simple random sample. At the other extreme, if all members within a stratum are identical to each other, there would be no variability within a psu, and the s2i would all be equal to 0. The variability would be entirely due to the differences between the psu means, and the sample is equivalent to a simple random sample of size n. From these two extremes, we can think of a simple random sample as a reference point against which to contrast other designs. In practice, we evaluate a design by taking the ratio of the variance
ISTUDY
527
Sampling Theory
of its mean estimator to the variance of the mean estimator for a simple random sample of the same size. This is called the design effect, denoted Deff . The design effect is a means of quantifying the uncertainty introduced by the design. Since the variance of the estimator for the mean of a simple random sample is inversely proportional to the sample size n, the design effect also informs us about a closely related concept, the effective sample size, neff = n/Deff . A given survey of size n with design effect Deff has the same standard error as a simple random sample of size neff . So, if Deff > 1, we would need a larger sample than a simple random sample in order to achieve the same accuracy. If Deff < 1, we need a smaller one. We do not often have Deff < 1, except perhaps when using a stratified design where the strata are homogeneous. More often we encounter designs with Deff > 1. It would be more proper to label the design effect Deff (x), to acknowledge that it is not only a property of the design, but also depends on which variable we are measuring. Rare is the survey where only a single variable is measured. That means we have as many design effects in a survey as we have variables. Common practice is to associate the design effect with the design, and we continue to follow this tradition, albeit an imperfect one. The design effect is useful as a summary statistic when designing a study, or more specifically, when deciding the required sample size for a study. A typical design sequence might be, first determine the necessary sample size for the desired accuracy if using a simple random sample; n, say. For example when obtaining a confidence interval or testing a hypothesis. To calculate the sample size we would need to achieve the same accuracy, but using a design with a design effect Deff , then the sample size required is nDeff . A complication arises, for example, if we want to obtain confidence intervals for different parameters, each associated with a different design effect. One option is to use the design effects associated with the most important variables in a survey and choose the maximum of these. This will guarantee the appropriate sample size for all the important variables. Note that some authors use the square root of the design effect and confusingly refer to that as the design effect. Others appropriately rename it as q Deff . Deft = Deft cannot be used as a divisor of the sample size to obtain the effective sample size, but it is generally smaller than Deff . This makes the price we pay for clustering seem smaller than it actually is. Another related quantity is the intraclass correlation coefficient (icc). Mathematically, with a cluster design where each cluster is of size m, Deff
=
1 + (m − 1) icc.
In practice the cluster sizes vary. If that variance is not large, then we can devise an approximation replacing m by an average cluster size, even the arithmetic mean. The icc, like a correlation coefficient, is a measure of the correlation between elements within the same cluster. One might expect this correlation to be high when dealing with infectious diseases, for example, resulting in the need for a much larger sample size for a cluster design than a simple random sample, see [321]. When designing a study, we need to determine, via the design effect or the icc, the effective sample size. The circularity between having to use estimates of quantities that result from the study to design the study exists here. The usual advice is to either do a small pilot study to guide the design of the study, or use results from previous surveys.
21.1.8
Nonprobability Sampling
All of the sampling designs described in this section result in probability samples. Because the probability of being included in the sample is known for each subject in a population, valid and
ISTUDY
528
Principles of Biostatistics
reliable inferences can be made. This cannot be said of nonprobability samples, in which the probability that an individual subject is included is not known. Examples of nonprobability samples include convenience samples and samples made up of volunteers. These types of samples are prone to bias, and cannot be assumed to be representative of any target population. In general, the choice of a sampling strategy depends on a number of factors, including the objectives of the study and the available resources [317]. The costs and benefits of the various methods should be weighed carefully. In practice, an investigator often combines two or more different sampling strategies.
21.2
Sources of Bias
No matter what the sampling scheme, when we are choosing a sample, selection bias is not the only potential source of error. A second source of bias is nonresponse. In situations where the units of study are people, there are typically individuals who cannot be reached, or who cannot or will not provide the information requested. Bias is present if these nonrespondents differ systematically from the individuals who do respond. Consider the results of the following study in which a sample of 5574 psychiatrists practicing in the United States were surveyed. A total of 1442, or only 26%, returned the questionnaire [322]. Of the 1057 males responding to the question, 7.1% admitted to having sexual contact with one or more patients. Of the 257 females, 3.1% confessed to this practice. How might nonresponse affect these estimates? Given that the Hippocratic oath expressly forbids sexual contact between physicians and their patients, it seems unlikely that there were claims of sexual contact that did not actually occur. It also seems probable that there were psychiatrists who did in fact have sexual contact with a patient and subsequently declined to return the survey. Therefore, it is likely that the percentages calculated from the survey data underestimate the true population proportions, by how much we do not know. A further potential source of bias results from the fact that a respondent may choose to lie rather than reveal something that is sensitive or incriminating; this might be true of the psychiatrists mentioned above, for example. Another situation in which an individual might not tell the truth is in a study investigating substance abuse patterns during pregnancy. In some states, a woman who confesses to using cocaine during her pregnancy runs the risk of having her child taken away from her. This risk might provide sufficient incentive to lie. In other circumstances, a person may lie even if the consequences are not as dire. For decades, public opinion polls have consistently reported that 40% of Americans attend a worship service at a church or synagogue at least once a week. This percentage is far higher than in most other Western nations. A study conducted in 1993 checked the attendance figures at religious services in a selected county in Ohio as well as a number of other churches around the country, and found that true attendance was closer to 20% [323]. Followup studies have suggested that it is often the most committed members of a religious group who exaggerate their involvement; even if they did not attend a service during the week in question, they feel they can answer in the affirmative because they usually do attend. One way to minimize the problem of lying in sample surveys is to apply the technique of randomized response. By introducing an extra degree of uncertainty into the data, we can mask the responses of specific individuals while still making inference about the population as a whole. If it works, randomized response reduces the motivation to lie. For example, suppose that the quantity we wish to estimate is the population prevalence of some characteristic, represented by π. An example might be the proportion of psychiatrists having sexual contact with one or more patients. A random sample of individuals from the population are questioned as to whether they possess this characteristic or not. Rather than being told to answer
ISTUDY
529
Sampling Theory
the question in a straightforward manner, a certain anonymous proportion of the respondents – represented by a where 0 < a < 1 – is instructed to answer “yes” under all circumstances. The remaining individuals are asked to tell the truth. Therefore, in a sample of size n, approximately na persons will always give an affirmative answer; the other n(1 − a) will reply truthfully. Of these n(1 − a), n(1 − a)π will say “yes,” and n(1 − a)(1 − π) will say “no.” If n∗ is the total number of respondents answering “yes,” then, on average, =
n∗
na + n(1 − a)π.
If we subtract na from each side of this equation and divide by n(1 − a), the population prevalence π may be estimated as (n∗ /n) − a . πˆ = 1−a As an example, a study conducted in New York City compared telephone responses obtained by direct questioning to those obtained through the use of randomized response. The study investigated the use of four different drugs: cocaine, heroin, PCP, and LSD. Each individual questioned was asked to have three coins available; he or she was to toss the coins before being asked a question and to respond according to the outcome of the toss. The rules were a little more complicated than those described above. If all three coins were heads, the respondent was instructed to answer in the affirmative; if all three were tails, then he or she had to answer “no.” If the coins were a mixture of both heads and tails, the respondent was asked to tell the truth. Therefore, the proportion of individuals always replying “yes” was a1
=
1 1 1 × × 2 2 2
=
1 . 8
Similarly, the proportion always providing a negative response was a2 The remaining
=
1 1 1 × × 2 2 2
=
1 8
1 1 6 3 3 − = = = 8 8 8 4 4 were instructed to tell the truth. In a sample of size n, approximately (3/4)nπ of these would reply “yes,” and (3/4)n(1 − π) would reply “no.” If n∗ is the total number of individuals answering “yes,” then 1 3 n∗ = n + nπ. 8 4 Consequently, 8n∗ − n πˆ = 6n would be the estimate of the proportion using a particular drug. For three of the four drugs in question, the proportions of individuals acknowledging use were higher when the answers were obtained by means of randomized response than they were when direct questioning was used; for cocaine the percentage increased from 11% to 21%, and for heroin it went from 3% to 10% [324]. This suggests that some individuals were not being completely truthful when questioned directly. The primary advantage of randomized response is that it reduces the proportion of individuals providing untruthful answers. Although it is impossible to identify individual responses, aggregate information can still be obtained. However, since this technique introduces an extra source of uncertainty into the analysis, the estimator πˆ has a larger variance than it would in the situation in which no masking device is used and everyone is assumed to answer the question honestly. What we lose in precision, however, we gain in accuracy. 1−
ISTUDY
530
21.3
Principles of Biostatistics
Further Applications
In previous chapters we studied a sample of low birth weight infants born in two teaching hospitals in Boston, Massachusetts. To illustrate some of the practical issues of sampling, we now treat these children as if they constituted a finite population of size N = 100. We might wish to describe a particular characteristic of this population – their average gestational age, perhaps. The 100 measures of gestational age for the infants are displayed in Table 21.3 [81]. The true mean for the population is µ = 28.9 weeks. Suppose we do not know this and we also do not have the resources to obtain the necessary information for each child. Instead, we must estimate the mean by obtaining information about a representative fraction of the newborns. How would we proceed? To select a simple random sample of size n, we choose study units independently from a list of the population elements – known as the sampling frame – until we achieve the desired sample size. Suppose we wish to draw a sample of size n = 10. One way to go about this would be to write the integers from 1 to 100 on slips of paper. After mixing them up thoroughly, we would select 10 different numbers. If we were working with a very large population this method would be impractical; instead, we could use a computer to generate the random numbers. In either case, each study unit has an equal chance of being selected. The probability that a particular unit is chosen is n N
=
10 100
=
0.10.
The ratio n/N is the sampling fraction of the population. Suppose that we follow this procedure for drawing a simple random sample and select the following set of numbers: 93 11 28 6 90 51 10 22 36 48 Returning to the population of low birth weight infants, we determine the gestational age of each of the newborns chosen; the appropriate values are marked in Table 21.3 and are listed below: 32 26 28 25 24 23 29 30 27 28 Note that the observations selected in a simple random sample need not be distributed evenly over the entire sampling frame. Using this random sample, we would estimate the population mean as x
=
32 + 26 + 28 + 25 + 24 + 23 + 29 + 30 + 27 + 28 10
= 27.2 weeks. This value is a little smaller than the true population mean of 28.9 weeks. As an alternative to the procedure described above, we might prefer to apply the technique of systematic sampling. When a complete list of the N elements in a population is available, systematic sampling is easier to carry out; it requires the selection of only a single random number. As noted above, the desired sampling fraction for the population of low birth weight infants is 0.10, or 1 in 10. Therefore, we would begin by randomly selecting the initial study unit from the first 10 units on the list. Suppose that we write the integers from 1 to 10 on slips of paper and randomly choose the number 5. In addition to identifying the gestational age of the 5th infant on the list, we would determine the gestational age of every 10th consecutive child – the 15th, the 25th, and so on. The appropriate ages are displayed below.
ISTUDY
531
Sampling Theory
TABLE 21.3 Measures of gestational age for a population of 100 low birth weight infants ID
Age
ID
Age
ID
Age
ID
Age
1
29
26
28
51
23
76
31
2
31
27
29
52
27
77
30
3
33
28
28
53
28
78
27
4
31
29
29
54
27
79
25
5
30
30
30
55
27
80
25
6
25
31
31
56
26
81
26
7
27
32
30
57
25
82
29
8
29
33
31
58
23
83
29
9
28
34
29
59
26
84
34
10
29
35
27
60
24
85
30
11
26
36
27
61
29
86
29
12
30
37
27
62
29
87
33
13
29
38
32
63
27
88
30
14
29
39
31
64
30
89
29
15
29
40
28
65
30
90
24
16
29
41
30
66
32
91
33
17
29
42
29
67
33
92
25
18
33
43
28
68
27
93
32
19
33
44
31
69
31
94
31
20
29
45
27
70
26
95
31
21
28
46
25
71
27
96
31
22
30
47
30
72
27
97
29
23
27
48
28
73
35
98
32
24
33
49
28
74
28
99
33
25
32
50
25
75
30
100
28
ISTUDY
532
Principles of Biostatistics ID 5 15 25 35 45 55 65 75 85 95
Age 30 29 32 27 27 27 30 30 30 31
These observations are evenly distributed over the sampling frame. As long as the population list is randomly ordered – and we have no reason to believe that it is not – a systematic sample can be treated as a simple random sample. In this case, therefore, we would estimate the population mean as x
=
30 + 29 + 32 + 27 + 27 + 27 + 30 + 30 + 30 + 31 10
= 29.3 weeks. This time, our estimate is slightly larger than the true population mean. If we feel it is important to include representative numbers of male and female infants in our sample – we might think that sex is associated with gestational age – we could select a stratified random sample. To do this, we must first divide the population of low birth weight infants into two distinct subgroups of 44 boys and 56 girls. The 100 values of gestational age for the population, sorted by sex, are displayed in Table 21.4. Even though we are working with two separate subpopulations, we would still like to have an overall sampling fraction of 1/10. Therefore, we should select a simple random sample of size 1 44 × = 4.4 ≈ 4 10 from the group of males and a sample of size 56 ×
1 10
=
5.6
≈
6
from the group of females. Using simple random sampling, we choose observations 2, 85, 61, and 54 for the males. For the females, we select 51, 14, 33, 25, 62, and 74. These observations are marked in Table 21.4. Thus, we see that the stratum-specific sample means are x males
=
31 + 30 + 29 + 27 4
= 29.3 weeks and x females
=
23 + 29 + 31 + 32 + 29 + 28 6
= 28.7 weeks.
ISTUDY
533
Sampling Theory
TABLE 21.4 Measures of gestational age for a population of 100 low birth weight infants, stratified by sex Males
Females
ID
Age
ID
Age
ID
Age
ID
Age
1
29
72
27
3
33
49
28
2
31
75
30
4
31
50
25
6
25
76
31
5
30
51
23
7
27
77
30
8
29
55
27
15
29
85
30
9
28
57
25
16
29
86
29
10
29
58
23
21
28
87
33
11
26
59
26
23
27
88
30
12
30
60
24
24
33
89
29
13
29
62
29
26
28
90
24
14
29
65
30
28
28
91
33
17
29
66
32
31
31
92
25
18
33
67
33
34
29
95
31
19
33
68
27
37
27
96
31
20
29
69
31
39
31
97
29
22
30
70
26
41
30
98
32
25
32
73
35
42
29
27
29
74
28
43
28
29
29
78
27
47
30
30
30
79
25
48
28
32
30
80
25
52
27
33
31
81
26
53
28
35
27
82
29
54
27
36
27
83
29
56
26
38
32
84
34
61
29
40
28
93
32
63
27
44
31
96
31
64
30
45
27
99
33
71
27
46
25
100
28
ISTUDY
534
Principles of Biostatistics
The true population mean is estimated as a weighted average of these quantities; therefore, x
=
4(29.3) + 6(28.6) 10
=
28.9 weeks.
By chance, this value is identical to the true population mean µ.
ISTUDY
Sampling Theory
21.4
535
Review Exercises
1. When conducting a survey, how does the study population relate to the target population? What is the sampling frame? 2. How does the finite version of the central limit theorem differ from the more commonly used version in which the underlying population is assumed to be infinite? 3. When might you prefer to use systematic sampling rather than simple random sampling? When would you prefer stratified sampling? 4. Explain the difference between cluster sampling and two-stage cluster sampling. 5. Explain the concept of a ratio estimator. 6. If the design effect of a proposed survey design takes a value greater than 1, what does this imply? 7. How can nonresponse result in a biased sample? What could you do to attempt to minimize nonresponse? 8. A study was conducted to examine the effects of maternal marijuana and cocaine use on fetal growth. Drug exposure was assessed in two different ways: the mothers were questioned directly during an interview, and a urinalysis was performed [325]. (a) Suppose that it is necessary to rely entirely on the information provided by the mothers. How might nonresponse affect the results of the study? (b) An alternative strategy might be to interview only those women who agree to be questioned. Do you feel that this method would provide a representative sample of the underlying population of expectant mothers? Why or why not? 9. Each year, the United States Department of Agriculture uses the revenue collected from excise taxes to estimate the number of cigarettes consumed in this country. Over the 11-year period 1974 to 1985, however, repeated surveys of smoking practices accounted for only about 72% of the total consumption [326]. (a) How would you explain this discrepancy in the estimates of cigarette consumption? (b) Which source are you more likely to believe, the excise tax revenue or the surveys of smoking practices? 10. Suppose that you are interested in conducting your own survey to estimate the proportion of psychiatrists who have had sexual contact with one or more patients. How would you carry out this study? Justify your method of data collection. Include a discussion of how you would attempt to minimize bias. 11. The data set lowbwt contains information describing 100 low birth weight infants born in Boston, Massachusetts [81]. Assume that these infants constitute a finite population. Their measures of systolic blood pressure are saved under the variable name sbp; the
ISTUDY
536
Principles of Biostatistics mean systolic blood pressure is µ = 47.1 mm Hg. Suppose that we do not know the true population mean and wish to estimate it using a sample of 20 newborns. (a) What is the sampling fraction of the population? (b) Select a simple random sample and use it to estimate the true mean systolic blood pressure for this population of low birth weight infants. (c) Draw a systematic sample from the same population and again estimate the mean systolic blood pressure. (d) Suppose you believe that a diagnosis of toxemia in an expectant mother might affect the systolic blood pressure of her child. Divide the population of low birth weight infants into two groups: those whose mothers were diagnosed with toxemia, and those whose mothers were not. Select a stratified random sample of size 20. Use these blood pressures to estimate the true population mean. (e) What are the sampling fractions in each of the two strata? (f) Could cluster sampling be applied in this problem? If so, how?
ISTUDY
22 Study Design
CONTENTS 22.1
22.2
22.3 22.4
Randomized Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.1 Control Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.2 Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.3 Blinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.4 Intention to Treat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.5 Crossover Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.6 Equipoise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.1 Cross-Sectional Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.2 Longitudinal Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.3 Case-Control Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.4 Cohort Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.5 Consequences of Design Flaws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
538 539 539 540 541 541 541 542 542 543 543 544 544 544 546
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. R.A. Fisher Presidential Address to the First Indian Statistical Congress (1938) In Chapter 21 we discuss sampling theory, and the ways in which surveys can be used to quantify population characteristics. Comparative studies – sometimes called analytical studies – are used to help understand the relationships between outcomes and explanatory variables measured for a population. Explanatory variables might be inherent subject characteristics, medical treatments, environmental factors, or other exposures. Study design is the creation of a plan to collect data to answer an important scientific question about the association between an outcome and one or more explanatory variables. For example, we might want to know if a particular health policy will lower infant mortality, or whether a new drug will improve survival in patients diagnosed with breast cancer. When designing a study to answer a research question, the goal is to determine the effect of the exposure in the population by comparing how often the outcome occurs in the presence of the exposure to how often it occurs in its absence. We can think of this as being analogous to a scientific experiment conducted in the laboratory. If an investigator wishes to study the effect of heat on a particular liquid, for example, they would do this by taking two identical samples of the liquid and applying heat to one but not the other. If the two samples remain identical, the conclusion would be that heat had no effect; on the other hand, if the two samples are no longer the same, the conclusion would be that heat caused the difference.
DOI: 10.1201/9780429340512-22
537
ISTUDY
538
Principles of Biostatistics
When studying humans, comparisons get more complicated. We cannot find two identical humans, perform some action on one but not the other, and draw a conclusion that can be generalized to a larger group of people. That we cannot find two identical humans does not diminish the importance of the concept. For practical purposes we identify two groups of people who – although not identical – are on average as similar as possible with respect to all important characteristics except exposure status. Then, if the two groups remain comparable after one is exposed and the other is not, we conclude that the exposure had no effect. If the groups are different, however, the conclusion is that the difference is a result of the exposure. How we then generalize these results depends on how the groups were chosen. There are a number of different ways in which information about exposures and outcomes can be collected in order to answer a research question. Assuming that only ethical studies are considered [327], the particular design chosen depends on the nature of the question being asked, the goal of the research, and the resources available. Each design has its own inherent strengths and weaknesses. Most study designs can be placed into one of two broad categories: randomized studies, and observational studies.
22.1
Randomized Studies
In a randomized study, subjects are assigned to one exposure group or another by the investigator, using a random mechanism. Subjects are then followed prospectively over time, and information about one or more outcomes is collected. Because the exposure is assigned, it is often referred to as a treatment or an intervention. A randomized study design aims to eliminate the effects of confounders when examining the relationship between the treatment and the outcome by distributing these confounding factors equally between the two treatment groups. If older people are more likely to develop the outcome than younger people, for instance, and if older people are also more likely to receive a particular intervention, then it would be difficult to determine whether a higher incidence of outcomes in the intervention group is due to the intervention itself or to the older age of those who receive it. In a randomized study, subjects of various ages would be randomly distributed between the groups which receive and do not receive the intervention. Then, if a difference in outcomes is observed, the difference is attributed to the intervention itself. Randomized studies are often used to evaluate the safety and efficacy of drugs, vaccines, and other therapeutic or medical procedures. In these instances, when the study involves human subjects, the design is called a randomized clinical trial. These studies are often considered to be the standard in establishing causal relationships between exposures and outcomes. One example of a randomized clinical trial is introduced in Section 11.2 [202]. Adults who underwent liver transplantation were recruited two months after surgery and randomized to one of two treatment groups: a combined intervention of exercise and dietary counseling, or usual care without additional intervention. Six months after surgery, patients completed the Medical Outcomes Study Short Form (sf-36) questionnaire. One outcome of the study was the mental component summary score of the sf-36, a continuous measurement where higher scores indicate better healthrelated quality of life. Another example of a randomized clinical trial demonstrates that these studies are not restricted to two treatment arms. To support an effort to stockpile doses of smallpox vaccine in the United States in case their use becomes necessary, a study was conducted to examine the success of a vaccine manufactured in the 1950s and frozen for several decades [328]. The undiluted preparation of the vaccine was compared to a 1:5 dilution (1 part vaccine to 5 parts sterile water diluent), and also a 1:10 dilution. Eligible volunteers were randomly assigned to one of the three dilution strengths. The outcome of the study was vaccine success, defined as presence of a vesicle or pustule at the inoculation site 6 to 11 days post vaccination, and local and systemic reactions to vaccination.
ISTUDY
Study Design
539
22.1.1 Control Groups One distinguishing characteristic of a randomized study is the inclusion of a control group, defined as a group of subjects who do not receive the exposure or intervention of interest. Suppose that in the trial of adults undergoing liver transplantation all participants received the combined intervention of exercise and dietary counseling. If six months after transplant we measure the mental component summary scores of these individuals, what are we able to say about the effect of counseling on quality of life? Perhaps the mental component summary scores of these patients would have been exactly the same with no counseling at all. Without a control group of subjects who do not receive the intervention – in the clinical trial, some transplant patients received usual care only, with no exercise or dietary counseling – we would not be able to draw any conclusion about the effect of the exposure. Exercise and dietary counseling might have a positive impact on quality of life, or a negative one, or no impact at all. We would not be able to say. In some randomized studies members of the control group, to which the intervention group is compared, may not receive any active treatment; in others, they may receive the standard or conventional therapy. In the clinical trial evaluating the combined intervention of exercise and dietary counseling, outcomes in this exposure group were compared to outcomes for individuals who received usual post-transplant care, without additional counseling. In the study investigating the success of frozen smallpox vaccine, 1:5 and 1:10 dilutions of the vaccine were compared to an undiluted preparation.
22.1.2 Randomization Randomization is a process which uses probability to assign subjects to an exposure group as they are enrolled in a trial. Subsequent to placing the patient on the study, the investigators have no input regarding exposure group assignment for any particular individual. In most such studies, randomization is performed so that study participants have an equal chance of being assigned to the intervention and control groups. Randomization is often carried out using a computer algorithm, but can also be accomplished by tossing a coin, or by using consecutive entries in a random number table where, for example, even integers are used to indicate assignment to the intervention group and odd integers to the control group. Ideally, randomization is carried out by an individual who is not involved in either the treatment of study subjects or in the assessment of outcomes. This is known as allocation concealment. Allocation concealment guarantees that the implementation of the randomization process is free from manipulation or bias. Traditionally, allocation concealment relied on sequentially numbered, sealed, opaque envelopes containing the randomization results. Today it is more common to use computer applications where investigators can access treatment assignments as subjects are enrolled. The primary benefit of randomization is that when the sample size of a study is large, both measured and unmeasured confounders will be equally distributed between the intervention and control groups, with no systematic differences. The two groups should be comparable in all respects except for exposure status. Then, if the outcome of interest differs between the two groups, we can conclude that the difference is a result of the exposure. In many instances, the preferred approach to randomization is simple randomization of individual subjects. As subjects are enrolled, each has a 50% chance of being assigned to the intervention group, and a 50% chance of being assigned to the control group. In smaller studies, however, equivalence of the distribution of confounders between treatment groups is not guaranteed. Special procedures are sometimes used to either balance the number of participants in each group throughout the course of enrollment, or to distribute baseline factors known to influence the outcome equally between the groups. With block randomization, randomization is performed in small blocks of prespecified size. For example, if a block size of 8 is used, randomization proceeds until four individuals have been assigned
ISTUDY
540
Principles of Biostatistics
to one of the two groups; after that, subjects are automatically assigned to the other group until the block of size 8 is completed. In this case, after each consecutive block of 8 subjects, the number of participants in each treatment group will be perfectly balanced. Blocking was used in the study comparing different dilution strengths of a frozen smallpox vaccine. Volunteers were randomized in blocks of size 6; for each consecutive group of 6 subjects, two were assigned to each of the three vaccine dilutions. If there is a concern that the treatment assignment of participants at the end of the block could be anticipated, the size of the blocks used can be varied randomly. This is called permuted block randomization. An advantage of block randomization is that if interim analyses are performed at prespecified time points during the study, the treatment groups should be balanced in size at each of these times. Similarly, if the study is terminated early, we would expect the sample sizes in each group to be approximately equal. Stratified randomization ensures that an important confounder such as age or sex is more evenly distributed between treatment groups than might happen by chance alone. Using this technique, randomization is performed separately within strata, for subjects with and without the confounder. (Recall the discussion of stratified sampling in Chapter 21.) As an example, in the study of patients undergoing liver transplantation, randomization was stratified by whether or not the subject had been diagnosed with hepatitis C. In small studies, stratified randomization can improve statistical power. It has less benefit in large studies, where random assignment is more likely to ensure an even distribution of baseline characteristics across treatment groups. When reporting the results of a randomized clinical trial, it is customary to include a table summarizing baseline patient characteristics for each exposure group. This allows us to check whether the randomization has worked.
22.1.3
Blinding
Blinding refers to the situation in which participants, investigators, persons who measure outcomes, and other individuals involved in the study are unaware of the exposure group to which a subject is assigned. The origin of the term “blinding” goes back to 1784, when a committee in Paris – chaired by Benjamin Franklin – actually placed a blindfold on a participant as part of a study. Regrettably, the label has persisted to this day [329]. Randomization reduces the impact of confounding factors at the time of randomization, but it does not affect differences that develop between the groups during follow-up. For example, a subject might behave differently if they know they received the intervention being studied, or if they know that they did not. Even if they do not change their behavior, their perception of their health status might differ. This recognized phenomenon is known as the Hawthorne effect. And it is not just the trial participants who can be influenced by knowledge of exposure assignment. An investigator might have expectations about how a particular intervention will work, and if they are aware of treatment assignment this could have an effect – whether consciously or unconsciously – on the way in which patients are treated. Perhaps a clinician will be more likely to prescribe additional medications in patients they know are not receiving the active treatment, for example. In order to achieve blinding, some randomized studies – especially trials involving drugs or other medications – use a placebo in the control group, when it is ethical to do so. A placebo is an inert substance administered in the same way as the treatment, so that neither the patients themselves nor the investigators know who is receiving the active treatment and who is not. Some randomized clinical trials are single blind, meaning that the study subjects are blinded to exposure assignment, but the investigators are not. The trial of smallpox vaccine was double blind, meaning that neither the study participants nor the investigators knew which vaccine strength was received by each subject. Although blinding is desirable, there are studies in which the interventions differ in a way that makes concealment of treatment assignment infeasible. The trial of patients undergoing liver transplantation is one such study; some subjects received exercise and dietary counseling and others did not, and both the participants and their physicians knew to which group
ISTUDY
Study Design
541
each individual had been randomized. In studies where it is impossible to blind the investigators, outcome assessment should be determined by someone without knowledge of the treatment assigned in order to prevent bias.
22.1.4
Intention to Treat
In a randomized trial, even if a study subject does not receive the treatment to which they were assigned, in the primary analysis they should always be analyzed according to their random group assignment. It does not matter what happened after randomization. It is possible that a subject used an alternative treatment, or that they did not adhere to the assigned protocol. Nevertheless, analysis according to the randomized group avoids the possibility of bias due to measured and unmeasured confounders, preserving the benefits of randomization.
22.1.5
Crossover Trial
Sometimes, rather than randomly assigning exposure status to two unique groups of subjects, the same individuals receive both treatments – or the active treatment and the control – in succession. Here, the order of treatment allocation is randomized. This is known as a crossover trial. In a crossover trial, the outcome is measured for each participant under two different conditions. Each subject is then compared to themselves, and effectively serves as their own control. The study of the effect of carbon monoxide exposure on adult males with coronary artery disease introduced in Section 11.4 is an example of a crossover trial [207]. On one day a subject underwent an exercise test on a treadmill until they experienced angina, was exposed to plain room air for one hour, and then performed a second exercise test until they again experienced angina. The outcome recorded was the percent increase in time to angina between the first and second tests. On another day the same subject performed the same sequence of events, but rather than being exposed to plain air between the exercises tests, he was exposed to a mixture of air and carbon monoxide. The order of the two exposures – the day with plain air, and the day with carbon monoxide – was randomly assigned for each individual. A crossover trial is often more efficient than a study which has independent subjects in each of two parallel exposure groups, meaning that it generally requires a smaller sample size to detect a difference between groups. Because treatments are applied in succession, however, the crossover trial might require a longer total time to complete the study. In crossover trials, a problem can arise if residual effects of the first treatment extend into the second treatment period. Therefore, these trials often include a washout period between the two interventions during which no treatment is received. This is done to minimize the carryover effect of the first treatment into the second treatment period. In general, crossover trials are best for the study of interventions with only short-term effects.
22.1.6
Equipoise
Special ethical concerns can arise in randomized studies. In fact, the ethical basis for a randomized clinical trial is that there is uncertainty or controversy within the medical community as to whether the intervention will be superior to the control or not [330]. If a treatment is believed to be safe and effective, it should not be withheld from those who might benefit from its use. On the other hand, if a treatment is potentially harmful, it should not be used unless the benefits outweigh the risks. When randomizing patients to treatment groups, there must not be data suggesting a difference between the two arms with respect to either safety or efficacy. Neither of the treatment options should be known
ISTUDY
542
Principles of Biostatistics
to be inferior to the other; information about their risks and benefits must be balanced. This state of balance is called equipoise. The primary advantage of randomized studies is that they provide the strongest empirical evidence of a causal relationship between an exposure and an outcome [331]. Randomization minimizes the effects of confounding by balancing the distribution of risk factors between the treatment groups; as a result, this study design most closely resembles the scientific experiments conducted in a laboratory. Blinding reduces the possibility that the apparent effects of the exposure are due to different behavior within the treatment arms, or to biased assessment of the outcomes. Randomized studies are not without disadvantages, however. Randomized clinical trials tend to be resource intensive; they are generally expensive, and can take a long time to complete. They can sometimes be impractical, or even unethical; to investigate the effects of illicit drug use by pregnant women on their unborn children, for example, women cannot be randomized to drug use. Randomized clinical trials may be difficult to justify if a practice has become established and is accepted within the medical community, even if that practice has no data to support its use. Finally, even in a study where confounding and bias has been eliminated, there are still issues of generalizability. If an intervention is found to be effective in males over the age of 50, that does not necessarily mean it will be equally effective in younger males, or in females, or in children. We must use caution when attempting to extrapolate beyond the data we have observed.
22.2
Observational Studies
In an observational study, subjects are again classified as to whether or not they have a particular exposure. Unlike a randomized study, however, exposure status in an observational study is not assigned by the investigator. Instead, the exposure occurs in some natural way, and its presence or absence is merely observed. When exposure status is not randomly assigned, it is possible that the groups being compared could differ in other important ways. One group might be older than the other, for example, or have more severe disease. The factors that differ between groups might be confounders in the analysis – variables that are associated with both the exposure status and the outcome. If so, these confounders could make the observed relationship between exposure and outcome appear either stronger or weaker than it really is. The most common types of observational study designs are described below.
22.2.1
Cross-Sectional Studies
In a cross-sectional study, a random sample of individuals is selected from the population of interest, and the exposure and outcome are measured at the same point in time. The sample is selected without knowledge of either exposure or outcome status. This design provides a snapshot of the prevalence of the outcome at a particular point in time, overall and by exposure status. For example, a survey might determine whether subjects have been diagnosed with emphysema (the outcome), based on whether or not they are current smokers (the exposure). Random sampling of the population being studied is particularly important in cross-sectional studies. Selection bias resulting from non-random sampling may distort the observed relationship between exposure and outcome. An advantage of cross-sectional studies is that they can examine more than one exposure and more than one outcome simultaneously. They are relatively quick and inexpensive compared to other study designs. A disadvantage is that while they can be used to investigate associations, cross-sectional
ISTUDY
Study Design
543
studies cannot be used to infer causality. Temporality cannot be established; we typically do not know whether the exposure came before or after the outcome for any particular study subject.
22.2.2
Longitudinal Studies
In a longitudinal study, measurements are made on the same study subjects before and after some intervention. This design differs from a crossover trial in that there is no randomization involved; all subjects receive the same intervention, and measurements are made both before and after that event. Over a short time period patient characteristics other than intervention status should not change, minimizing the effect of potential confounders. Furthermore, temporality between the exposure variable and the outcome is established. Over a longer time period, however, the investigator would not have control over other risk factors that might change, making it difficult to attribute a difference in outcome directly to the intervention.
22.2.3
Case-Control Studies
In a case-control study, the investigator begins by identifying groups of subjects with and without the outcome of interest. Those with the outcome are called the cases, and those without the outcome are the controls. Exposure status is then determined, and is compared for the two outcome groups. In this design it is essential to have a clear, consistent definition of what constitutes a case subject, so that the cases represent a homogeneous group. Often, all available subjects meeting the case definition over a specified time period are chosen. It is also important to consider whether the cases selected are representative of all individuals meeting the case definition; issues which could affect this include access to healthcare and patient survival. In a case-control study, the controls must be representative of the population from which the cases were selected, and are often sampled from the same source population. A subject is eligible to be a control if that person would have been selected as a case had they experienced the outcome. Selection of appropriate control subjects can be difficult; if the controls are either more or less likely than the cases to have been exposed for reasons unrelated to the outcome, the estimated association between exposure and outcome will be biased. Potential biases need to be considered carefully for each research question. In Chapter 5 we introduced a case-control study investigating the association between oral contraceptive use and breast cancer [130]. The researchers began by identifying a group of female nurses between the ages of 30 and 55, all of whom were members of the American Nurses’ Association, and had a confirmed diagnosis of breast cancer. For each case, 10 control patients who had never been diagnosed with any form of cancer were randomly selected from the same population of female nurses. All individuals were then asked about prior oral contraceptive use. If there are characteristics so important that an imbalance between groups would alter the conclusions being drawn, then controls might be matched to cases on one or more important confounders. If this is done, a paired analysis must be performed to account for the matching. A problem can arise if not all cases are able to be matched. Cases which are not matched to a control must be excluded from the analysis, which can reduce generalizability of the conclusions. Case-control studies can be very efficient for studying rare outcomes. They are convenient if there is more than one exposure of interest, and are often less expensive than other study designs. One challenge is that it can be difficult to choose an appropriate control group. In addition, recall bias can be an issue if subjects are asked to remember prior exposure status; individuals who have experienced an adverse health outcome and are classified as cases might be more likely to remember prior exposures than those who are controls. Also, because the investigator selects the ratio of cases to controls, the prevalence of the outcome in the underlying population cannot be determined from a case-control study.
ISTUDY
544
22.2.4
Principles of Biostatistics
Cohort Studies
In a cohort study, the investigator begins by identifying groups of subjects with and without some exposure or risk factor of interest. Members of both groups must be free of the outcome at the start of the study, but eligible to develop the outcome in the future. Once the exposure groups are defined, these groups are followed for a defined period of time to determine who develops the outcome of interest and who does not. This design is similar in structure to a randomized clinical trial, except that exposure status is not assigned by the investigator. A cohort study can be either prospective or retrospective, depending on the timing of the research with respect to development of the outcome. In a retrospective cohort study, outcome status is known when the study is begun. Information is collected through medical records, or by asking participants to recall past exposures and outcomes. Study participants are not followed in real time. In a prospective cohort study, the outcome has not yet occurred at the time the study begins. Exposure status is determined, and participants are followed for a period of time during which outcome information is collected. Retrospective cohort studies are generally quicker and less expensive that prospective studies, but temporality can be more easily demonstrated in a prospective design. In Chapter 5 we introduced a cohort study that examined risk factors for breast cancer among females participating in the first National Health and Nutrition Examination Survey in the 1980s [128]. Exposure status for parous females was determined at baseline; a female was considered to be “exposed” if she first gave birth at age 25 or older, and “unexposed” if she gave birth at a younger age. These individuals were then followed for a median of 10 years, and diagnoses of breast cancer were recorded as the outcome. Breast cancer diagnosis was confirmed by hospital records. Cohort studies are convenient if there is more than one outcome of interest. Due to the selective inclusion of subjects with the exposure of interest, they are particularly well suited for the study of rare exposures. Furthermore, with this design it is known that the exposure precedes the outcome, making it possible to infer causality. However, cohort studies can also have weaknesses, particularly if the follow-up period for the study is long. Subjects might leave the study or be lost to follow-up. They might die before reaching the endpoint of interest. If the reasons for loss to follow-up are associated with outcome status, the validity of the study results can be affected. In some instances, it is also possible for exposure status to change over the course of a study, or for other events to occur during that time, making it difficult to say that the exposure caused the outcome. Prospective cohort studies in particular can be expensive if many subjects are followed for a long period of time. Both prospective and retrospective cohort studies can be inefficient if the outcome is rare.
22.2.5
Consequences of Design Flaws
Designing a study can be a challenging undertaking. The choice of a particular study design is influenced not only by the research question, but also by what is feasible. Whatever design is chosen, it is imperative that a study produce valid inferences; if it does not, the research was a waste of time and resources. Any study – whether experimental or observational – has both strengths and weaknesses which must be taken into consideration. If a study has fundamental design flaws, there is often nothing that can be done analytically to correct the problem. The only option is to repeat the study with a more appropriate design.
22.3
Big Data
Throughout this text we have demonstrated how to use the methods introduced by applying them to datasets of manageable size – datasets with tens to thousands of observations. We have not used big
ISTUDY
Study Design
545
data to showcase these methods. The term big data refers to massive amounts of data that continue to grow over time. Examples include information contained in electronic health records, observations from wearables such as Fitbits, and measurements made during medical imaging. All the methods introduced in this book can be used to analyze big data, with a bit of caution. It is often the case that big data do not come from a random sample. This contradicts the random sampling assumption of the statistical models presented in the text, and thus diminishes the validity of inference from these models. Furthermore, the extremely large sample size yields tremendous power to detect even very small deviations from the null hypothesis. However, even if statistically significant, a small difference may not be clinically relevant. In this case, confidence intervals and effect sizes should be used for inference rather than p-values.
ISTUDY
546
Principles of Biostatistics
22.4
Review Exercises
1. What is the purpose of a control group in a research study? 2. What is the purpose of randomization in a research study? 3. When might an investigator wish to use block randomization? When might stratified randomization be considered? 4. Can a cross-sectional study be used to infer that a particular exposure causes an outcome? Explain. 5. List two strengths and two weaknesses of a case-control study. 6. List two strengths and two weaknesses of a cohort study. 7. In Chapter 1, you were asked to design a study aimed at investigating an issue you believe might influence the health of the world. Now that you have completed the text, revise your initial design. What elements would you keep? What elements would you change, and why?
ISTUDY
Bibliography
[1] Who, unicef, unfpa, World Bank Group and the United Nations Population Division, “Maternal Mortality: Levels and Trends 2000–2017,” 2019, https://www.who.int/ reproductivehealth/publications/maternal-mortality-2000-2017/en. [2] Martin N, Montagne R, “US Has the Worst Rate of Maternal Deaths in the Developed World,” Propublica, May 12, 2017, https://www.npr.org/2017/05/12/528098789/ us-has-the-worst-rate-of-maternal- deaths-in-the-developed-world. [3] Gbd 2015 Maternal Mortality Collaborators, “Global, Regional, and National Levels of Maternal Mortality, 1990–2015: A Systematic Analysis for the Global Burden of Disease Study 2015,” The Lancet, Volume 388, 10053, 1775–1812, October 8, 2016. [4] “Ten Great Public Health Achievements – United States, 1900–1999,” Morbidity and Mortality Weekly Report, Volume 48, Number 12, April 2, 1999, 241–243. [5] “Ten Great Public Health Achievements – United States, 2001–2010,” Morbidity and Mortality Weekly Report, Volume 60, Number 19, May 20, 2011, 619–623. [6] “Ten Great Public Health Achievements – Worldwide, 2001–2010,” Morbidity and Mortality Weekly Report, Volume 60, Number 24, June 24, 2011, 814–818. [7] O’Neil C, Weapons of Math Destruction: How Big Data Increases Inequity and Threatens Democracy, New York: Crown Random House, 2016. [8] Mervosh S, “Nearly 40,000 People Died From Guns in US Last Year, Highest in 50 Years,” The New York Times, December 18, 2018, https://www.nytimes.com/2018/12/18/us/ gun-deaths.html. [9] Katsiyannis A, Whitford DK, Ennis RP, “Historical Examination of United States Intentional Mass School Shootings in the 20th and 21st Centuries: Implications for Students, Schools, and Society,” Journal of Child and Family Studies, Volume 27, 2018, 2562. [10] Nemeth JM, Fraga Rizo C, “Estimating the Prevalence of Human Trafficking: Progress Made and Future Directions,” American Journal of Public Health, Volume 109, Number 10, 2019, 1318–1319. [11] “Female Genital Mutilation/Cutting: A Global Concern,” unicef, 2016, https://www. unicef.org/media/files/FGMC_2016_brochure_final_UNICEF_SPREAD.pdf. [12] Henney JE, Gayle HD, “Time to Reevaluate US Mifepristone Restrictions,” The New England Journal of Medicine, Volume 381, 2019, 597–598. [13] The Flat Earth Society, 2013-2021, https://www.tfes.org. [14] Richard A, Oppel RA, Gebeloff R, Lai KKR, Wright W, Smith M, “The Fullest Look Yet at the Racial Inequity of Coronavirus,” The New York Times, July 5, 2020.
547
ISTUDY
548
Bibliography
[15] National Center for Health Statistics, Health, United States, 2016: With Chartbook on LongTerm Trends in Health, 2017. [16] Carter-Lome M, “Artfully Small Examples of the Real Thing,” The Journal of Antiques and Collectibles, Sturbridge, Massachusetts, November 13, 2019. [17] Henrich J, Heine SJ, Norenzayan A, “Beyond weird: Towards a Broad-Based Behavioral Science,” Behavioral and Brain Sciences, Volume 33, 2010, 111–135. [18] Bavdekar SB, “Pediatric Clinical Trials,” Perspectives in Clinical Research, Volume 4, Number 1, 2013, 89–99. [19] Field MJ, Behrman RE, editors, Institute of Medicine (US) Committee on Clinical Research Involving Children, “Ethical Conduct of Clinical Research Involving Children,” The Necessity and Challenges of Clinical Research Involving Children, Washington, DC: National Academies Press, Volume 2, 2004, https://www.ncbi.nlm.nih.gov/books/NBK25553. [20] Liu KA, Mager NA, “Women’s Involvement in Clinical Trials: Historical Perspective and Future Implications,” Pharmacy Practice (Granada), Volume 14, Number 1, 2016, 708, doi:10.18549/PharmPract.2016.01.708. [21] Oh SS, Galanter J, Thakur N, Pino-Yanes M, Barcelo NE, White MJ, de Bruin DM, Greenblatt RM, Bibbins-Domingo K, Wu AHB, Borrell LN, Gunter C, Powe1 NR, Burchard EG, “Diversity in Clinical and Biomedical Research: A Promise Yet to Be Fulfilled,” PLoS Medicine, Volume 12, Number 12, December 15, 2015, e1001918. [22] “Clinical Trials Have Far Too Little Racial and Ethnic Diversity: It’s Unethical and Risky to Ignore Racial and Ethnic Minorities,” Scientific American, September 1, 2018. [23] Popejoy AB, Fullerton SM, “Genomics is Failing on Diversity,” Nature, Volume 538, Number 7624, October 13, 2016, 161–164. [24] Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ, “Clinical Use of Current Polygenic Risk Scores May Exacerbate Health Disparities,” Nature Genetics, Volume 51, 2019, 584–591. [25] Chen EH, Shofer FS, Dean AJ, Hollander JE, Baxt WG, Robey JL, Sease KL, Mills AM, “Gender Disparity in Analgesic Treatment of Emergency Department Patients with Acute Abdominal Pain,” Academic Emergency Medicine, Volume 15, Number 5, 2008, 414–418. [26] “Basic Information about Lead in Drinking Water,” Environmental Protection Agency, Drinking Water Requirements for Lead, 2021, https://www.epa.gov/groundwater-and-drinking-water/basic-information-about-lead-drinking-water\ #regs. [27] Lead and Copper Rule, Environmental Protection Agency, Drinking Water Requirements for Lead, 2021, https://www.epa.gov/dwreginfo/lead-and-copper-rule. [28] Brush M, “Expert Says Michigan Officials Changed a Flint Lead Report to Avoid Federal Action,” National Public Radio, November 5, 2015, https://www.michiganradio.org/ post/expert-says-michigan-officials-changed-flint-lead-report-avoidfederal-action. [29] Michigan Civil Rights Commission, The Flint Water Crisis: Systemic Racism Through the Lens of Flint, 2017.
ISTUDY
Bibliography
549
[30] Leyden L, “In Echo of Flint, Michigan, Water Crisis Now Hits Newark,” The New York Times, October 30, 2018. [31] Rich JT, Neely JG, Paniello RC, Voelker CCJ, Nussenbaum B, Wang EW, “A Practical Guide to Understanding Kaplan-Meier Curves,” Otolaryngology – Head and Neck Surgery, Volume 143, Number 3, 2010, 331–336. [32] American Community Survey, https://www.census.gov/programs-surveys/acs. [33] Surveys and Data Collection Systems, National Center for Health Statistics, https://www. cdc.gov/nchs/surveys.html. [34] Michaels D, “Extent of covid-19 Deaths Failed to Be Captured by Most Countries” Wall Street Journal, May 28, 2020. [35] Bernstein L, “US Reports 66,000 More Deaths Than Expected So Far This Year,” The Washington Post, April 29, 2020. [36] Rossen LM, Branum AM, Ahmad FB, Sutton P, Anderson RN, “Excess Deaths Associated with covid-19, by Age and Race and Ethnicity – United States, January 26 – October 3, 2020,” Morbidity and Mortality Weekly Report, Volume 69, 2020, 1522–1527. [37] Yule GU, The Function of Statistical Methods in Scientific Investigation, London: Medical Research Council Industrial Fatigue Research Board, 1924. [38] United Nations Population Fund, “Family Planning: Saving Children, Improving Lives,” New York: Jones & Janello, 1990. [39] Centers for Disease Control and Prevention, Hiv/aids Surveillance Report, Volume 5, Number 4, 1994. [40] Oken MM, Creech RH, Tormey DC, Horton J, Davis TE, McFadden ET, Carbone PP, “Toxicity and Response Criteria of the Eastern Cooperative Oncology Group,” American Journal of Clinical Oncology, Volume 5, December 1982, 649–655. [41] Heron M, National Center for Health Statistics, National Vital Statistics Report, Volume 67, Number 6, July 26, 2018. [42] Wang TW, Kenemer B, Tynan MA, Singh T, King B, “Consumption of Combustible and Smokeless Tobacco – United States, 2000-2015,” Morbidity and Mortality Weekly Report, Volume 65, Number 48, December 2016. [43] Fulwood R, Kalsbeek W, Rifkind B, Russell-Briefel R, Muesing R, LaRosa J, Lippel K, “Total Serum Cholesterol Levels of Adults 20–74 Years of Age: United States, 1976–1980,” Vital and Health Statistics, Series 11, Number 236, May 1986. [44] “Longterm Health Conditions,” National Health Survey: First Results 2014-15, Australian Bureau of Statistics, August 2015, https://www.abs.gov.au/statistics/health/ health-conditions-and-risks/national- health-survey-first-results. [45] Spear ME, Charting Statistics, New York: McGraw-Hill, 1952. [46] National Center for Health Statistics, Centers for Disease Control and Prevention Compressed Mortality File, 1999–2016, cdc wonder Online Database, June 2017, https://wonder. cdc.gov. [47] Tukey JW, Exploratory Data Analysis, Reading, Massachusetts: Addison-Wesley, 1977.
ISTUDY
550
Bibliography
[48] Bethel RA, Sheppard D, Geffroy B, Tam E, Nadel JA, Boushey HA, “Effect of 0.25 ppm Sulphur Dioxide on Airway Resistance in Freely Breathing, Heavily Exercising, Asthmatic Subjects,” American Review of Respiratory Disease, Volume 31, April 1985, 659–661. [49] Adams DA, Thomas KR, Jajosky RA, Foster L, Baroi G, Sharp P, Onweh DH, Schley AW, Anderson WJ, for the Nationally Notifiable Infectious Conditions Group, “Summary of Notifiable Infectious Diseases and Conditions — United States, 2015,” Morbidity and Mortality Weekly Report, Volume 64, Number 53, August 2017. [50] “National Health Expenditure Data: Historical,” Centers for Medicare & Medicaid Services, December 11, 2018, www.cms.gov/Research- Statistics-Data-and-Systems/ Statistics-Trends-and-Reports/NationalHealthExpendData. [51] “National Health Expenditure Trends, 1975 to 2018,” Canadian Institute for Health Information, Ottawa, Ontario: cihi, 2018. [52] Tufte ER, The Visual Display of Quantitative Information, Cheshire, Connecticut: Graphics Press, 1983. [53] “Causes of Death 2013,” Statistics South Africa, Pretoria, South Africa: 2016, http://www. statssa.gov.za. [54] Koenig JQ, Covert DS, Hanley QS, Van Belle G, Pierson WE, “Prior Exposure to Ozone Potentiates Subsequent Response to Sulfur Dioxide in Adolescent Asthmatic Subjects,” American Review of Respiratory Disease, Volume 141, February 1990, 377–380. [55] Prial FJ, “Wine Talk,” The New York Times, December 25, 1991, 29. [56] Framingham Heart Study Teaching Dataset, Biologic Specimen and Data Repository Information Coordinating Center, National Heart, Lung, and Blood Institute, https://biolincc. nhlbi.gov/teaching. [57] McGuill D, “Does the United States Have a Lower Death Rate From Mass Shootings Than European Countries?,” March 9, 2018, https://www.snopes.com. [58] “Ten Leading Causes of Injury Death by Age Group, Highlighting Unintentional Injury Deaths, United States – 2017,” National Center for Health Statistics, Centers for Disease Control and Prevention Injury Statistics Query and Reporting System, 2019, https://www. cdc.gov/injury/wisqars/. [59] Natality Public Use Data 2007-2016, cdc wonder Online Database, February 2018, https: //wonder.cdc.gov. [60] “World Health Statistics 2018: Monitoring Health for the sdgs, Sustainable Development Goals,” Global Health Observatory Data, Geneva: World Health Organization, 2018. [61] The Health Consequences of Smoking: 50 Years of Progress: A Report of the Surgeon General, Atlanta, Georgia: Centers for Disease Control and Prevention, 2014, https://www.cdc. gov/tobacco/data_statistics/sgr/50th-anniversary/index.htm. [62] Molfino NA, Nannini LJ, Martelli AN, Slutsky AS, “Respiratory Arrest in Near-Fatal Asthma,” The New England Journal of Medicine, Volume 324, January 31, 1991, 285–288. [63] Nelson C, “Office Visits to Cardiovascular Disease Specialists, 1985,” Vital and Health Statistics, Advance Data Report Number 171, June 23, 1989.
ISTUDY
Bibliography
551
[64] “Summary of Notifiable Diseases, United States, 1989,” Morbidity and Mortality Weekly Report, Volume 38, October 5, 1990. [65] Kuntz T, “Killings, Legal and Otherwise, Around the US,” The New York Times, December 4, 1994, 3. [66] Wagenknecht LE, Burke GL, Perkins LL, Haley NJ, Friedman GD, “Misclassification of Smoking Status in the cardia Study: A Comparison of Self-Report with Serum Cotinine Levels,” American Journal of Public Health, Volume 82, January 1992, 33–36. [67] Yassi A, Cheang M, Tenenbein M, Bawden G, Spiegel J, Redekop T, “An Analysis of Occupational Blood Lead Trends in Manitoba, 1979 Through 1987,” American Journal of Public Health, Volume 81, June 1991, 736–740. [68] Wynder EL, Graham EA, “Tobacco Smoking as a Possible Etiologic Factor in Bronchiogenic Carcinoma: A Study of 684 Proved Cases,” Journal of the American Medical Association, Volume 143, 329–336. [69] Ochmann S, Roser M, “Smallpox,” Our World in Data, 2019, https://ourworldindata. org. [70] Pomeroy SL, Holmes SJ, Dodge PR, Feigin RD, “Seizures and Other Neurologic Sequelae of Bacterial Meningitis in Children,” The New England Journal of Medicine, Volume 323, December 13, 1990, 1651–1656. [71] Jacobus CH, Holick MF, Shao Q, Chen TC, Holm IA, Kolodny JM, Fuleihan GEH, Seely EW, “Hypervitaminosis D Associated with Drinking Milk,” The New England Journal of Medicine, Volume 326, April 30, 1992, 1173–1177. [72] Gwirtsman HE, Kaye WH, Obarzanek E, George DT, Jimerson DC, Ebert MH, “Decreased Caloric Intake in Normal-Weight Patients with Bulimia: Comparison with Female Volunteers,” American Journal of Clinical Nutrition, Volume 49, January 1989, 86–92. [73] “Infant Mortality Rate,” The State of the World’s Children, United Nations Children’s Fund, 2013, https://undata.org. [74] “Live Births by Month of Birth,” Demographic Statistics Database, United Nations Statistics Division, February 2019, https://undata.org. [75] Fulwood R, Johnson CL, Bryner JD, Gunter EW, McGrath CR, “Hematological and Nutritional Biochemistry Reference Data for Persons 6 Months–74 Years of Age: United States, 1976–1980,” Vital and Health Statistics, Series 11, Number 232, December 1982. [76] Bansal A, Garg C, Pakhare A, Gupta S, “Selfies: A Boon or Bane,” Journal of Family Medicine and Primary Care, Volume 8, 2018, 828–831. [77] “Low Birthweight: Country, Regional and Global Estimates,” United Nations Children’s Fund and World Health Organization, New York: UNICEF, 2004. [78] Health, United States, 2016: With Chartbook on Long-term Trends in Health. Hyattsville, Maryland: National Center for Health Statistics, May 2017, 1217–1232. [79] Kaiserman MJ, Rickert WS, “Carcinogens in Tobacco Smoke: Benzo[a]py- rene from Canadian Cigarettes and Cigarette Tobacco,” American Journal of Public Health, Volume 82, July 1992, 1023–1026.
ISTUDY
552
Bibliography
[80] Ferdos J, Rahman M, “Maternal Experience of Intimate Partner Violence and Low Birth Weight of Children: A Hospital-Based Study in Bangladesh,” PLoS ONE, 2017, 12:e0187138. [81] Leviton A, Fenton T, Kuban KCK, Pagano M, “Labor and Delivery Characteristics and the Risk of Germinal Matrix Hemorrhage in Low Birth Weight Infants,” Journal of Child Neurology, Volume 6, October 1991, 35–40. [82] Kochanek KD, Murphy SL, Xu J, Arias E, “Deaths: Final Data for 2017,” National Vital Statistics Reports, Volume 68, Number 6, June 24, 2019. [83] Scholl L, Seth P, Kariisa M, Wilson N, Baldwin G, “Drug and Opioid-Involved Overdose Deaths – United States, 2013–2017,” Morbidity and Mortality Weekly Reports, ePub: December 21, 2018. [84] “Child Mortality – Number of Deaths,” Global Health Observatory, World Health Organization, https://www.who.int/healthinfo/mortality_data/en/. [85] “Perinatal Mortality Rates by Country, 2000,” World Health Organization (2006) Neonatal and Perinatal Mortality: Country, Regional and Global Estimates, Geneva: World Health Organization, 2006, https://apps.who.int/iris/handle/10665/43444. [86] “Under-Five Mortality,” Global Health Observatory, World Health Organization, https:// www.who.int/gho/child_health/mortality/mort ality_under_five_text/en/. [87] “Health Indicators, 2000,” Statistics Canada, Statistique Canada, Catalogue Number 82-221XIE, https://www150.statcan.gc.ca/n1/pub/82-221-x/4149077-eng.htm. [88] “Health,” The World Bank, 2020, https://data.worldbank.org/topic/8. [89] “2018 Global Reference List of 100 Core Health Indicators (plus health-related SDGs),” World Health Organization, Geneva: World Health Organization, 2018. https://apps.who.int/ iris/handle/10665/259951 [90] Xu JQ, Murphy SL, Kochanek KD, Arias E, “Mortality in the United States, 2018,” National Center for Health Statistics Data Brief, Number 355, Hyattsville, MD: National Center for Health Statistics, 2020. [91] Dimick JB, Staiger DO, Birkmeyer, J.D., “Ranking Hospitals on Surgical Mortality: The Importance of Reliability Adjustment,” Health Services Research, Volume 45, Number 6, 2020, 1614–1629. [92] “A Look at the 1940 Census,” United States Census Bureau, https://www.census.gov/ newsroom/cspan/1940census/CSPAN_1940slides.pdf. [93] Vital Statistics of the United States, 1940, Washington: United States Government Printing Office, 1943, https://www.cdc.gov/nchs/data/vsus/vsus_1940_1.pdf. [94] Murphy S, Xu J, Kochanek KD, “Deaths: Final Data for 2010,” National Vital Statistics Reports, Volume 61, Number 4, May 8, 2013, https://www.cdc.gov/nchs/data/nvsr/ nvsr61/nvsr61_04.pdf. [95] Shalala, DE, “HHS Policy for Changing the Population Standard for Age Adjusting Death Rates, 8/26/98,” United States Department of Health and Human Services, 1998, https://aspe.hhs.gov/hhs-policy-changing-population-standard-ageadjusting-death-rates-82698.
ISTUDY
Bibliography
553
[96] Krieger N, Williams DR, “Changing to the 2000 Standard Million: Are Declining Racial/Ethnic and Socioeconomic Inequalities in Health Real Progress or Statistical Illusion?,” American Journal of Public Health, August 2001, Volume 91, Number 8, 1209–1213. [97] “Cancer Facts & Figures 2020,” American Cancer Society, Atlanta, Georgia: American Cancer Society, https://www.cancer.org/research/ cancer-facts-statistics/ all-cancer-facts-figures/cancer- facts-figures-2020.html. [98] Tate J, Jenkins J, Rich S, “Fatal Force,” The Washington Post, http://www. washingtonpost.com/graphics/investigations/police- shootings-database. [99] Roser M, Ritchie H, “Cancer,” Our World in Data, 2015, https://ourworldindata.org/ cancer. [100] Centers for Disease Control and Prevention, “National Diabetes Statistics Report, 2020,” Atlanta, Georgia: Centers for Disease Control and Prevention, United States Department of Health and Human Services, 2020. [101] Ely DM, Driscoll AK, “Infant Mortality in the United States, 2018: Data from the Period Linked Birth/Infant Death File,” National Vital Statistics Reports, Volume 69, Number 7, 2020. [102] Ely DM, Driscoll AK, “Infant Mortality in the United States, 2017: Data From the Period Linked Birth/Infant Death File,” National Vital Statistics Reports, Volume 68, Number 10, 2019. [103] “Massachusetts Department of Public Health, Registry of Vital Records and Statistics,” Massachusetts Births 2017, November 2019, https://www.mass.gov/doc/ 2017-birth-report-updated. [104] “Massachusetts Department of Public Health, Registry of Vital Records and Statistics,” Massachusetts Deaths 2017, October 2019, https://www.mass.gov/doc/ 2017-death-report-updated. [105] Ely DM, Gregory EEW, Drake P, “Infant Mortality by Maternal Prepregnancy Body Mass Index: United States, 2017–2018,” National Vital Statistics Reports, Volume 69, Number 9, 2020. [106] Foreman J, “Making Age Obsolete: Scientists See Falling Barriers to Human Longevity,” The Boston Globe, September 27, 1992, 1, 28–29. [107] Dell AJ, Kahn AJ, “Geographical Maldistribution of Surgical Resources in South Africa: A Review of the Number of Hospitals, Hospital Beds and Surgical Beds,” South African Medical Journal, Volume 107, Number 12, 2017, 1099–1105. [108] Meier P, “Polio Trial: An Early Efficient Clinical Trial,” Statistics in Medicine, Volume 9, Number 1/2, January–February 1990, 13–16. [109] National Center for Health Statistics, Vital Statistics Rates in the United States, 1900–1940, Chapters I–IV, reprinted 1972. [110] Gardner JF, Wiedemann T, The Roman Household: A Sourcebook, New York: Routledge, Taylor and Francis Group, 2013. [111] Graunt J, Natural and Political Observations Mentioned in a Following Index, and Made Upon the Bills of Mortality, Fifth Edition, Much Enlarged, 1676.
ISTUDY
554
Bibliography
[112] Halley E, “An Estimate of the Degrees of the Mortality of Mankind, Drawn from Curious Tables of the Births and Funerals at the City of Breslaw; with an Attempt to Ascertain the Price of Annuities upon Lives,” Philosophical Transactions, Volume 17, 1693, 596–610. [113] Arias E, Xu J, Kochanek KD, “United States Life Tables, 2016,” National Vital Statistics Reports, Volume 68, Number 4, May 7, 2019. [114] Arias E, “United States Life Tables, 2006,” National Vital Statistics Reports, Volume 58, Number 21, June 28, 2010. [115] Vandenbroucke JP, “Survival and Expectation of Life from the 1400s to the Present: A Study of the Knighthood Order of the Golden Fleece,” American Journal of Epidemiology, Volume 122, December 1985, 1007–1015. [116] Organization for Economic Cooperation and Development, “Life Expectancy at Birth (Indicator),” oecd Data, 2020, doi: 10.1787/27e0fc9d-en. [117] Holden C, “Why Do Women Live Longer Than Men?,” Science, Volume 238, 1987, 158-160. [118] Eskes T, Haanen C, “Why Do Women Live Longer Than Men?,” European Journal of Obstetrics & Gynecology and Reproductive Biology, Volume 133, Number 2, August 2007, 126–133. [119] Xirocostas ZA, Everingham SE, Moles AT, “The Sex with the Reduced Sex Chromosome Dies Earlier: A Comparison Across the Tree of Life,” Biology Letters, Volume 16, Number 3, 2020, 20190867. [120] Woolhandler S, Himmelstein DU, Ahmed S, Bailey Z, Bassett MT, “Public Policy and Health in the Trump Era,” The Lancet, February 11, 2021, https://doi.org/10.1016/ S0140-6736(20)32545-9. [121] Wilkins R, Adams OB, “Health Expectancy in Canada, Late 1970s: Demographic, Regional, and Social Dimensions,” American Journal of Public Health, Volume 73, September 1983, 1073–1080. [122] Arias E, Xu J, “United States Life Tables, 2017,” National Vital Statistics Reports, Volume 68, Number 7, June 24, 2019. [123] National Academies of Sciences, Engineering, and Medicine, The Growing Gap in Life Expectancy by Income: Implications for Federal Programs and Policy Responses, Washington, DC: The National Academies Press, 2015, doi.org/10.17226/19015. [124] Medawar P, The Strange Case of the Spotted Mice, and Other Classic Essays on Science, Oxford: Oxford University Press, 1996. [125] Martin JA, Hamilton BE, Osterman MJK, Driscoll AK, “Births: Final Data for 2018,” National Center for Health Statistics, National Vital Statistics Reports, Volume 68, Number 13, November 27, 2019. [126] Howden LM, Meyer JA, “Age and Sex Composition: 2010,” United States Census Bureau, May 2011. [127] Plassman BL, Langa KM, Fisher GG, Heeringa SG, Weir DR, Ofstedal MB, Burke JR, Hurd MD, Potter GG, Rodgers WL, Steffens DC, Willis RJ, Wallace RB, “Prevalence of Dementia in the United States: The Aging, Demographics, and Memory Study,” Neuroepidemiology, Volume 29, 2007, 125-132.
ISTUDY
Bibliography
555
[128] Carter CL, Jones DY, Schatzkin A, Brinton LA, “A Prospective Study of Reproductive, Familial, and Socioeconomic Risk Factors for Breast Cancer Using NHANES I Data,” Public Health Reports, Volume 104, January–February 1989, 45–49. [129] Garfinkel L, Silverberg E, “Lung Cancer and Smoking Trends in the United States Over the Past 25 Years,” Ca—A Cancer Journal for Clinicians, Volume 41, May/June 1991, 137–145. [130] Hennekens CH, Speizer FE, Lipnick RJ, Rosner B, Bain C, Belanger C, Stampfer MJ, Willett W, Peto R, “A Case-Control Study of Oral Contraceptive Use and Breast Cancer,” Journal of the National Cancer Institute, Volume 72, January 1984, 39–42. [131] Colditz GA, Hankinson SE, Hunter DJ, Willett WC, Manson JE, Stampfer MJ, Hennekens C, Rosner B, Speizer FE, “The Use of Estrogens and Progestins and the Risk of Breast Cancer in Postmenopausal Women,” The New England Journal of Medicine, Volume 332, June 15, 1995, 1589–1593. [132] American Cancer Society, Breast Cancer Facts & Figures 2017–2018, Atlanta, Georgia: American Cancer Society, 2017. [133] American Cancer Society, Colorectal Cancer Facts & Figures 2017–2019, Atlanta, Georgia: American Cancer Society, 2017. [134] Morales M, Joseph E, “Blacks and Latinos are Overwhelmingly Ticketed by nypd for Social Distancing Violations,” CNN, May 9, 2020, https://www.cnn.com/2020/05/08/us/ social-distancing-stats-nyc/index.html. [135] Greer S, Naidoo M, Hinterland K, Archer A, Lundy de la Cruz N, Crossa A, Gould LH, “Health of Latinos in New York City,” NYC Health, 2017, 1–32. [136] Freeman WJ, Weiss AJ, Heslin KC, “Overview of U.S. Hospital Stays in 2016: Variation by Geographic Region,” Healthcare Cost and Utilization Project (hcup) Statistical Briefs, December 18, 2018. [137] Berchick ER, Barnett JC, Upton RD, “Health Insurance Coverage in the United States, 2018,” United States Census Bureau, November 8, 2019. [138] Sundaram A, Vaughan B, Kost K, Bankole A, Finer L, Singh S, Trussell J, “Contraceptive Failure in the United States: Estimates from the 2006-2010 National Survey of Family Growth,” Perspectives on Sexual and Reproductive Health, Volume 49, Number 1, February 28, 2017, 7–16. [139] Margolis PA, Greenberg RA, Keyes LL, LaVange LM, Chapman RS, Denny FW, Bauman KE, Boat BW, “Lower Respiratory Illness in Infants and Low Socioeconomic Status,” American Journal of Public Health, Volume 82, August 1992, 1119–1126. [140] “US and World Population Clock,” United States Census Bureau, May 25, 2020, https: //www.census.gov/popclock. [141] “Coronavirus Resource Center: covid-19 Case Tracker,” Johns Hopkins University & Medicine, May 2020, https://coronavirus.jhu.edu/map.html. [142] Maxim LD, Niebol R, Utell MJ, “Screening Tests: A Review with Examples,” Inhalation Toxicology, Volume 26, Number 13, 2014, 811–820.
ISTUDY
556
Bibliography
[143] Mayrand MH, Duarte-Franco E, Rodrigues I, Walter SD, Hanley J, Ferenczy A, Ratnam S, Coutlee F, Franco EL, “Human Papillomavirus DNA versus Papanicolaou Screening Tests for Cervical Cancer,” The New England Journal of Medicine, Volume 357, Number 16, October 18, 2007, 1579–1588. [144] Kumar N, Bhargava SK, Agrawal CS, George K, Karki P, Baral D, “Chest Radiographs and Their Reliability in the Diagnosis of Tuberculosis,” Journal of Nepal Medical Association, Volume 44, Number 160, 2005, 138–142. [145] Skerrett PJ, “Doctor Groups List Top Overused, Misused Tests, Treatments, and Procedures,” Harvard Health Publishing, Harvard Medical School, April 5, 2012, www.health.harvard. edu/blog/doctor-groups-list-top-overused-misused-tests-treatments-and -procedures-201204054570. [146] DeLong ER, Vernon WB, Bollinger RR, “Sensitivity and Specificity of a Monitoring Test,” Biometrics, Volume 41, December 1985, 947–958. [147] World Health Organization, “Preventive Chemotherapy in Human Helminthiasis: Coordinated Use of Athelminthic Drugs in Control Interventions: A Manual for Health Professionals and Programme Managers,” ISBN 978 92 4 154710 9, 2006. [148] Novick LF, Glebatis DM, Stricof RL, MacCubbin PA, Lessner L, Berns DS, “Newborn Seroprevalence Study: Methods and Results,” American Journal of Public Health, Volume 81, May 1991, 15–21. [149] Gurevich R, “When to Take a Pregnancy Test: Understanding How Pregnancy Tests Really Work,” https://www.verywellfamily.com/when-is-the-best-time-to-take-an -early-pregnancy-test-1960163. [150] Ransohoff DF, Feinstein AR, “Problems of Spectrum and Bias in Evaluating the Efficacy of Diagnostic Tests,” The New England Journal of Medicine, Volume 299, Number 17, October 26, 1978, 926–930. [151] Hajian-Tilaki KO, Gholizadehpasha AR, Bozorgzadeh S, Hajian-Tilaki E, “Body Mass Index and Waist Circumference are Predictor Biomarkers of Breast Cancer Risk in Iranian Women,” Medical Oncology, Volume 28, 2011, 1296–1301. [152] Kwon C, Farrell PM, “The Magnitude and Challenge of False-Positive Newborn Screening Test Results,” Archives of Pediatrics and Adolescent Medicine, Volume 154, Number 7, 2000, 714–718. [153] “Screening Mammography Sensitivity, Specificity, and False Negative Rate,” Breast Cancer Surveillance Consortium, March 23, 2017, https://www.bcsc-research. org/statistics/screening-performance-benchmarks/screening-sens-specfalse-negative. [154] Katz JN, Larson MG, Fossel AH, Liang MH, “Validation of a Surveillance Case Definition of Carpal Tunnel Syndrome,” American Journal of Public Health, Volume 81, February 1991, 189–193. [155] Begg CB, McNeil BJ, “Assessment of Radiologic Tests: Control of Bias and Other Design Considerations,” Radiology, Volume 167, May 1988, 565–569. [156] Schroder FH, Kranse R, “Verification Bias and the Prostate-Specific Antigen Test – Is There a Case for a Lower Threshold for Biopsy,” The New England Journal of Medicine, Volume 349, Number 4, July 24, 2003, 393–395.
ISTUDY
Bibliography
557
[157] Bortheiry AL, Malerbi DA, Franco LJ, “The roc Curve in the Evaluation of Fasting Capillary Blood Glucose as a Screening Test for Diabetes and Igt,” Diabetes Care, Volume 17, November 1994, 1269–1272. [158] Current Population Survey, “Households by Size, 1960 to Present,” United States Census Bureau, November 2018, https://census.gov. [159] Centers for Disease Control and Prevention, “Current Cigarette Smoking Among Adults in the United States,” Fast Facts and Fact Sheets: Smoking & Tobacco Use, February 4, 2019, https://www.cdc.gov. [160] Wilson R, Crouch EAC, “Risk Assessment and Comparisons: An Introduction,” Science, Volume 236, April 17, 1987, 267–270. [161] National Center for Health Statistics, Drizd T, Dannenberg AL, Engel A, “Blood Pressure Levels in Persons 18–74 Years of Age in 1976–1980, and Trends in Blood Pressure From 1960 to 1980 in the United States,” Vital and Health Statistics, Series 11, Number 234, July 1986. [162] Castelli WP, Anderson K, “Antihypertensive Treatment and Plasma Lipoprotein Levels: The Associations in Data from a Population Study,” American Journal of Medicine Supplement, Volume 80, February 14, 1986, 23–32. [163] Tye L, “Many States Tackling Issue of aids-Infected Health Care Workers,” The Boston Globe, May 27, 1991, 29–30. [164] Centers for Disease Control, “Summary of Notifiable Diseases, United States, 1989,” Morbidity and Mortality Weekly Report, Volume 39, October 5, 1990. [165] Najjar MF, Rowland M, “Anthropometric Reference Data and Prevalence of Overweight: United States, 1976–1980,” Vital and Health Statistics, Series 11, Number 238, October 1987. [166] Woodwell D, “Office Visits to Pediatric Specialists, 1989,” Vital and Health Statistics, Advance Data Report Number 208, January 17, 1992. [167] National Center for Health Statistics, Martin JA, Hamilton BE, Osterman MJK, Driscoll AK, Drake P, “Births: Final Data for 2016,” National Vita Statistics Report, Volume 67, Number 1, 2018. [168] American Diabetes Association, “Overall Numbers, Diabetes and Prediabetes,” Statistics About Diabetes, March 22, 2018, https://www.diabetes.org. [169] Youth Risk Behavior Surveillance System, “Adolescent and School Health, 2017,” National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention, August 22, 2018. [170] Gibbons RD, Clark DC, Fawcett J, “A Statistical Method for Evaluating Suicide Clusters and Implementing Cluster Surveillance,” American Journal of Epidemiology Supplement, Volume 132, July 1990, 183–191. [171] Arias E, Xu J, “United States Life Tables, 2015,” National Vital Statistics Report, Volume 67, Number 7, November 13, 2018. [172] MacMahon SW, MacDonald GJ, “A Population at Risk: Prevalence of High Cholesterol Levels in Hypertensive Patients in the Framingham Study,” American Journal of Medicine Supplement, Volume 80, February 14, 1986, 40–47.
ISTUDY
558
Bibliography
[173] Lindgren BW, Statistical Theory, New York: Macmillan, 1976. [174] Snedecor GW, Cochran WG, Statistical Methods, Ames, Iowa: Iowa State University Press, 1980. [175] Scully RE, McNeely BU, Mark EJ, “Case Record of the Massachusetts General Hospital: Weekly Clinicopathological Exercises,” The New England Journal of Medicine, Volume 314, January 2, 1986, 39–49. [176] Ostro BD, Lipsett MJ, Wiener MB, Selner JC, “Asthmatic Responses to Airborne Acid Aerosols,” American Journal of Public Health, Volume 81, June 1991, 694–702. [177] Wilcox AJ, Skjærven R, “Birth Weight and Perinatal Mortality: The Effects of Gestational Age,” American Journal of Public Health, Volume 82, March 1992, 378–382. [178] Loenen HMJA, Eshuis H, Lowik MRH, Schouten EG, Hulshof KFAM, Odink J, Kok FJ, “Serum Uric Acid Correlates in Elderly Men and Women with Special Reference to Body Composition and Dietary Intake (Dutch Nutrition Surveillance System),” Journal of Clinical Epidemiology, Volume 43, Number 12, 1990, 1297–1303. [179] Kaplan NM, “Strategies to Reduce Risk Factors in Hypertensive Patients Who Smoke,” American Heart Journal, Volume 115, January 1988, 288–294. [180] Clark M, Royal J, Seeler R, “Interaction of Iron Deficiency and Lead and the Hematologic Findings in Children with Severe Lead Poisoning,” Pediatrics, Volume 81, February 1988, 247–253. [181] Tsou VM, Young RM, Hart MH, Vanderhoof JA, “Elevated Plasma Aluminum Levels in Normal Infants Receiving Antacids Containing Aluminum,” Pediatrics, Volume 87, February 1991, 148–151. [182] Streissguth AP, Aase JM, Clarren K, Randels SP, LaDue RA, Smith DF, “Fetal Alcohol Syndrome in Adolescents and Adults,” Journal of the American Medical Association, Volume 265, April 17, 1991, 1961–1967. [183] Tirosh E, Elhasid R, Kamah SCB, Cohen A, “Predictive Value of Placebo Methylphenidate,” Pediatric Neurology, Volume 9, Number 2, 1993, 131–133. [184] Klein BEK, Klein R, Moss SE, “Blood Pressure in a Population of Diabetic Persons Diagnosed After 30 Years of Age,” American Journal of Public Health, Volume 74, April 1984, 336–339. [185] Ahmed T, Garrigo J, Danta I, “Preventing Bronchoconstriction in Exercise-Induced Asthma with Inhaled Heparin,” The New England Journal of Medicine, Volume 329, July 8, 1993, 90–95. [186] Longaker MT, Golbus MS, Filly RA, Rosen MA, Chang SW, Harrison MR, “Maternal Outcome After Open Fetal Surgery,” Journal of the American Medical Association, Volume 265, February 13, 1991, 737–741. [187] Saudek CD, Selam JL, Pitt HA, Waxman K, Rubio M, Jeandidier N, Turner D, Fischell RE, Charles MA, “A Preliminary Trial of the Programmable Implantable Medication System for Insulin Delivery,” The New England Journal of Medicine, Volume 321, August 31, 1989, 574–579. [188] Campbell S, Nash D, “The cdc’s Gun Injury Data Is Becoming Even More Unreliable,” The Trace, March 11, 2019.
ISTUDY
Bibliography
559
[189] Bellinger DC, Jonas RA, Rappaport LA, Wypij D, Wernovsky G, Kuban KCK, Barnes PD, Holmes GL, Hickey PR, Strand RD, Walsh AZ, Helmers SL, Constantinou JE, Carrazana EJ, Mayer JE, Hanley FL, Castaneda AR, Ware JH, Newburger JW, “Developmental and Neurologic Status of Children After Heart Surgery with Hypothermic Circulatory Arrest or Low-Flow Cardiopulmonary Bypass,” The New England Journal of Medicine, Volume 332, March 2, 1995, 549–555. [190] Gauvreau K, Pagano M, “Why 5%?,” Nutrition, Volume 10, 1994, 93–94. [191] Wasserstein R, on behalf of the American Statistical Association, “ASA Statement on Statistical Significance and P-Values,” The American Statistician, Volume 70, 2016, 129–133. [192] “Firm Admits Using Rival’s Drug in Tests,” The Boston Globe, July 1, 1989, 41. [193] Davidson JW, Lytle MH, After the Fact: The Art of Historical Detection, Third Edition, Volume 1, New York: McGraw-Hill, Inc., 1992, 26. [194] “Child Abuse – and Trial Abuse,” The New York Times, January 20, 1990, 24. [195] Appel BR, Guirguis G, Kim I, Garbin O, Fracchia M, Flessel CP, Kizer KW, Book SA, Warriner TE, “Benzene, Benzo(a)Pyrene, and Lead in Smoke from Tobacco Products Other Than Cigarettes,” American Journal of Public Health, Volume 80, May 1990, 560–564. [196] Rohrbach BW, Harkess JR, Ewing SA, Kudlac J, McKee GL, Istre GR, “Epidemiologic and Clinical Characteristics of Persons with Serologic Evidence of E. canis Infection,” American Journal of Public Health, Volume 80, April 1990, 442–445. [197] Feskens EJM, Kromhout D, “Cardiovascular Risk Factors and the 25 Year Incidence of Diabetes Mellitus in Middle-Aged Men,” American Journal of Epidemiology, Volume 130, December 1989, 1101–1108. [198] Meade TW, Cooper JA, Peart WS, “Plasma Renin Activity and Ischemic Heart Disease,” The New England Journal of Medicine, Volume 329, August 26, 1993, 616-619. [199] Burkholz H, The fda Follies, New York: Basic Books, 1994, 107–113. [200] Packard FR, The Life and Times of Ambroise Paré, 1510–1590, New York: Paul B. Hoeber, 1921. [201] Rhodes J, Curran TJ, Camil L, Rabideau N, Fulton DR, Gauthier NS, Gauvreau K, Jenkins KJ, “Impact of Cardiac Rehabilitation on the Excercise Function of Children with Serious Congenital Heart Disease,” Pediatrics, Volume 116, 2005, 1339–1345. [202] Krasnoff JB, Vintro AQ, Ascher NL, Bass NM, Paul SM, Dodd MJ, Painter PL, “A Randomized Trial of Exercise and Dietary Counseling After Liver Transplantation,” Americal Journal of Transplantation, Volume 6, 2006, 1896–1905. [203] Markowski CA, Markowski EP, “Conditions for the Effectiveness of a Preliminary Test of Variance,” The American Statistician, Volume 44, November 1990, 322–326. [204] Moser BK, Stevens GR, “Homogeneity of Variance in the Two-Sample Means Test,” The American Statistician, Volume 46, February 1992, 19–21. [205] Satterthwaite FW, “An Approximate Distribution of Estimates of Variance Components,” Biometrics Bulletin, Volume 2, December 1946, 110–114.
ISTUDY
560
Bibliography
[206] shep Cooperative Research Group, “Prevention of Stroke by Antihypertensive Drug Treatment in Older Persons with Isolated Systolic Hypertension: Final Results of the Systolic Hypertension in the Elderly Program (shep),” Journal of the American Medical Association, Volume 265, June 26, 1991, 3255–3264. [207] Allred EN, Bleecker ER, Chaitman BR, Dahms TE, Gottlieb SO, Hackney JD, Hayes D, Pagano M, Selvester RH, Walden SM, Warren J, “Acute Effects of Carbon Monoxide Exposure on Individuals with Coronary Artery Disease,” Health Effects Institute Research Report Number 25, November 1989. [208] Kien CL, Liechty EA, Mullett MD, “Effects of Lactose Intake on Nutritional Status in Premature Infants,” Journal of Pediatrics, Volume 116, March 1990, 446–449. [209] Wolff MS, Toniolo PG, Lee EW, Rivera M, Dubin N, “Blood Levels of Organochlorine Residues and Risk of Breast Cancer,” Journal of the National Cancer Institute, Volume 85, April 21, 1993, 648–652. [210] DiGiusto E, Eckhard I, “Some Properties of Saliva Cotinine Measurements in Indicating Exposure to Tobacco Smoking,” American Journal of Public Health, Volume 76, October 1986, 1245–1246. [211] Venkataraman PS, Duke JC, “Bone Mineral Content of Healthy, Full-term Neonates: Effect of Race, Gender, and Maternal Cigarette Smoking,” American Journal of Diseases of Children, Volume 145, November 1991, 1310–1312. [212] Schiff E, Barkai G, Ben-Baruch G, Mashiach S, “Low-Dose Aspirin Does Not Influence the Clinical Course of Women with Mild Pregnancy-Induced Hypertension,” Obstetrics and Gynecology, Volume 76, November 1990, 742–744. [213] Tramo MJ, Loftus WC, Green RL, Stukel TA, Weaver JB, Gazzaniga MS, “Brain Size, Head Size, and Intelligence Quotient in Monozygotic Twins,” Neurology, Volume 50, Number 5, May 1998, 1246–1252. [214] Anderson JW, Spencer DB, Hamilton CC, Smith SF, Tietyen J, Bryant CA, Oeltgen P, “Oat-Bran Cereal Lowers Serum Total and LDL Cholesterol in Hypercholesterolemic Men,” American Journal of Clinical Nutrition, Volume 52, September 1990, 495–499. [215] Stoline MR,“The Status of Multiple Comparisons: Simultaneous Estimation of All Pairwise Comparisons in One-Way anova Designs,” The American Statistician, Volume 35, Number 3, 1981, 134–141. [216] Wood PD, Stefanick ML, Dreon DM, Frey-Hewitt B, Garay SC, Williams PT, Superko HR, Fortmann SP, Albers JJ, Vranizan KM, Ellsworth NM, Terry RB, Haskell WL, “Changes in Plasma Lipids and Lipoproteins in Overweight Men During Weight Loss Through Dieting as Compared with Exercise,” The New England Journal of Medicine, Volume 319, November 3, 1988, 1173–1179. [217] Chase HP, Garg SK, Marshall G, Berg CL, Harris S, Jackson WE, Hamman RE, “Cigarette Smoking Increases the Risk of Albuminuria Among Subjects with Type I Diabetes,” Journal of the American Medical Association, Volume 265, February 6, 1991, 614–617. [218] Fowkes FGR, Housley E, Riemersma RA, Macintyre CCA, Cawood EHH, Prescott RJ, Ruckley CV, “Smoking, Lipids, Glucose Intolerance, and Blood Pressure as Risk Factors for Peripheral Atherosclerosis Compared with Ischemic Heart Disease in the Edinburgh Artery Study,” American Journal of Epidemiology, Volume 135, February 15, 1992, 331–340.
ISTUDY
Bibliography
561
[219] Wheeler JRC, Fadel H, D’Aunno TA, “Ownership and Performance of Outpatient Substance Abuse Treatment Centers,” American Journal of Public Health, Volume 82, May 1992, 711– 718. [220] Spicher V, Roulet M, Schutz Y, “Assessment of Total Energy Expenditure in Free-Living Patients with Cystic Fibrosis,” Journal of Pediatrics, Volume 118, June 1991, 865–872. [221] Knowles MR, Church NL, Waltner WE, Yankaskas JR, Gilligan P, King M, Edwards LJ, Helms RW, Boucher RC, “A Pilot Study of Aerosolized Amiloride for the Treatment of Lung Disease in Cystic Fibrosis,” The New England Journal of Medicine, Volume 322, April 26, 1990, 1189–1194. [222] Hollander M, Wolfe DA, Nonparametric Statistical Methods, New York: Wiley, 1973. [223] Wrona RM, “A Clinical Epidemiologic Study of Hyperphenylalaninemia,” American Journal of Public Health, Volume 69, July 1979, 673–679. [224] Burch KD, Covitz W, Lovett EJ, Howell C, Kanto WP, “The Significance of Ductal Shunting During Extracorporeal Membrane Oxygenation,” Journal of Pediatric Surgery, Volume 24, September 1989, 855–859. [225] Morrison NJ, Abboud RT, Ramadan F, Miller RR, Gibson NN, Evans KG, Nelems B, Müller NL, “Comparison of Single Breath Carbon Monoxide Diffusing Capacity and PressureVolume Curves in Detecting Emphysema,” American Review of Respiratory Disease, Volume 139, May 1989, 1179–1187. [226] Abdelrahman I, Elmasry M, Olofsson P, Steinvall I, Fredrikson M, Sjoberg F, “Division of Overall Duration of Stay into Operative Stay and Postoperative Stay Improves the Overall Estimate as a Measure of Quality of Outcome in Burn Care,” plos one, Volume 12, March 2017, e0174579. [227] Lee LA, Kimball TR, Daniels SR, Khoury P, Meyer RA, “Left Ventricular Mechanics in the Preterm Infant and their Effect on the Measurement of Cardiac Performance,” The Journal of Pediatrics, Volume 120, January 1992, 114–119. [228] Walker AM, Jick H, Perera DR, Thompson RS, Knauss TA, “Diphtheria–Tetanus–Pertussis Immunization and Sudden Infant Death Syndrome,” American Journal of Public Health, Volume 77, August 1987, 945–951. [229] Jeffery RW, Forster JL, French SA, Kelder SH, Lando HA, McGovern PG, Jacobs DR, Baxter JE, “The Healthy Worker Project: A Work-Site Intervention for Weight Control and Smoking Cessation,” American Journal of Public Health, Volume 83, March 1993, 395–401. [230] Ayanian JZ, Kohler BA, Abe T, Epstein AM, “The Relation Between Health Insurance Coverage and Clinical Outcomes Among Women with Breast Cancer,” The New England Journal of Medicine, Volume 329, July 29, 1993, 326–331. [231] Nuland SB, Doctors: The Biography of Medicine, New York: Vintage Books, 1989. [232] “Lung Cancer Survival Rates,” American Cancer Society, January 9, 2020, https:// cancer.org/cancer/lung-cancer/detection-diagnosis-staging. [233] Liu M, Cai X, Yu W, Lv C, Fu X, “Clinical Significance of Age at Diagnosis Among Young Non-Small Cell Lung Cancer Patients Under 40 Years Old: A Population-Based Study,” Oncotarget, Volume 6, 2015, 44963–44970.
ISTUDY
562
Bibliography
[234] Brown LD, Cai TT, DasGupta A, “Interval Estimation for a Binomial Proportion,” Statistical Science, Volume 16, 2001, 101–133. [235] Wilson EB, “Probable Inference, the Law of Succession, and Statistical Inference,” Journal of the American Statistical Association, Volume 22, 1927, 209–212. [236] Rosner B, Fundamentals of Biostatistics, Seventh Edition, Boston, Massachusetts: Brooks/ Cole, Cengage Learning, 2011. [237] Osberg JS, DiScala C, “Morbidity Among Pediatric Motor Vehicle Crash Victims: The Effectiveness of Seat Belts,” American Journal of Public Health, Volume 82, March 1992, 422–425. [238] Hack M, Breslau N, Weissman B, Aram D, Klein N, Borawski E, “Effect of Very Low Birth Weight and Subnormal Head Size on Cognitive Abilities at School Age,” The New England Journal of Medicine, Volume 325, July 25, 1991, 231–237. [239] Sheehan WJ, Mauger DT, Paul IM, Moy JN, Boehmer SJ, Szefler SJ, Fitzpatrick AM, Jackson DJ, Bacharier LB, Cabana MD, “Acetaminophen versus Ibuprofen in Young Children with Mild Persistent Asthma,” The New England Journal of Medicine, Volume 375, August 18, 2016, 619–630. [240] Sanchis-Gomar F, Perez-Quilis C, Leischik R, Lucia A, “Epidemiology of Coronary Heart Disease and Acute Coronary Syndrome,” Annals of Translational Medicine, Volume 4, July 2016, 256. [241] Tsai J, Hoff RA, Harpaz-Rotem I, “One-Year Incidence and Predictors of Homelessness Among 300,000 US Veterans Seen in Specialty Mental Health Care,” Psychological Services, Volume 14, 2017, 203–207. [242] Magruder KM, Frueh BC, Knapp RG, Davis L, Hamner MB, Martin RH, Gold PB, Arana GW, “Prevalence of Posttraumatic Stress Disorder in Veterans Affairs Primary Care Clinics,” General Hospital Psychiatry, Volume 27, 2005, 169–179. [243] Peyron R, Aubeny E, Targosz V, Silvestre L, Renault M, Elkik F, Leclerc P, Ulmann A, Baulieu E, “Early Termination of Pregnancy with Mifepristone (RU 486) and the Orally Active Prostaglandin Misoprostol,” The New England Journal of Medicine, Volume 328, May 27, 1993, 1509–1513. [244] Folsom AR, Grim RH, “Stop Smoking Advice by Physicians: A Feasible Approach?,” American Journal of Public Health, Volume 77, July 1987, 849–850. [245] Christianson JB, Lurie N, Finch M, Moscovice IS, Hartley D, “Use of Community-Based Mental Health Programs by hmos: Evidence from a Medicaid Demonstration,” American Journal of Public Health, Volume 82, June 1992, 790–796. [246] Graham NMH, Nelson KE, Solomon L, Bonds M, Rizzo RT, Scavotto J, Astemborski J, Vlahov D, “Prevalence of Tuberculin Positivity and Skin Test Anergy in HIV-1-Seropositive and -Seronegative Intravenous Drug Users,” Journal of the American Medical Association, Volume 267, January 15, 1992, 369–373. [247] Wijnen M, Olsson DS, van den Heuvel-Eibrink MM, Hammarstrand C, Janssen JAMJL, van der Lely AJ, Johannsson G, Neggers SJCMM, “The Metabolic Syndrome and its Components in 178 Patients Treated for Craniopharyngioma after 16 Years of Follow-up,” European Journal of Endocrinology, Volume 178, 2018, 11–22.
ISTUDY
Bibliography
563
[248] Juul SE, Comstock BA, Wadhawan R, Mayock DE, Courtney SE, Robinson T, Ahmad KA, Bendel-Stenzel E, Baserga M, LaGamma EF, “A Randomized Trial of Erythropoietin for Neuroprotection in Preterm Infants,” The New England Journal of Medicine, Volume 382, January 16, 2020, 233–243. [249] Thompson RS, Rivara FP, Thompson DC, “A Case-Control Study of the Effectiveness of Bicycle Safety Helmets,” The New England Journal of Medicine, Volume 320, May 25, 1989, 1361–1367. [250] Cochran WG, “Some Methods for Strengthening the Common χ2 Test,” Biometrics, Volume 10, December 1954, 417–451. [251] Grizzle JE, “Continuity Correction in the χ2 Test for 2 × 2 Tables,” The American Statistician, Volume 21, October 1967, 28–32. [252] Schottenfeld D, Eaton M, Sommers SC, Alonso DR, Wilkinson C, “The Autopsy as a Measure of Accuracy of the Death Certificate,” Bulletin of the New York Academy of Medicine, Volume 58, December 1982, 778–794. [253] Coulehan JL, Lerner G, Helzlsouer K, Welty TK, McLaughlin J, “Acute Myocardial Infarction Among Navajo Indians, 1976–1983,” American Journal of Public Health, Volume 76, April 1986, 412–414. [254] McCusker J, Harris DR, Hosmer DW, “Association of Electronic Fetal Monitoring During Labor with Cæsarean Section Rate and with Neonatal Morbidity and Mortality,” American Journal of Public Health, Volume 78, September 1988, 1170–1174. [255] Roberts RS, Spitzer WO, Delmore T, Sackett DL, “An Empirical Demonstration of Berkson’s Bias,” Journal of Chronic Diseases, Volume 31, February 1978, 119–128. [256] Gross TP, Conde JG, Gary GW, Harting D, Goeller D, Israel E, “An Outbreak of Acute Infectious Nonbacterial Gastroenteritis in a High School in Maryland,” Public Health Reports, Volume 104, March–April 1989, 164–169. [257] Kirscht JP, Brock BM, Hawthorne VM, “Cigarette Smoking and Changes in Smoking Among a Cohort of Michigan Adults, 1980-82,” American Journal of Public Health, Volume 77, April 1987, 501–502. [258] Engs RC, Hanson DJ, “University Students’ Drinking Patterns and Problems: Examining the Effects of Raising the Purchase Age,” Public Health Reports, Volume 103, November– December 1988, 667–673. [259] Tilyard MW, Spears GFS, Thomson J, Dovey S, “Treatment of Postmenopausal Osteoporosis with Calcitriol or Calcium,” The New England Journal of Medicine, Volume 326, February 6, 1992, 357–362. [260] Liberati A, Apolone G, Nicolucci A, Confalonieri C, Fossati R, Grilli R, Torri V, Mosconi P, Alexanian A, “The Role of Attitudes, Beliefs, and Personal Characteristics of Italian Physicians in the Surgical Treatment of Early Breast Cancer,” American Journal of Public Health, Volume 81, January 1991, 38–42. [261] Kircher T, Nelson J, Burdo H, “The Autopsy as a Measure of Accuracy of the Death Certificate,” The New England Journal of Medicine, Volume 313, November 14, 1985, 1263–1269. [262] Siscovick DS, Strogatz DS, Weiss NS, Rennert G, “Retirement and Primary Cardiac Arrest in Males,” American Journal of Public Health, Volume 80, February 1990, 207–208.
ISTUDY
564
Bibliography
[263] Lilienfeld AM, Graham S, “Validity of Determining Circumcision Status by Questionnaire as Related to Biological Studies of Cancer of the Cervix,” Journal of the National Cancer Institute, Volume 21, October 1958, 713–720. [264] Haahtela T, Marttila O, Vilkka V, Jappinen P, Jaakkola JJK, “The South Karelia Air Pollution Study: Acute Health Effects of Malodorous Sulphur Air Pollutants Released by a Pulp Mill,” American Journal of Public Health, Volume 82, April 1992, 603–605. [265] Nischan P, Ebeling K, Schindler C, “Smoking and Invasive Cervical Cancer Risk: Results from a Case-Control Study,” American Journal of Epidemiology, Volume 128, July 1988, 74–77. [266] Coste J, Job-Spira N, Fernandez H, Papiernik E, Spira A, “Risk Factors for Ectopic Pregnancy: A Case-Control Study in France, with Special Focus on Infectious Factors,” American Journal of Epidemiology, Volume 133, May 1, 1991, 839–849. [267] Armstrong BG, McDonald AD, Sloan M, “Cigarette, Alcohol, and Coffee Consumption and Spontaneous Abortion,” American Journal of Public Health, Volume 82, January 1992, 85–87. [268] Smith PF, Mikl J, Truman BI, Lessner L, Lehman JS, Stevens RW, Lord EA, Broaddus RK, Morse DL, “HIV Infection Among Women Entering the New York State Correctional System,” American Journal of Public Health Supplement, Volume 81, May 1991, 35–40. [269] Chassin MR, Kosecoff J, Park RE, Winslow CM, Kahn KL, Merrick NJ, Keesey J, Fink A, Solomon DH, Brook RH, “Does Inappropriate Use Explain Geographic Variations in the Use of Health Care Services?,” Journal of the American Medical Association, Volume 258, November 13, 1987, 2533–2537. [270] King A, “Enhancing the Self-Report of Alcohol Consumption in the Community: Two Questionnaire Formats,” American Journal of Public Health, Volume 84, February 1994, 294–296. [271] Rea TD, Fahrenbruch C, Culley L, Donohoe RT, Hambly C, Innes J, Bloomingdale M, Subido C, Romines S, Eisenberg MS, “Cpr with Chest Compression Alone or with Rescue Breathing,” The New England Journal of Medicine, Volume 363, July 29, 2010, 423-432. [272] “Children, Food, and Nutrition: Growing Well in a Changing World,” The State of the World’s Children 2019, New York: unicef, 2019. [273] Tefft BC, “Rates of Motor Vehicle Crashes, Injuries and Deaths in Relation to Driver Age, United States, 2014-2015,” AAA Foundation for Traffic Safety, June 2017. [274] Cominacini L, Zocca I, Garbin U, Davoli A, Compri R, Brunetti L, Bosello O, “Long-Term Effect of a Low-Fat, High-Carbohydrate Diet on Plasma Lipids of Patients Affected by Familial Endogenous Hypertriglyceridemia,” American Journal of Clinical Nutrition, Volume 48, July 1988, 57–65. [275] Miller PF, Sheps DS, Bragdon EE, Herbst MC, Dalton JL, Hinderliter AL, Koch GG, Maixner W, Ekelund LG, “Aging and Pain Perception in Ischemic Heart Disease,” American Heart Journal, Volume 120, July 1990, 22–30. [276] Dean HT, Arnold FA, Elvove E, “Domestic Water and Dental Caries,” Public Health Reports, Volume 57, August 7, 1942, 1155–1179. [277] Wolfe SM, Williams C, Zaslow A, “Public Citizen Health Research Group Ranking of the Rate of State Medical Boards’ Serious Disciplinary Actions, 2009–2011,” Public Citizen, May 17, 2012.
ISTUDY
Bibliography
565
[278] Rollins JD, Collins JS, Holden KR, “United States Head Circumference Growth Reference Charts,” Journal of Pediatrics, Volume 156, 2010, 907–913. [279] Kleinbaum DG, Kupper LL, Muller KE, Applied Regression Analysis and Other Multivariable Methods, Boston, Massachusetts: PWS-Kent, 1988. [280] State of the World’s Children 2019. Children, Food, and Nutrition: Growing Well in a Changing World, New York: unicef, 2019. [281] Wyatt JS, Edwards AD, Cope M, Delpy DT, McCormick DC, Potter A, Reynolds EOR, “Response of Cerebral Blood Volume to Changes in Arterial Carbon Dioxide Tension in Preterm and Term Infants,” Pediatric Research, Volume 29, June 1991, 553–557. [282] Ahmadian HR, Sclafani JJ, Emmons EE, Morris MJ, Leclerc KM, Slim AM, “Comparison of Predicted Exercise Capacity Equations and the Effect of Actual versus Ideal Body Weight among Subjects Undergoing Cardiopulmonary Exercise Testing,” Cardiology Research and Practice, Volume 2013, ID 940170. [283] Weeks JL, Fox M, “Fatality Rates and Regulatory Policies in Bituminous Coal Mining, United States, 1959–1981,” American Journal of Public Health, Volume 73, November 1983, 1278–1280. [284] Agongo G, Nonterah EA, Debpuur C, Amenga-Elego L, Ali S, Oduro A, Crowther NJ, Ramsey M, “The Burden of Dyslipidaemia and Factors Associated with Lipid Levels Among Adults in Rural Northern Ghana: An AWI-Gen Sub-Study,” plos one, Volume 13, 2018, e0206326. [285] Gapminder, Data, 2005, www.gapminder.org/data. [286] Van Horn L, Moag-Stahlberg A, Liu K, Ballew C, Ruth K, Hughes R, Stamler J, “Effects on Serum Lipids of Adding Instant Oats to Usual American Diets,” American Journal of Public Health, Volume 81, February 1991, 183–188. [287] Almond CSD, Shin AY, Fortescue EB, Mannix RC, Wypij D, Binstadt BA, Duncan CN, Olson DP, Salerno AE, Newburger JW, Greenes DS, “Hyponatremia among Runners in the Boston Marathon,” The New England Journal of Medicine, Volume 352, April 14, 2005, 1550-1556. [288] Hosmer DW, Lemeshow S, Applied Logistic Regression, New York: Wiley, 1989. [289] Hoagland PM, Cook EF, Flatley M, Walker C, Goldman L, “Case-Control Analysis of Risk Factors for the Presence of Aortic Stenosis in Adults (Age 50 Years or Older),” American Journal of Cardiology, Volume 55, March 1, 1985, 744–747. [290] Zweig MS, Singh T, Htoo M, Schultz S, “The Association Between Congenital Syphilis and Cocaine/Crack Use in New York City: A Case-Control Study,” American Journal of Public Health, Volume 81, October 1991, 1316–1318. [291] Holtzman D, Anderson JE, Kann L, Arday SL, Truman BI, Kohbe LJ, “hiv Instruction, hiv Knowledge, and Drug Injection Among High School Students in the United States,” American Journal of Public Health, Volume 81, December 1991, 1596–1601. [292] Rosenberg L, Palmer JR, Kelly JP, Kaufman DW, Shapiro S, “Coffee Drinking and Nonfatal Myocardial Infarction in Men Under 55 Years of Age,” American Journal of Epidemiology, Volume 128, September 1988, 570–578. [293] Martinez FD, Cline M, Burrows B, “Increased Incidence of Asthma in Children of Smoking Mothers,” Pediatrics, Volume 89, January 1992, 21–26.
ISTUDY
566
Bibliography
[294] Pearl R, “Tobacco Smoking and Longevity,” Science, Volume 87, March 4, 1938, 216–217. [295] Ragni MV, Kingsley LA, “Cumulative Risk for aids and Other hiv Outcomes in a Cohort of Hemophiliacs in Western Pennsylvania,” Journal of Acquired Immune Deficiency Syndromes, Volume 3, July 1990, 708–713. [296] Brown BW, “Estimation in Survival Analysis: Parametric Models, Product-Limit and Life Table Methods,” Statistics in Medical Research, New York: Wiley, 1982. [297] Hosmer DW, Lemeshow S, Applied Survival Analysis: Regression Modeling of Time to Event Data, New York: John Wiley, 1999. [298] Shapiro CL, Henderson IC, Gelman RS, Harris JR, Canellos GP, Frei E, “A Randomized Trial of Cyclophosphamide, Methotrexate, and Fluorouracil Versus Methotrexate, Fluorouracil Adjuvant Chemotherapy in Moderate Risk Breast Cancer Patients,” Proceedings of the American Association for Cancer Research, Volume 31, March 1990, 185. [299] Patchell RA, Tibbs PA, Walsh JW, Dempsey RJ, Maruyama Y, Kryscio RJ, Markesbery WR, MacDonald JS, Young B, “A Randomized Trial of Surgery in the Treatment of Single Metastases to the Brain,” The New England Journal of Medicine, Volume 322, February 22, 1990, 494–500. [300] Ash RC, Casper JT, Chitambar CR, Hansen R, Bunin N, Truitt RL, Lawton C, Murray K, Hunter J, Baxter-Lowe LA, Gottschall JL, Oldham K, Anderson T, Camitta B, Menitove J, “Successful Allogeneic Transplantation of T-Cell Depleted Bone Marrow from Closely hlaMatched Unrelated Donors,” The New England Journal of Medicine, Volume 322, February 22, 1990, 485–494. [301] Laine L, Bonacini M, Sattler F, Young T, Sherrod A, “Cytomegalovirus and Candida Esophagitis in Patients with aids,” Journal of Acquired Immune Deficiency Syndromes, Volume 5, June 1992, 605–609. [302] Hviid A, Hansen JV, Frisch M, Melbye M, “Measles, Mumps, Rubella Vaccination and Autism: A Nationwide Cohort Study,” Annals of Internal Medicine, Volume 170, 2019, 513– 520. [303] Kotler DP, “Cytomegalovirus Colitis and Wasting,” Journal of Acquired Immune Deficiency Syndromes, Volume 4, Supplement 1, 1991, S36–S41. [304] Wei LJ, Lin DY, Weissfeld L, “Regression Analysis of Multivariate Incomplete Failure Time Data by Modeling Marginal Distributions,” Journal of the American Statistical Association, Volume 84, December 1989, 1065–1073. [305] Ahmad T, Munir A, Bhatti SH, Aftab M, Raza MA, “Survival Analysis of Heart Failure Patients: A Case Study,”PLoS ONE, Volume 12, 2017, e0181001. [306] National Health Interview Survey, National Center for Health Statistics, Centers for Disease Control and Prevention, 2021, https://www.cdc.gov/nchs/nhis. [307] Nchs Fact Sheets, National Center for Health Statistics, Centers for Disease Control and Prevention, 2021, https://www.cdc.gov/nchs/about/fact_sheets.htm. [308] “Data Collection Systems,” National Center for Health Statistics, Centers for Disease Control and Prevention, 2021, https://www.cdc.gov/nchs/index.htm. [309] The Demographic and Health Surveys Program, https://dhsprogram.com.
ISTUDY
Bibliography
567
[310] Judiciary and Judicial Procedure, Part V, “Chapter 121: Juries, Trials by Jury,” Legal Information Institute, United States Government Publishing Office, 2012, https://www.govinfo. gov/content/pkg/USCODE- 2012-title28-partV-chap121.html. [311] Chernoff NW, “No Records, No Right: Discovery & the Fair Cross-Section Guarantee,” Iowa Law Review, Volume 101, 2016, 1719–1786. [312] Rao RU, Samarasekera SD, Nagodavithana KC, Punchihewa MW, Ranasinghe USB, Weil GJ, “Systematic Sampling of Adults as a Sensitive Means of Detecting Persistence of Lymphatic Filariasis Following Mass Drug Administration in Sri Lanka,” PLoS Neglected Tropical Diseases, Volume 13, April 22, 2019, https://doi.org/10.1371/journal.pntd.0007365. [313] Nelson RK, Winling L, Marciano R, Connolly N, “Mapping Inequality,” American Panorama, ed. Nelson RK, Ayers EL, https://dsl.richmond.edu/panorama/redlining. [314] Statistical Release P0302 Mid-Year Population Estimates, Statistics South Africa, Republic of South Africa, 2019. [315] Dell A, Kahn D, “Geographical Maldistribution of Surgical Resources in South Africa: A Review of the Number of Hospitals, Hospital Beds and Surgical Beds,” South African Medical Journal, Volume 107, 2017, 1099–1105. [316] Laplace, PS, ”Sur les Naissances, les Mariages et les Morts,” Mémoires de l’Académie Royale des Sciences de Paris, 1783, 693–702. [317] Lohr SL, Sampling Design and Analysis, Second Edition, New York: Chapman and Hall/CRC, 2019. [318] Henderson RE, Davis H, Eddins D, Foege W, “Assessment of Vaccination Coverage, Vaccination Scar Rates, and Smallpox Scarring in Five Areas of West Africa,” Bulletin of the World Health Organization, Volume 48, 1973, 183–194. [319] Henderson RE, Sundaresan T, “Cluster Sampling to Assess Immunization Coverage: A Review of Experience with a Simplified Sampling Method,” Bulletin of the World Health Organization, Volume 60, 1982, 253–260. [320] “The Immunization Programme that Saved Millions of Lives,” Bulletin of the World Health Organization, Volume 92, 2014, 314–315. [321] Otte MJ, Gumm ID, “Intra-Cluster Correlation Coefficients of 20 Infections Calculated from the Results of Cluster-Sample Surveys,” Preventive Veterinary Medicine, Volume 31, 1997, 147–150. [322] Gatrell N, Herman J, Olarte S, Feldstein M, Localio R, “Psychiatrist–Patient Sexual Contact: Results of a National Survey, Prevalence,” American Journal of Psychiatry, Volume 143, September 1986, 1126–1131. [323] Owen K, “Honesty No Longer Assumed Where Religion is Concerned,” The Owensboro Messenger–Inquirer, January 16, 1999. [324] Weissman AN, Steer RA, Lipton DS, “Estimating Illicit Drug Use Through Telephone Interviews and the Randomized Response Technique,” Drug and Alcohol Dependence, Volume 18, 1986, 225–233.
ISTUDY
568
Bibliography
[325] Zuckerman B, Frank DA, Hingson R, Amaro H, Levenson SM, Kayne H, Parker S, Vinci R, Aboagye K, Fried LE, Cabral H, Timperi R, Bauchner H, “Effects of Maternal Marijuana and Cocaine Use on Fetal Growth,” The New England Journal of Medicine, Volume 320, June 1989, 762–768. [326] Hatziandreu EJ, Pierce JP, Fiore MC, Grise V, Novotny TE, Davis RM, “The Reliability of Self-Reported Cigarette Consumption in the United States,” American Journal of Public Health, Volume 79, August 1989, 1020–1023. [327] Silverman H, “Ethical Issues during the Conduct of Clinical Trials,” Proceedings of the American Thoracic Society, Volume 4, 2007, 180–184. [328] Talbot TR, Stapleton JT, Brady RC, Winokur PL, Bernstein DI, Germanson T, Yoder SM, Rock MT, Crowe JE, Edwards KM, “Vaccination Success Rate and Reaction Profile with Diluted and Undiluted Smallpox Vaccine,” Journal of the American Medical Association, Volume 292, September 2004, 1205–1212. [329] Best M, Neuhauser D, Slavin L, “Evaluating Mesmerism, Paris, 1784: The Controversy Over the Blinded Placebo Controlled Trials has not Stopped, ” Quality and Safety in Health Care, Volume 12, 2003. [330] Rosner F, “The Ethics of Randomized Clinical Trials,” American Journal of Medicine, Volume 82, 1987, 283–290. [331] Collins R, Bowman L, Landray M, Peto R, “The Magic of Randomization versus the Myth of Real-World Evidence,” The New England Journal of Medicine, Volume 382, February 13, 2020, 674–678.
ISTUDY
Glossary
abbreviated life table: A life table with age intervals longer than one year, which are often five years in length. additive rule of probability: For mutually exclusive events, a rule that states that the probability of the union of events is equal to the sum of their individual probabilities. adjacent values: In a box plot, the most extreme observations that are not more than 1.5 times the height of the box (the interquartile range) beyond either quartile. adjusted rate: A rate that has been adjusted so that it reflects a particular distribution of a confounder. age-specific rate: A rate calculated for a subgroup of a specific age rather than the population as a whole. all possible models: A regression model selection technique where models with all possible combinations of the explanatory variables are fitted and the best model is chosen. allocation concealment: When the randomization for an experimental study is carried out by an individual who is not involved in either the treatment of study subjects or the assessment of outcomes, ensuring that the implementation of the randomization process is free from manipulation. alternative hypothesis: A statement about a population parameter(s) which is the complement of the null hypothesis; the conclusion reached when the null hypothesis is rejected. analysis of variance: A statistical procedure for testing the equality of group means which can be used when there are more than two groups. average: See mean. backward elimination: A regression model selection technique where initially all explanatory variables are included, and variables are removed one at a time until a final model is reached. bar chart: A graph used with nominal or ordinal data to display the frequency or relative frequency of measurements within each category or class, by a bar of appropriate length. Bayes’ theorem: A formula for calculating a conditional probability, the probability of an event given new piece of information. bell-shaped curve: A description in plain terms of the shape of the normal probability distribution. Berkson’s fallacy: An error in inference resulting from the use of a biased sample. Bernoulli random variable: A random variable that takes on one of only two possible values. bias: A general term for the inaccuracy that results from systematic distortion or error, as opposed to random error; the error related to the way in which the target population and sampled population differ. 569
ISTUDY
570
Glossary
bias of an estimator: The average difference between the value of an estimator and the value of the parameter it is estimating. big data: A large volume of data which often arrives rapidly and in a less structured form. bimodal distribution: A set of measurements with two modes. binary data: See dichotomous data. binomial distribution: The probability distribution which describes the number of successes among a fixed number of independent Bernoulli variables (trials), each with the same probability of success. biostatistics: The study of statistics dealing with the biological and health sciences, the three tenets being the study of variability, inference, and probability. blinding: The situation in which participants, investigators, and other individuals involved in a research study are unaware of the exposure group to which a particular subject is assigned. block randomization: Randomization performed in small blocks of prespecified size, such that after each consecutive block has been completed, the number of participants in each treatment group is perfectly balanced. Bonferroni correction: A modification made to the level of significance for individual pairwise hypothesis tests in order to control the overall probability of making a type I error. box plot: A graph used with discrete or continuous data which uses a single axis to display the distribution of measurements by plotting selected summaries including the minimum, 25th percentile, median, 75th percentile, and the maximum. case-control study: An observational study design where risk factors are compared for subjects with and without a specified outcome condition. categorical data: Measurements that are either nominal or ordinal. censoring: In time to event data, failure to observe the outcome event, often due to insufficient observation time. central limit theorem: A theorem which states that, for large samples, the probability distribution of the sample mean is approximately normal, for most classes of original measurements. chi-square distribution: A sampling distribution of positive statistics that are often associated with evaluating model fit, such as when testing the independence of row and column classification in a contingency table. chi-square test: A statistical procedure used to evaluate the association between an observed and a hypothesized tabulation. circle of powers: A general guide for choosing the appropriate power transformation to achieve linearity in a regression model. clinical trial: An ethical experimental study involving human subjects. cluster sampling: A sampling process where a population is first stratified into subgroups, and then some of these subgroups are sampled. coefficient: In a regression model, a parameter which relates an explanatory variables to the response.
ISTUDY
Glossary
571
coefficient of determination: A measure of the fit of a linear regression model which can be interpreted as the proportion of the variability in the response that is accounted for by its relationships with the explanatory variables. cohort life table: A life table constructed based on longitudinal information collected for each individual in a population. cohort study: An observational study design where subjects are followed over time, and outcomes are compared for subjects with and without a particular risk factor or exposure. collinearity: A situation where two or more of the explanatory variables in a regression model are highly correlated with each other. combination: The number of ways in which x distinct objects can be selected from a total of n distinct objects without regard to order. comparative study: A research study conducted to facilitate an understanding of the relationships between outcomes and explanatory variables measured for a population. complement: For a defined outcome of interest, the event that includes all other possible outcomes which are not the outcome of interest. conditional probability: The probability of one event given that a second event has already occurred. confidence interval: A range of values calculated based on sample data which, prior to calculation, have a specified probability of containing the true value of a population parameter; this range provides information about the precision with which the parameter is estimated. confounder: When studying the relationship between an outcome and an explanatory variable, a third variable which impacts or confuses the relationship and thus should not be ignored. consistency: The property of an estimator where, as the sample size increases, the estimate of the population parameter approaches its true value. contingency table: A table used to display frequencies for two categorical variables, where one variable defines the rows of the table and the other the columns. continuity correction: A small adjustment to a calculation used to obtain a better approximation, especially when using a continuous probability distribution to approximate the distribution of a discrete random variable. continuous data: Data that represent measurable quantities which are not restricted to certain specified values (such as integers); measurements can assume any value on the real line. continuous random variable: A random variable that can take on any value within a specified interval or continuum. control group: In a research study, a group of subjects who do not receive the exposure or intervention of interest; these subjects are then compared to those who do receive the exposure. correlation: The quantification of the degree to which two variables are related. correlation coefficient: Typically refers to the Pearson correlation coefficient. Cox proportional hazards model: A regression model for survival data which relates the hazard function to one or more explanatory variables.
ISTUDY
572
Glossary
critical value: The value(s) that a test statistic must exceed in order to reject the null hypothesis. crossover trial: A study design in which two or more treatments are applied to the same subject consecutively, possibly in different orders. cross-sectional data: Data where measurements are collected at a single point in time for each study subject. cross-sectional study: An observational study where risk factors and outcomes are measured at the same point in time. crude rate: The number of occurrences of a particular outcome over a given period of time divided by the number of individuals at risk for the outcome during that time period; a rate for an entire population that is neither specific to a subgroup, nor adjusted for a confounder. cumulative frequency polygon: A frequency polygon where the vertical axis displays cumulative relative frequencies rather than counts or relative frequencies. cumulative relative frequency: For an interval or category in a table, the percentage of the total number of observations that take a value less than or equal to the upper limit of that interval. current life table: See period life table. death rate: See mortality rate. degrees of freedom: A parameter associated with some sampling distributions which depend on sample size. demographic data: Data used to describe the characteristics of a population of people, including its composition by sex, age, and other social factors. dependent variable: See response variable. descriptive statistics: Tables, graphs, and numerical summary measures used to organize and summarize a set of data. diagnostic test: A test used to confirm the diagnosis of a medical condition in a group of individuals who may exhibit symptoms of the condition. dichotomous data: A nominal measurement that takes on one of only two distinct values, such as 0 or 1. direct standardization: A method for computing the rates we would expect to find in the populations under study if they all had the same composition according to a confounder. discrete data: Measurements where both ordering and magnitude are important; the numbers represent actual measurable quantities but are restricted to taking on specified values (often integers or counts). discrete random variable: A random variable that can assume only a finite or countable number of values. disjoint: See mutually exclusive. distribution-free methods: See nonparametric test. ecological fallacy: The error in reasoning that comes from inferring individual characteristics from group, or aggregate, characteristics.
ISTUDY
Glossary
573
empirical probability: A probability calculated from a finite amount of data. equipoise: When conducting a randomized clinical trial, a state of balance where neither treatment option is known to be inferior to the other. estimation: The process of using information from a sample to draw conclusions about the value of a population parameter. estimator: A statistic used to estimate the value of a population parameter using the information contained in a sample. event: The basic element to which probability can be applied; a single outcome from an experiment. exact test: A test based on the exact distribution of the test statistic. exhaustive: A set of events comprising all possible outcomes, therefore taking up the entire sample space. explanatory variable: In regression, a variable that is used to help explain the variability in the response variable. factorial: The product of a positive integer n and all the positive integers which are smaller. false negative: A negative screening or diagnostic test result in an individual who has the condition. false positive: A positive screening or diagnostic test result in an individual who does not have the condition. F distribution: A sampling distribution for the weighted ratio of two independent sample variances from normal populations. Fisher’s exact test: A nonparametric test for evaluating associations in contingency tables, for which we can calculate an exact distribution. forward selection: A regression model selection technique which begins with no explanatory variables in the model; variables are then introduced one at a time until a final model is reached. frequency distribution: A list of values or categories of values in a dataset along with the numerical counts that correspond to each one. frequency polygon: A graph similar to a histogram where the horizontal midpoints of the top of each bar are joined by straight lines to provide a smoother representation of the frequency distribution. frequentist definition: The probability of an event is defined as the proportion of times the event occurs in repeated trials or experiments. F-test: A hypothesis testing procedure for comparing two variances. Gaussian distribution: See normal distribution. graph: A pictorial representation of numerical data. hazard function: The proportion of individuals alive at the beginning of a time interval who die within the interval, or, for continuous time, the instantaneous rate of failure at time t for an individual who has survived up to time t. hazard ratio: The ratio of two hazard functions.
ISTUDY
574
Glossary
histogram: A graph used with discrete or continuous data to display the frequency or relative frequency of measurements within defined categories. homoscedasticity: A situation in which the variance (standard deviation) of a measurement is the same across subgroups. hypothesis test: A statistical procedure used to evaluate whether observed sample data is consistent with a proposed null hypothesis; an approach to statistical inference which leads to a decision to either reject or not reject the null hypothesis. independent events: Events where the probability that one event happens is not impacted by knowledge about whether the other event has happened. indicator variable: A dichotomous variable that indicates the existence or nonexistence of a condition, usually as an explanatory variable in a regression model. indirect standardization: A method for computing the rates we would expect to find in the populations under study if they all had the same subgroup specific rates according to a confounder. infant mortality rate: The proportion of infants born alive in a particular year who do not survive to their first birthday. inference: The process of drawing conclusions about an entire population, including those not sampled, based on the information contained in a sample. intention to treat: In a randomized trial or experiment, an analysis where each study subject remains in the group to which they were assigned, even if they do not ultimately receive that treatment. interaction: The situation where one explanatory variable has a different relationship with the response depending on the value of a second explanatory variable. interquartile range: A numerical summary measure of variation for a set of measurements defined as the difference between the 25th and 75th percentiles. intersection: For two or more individual events, the event that they all happen simultaneously. interval estimation: Use of sample data to calculate a range of reasonable values to estimate a population parameter. Kaplan-Meier method: See product-limit method. Kruskal-Wallis test: A nonparametric hypothesis testing procedure for three or more independent groups; this test is a nonparametric alternative to the one-way analysis of variance. ladder of powers: See circle of powers. level of confidence: The probability that a confidence interval contains the parameter it is estimating in repeated sampling. life expectancy: The mean survival beyond a certain age. life table: A statistical tool which summarizes the mortality experience of a population over time. linear regression: A statistical technique relating the conditional means of a continuous response (dependent) variable to one or more explanatory (independent) variables. line graph: A two-way scatter plot where each value on the horizontal axis has a single corresponding measurement on the vertical axis, and adjacent points are connected by straight lines.
ISTUDY
Glossary
575
logistic function: A nonnegative, S-shaped, monotonically increasing function which can be used to model a probability. logistic regression: A regression technique used when the response variable is dichotomous. log-rank test: A hypothesis testing method to evaluate the difference between two survival curves. longitudinal data: Data where multiple measurements are taken over time on the same study subject. Mann-Whitney test: See Wilcoxon rank sum test. maximum likelihood estimation: A method for calculating an estimator which maximizes the likelihood function; it produces the value of the population parameter most likely to have produced the observed sample data. McNemar’s test: A hypothesis testing procedure for comparing proportions when the data are paired. mean: A numerical summary measure of central tendency defined as the sum of a set of measurements divided by the number of measurements. measure of central tendency: A numerical summary measure that characterizes the center of a set of data, or the point about which the observations tend to cluster. measure of variability: A numerical summary measure that characterizes the amount of variability or dispersion in a set of data. median: A numerical summary measure of central tendency defined as the 50th percentile of a set of measurements. method of least squares: A technique for fitting a linear regression model which minimizes the sum of the squares of the residuals. mode: A numerical summary measure defined as the value that occurs most frequently in a set of measurements. model selection: Statistical and nonstatistical considerations used to determine which explanatory variables should be included in a regression model. mortality rate: The number of deaths occurring during some time period divided by the total population during that time period. multiple comparisons procedure: A procedure for performing pairwise comparisons of the means of three or more populations, usually while controlling the overall probability of making a type I error. multiplicative method: A calculation characterized by multiplication over consecutive time intervals. multiplicative rule of probability: A rule for calculating the probability that two events occur simultaneously. mutually exclusive events: Events that cannot happen simultaneously. natural experiment: A study in which individuals’ exposure to either experimental or control conditions is determined by nature and not by design.
ISTUDY
576
Glossary
negative likelihood ratio: A quantification of the informative benefit of a negative screening test, defined as specificity divided by 1 minus sensitivity. negative predictive value: The probability that an individual who tests negative for a disease actually has the disease. nominal data: Measurements where the values fall into unordered categories or classes; numbers used to represent the categories are merely labels. nonparametric test: A hypothesis testing procedure where the underlying probability distributions are not assumed to be known normal distribution: The most commonly occurring continuous probability distribution. It has bell-shaped probability density. This distribution is the approximate sampling distribution of the sample mean of a large simple random sample from a population with finite variance. null event: An event that can never occur. null hypothesis: The existing state of affairs formulated as a hypothesis about the value of a population parameter. numerical summary measure: A single number capturing an important characteristic of a set of measurements, such as the center of the values, or the amount of variability among the values. observational study: A research study where subjects are classified as to whether or not they have a particular exposure, but are not assigned an exposure status by the investigator; the exposure occurs in some natural way, and its presence or absence is merely observed. odds: The probability that an event will happen divided by the probability that the event will not happen; the probability of an event divided by 1 minus the probability of the event. odds ratio: The odds of an event in one group divided by the odds of the event in another group. one-sample t-test: A hypothesis test that evaluates the distance of the sample mean from the hypothesized mean, measured in units of the sample standard deviation; used when the population standard deviation is not known. one-sample z-test: A hypothesis test that evaluates the distance of the sample mean from the hypothesized mean, measured in units of the standard deviation; used when the population standard deviation is known. one-sided confidence interval: A confidence interval with either a lower bound or an upper bound, but not both. one-sided test of hypothesis: A test of an hypothesis where the parameter is bounded from either above or from below, not both. one-way analysis of variance: A hypothesis testing procedure used to compare the means of two or more independent populations. ordinal data: Measurements where the values fall into ordered categories or classes; numbers used to represent the categories are merely labels but must preserve the order of the categories. outcome variable: See response variable. outlier: A measurement that lies considerably outside the range of the other data values.
ISTUDY
Glossary
577
paired data: Data where each measurement in one group has a corresponding measurement in a second group. paired t-test: A hypothesis testing procedure used to evaluate the mean difference for paired data. parameter: A number that summarizes some characteristic of a probability distribution. parametric test: A hypothesis testing procedure where the underlying probability distributions are assumed to be known, and only the values of certain population parameters are not known. parsimonious model: A regression model which includes only those explanatory variables that help to predict the response, the coefficients of which can be accurately estimated, and no others; motivated by Occam’s Razor, or the law of parsimony in problem solving, that the simplest explanation is usually the correct one. Pearson correlation coefficient: A measure of the strength of the linear relationship between two continuous random variables. percentile: One of the points which divides a set of measurements into 100 equal parts; the pth percentile is a value which is greater than or equal to p% of the observations and less than or equal to the remaining (100 − p)%. perinatal mortality rate: Number of infant deaths under age 7 days plus fetal deaths at 28 weeks of gestation or more, divided by the total number of live births, in a particular year. period life table: A life table constructed based on cross-sectional data rather than longitudinal information for each individual. permuted block randomization: Block randomization where the size of the blocks is varied randomly. person-year: A measure of life experience that takes into account both the number of people in a study and how much time each person contributes to the study. placebo: In a clinical trial, a substance administered in the control group that has no therapeutic effect. point estimation: Use of sample data to calculate a single number to estimate a population parameter. Poisson distribution: A probability distribution for modeling discrete events which occur infrequently in time or space. population at risk: The denominator when calculating a rate or proportion; all individuals eligible to be part of the numerator. population mean: The mean or average value assumed by a random variable. population regression line: The line relating the conditional means of the response variable to the explanatory variable in the underlying population being studied. population variance: The variability of the values of a random variable around the population mean. positive likelihood ratio: A quantification of the informative benefit of a positive screening test, defined as sensitivity divided by 1 minus specificity.
ISTUDY
578
Glossary
positive predictive value: The probability that an individual has the disease being tested for given a positive test result. power: The probability of rejecting the null hypothesis when it is false; the likelihood that a study will detect a deviation from the null hypothesis given that one exists. power curve: A graph which displays the power of a hypothesis test as a function of the hypothesized value of the population parameter under the alternative hypothesis. precision medicine: The search for treatment strategies tailored to individual characteristics, such as genetics, environment, and lifestyle. prevalence: The proportion of a population that share a particular condition at a specific point in time. probability: A measure of the uncertainty of an event happening that takes values from zero (impossible) to one (certain to happen); the proportion of times the event occurs in a large number of identical trials. probability density: A smooth curve used to represent the probability distribution of a continuous random variable. probability distribution: A description of the behavior of a random variable that quantifies the probability that the random variable takes on a particular value or range of values. product-limit method: A method for calculating a life table or survival curve from longitudinal data which accounts for the cumulative effect of the mortality experiences at all preceding time points. proportion: A ratio in which all individuals included in the numerator are also in the denominator; a fraction. p-value: The probability of obtaining a sample statistic as extreme or more extreme than the one observed, given that the null hypothesis is true. quartile: One of the three values that divide a set of measurements into four equal parts or quarters; the 25th, 50th, and 75th percentiles. randomization: A process which uses probability to assign subjects to an exposure group as they are enrolled in an experimental study. randomized clinical trial: An ethical experimental study involving randomizing human subjects to various treatments or procedures. randomized study: A research study where the investigator uses a random mechanism to assign subjects to one exposure group or another. random sample: A sample from a population resulting from a process of measuring independent and identically distributed random variables. random variable: Any characteristic that can be measured or categorized and which can assume a number of different values, where any particular outcome is determined by chance. range: A numerical summary measure of variation for a set of measurements defined as the difference between the largest and smallest values.
ISTUDY
Glossary
579
ranked data: Measurements that are arranged from highest to lowest (or lowest to highest) according to magnitude, and then assigned numbers corresponding to each observation’s place in the sequence. rate: The number of occurrences of a particular outcome over a given period of time divided by the size of the population generating the outcomes in that time period. receiver-operating characteristic curve: See roc curve. relative frequency: The proportion of the total number of observations within a category or class rather than the absolute number. relative odds: See odds ratio. relative risk: See risk ratio. residual: The vertical distance of a point from a regression line; the difference between the observed value and the value predicted by the model. residual plot: A plot of the residuals against the independent or explanatory variable; it is used to examine important model diagnostic information. response variable: In regression, the variable whose values are the outcome of the study. risk difference: The probability of an event in one group minus the probability of the event in another group; a measure of the absolute difference in risk between two groups. risk ratio: The probability of an event in one group divided by the probability of the event in another group; a measure of the relative difference in risk between two groups. robust estimator: An estimator that is not overly influenced by outliers. roc curve: Line graph of the probability of a true positive screening test result versus the probability of a false positive test result for a range of different cutoff points. sample size estimation: The process of estimating the sample size needed to achieve a specified power for a hypothesis test. sample space: All of the outcomes which could possibly occur. sampling distribution: The probability distribution of an estimator of a population parameter. sampling variability: The observed variability in an estimator when a random sampling procedure is repeated. scatter plot: See two-way scatter plot. self-pairing: A technique where measurements are taken on a single subject at two distinct points in time. sensitivity: The probability of a positive screening or diagnostic test result in an individual who does have the disease. screening test: A test used to detect a condition in a group of individuals who do not exhibit any symptoms of the condition. significance level: The probability value used as the cutoff to define statistical significance.
ISTUDY
580
Glossary
sign test: A nonparametric hypothesis testing procedure for paired data that is based on the signs (plus or minus) of the differences within each pair. simple random sample: A sample of n items selected from a population where each sample of size n has an equal probability of being chosen. Simpson’s paradox: The difference in the measured association between two variables when a confounder is accounted for and when it is not. simulation: The process of using a computer to repeatedly model and aggregate the data from an experiment or procedure according to a specified probability distribution. skewed data: A set of measurements where the distribution is not symmetric. slope: In a simple linear regression model, the mean change in the response that corresponds to a one-unit increase in the explanatory variable. Spearman rank correlation coefficient: A correlation coefficient calculated for ranked measurements, rather than the actual measured values. specificity: The probability of a negative screening or diagnostic test result in an individual who does not have the disease. spectrum bias: A phenomenon that occurs when the users being tested to establish the properties of a screening or diagnostic test do not reflect the full spectrum of future or intended subjects of the test. standard deviation: A numerical summary measure of variation which is the positive square root of the variance. standard error: The standard deviation of the sampling distribution of an estimator. standardization: When two or more factors impact an outcome, the practice of fixing all but one factor to a standard, in order to study the effect of the remaining factor on the outcome. standardized mortality ratio: The ratio of the observed number of deaths to the number of deaths expected when applying standard age-specific mortality rates. standard normal deviate: The outcome of a standard normal random variable. standard normal distribution: The normal distribution with mean 0 and variance (or standard deviation) 1. stationary population: A population whose size and characteristics do not vary over time. statistical inference: See inference. statistical package: A series of programs designed to analyze numerical data. statistical significance: When the p-value of a hypothesis test is below a prespecified significance level, often set at 0.05. statistics: The science related to the collection, organization, analysis, and interpretation of numerical data sampled from a population, in order to describe that population. stepwise selection: A regression model selection technique which combines forward selection and backward elimination, adding and removing explanatory variables until a final model is reached.
ISTUDY
Glossary
581
stratified analysis: Analyses repeated for each separate stratum of a population. stratified randomization: A situation where randomization is performed separately within strata defined by a confounder. stratified sampling: The technique of first stratifying a population into nonoverlapping subgroups, and then sampling from each of the subgroups. Student’s t distribution: The probability distribution of the distance between the sample mean and the population mean, divided by the estimated standard error, when sampling from a normal distribution. study design: The creation of a plan to collect data to answer an important scientific question about the association between an outcome and one or more explanatory variables. survival analysis: Statistical techniques used to study the probability of surviving beyond a certain time. survival curve: A graphical representation of the life table for a population, or equivalently, a graphical representation of a survival function. survival function: The probability of surviving beyond a certain time; a distribution of survival times. survival time: The time from start of follow-up until the occurrence of a specified event. symmetric data: A set of measurements where the distribution of values has the same shape on either side of the 50th percentile; the pth and (100 − p)th percentiles are the same distance from the 50th percentile. table: A descriptive statistical technique for organizing measurements into categories and presenting the corresponding counts within each category in columns and/or rows. t distribution: See Student’s t distribution. test statistic: A quantity calculated from observed data used to measure the ‘distance’ of the observations from the null hypothesis. total probability rule: The decomposition of the sample space into mutually exclusive and exhaustive events such that for another specified event to occur, it must happen together with one and only one of the mutually exclusive events. transformation: A mathematical operation performed on a variable to make it better satisfy the assumptions of the statistical technique being applied. true negative: A negative screening or diagnostic test result when the tested individual does not have the condition. true positive: A positive screening or diagnostic test result when the tested individual has the condition. two-sample t-test: A hypothesis testing procedure used to compare the means of two independent populations. two-sided confidence interval: A confidence interval with both a finite lower bound and a finite upper bound.
ISTUDY
582
Glossary
two-sided test of hypothesis: Testing when the alternative hypothesis is not bounded from above or from below. two-way scatter plot: A graph used to depict the relationship between two different, usually continuous measurements, where one is displayed on the horizontal axis and the other on the vertical axis; each point on the graph represents a pair of values. type I error: The error made when a true null hypothesis is rejected. type II error: The error made when a false null hypothesis fails to be rejected. under-1 mortality rate: See infant mortality rate. under-5 mortality rate: The proportion of children born alive in a particular year who do not survive to their fifth birthday. unimodal distribution: A set of measurements with only one mode. union: For two or more individual events, the event that at least one of them happens. variable: Any characteristic that can be measured or categorized. variance: A numerical summary measure of variation which quantifies how different a set of measurements are from each other by computing half of the average squared distance between the measurements. variance, pooled estimate: A variance estimator which uses data from independent samples drawn from populations with a common variance. Venn diagram: A graph or figure used to depict the relationships among events. vital statistics: Data that describe the life of a population, including events such as births, deaths, marriages, and occurrences of disease. Wilcoxon rank sum test: A nonparametric hypothesis testing procedure for two independent groups; this test is a nonparametric alternative to the two-sample t-test. Wilcoxon signed-rank test: A nonparametric hypothesis testing procedure for paired data that is based on both the signs (plus or minus) of the differences within each pair and the magnitudes of the differences; this test is a nonparametric alternative to the paired t-test. Yates’ correction: A continuity correction sometimes used in the calculation of the chi-square test statistic for 2 × 2 tables. y-intercept: In a simple linear regression model, the mean value of the response when the explanatory variable is equal to 0. z-score: The outcome of a standard normal random variable; the distance of a particular outcome of a normal random variable from the mean, measured in units of the standard deviation.
ISTUDY
Statistical Tables
583
ISTUDY
584
Statistical Tables
TABLE A.1 Binomial probabilities p n k
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
2
0 0.9025 0.8100 0.7225 0.6400 0.5625 0.4900 0.4225 0.3600 0.3025 0.2500 1 0.0950 0.1800 0.2550 0.3200 0.3750 0.4200 0.4550 0.4800 0.4950 0.5000 2 0.0025 0.0100 0.0225 0.0400 0.0625 0.0900 0.1225 0.1600 0.2025 0.2500
3
0 1 2 3
0.8574 0.1354 0.0071 0.0001
0.7290 0.2430 0.0270 0.0010
0.6141 0.3251 0.0574 0.0034
0.5120 0.3840 0.0960 0.0080
0.4219 0.4219 0.1406 0.0156
0.3430 0.4410 0.1890 0.0270
0.2746 0.4436 0.2389 0.0429
0.2160 0.4320 0.2880 0.0640
0.1664 0.4084 0.3341 0.0911
0.1250 0.3750 0.3750 0.1250
4
0 1 2 3 4
0.8145 0.1715 0.0135 0.0005 0.0000
0.6561 0.2916 0.0486 0.0036 0.0001
0.5220 0.3685 0.0975 0.0115 0.0005
0.4096 0.4096 0.1536 0.0256 0.0016
0.3164 0.4219 0.2109 0.0469 0.0039
0.2401 0.4116 0.2646 0.0756 0.0081
0.1785 0.3845 0.3105 0.1115 0.0150
0.1296 0.3456 0.3456 0.1536 0.0256
0.0915 0.2995 0.3675 0.2005 0.0410
0.0625 0.2500 0.3750 0.2500 0.0625
5
0 1 2 3 4 5
0.7738 0.2036 0.0214 0.0011 0.0000 0.0000
0.5905 0.3280 0.0729 0.0081 0.0004 0.0000
0.4437 0.3915 0.1382 0.0244 0.0022 0.0001
0.3277 0.4096 0.2048 0.0512 0.0064 0.0003
0.2373 0.3955 0.2637 0.0879 0.0146 0.0010
0.1681 0.3602 0.3087 0.1323 0.0283 0.0024
0.1160 0.3124 0.3364 0.1811 0.0488 0.0053
0.0778 0.2592 0.3456 0.2304 0.0768 0.0102
0.0503 0.2059 0.3369 0.2757 0.1128 0.0185
0.0312 0.1562 0.3125 0.3125 0.1562 0.0313
6
0 1 2 3 4 5 6
0.7351 0.2321 0.0305 0.0021 0.0001 0.0000 0.0000
0.5314 0.3543 0.0984 0.0146 0.0012 0.0001 0.0000
0.3771 0.3993 0.1762 0.0415 0.0055 0.0004 0.0000
0.2621 0.3932 0.2458 0.0819 0.0154 0.0015 0.0001
0.1780 0.3560 0.2966 0.1318 0.0330 0.0044 0.0002
0.1176 0.3025 0.3241 0.1852 0.0595 0.0102 0.0007
0.0754 0.2437 0.3280 0.2355 0.0951 0.0205 0.0018
0.0467 0.1866 0.3110 0.2765 0.1382 0.0369 0.0041
0.0277 0.1359 0.2780 0.3032 0.1861 0.0609 0.0083
0.0156 0.0938 0.2344 0.3125 0.2344 0.0938 0.0156
7
0 1 2 3 4 5 6 7
0.6983 0.2573 0.0406 0.0036 0.0002 0.0000 0.0000 0.0000
0.4783 0.3720 0.1240 0.0230 0.0026 0.0002 0.0000 0.0000
0.3206 0.3960 0.2097 0.0617 0.0109 0.0012 0.0001 0.0000
0.2097 0.3670 0.2753 0.1147 0.0287 0.0043 0.0004 0.0000
0.1335 0.3115 0.3115 0.1730 0.0577 0.0115 0.0013 0.0001
0.0824 0.2471 0.3177 0.2269 0.0972 0.0250 0.0036 0.0002
0.0490 0.1848 0.2985 0.2679 0.1442 0.0466 0.0084 0.0006
0.0280 0.1306 0.2613 0.2903 0.1935 0.0774 0.0172 0.0016
0.0152 0.0872 0.2140 0.2918 0.2388 0.1172 0.0320 0.0037
0.0078 0.0547 0.1641 0.2734 0.2734 0.1641 0.0547 0.0078
ISTUDY
585
Statistical Tables
p n
k
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
8
0 1 2 3 4 5 6 7 8
0.6634 0.2793 0.0515 0.0054 0.0004 0.0000 0.0000 0.0000 0.0000
0.4305 0.3826 0.1488 0.0331 0.0046 0.0004 0.0000 0.0000 0.0000
0.2725 0.3847 0.2376 0.0839 0.0185 0.0026 0.0002 0.0000 0.0000
0.1678 0.3355 0.2936 0.1468 0.0459 0.0092 0.0011 0.0001 0.0000
0.1001 0.2670 0.3115 0.2076 0.0865 0.0231 0.0038 0.0004 0.0000
0.0576 0.1977 0.2965 0.2541 0.1361 0.0467 0.0100 0.0012 0.0001
0.0319 0.1373 0.2587 0.2786 0.1875 0.0808 0.0217 0.0033 0.0002
0.0168 0.0896 0.2090 0.2787 0.2322 0.1239 0.0413 0.0079 0.0007
0.0084 0.0548 0.1569 0.2568 0.2627 0.1719 0.0703 0.0164 0.0017
0.0039 0.0312 0.1094 0.2188 0.2734 0.2188 0.1094 0.0312 0.0039
9
0 1 2 3 4 5 6 7 8 9
0.6302 0.2985 0.0629 0.0077 0.0006 0.0000 0.0000 0.0000 0.0000 0.0000
0.3874 0.3874 0.1722 0.0446 0.0074 0.0008 0.0001 0.0000 0.0000 0.0000
0.2316 0.3679 0.2597 0.1069 0.0283 0.0050 0.0006 0.0000 0.0000 0.0000
0.1342 0.3020 0.3020 0.1762 0.0661 0.0165 0.0028 0.0003 0.0000 0.0000
0.0751 0.2253 0.3003 0.2336 0.1168 0.0389 0.0087 0.0012 0.0001 0.0000
0.0404 0.1556 0.2668 0.2668 0.1715 0.0735 0.0210 0.0039 0.0004 0.0000
0.0207 0.1004 0.2162 0.2716 0.2194 0.1181 0.0424 0.0098 0.0013 0.0001
0.0101 0.0605 0.1612 0.2508 0.2508 0.1672 0.0743 0.0212 0.0035 0.0003
0.0046 0.0339 0.1110 0.2119 0.2600 0.2128 0.1160 0.0407 0.0083 0.0008
0.0020 0.0176 0.0703 0.1641 0.2461 0.2461 0.1641 0.0703 0.0176 0.0020
10
0 1 2 3 4 5 6 7 8 9 10
0.5987 0.3151 0.0746 0.0105 0.0010 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
0.3487 0.3874 0.1937 0.0574 0.0112 0.0015 0.0001 0.0000 0.0000 0.0000 0.0000
0.1969 0.3474 0.2759 0.1298 0.0401 0.0085 0.0012 0.0001 0.0000 0.0000 0.0000
0.1074 0.2684 0.3020 0.2013 0.0881 0.0264 0.0055 0.0008 0.0001 0.0000 0.0000
0.0563 0.1877 0.2816 0.2503 0.1460 0.0584 0.0162 0.0031 0.0004 0.0000 0.0000
0.0282 0.1211 0.2335 0.2668 0.2001 0.1029 0.0368 0.0090 0.0014 0.0001 0.0000
0.0135 0.0725 0.1757 0.2522 0.2377 0.1536 0.0689 0.0212 0.0043 0.0005 0.0000
0.0060 0.0403 0.1209 0.2150 0.2508 0.2007 0.1115 0.0425 0.0106 0.0016 0.0001
0.0025 0.0207 0.0763 0.1665 0.2384 0.2340 0.1596 0.0746 0.0229 0.0042 0.0003
0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010
11
0 1 2 3 4
0.5688 0.3293 0.0867 0.0137 0.0014
0.3138 0.3835 0.2131 0.0710 0.0158
0.1673 0.3248 0.2866 0.1517 0.0536
0.0859 0.2362 0.2953 0.2215 0.1107
0.0422 0.1549 0.2581 0.2581 0.1721
0.0198 0.0932 0.1998 0.2568 0.2201
0.0088 0.0518 0.1395 0.2254 0.2428
0.0036 0.0266 0.0887 0.1774 0.2365
0.0014 0.0125 0.0513 0.1259 0.2060
0.0005 0.0054 0.0269 0.0806 0.1611
ISTUDY
586
Statistical Tables
p n
k
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
11
5 6 7 8 9 10 11
0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0025 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000
0.0132 0.0023 0.0003 0.0000 0.0000 0.0000 0.0000
0.0388 0.0097 0.0017 0.0002 0.0000 0.0000 0.0000
0.0803 0.0268 0.0064 0.0011 0.0001 0.0000 0.0000
0.1321 0.0566 0.0173 0.0037 0.0005 0.0000 0.0000
0.1830 0.0985 0.0379 0.0102 0.0018 0.0002 0.0000
0.2207 0.1471 0.0701 0.0234 0.0052 0.0007 0.0000
0.2360 0.1931 0.1128 0.0462 0.0126 0.0021 0.0002
0.2256 0.2256 0.1611 0.0806 0.0269 0.0054 0.0005
12
0 1 2 3 4 5 6 7 8 9 10 11 12
0.5404 0.3413 0.0988 0.0173 0.0021 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.2824 0.3766 0.2301 0.0852 0.0213 0.0038 0.0005 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1422 0.3012 0.2924 0.1720 0.0683 0.0193 0.0040 0.0006 0.0001 0.0000 0.0000 0.0000 0.0000
0.0687 0.2062 0.2835 0.2362 0.1329 0.0532 0.0155 0.0033 0.0005 0.0001 0.0000 0.0000 0.0000
0.0317 0.1267 0.2323 0.2581 0.1936 0.1032 0.0401 0.0115 0.0024 0.0004 0.0000 0.0000 0.0000
0.0138 0.0712 0.1678 0.2397 0.2311 0.1585 0.0792 0.0291 0.0078 0.0015 0.0002 0.0000 0.0000
0.0057 0.0368 0.1088 0.1954 0.2367 0.2039 0.1281 0.0591 0.0199 0.0048 0.0008 0.0001 0.0000
0.0022 0.0174 0.0639 0.1419 0.2128 0.2270 0.1766 0.1009 0.0420 0.0125 0.0025 0.0003 0.0000
0.0008 0.0075 0.0339 0.0923 0.1700 0.2225 0.2124 0.1489 0.0762 0.0277 0.0068 0.0010 0.0001
0.0002 0.0029 0.0161 0.0537 0.1208 0.1934 0.2256 0.1934 0.1208 0.0537 0.0161 0.0029 0.0002
13
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0.5133 0.3512 0.1109 0.0214 0.0028 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.2542 0.3672 0.2448 0.0997 0.0277 0.0055 0.0008 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1209 0.2774 0.2937 0.1900 0.0838 0.0266 0.0063 0.0011 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
0.0550 0.1787 0.2680 0.2457 0.1535 0.0691 0.0230 0.0058 0.0011 0.0001 0.0000 0.0000 0.0000 0.0000
0.0238 0.1029 0.2059 0.2517 0.2097 0.1258 0.0559 0.0186 0.0047 0.0009 0.0001 0.0000 0.0000 0.0000
0.0097 0.0540 0.1388 0.2181 0.2337 0.1803 0.1030 0.0442 0.0142 0.0034 0.0006 0.0001 0.0000 0.0000
0.0037 0.0259 0.0836 0.1651 0.2222 0.2154 0.1546 0.0833 0.0336 0.0101 0.0022 0.0003 0.0000 0.0000
0.0013 0.0113 0.0453 0.1107 0.1845 0.2214 0.1968 0.1312 0.0656 0.0243 0.0065 0.0012 0.0001 0.0000
0.0004 0.0045 0.0220 0.0660 0.1350 0.1989 0.2169 0.1775 0.1089 0.0495 0.0162 0.0036 0.0005 0.0000
0.0001 0.0016 0.0095 0.0349 0.0873 0.1571 0.2095 0.2095 0.1571 0.0873 0.0349 0.0095 0.0016 0.0001
ISTUDY
587
Statistical Tables
p n
k
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0.4877 0.3593 0.1229 0.0259 0.0037 0.0004 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.2288 0.3559 0.2570 0.1142 0.0349 0.0078 0.0013 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1028 0.2539 0.2912 0.2056 0.0998 0.0352 0.0093 0.0019 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0440 0.1539 0.2501 0.2501 0.1720 0.0860 0.0322 0.0092 0.0020 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000
0.0178 0.0832 0.1802 0.2402 0.2202 0.1468 0.0734 0.0280 0.0082 0.0018 0.0003 0.0000 0.0000 0.0000 0.0000
0.0068 0.0407 0.1134 0.1943 0.2290 0.1963 0.1262 0.0618 0.0232 0.0066 0.0014 0.0002 0.0000 0.0000 0.0000
0.0024 0.0181 0.0634 0.1366 0.2022 0.2178 0.1759 0.1082 0.0510 0.0183 0.0049 0.0010 0.0001 0.0000 0.0000
0.0008 0.0073 0.0317 0.0845 0.1549 0.2066 0.2066 0.1574 0.0918 0.0408 0.0136 0.0033 0.0005 0.0001 0.0000
0.0002 0.0027 0.0141 0.0462 0.1040 0.1701 0.2088 0.1952 0.1398 0.0762 0.0312 0.0093 0.0019 0.0002 0.0000
0.0001 0.0009 0.0056 0.0222 0.0611 0.1222 0.1833 0.2095 0.1833 0.1222 0.0611 0.0222 0.0056 0.0009 0.0001
15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0.4633 0.3658 0.1348 0.0307 0.0049 0.0006 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.2059 0.3432 0.2669 0.1285 0.0428 0.0105 0.0019 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0874 0.2312 0.2856 0.2184 0.1156 0.0449 0.0132 0.0030 0.0005 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
0.0352 0.1319 0.2309 0.2501 0.1876 0.1032 0.0430 0.0138 0.0035 0.0007 0.0001 0.0000 0.0000 0.0000 0.0000
0.0134 0.0668 0.1559 0.2252 0.2252 0.1651 0.0917 0.0393 0.0131 0.0034 0.0007 0.0001 0.0000 0.0000 0.0000
0.0047 0.0305 0.0916 0.1700 0.2186 0.2061 0.1472 0.0811 0.0348 0.0116 0.0030 0.0006 0.0001 0.0000 0.0000
0.0016 0.0126 0.0476 0.1110 0.1792 0.2123 0.1906 0.1319 0.0710 0.0298 0.0096 0.0024 0.0004 0.0001 0.0000
0.0005 0.0047 0.0219 0.0634 0.1268 0.1859 0.2066 0.1771 0.1181 0.0612 0.0245 0.0074 0.0016 0.0003 0.0000
0.0001 0.0016 0.0090 0.0318 0.0780 0.1404 0.1914 0.2013 0.1647 0.1048 0.0515 0.0191 0.0052 0.0010 0.0001
0.0000 0.0005 0.0032 0.0139 0.0417 0.0916 0.1527 0.1964 0.1964 0.1527 0.0916 0.0417 0.0139 0.0032 0.0005
16
0 1 2 3 4 5
0.4401 0.3706 0.1463 0.0359 0.0061 0.0008
0.1853 0.3294 0.2745 0.1423 0.0514 0.0137
0.0743 0.2097 0.2775 0.2285 0.1311 0.0555
0.0281 0.1126 0.2111 0.2463 0.2001 0.1201
0.0100 0.0535 0.1336 0.2079 0.2252 0.1802
0.0033 0.0228 0.0732 0.1465 0.2040 0.2099
0.0010 0.0087 0.0353 0.0888 0.1553 0.2008
0.0003 0.0030 0.0150 0.0468 0.1014 0.1623
0.0001 0.0009 0.0056 0.0215 0.0572 0.1123
0.0000 0.0002 0.0018 0.0085 0.0278 0.0667
ISTUDY
588
Statistical Tables
p n
k
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
16
6 7 8 9 10 11 12 13 14 15 16
0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0028 0.0004 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0180 0.0045 0.0009 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0550 0.0197 0.0055 0.0012 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1101 0.0524 0.0197 0.0058 0.0014 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000
0.1649 0.1010 0.0487 0.0185 0.0056 0.0013 0.0002 0.0000 0.0000 0.0000 0.0000
0.1982 0.1524 0.0923 0.0442 0.0167 0.0049 0.0011 0.0002 0.0000 0.0000 0.0000
0.1983 0.1889 0.1417 0.0840 0.0392 0.0142 0.0040 0.0008 0.0001 0.0000 0.0000
0.1684 0.1969 0.1812 0.1318 0.0755 0.0337 0.0115 0.0029 0.0005 0.0001 0.0000
0.1222 0.1746 0.1964 0.1746 0.1222 0.0667 0.0278 0.0085 0.0018 0.0002 0.0000
17
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0.4181 0.3741 0.1575 0.0415 0.0076 0.0010 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1668 0.3150 0.2800 0.1556 0.0605 0.0175 0.0039 0.0007 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0631 0.1893 0.2673 0.2359 0.1457 0.0668 0.0236 0.0065 0.0014 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0225 0.0957 0.1914 0.2393 0.2093 0.1361 0.0680 0.0267 0.0084 0.0021 0.0004 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0075 0.0426 0.1136 0.1893 0.2209 0.1914 0.1276 0.0668 0.0279 0.0093 0.0025 0.0005 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
0.0023 0.0169 0.0581 0.1245 0.1868 0.2081 0.1784 0.1201 0.0644 0.0276 0.0095 0.0026 0.0006 0.0001 0.0000 0.0000 0.0000 0.0000
0.0007 0.0060 0.0260 0.0701 0.1320 0.1849 0.1991 0.1685 0.1134 0.0611 0.0263 0.0090 0.0024 0.0005 0.0001 0.0000 0.0000 0.0000
0.0002 0.0019 0.0102 0.0341 0.0796 0.1379 0.1839 0.1927 0.1606 0.1070 0.0571 0.0242 0.0081 0.0021 0.0004 0.0001 0.0000 0.0000
0.0000 0.0005 0.0035 0.0144 0.0411 0.0875 0.1432 0.1841 0.1883 0.1540 0.1008 0.0525 0.0215 0.0068 0.0016 0.0003 0.0000 0.0000
0.0000 0.0001 0.0010 0.0052 0.0182 0.0472 0.0944 0.1484 0.1855 0.1855 0.1484 0.0944 0.0472 0.0182 0.0052 0.0010 0.0001 0.0000
18
0 1 2 3 4 5 6
0.3972 0.3763 0.1683 0.0473 0.0093 0.0014 0.0002
0.1501 0.3002 0.2835 0.1680 0.0700 0.0218 0.0052
0.0536 0.1704 0.2556 0.2406 0.1592 0.0787 0.0301
0.0180 0.0811 0.1723 0.2297 0.2153 0.1507 0.0816
0.0056 0.0338 0.0958 0.1704 0.2130 0.1988 0.1436
0.0016 0.0126 0.0458 0.1046 0.1681 0.2017 0.1873
0.0004 0.0042 0.0190 0.0547 0.1104 0.1664 0.1941
0.0001 0.0012 0.0069 0.0246 0.0614 0.1146 0.1655
0.0000 0.0003 0.0022 0.0095 0.0291 0.0666 0.1181
0.0000 0.0001 0.0006 0.0031 0.0117 0.0327 0.0708
ISTUDY
589
Statistical Tables
p n
k
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
18
7 8 9 10 11 12 13 14 15 16 17 18
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0010 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0091 0.0022 0.0004 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0350 0.0120 0.0033 0.0008 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0820 0.0376 0.0139 0.0042 0.0010 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1376 0.0811 0.0386 0.0149 0.0046 0.0012 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000
0.1792 0.1327 0.0794 0.0385 0.0151 0.0047 0.0012 0.0002 0.0000 0.0000 0.0000 0.0000
0.1892 0.1734 0.1284 0.0771 0.0374 0.0145 0.0045 0.0011 0.0002 0.0000 0.0000 0.0000
0.1657 0.1864 0.1694 0.1248 0.0742 0.0354 0.0134 0.0039 0.0009 0.0001 0.0000 0.0000
0.1214 0.1669 0.1855 0.1669 0.1214 0.0708 0.0327 0.0117 0.0031 0.0006 0.0001 0.0000
19
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0.3774 0.3774 0.1787 0.0533 0.0112 0.0018 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1351 0.2852 0.2852 0.1796 0.0798 0.0266 0.0069 0.0014 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0456 0.1529 0.2428 0.2428 0.1714 0.0907 0.0374 0.0122 0.0032 0.0007 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0144 0.0685 0.1540 0.2182 0.2182 0.1636 0.0955 0.0443 0.0166 0.0051 0.0013 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0042 0.0268 0.0803 0.1517 0.2023 0.2023 0.1574 0.0974 0.0487 0.0198 0.0066 0.0018 0.0004 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0011 0.0093 0.0358 0.0869 0.1491 0.1916 0.1916 0.1525 0.0981 0.0514 0.0220 0.0077 0.0022 0.0005 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
0.0003 0.0029 0.0138 0.0422 0.0909 0.1468 0.1844 0.1844 0.1489 0.0980 0.0528 0.0233 0.0083 0.0024 0.0006 0.0001 0.0000 0.0000 0.0000 0.0000
0.0001 0.0008 0.0046 0.0175 0.0467 0.0933 0.1451 0.1797 0.1797 0.1464 0.0976 0.0532 0.0237 0.0085 0.0024 0.0005 0.0001 0.0000 0.0000 0.0000
0.0000 0.0002 0.0013 0.0062 0.0203 0.0497 0.0949 0.1443 0.1771 0.1771 0.1449 0.0970 0.0529 0.0233 0.0082 0.0022 0.0005 0.0001 0.0000 0.0000
0.0000 0.0000 0.0003 0.0018 0.0074 0.0222 0.0518 0.0961 0.1442 0.1762 0.1762 0.1442 0.0961 0.0518 0.0222 0.0074 0.0018 0.0003 0.0000 0.0000
20
0 0.3585 0.1216 0.0388 0.0115 0.0032 .0008 0.0002 0.0000 0.0000 0.0000 1 0.3774 0.2702 0.1368 0.0576 0.0211 0.0068 0.0020 0.0005 0.0001 0.0000 2 0.1887 0.2852 0.2293 0.1369 0.0669 0.0278 0.0100 0.0031 0.0008 0.0002
ISTUDY
590
Statistical Tables
p n
k
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
20
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.0596 0.0133 0.0022 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1901 0.0898 0.0319 0.0089 0.0020 0.0004 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.2428 0.1821 0.1028 0.0454 0.0160 0.0046 0.0011 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.2054 0.2182 0.1746 0.1091 0.0545 0.0222 0.0074 0.0020 0.0005 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1339 0.1897 0.2023 0.1686 0.1124 0.0609 0.0271 0.0099 0.0030 0.0008 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0716 0.1304 0.1789 0.1916 0.1643 0.1144 0.0654 0.0308 0.0120 0.0039 0.0010 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0323 0.0738 0.1272 0.1712 0.1844 0.1614 0.1158 0.0686 0.0336 0.0136 0.0045 0.0012 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000
0.0123 0.0350 0.0746 0.1244 0.1659 0.1797 0.1597 0.1171 0.0710 0.0355 0.0146 0.0049 0.0013 0.0003 0.0000 0.0000 0.0000 0.0000
0.0040 0.0139 0.0365 0.0746 0.1221 0.1623 0.1771 0.1593 0.1185 0.0727 0.0366 0.0150 0.0049 0.0013 0.0002 0.0000 0.0000 0.0000
0.0011 0.0046 0.0148 0.0370 0.0739 0.1201 0.1602 0.1762 0.1602 0.1201 0.0739 0.0370 0.0148 0.0046 0.0011 0.0002 0.0000 0.0000
ISTUDY
591
Statistical Tables
TABLE A.2 Poisson probabilities
k
0.5
1.0
1.5
2.0
2.5
µ 3.0
3.5
4.0
4.5
5.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.6065 0.3033 0.0758 0.0126 0.0016 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.3679 0.3679 0.1839 0.0613 0.0153 0.0031 0.0005 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.2231 0.3347 0.2510 0.1255 0.0471 0.0141 0.0035 0.0008 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.1353 0.2707 0.2707 0.1804 0.0902 0.0361 0.0120 0.0034 0.0009 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0821 0.2052 0.2565 0.2138 0.1336 0.0668 0.0278 0.0099 0.0031 0.0009 0.0002 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0498 0.1494 0.2240 0.2240 0.1680 0.1008 0.0504 0.0216 0.0081 0.0027 0.0008 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000
0.0302 0.1057 0.1850 0.2158 0.1888 0.1322 0.0771 0.0385 0.0169 0.0066 0.0023 0.0007 0.0002 0.0001 0.0000 0.0000 0.0000
0.0183 0.0733 0.1465 0.1954 0.1954 0.1563 0.1042 0.0595 0.0298 0.0132 0.0053 0.0019 0.0006 0.0002 0.0001 0.0000 0.0000
0.0111 0.0500 0.1125 0.1687 0.1898 0.1708 0.1281 0.0824 0.0463 0.0232 0.0104 0.0043 0.0016 0.0006 0.0002 0.0001 0.0000
0.0067 0.0337 0.0842 0.1404 0.1755 0.1755 0.1462 0.1044 0.0653 0.0363 0.0181 0.0082 0.0034 0.0013 0.0005 0.0002 0.0000
k
5.5
6.0
6.5
7.0
7.5
µ 8.0
8.5
9.0
9.5
10.0
0 1 2 3 4 5 6 7 8 9 10 11 12
0.0041 0.0225 0.0618 0.1133 0.1558 0.1714 0.1571 0.1234 0.0849 0.0519 0.0285 0.0143 0.0065
0.0025 0.0149 0.0446 0.0892 0.1339 0.1606 0.1606 0.1377 0.1033 0.0688 0.0413 0.0225 0.0113
0.0015 0.0098 0.0318 0.0688 0.1118 0.1454 0.1575 0.1462 0.1188 0.0858 0.0558 0.0330 0.0179
0.0009 0.0064 0.0223 0.0521 0.0912 0.1277 0.1490 0.1490 0.1304 0.1014 0.0710 0.0452 0.0263
0.0006 0.0041 0.0156 0.0389 0.0729 0.1094 0.1367 0.1465 0.1373 0.1144 0.0858 0.0585 0.0366
0.0003 0.0027 0.0107 0.0286 0.0573 0.0916 0.1221 0.1396 0.1396 0.1241 0.0993 0.0722 0.0481
0.0002 0.0017 0.0074 0.0208 0.0443 0.0752 0.1066 0.1294 0.1375 0.1299 0.1104 0.0853 0.0604
0.0001 0.0011 0.0050 0.0150 0.0337 0.0607 0.0911 0.1171 0.1318 0.1318 0.1186 0.0970 0.0728
0.0001 0.0007 0.0034 0.0107 0.0254 0.0483 0.0764 0.1037 0.1232 0.1300 0.1235 0.1067 0.0844
0.0000 0.0005 0.0023 0.0076 0.0189 0.0378 0.0631 0.0901 0.1126 0.1251 0.1251 0.1137 0.0948
ISTUDY
592
Statistical Tables
k
5.5
6.0
6.5
7.0
7.5
µ 8.0
8.5
9.0
9.5
10.0
13 14 15 16 17 18 19 20 21 22 23 24 25
0.0028 0.0011 0.0004 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0052 0.0022 0.0009 0.0003 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0089 0.0041 0.0018 0.0007 0.0003 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0142 0.0071 0.0033 0.0014 0.0006 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0211 0.0113 0.0057 0.0026 0.0012 0.0005 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
0.0296 0.0169 0.0090 0.0045 0.0021 0.0009 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000
0.0395 0.0240 0.0136 0.0072 0.0036 0.0017 0.0008 0.0003 0.0001 0.0001 0.0000 0.0000 0.0000
0.0504 0.0324 0.0194 0.0109 0.0058 0.0029 0.0014 0.0006 0.0003 0.0001 0.0000 0.0000 0.0000
0.0617 0.0419 0.0265 0.0157 0.0088 0.0046 0.0023 0.0011 0.0005 0.0002 0.0001 0.0000 0.0000
0.0729 0.0521 0.0347 0.0217 0.0128 0.0071 0.0037 0.0019 0.0009 0.0004 0.0002 0.0001 0.0000
k
10.5
11.0
11.5
12.0
12.5
µ 13.0
13.5
14.0
14.5
15.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.0000 0.0003 0.0015 0.0053 0.0139 0.0293 0.0513 0.0769 0.1009 0.1177 0.1236 0.1180 0.1032 0.0834 0.0625 0.0438 0.0287
0.0000 0.0002 0.0010 0.0037 0.0102 0.0224 0.0411 0.0646 0.0888 0.1085 0.1194 0.1194 0.1094 0.0926 0.0728 0.0534 0.0367
0.0000 0.0001 0.0007 0.0026 0.0074 0.0170 0.0325 0.0535 0.0769 0.0982 0.1129 0.1181 0.1131 0.1001 0.0822 0.0630 0.0453
0.0000 0.0001 0.0004 0.0018 0.0053 0.0127 0.0255 0.0437 0.0655 0.0874 0.1048 0.1144 0.1144 0.1056 0.0905 0.0724 0.0543
0.0000 0.0000 0.0003 0.0012 0.0038 0.0095 0.0197 0.0353 0.0551 0.0765 0.0956 0.1087 0.1132 0.1089 0.0972 0.0810 0.0633
0.0000 0.0000 0.0002 0.0008 0.0027 0.0070 0.0152 0.0281 0.0457 0.0661 0.0859 0.1015 0.1099 0.1099 0.1021 0.0885 0.0719
0.0000 0.0000 0.0001 0.0006 0.0019 0.0051 0.0115 0.0222 0.0375 0.0563 0.0760 0.0932 0.1049 0.1089 0.1050 0.0945 0.0798
0.0000 0.0000 0.0001 0.0004 0.0013 0.0037 0.0087 0.0174 0.0304 0.0473 0.0663 0.0844 0.0984 0.1060 0.1060 0.0989 0.0866
0.0000 0.0000 0.0001 0.0003 0.0009 0.0027 0.0065 0.0135 0.0244 0.0394 0.0571 0.0753 0.0910 0.1014 0.1051 0.1016 0.0920
0.0000 0.0000 0.0000 0.0002 0.0006 0.0019 0.0048 0.0104 0.0194 0.0324 0.0486 0.0663 0.0829 0.0956 0.1024 0.1024 0.0960
ISTUDY
593
Statistical Tables
k
10.5
11.0
11.5
12.0
12.5
µ 13.0
13.5
14.0
14.5
15.0
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
0.0177 0.0104 0.0057 0.0030 0.0015 0.0007 0.0003 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0237 0.0145 0.0084 0.0046 0.0024 0.0012 0.0006 0.0003 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0306 0.0196 0.0119 0.0068 0.0037 0.0020 0.0010 0.0005 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0383 0.0255 0.0161 0.0097 0.0055 0.0030 0.0016 0.0008 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0465 0.0323 0.0213 0.0133 0.0079 0.0045 0.0024 0.0013 0.0006 0.0003 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
0.0550 0.0397 0.0272 0.0177 0.0109 0.0065 0.0037 0.0020 0.0010 0.0005 0.0002 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000
0.0633 0.0475 0.0337 0.0228 0.0146 0.0090 0.0053 0.0030 0.0016 0.0008 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000
0.0713 0.0554 0.0409 0.0286 0.0191 0.0121 0.0074 0.0043 0.0024 0.0013 0.0007 0.0003 0.0002 0.0001 0.0000 0.0000 0.0000
0.0785 0.0632 0.0483 0.0350 0.0242 0.0159 0.0100 0.0061 0.0035 0.0020 0.0011 0.0005 0.0003 0.0001 .0001 0.0000 0.0000
0.0847 0.0706 0.0557 0.0418 0.0299 0.0204 0.0133 0.0083 0.0050 0.0029 0.0016 0.0009 0.0004 0.0002 0.0001 0.0001 0.0000
k
15.5
16.0
16.5
17.0
17.5
µ 18.0
18.5
19.0
19.5
20.0
0 1 2 3 4 5 6 7 8 9 10 11 12
0.0000 0.0000 0.0000 0.0001 0.0004 0.0014 0.0036 0.0079 0.0153 0.0264 0.0409 0.0577 0.0745
0.0000 0.0000 0.0000 0.0001 0.0003 0.0010 0.0026 0.0060 0.0120 0.0213 0.0341 0.0496 0.0661
0.0000 0.0000 0.0000 0.0001 0.0002 0.0007 0.0019 0.0045 0.0093 0.0171 0.0281 0.0422 0.0580
0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0014 0.0034 0.0072 0.0135 0.0230 0.0355 0.0504
0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0010 0.0025 0.0055 0.0107 0.0186 0.0297 0.0432
0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0007 0.0019 0.0042 0.0083 0.0150 0.0245 0.0368
0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0005 0.0014 0.0031 0.0065 0.0120 0.0201 0.0310
0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0004 0.0010 0.0024 0.0050 0.0095 0.0164 0.0259
0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0007 0.0018 0.0038 0.0074 0.0132 0.0214
0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0005 0.0013 0.0029 0.0058 0.0106 0.0176
ISTUDY
594
Statistical Tables
k
15.5
16.0
16.5
17.0
17.5
µ 18.0
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
0.0888 0.0983 0.1016 0.0984 0.0897 0.0773 0.0630 0.0489 0.0361 0.0254 0.0171 0.0111 0.0069 0.0041 0.0023 0.0013 0.0007 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0814 0.0930 0.0992 0.0992 0.0934 0.0830 0.0699 0.0559 0.0426 0.0310 0.0216 0.0144 0.0092 0.0057 0.0034 0.0019 0.0011 0.0006 0.0003 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0736 0.0868 0.0955 0.0985 0.0956 0.0876 0.0761 0.0628 0.0493 0.0370 0.0265 0.0182 0.0120 0.0076 0.0047 0.0028 0.0016 0.0009 0.0005 0.0002 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0658 0.0800 0.0906 0.0963 0.0963 0.0909 0.0814 0.0692 0.0560 0.0433 0.0320 0.0226 0.0154 0.0101 0.0063 0.0038 0.0023 0.0013 0.0007 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0582 0.0728 0.0849 0.0929 0.0956 0.0929 0.0856 0.0749 0.0624 0.0496 0.0378 0.0275 0.0193 0.0130 0.0084 0.0053 0.0032 0.0019 0.0010 0.0006 0.0003 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
0.0509 0.0655 0.0786 0.0884 0.0936 0.0936 0.0887 0.0798 0.0684 0.0560 0.0438 0.0328 0.0237 0.0164 0.0109 0.0070 0.0044 0.0026 0.0015 0.0009 0.0005 0.0002 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000
18.5
19.0
19.5
20.0
0.0441 0.0583 0.0719 0.0831 0.0904 0.0930 0.0905 0.0837 0.0738 0.0620 0.0499 0.0385 0.0285 0.0202 0.0139 0.0092 0.0058 0.0036 0.0022 0.0012 0.0007 0.0004 0.0002 0.0001 0.0001 0.0000 0.0000 0.0000
0.0378 0.0514 0.0650 0.0772 0.0863 0.0911 0.0911 0.0866 0.0783 0.0676 0.0559 0.0442 0.0336 0.0246 0.0173 0.0117 0.0077 0.0049 0.0030 0.0018 0.0010 0.0006 0.0003 0.0002 0.0001 0.0000 0.0000 0.0000
0.0322 0.0448 0.0582 0.0710 0.0814 0.0882 0.0905 0.0883 0.0820 0.0727 0.0616 0.0500 0.0390 0.0293 0.0211 0.0147 0.0099 0.0064 0.0040 0.0025 0.0015 0.0008 0.0005 0.0003 0.0001 0.0001 0.0000 0.0000
0.0271 0.0387 0.0516 0.0646 0.0760 0.0844 0.0888 0.0888 0.0846 0.0769 0.0669 0.0557 0.0446 0.0343 0.0254 0.0181 0.0125 0.0083 0.0054 0.0034 0.0020 0.0012 0.0007 0.0004 0.0002 0.0001 0.0001 0.0000
ISTUDY
595
Statistical Tables
TABLE A.3 Areas in the upper tail of the standard normal distribution z
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0 0.1 0.2 0.3 0.4
0.500 0.460 0.421 0.382 0.345
0.496 0.456 0.417 0.378 0.341
0.492 0.452 0.413 0.374 0.337
0.488 0.448 0.409 0.371 0.334
0.484 0.444 0.405 0.367 0.330
0.480 0.440 0.401 0.363 0.326
0.476 0.436 0.397 0.359 0.323
0.472 0.433 0.394 0.356 0.319
0.468 0.429 0.390 0.352 0.316
0.464 0.425 0.386 0.348 0.312
0.5 0.6 0.7 0.8 0.9
0.309 0.274 0.242 0.212 0.184
0.305 0.271 0.239 0.209 0.181
0.302 0.268 0.236 0.206 0.179
0.298 0.264 0.233 0.203 0.176
0.295 0.261 0.230 0.200 0.174
0.291 0.258 0.227 0.198 0.171
0.288 0.255 0.224 0.195 0.169
0.284 0.251 0.221 0.192 0.166
0.281 0.248 0.218 0.189 0.164
0.278 0.245 0.215 0.187 0.161
1.0 1.1 1.2 1.3 1.4
0.159 0.136 0.115 0.097 0.081
0.156 0.133 0.113 0.095 0.079
0.154 0.131 0.111 0.093 0.078
0.152 0.129 0.109 0.092 0.076
0.149 0.127 0.107 0.090 0.075
0.147 0.125 0.106 0.089 0.074
0.145 0.123 0.104 0.087 0.072
0.142 0.121 0.102 0.085 0.071
0.140 0.119 0.100 0.084 0.069
0.138 0.117 0.099 0.082 0.068
1.5 1.6 1.7 1.8 1.9
0.067 0.055 0.045 0.036 0.029
0.066 0.054 0.044 0.035 0.028
0.064 0.053 0.043 0.034 0.027
0.063 0.052 0.042 0.034 0.027
0.062 0.051 0.041 0.033 0.026
0.061 0.049 0.040 0.032 0.026
0.059 0.048 0.039 0.031 0.025
0.058 0.047 0.038 0.031 0.024
0.057 0.046 0.038 0.030 0.024
0.056 0.046 0.037 0.029 0.023
2.0 2.1 2.2 2.3 2.4
0.023 0.018 0.014 0.011 0.008
0.022 0.017 0.014 0.010 0.008
0.022 0.017 0.013 0.010 0.008
0.021 0.017 0.013 0.010 0.008
0.021 0.016 0.013 0.010 0.007
0.020 0.016 0.012 0.009 0.007
0.020 0.015 0.012 0.009 0.007
0.019 0.015 0.012 0.009 0.007
0.019 0.015 0.011 0.009 0.007
0.018 0.014 0.011 0.008 0.006
2.5 2.6 2.7 2.8 2.9
0.006 0.005 0.003 0.003 0.002
0.006 0.005 0.003 0.002 0.002
0.006 0.004 0.003 0.002 0.002
0.006 0.004 0.003 0.002 0.002
0.006 0.004 0.003 0.002 0.002
0.005 0.004 0.003 0.002 0.002
0.005 0.004 0.003 0.002 0.002
0.005 0.004 0.003 0.002 0.001
0.005 0.004 0.003 0.002 0.001
0.005 0.004 0.003 0.002 0.001
3.0 3.1 3.2 3.3 3.4
0.001 0.001 0.001 0.000 0.000
0.001 0.001 0.001 0.000 0.000
0.001 0.001 0.001 0.000 0.000
0.001 0.001 0.001 0.000 0.000
0.001 0.001 0.001 0.000 0.000
0.001 0.001 0.001 0.000 0.000
0.001 0.001 0.001 0.000 0.000
0.001 0.001 0.001 0.000 0.000
0.001 0.001 0.001 0.000 0.000
0.001 0.001 0.001 0.000 0.000
ISTUDY
596
Statistical Tables
TABLE A.4 Percentiles of the t distribution Area in Upper Tail df
0.10
0.05
0.025
0.01
0.005
0.0005
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 50 60 70 80 90 100 110 120 ∞
3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.303 1.299 1.296 1.294 1.292 1.291 1.290 1.289 1.289 1.282
6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.684 1.676 1.671 1.667 1.664 1.662 1.660 1.659 1.658 1.645
12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.021 2.009 2.000 1.994 1.990 1.987 1.984 1.982 1.980 1.960
31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.423 2.403 2.390 2.381 2.374 2.368 2.364 2.361 2.358 2.327
63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.704 2.678 2.660 2.648 2.639 2.632 2.626 2.621 2.617 2.576
636.619 31.599 12.924 8.610 6.869 5.959 5.408 5.041 4.781 4.587 4.437 4.318 4.221 4.140 4.073 4.015 3.965 3.922 3.883 3.850 3.819 3.792 3.768 3.745 3.725 3.707 3.690 3.674 3.659 3.646 3.551 3.496 3.460 3.435 3.416 3.402 3.390 3.381 3.373 3.291
ISTUDY
597
Statistical Tables
TABLE A.5 Percentiles of the F distribution Denominator df 2
Area in Upper Tail 0.100 0.050 0.025 0.010 0.005 0.001
1 8.53 18.51 38.51 98.50 198.5 998.5
2 9.00 19.00 39.00 99.00 199.0 999.0
3 9.16 19.16 39.17 99.17 199.2 999.2
Numerator Degrees of Freedom (df) 4 5 6 7 8 9.24 9.29 9.33 9.35 9.37 19.25 19.30 19.33 19.35 19.37 39.25 39.30 39.33 39.36 39.37 99.25 99.30 99.33 99.36 99.37 199.3 199.3 199.3 199.4 199.4 999.3 999.3 999.3 999.4 999.4
12 9.41 19.41 39.41 99.42 199.4 999.4
24 9.45 19.45 39.46 99.46 199.5 999.5
∞ 9.49 19.50 39.50 99.50 199.5 999.5
3
0.100 0.050 0.025 0.010 0.005 0.001
5.54 10.13 17.44 34.12 55.55 167.0
5.46 9.55 16.04 30.82 49.80 148.5
5.39 9.28 15.44 29.46 47.47 141.1
5.34 9.12 15.10 28.71 46.19 137.1
5.31 9.01 14.88 28.24 45.39 134.6
5.28 8.94 14.73 27.91 44.84 132.9
5.27 8.89 14.62 27.67 44.43 131.6
5.25 8.85 14.54 27.49 44.13 130.6
5.22 8.74 14.34 27.05 43.39 128.3
5.18 8.64 14.12 26.60 42.62 125.9
5.13 8.53 13.90 26.13 41.83 123.5
4
0.100 0.050 0.025 0.010 0.005 0.001
4.54 7.71 12.22 21.20 31.33 74.14
4.32 6.94 10.65 18.00 26.28 61.25
4.19 6.59 9.98 16.69 24.26 56.18
4.11 6.39 9.60 15.98 23.15 53.44
4.05 6.26 9.36 15.52 22.46 51.71
4.01 6.16 9.20 15.21 21.97 50.53
3.98 6.09 9.07 14.98 21.62 49.66
3.95 6.04 8.98 14.80 21.35 49.00
3.90 5.91 8.75 14.37 20.70 47.41
3.83 5.77 8.51 13.93 20.03 45.77
3.76 5.63 8.26 13.46 19.32 44.05
5
0.100 0.050 0.025 0.010 0.005 0.001
4.06 6.61 10.01 16.26 22.78 47.18
3.78 5.79 8.43 13.27 18.31 37.12
3.62 5.41 7.76 12.06 16.53 33.20
3.52 5.19 7.39 11.39 15.56 31.09
3.45 5.05 7.15 10.97 14.94 29.75
3.40 4.95 6.98 10.67 14.51 28.83
3.37 4.88 6.85 10.46 14.20 28.16
3.34 4.82 6.76 10.29 13.96 27.65
3.27 4.68 6.52 9.89 13.38 26.42
3.19 4.53 6.28 9.47 12.78 25.13
3.10 4.36 6.02 9.02 12.14 23.79
6
0.100 0.050 0.025 0.010 0.005 0.001
3.78 5.99 8.81 13.75 18.63 35.51
3.46 5.14 7.26 10.92 14.54 27.00
3.29 4.76 6.60 9.78 12.92 23.70
3.18 4.53 6.23 9.15 12.03 21.92
3.11 4.39 5.99 8.75 11.46 20.80
3.05 4.28 5.82 8.47 11.07 20.03
3.01 4.21 5.70 8.26 10.79 19.46
2.98 4.15 5.60 8.10 10.57 19.03
2.90 4.00 5.37 7.72 10.03 17.99
2.82 3.84 5.12 7.31 9.47 16.90
2.72 3.67 4.85 6.88 8.88 15.75
7
0.100 0.050 0.025 0.010 0.005 0.001
3.59 5.59 8.07 12.25 16.24 29.25
3.26 4.74 6.54 9.55 12.40 21.69
3.07 4.35 5.89 8.45 10.88 18.77
2.96 4.12 5.52 7.85 10.05 17.20
2.88 3.97 5.29 7.46 9.52 16.21
2.83 3.87 5.12 7.19 9.16 15.52
2.78 3.79 4.99 6.99 8.89 15.02
2.75 3.73 4.90 6.84 8.68 14.63
2.67 3.57 4.67 6.47 8.18 13.71
2.58 3.41 4.41 6.07 7.64 12.73
2.47 3.23 4.14 5.65 7.08 11.70
8
0.100 0.050 0.025 0.010 0.005 0.001
3.46 5.32 7.57 11.26 14.69 25.41
3.11 4.46 6.06 8.65 11.04 18.49
2.92 4.07 5.42 7.59 9.60 15.83
2.81 3.84 5.05 7.01 8.81 14.39
2.73 3.69 4.82 6.63 8.30 13.48
2.67 3.58 4.65 6.37 7.95 12.86
2.62 3.50 4.53 6.18 7.69 12.40
2.59 3.44 4.43 6.03 7.50 12.05
2.50 3.28 4.20 5.67 7.01 11.19
2.40 3.12 3.95 5.28 6.50 10.30
2.29 2.93 3.67 4.86 5.95 9.33
9
0.100 0.050 0.025 0.010 0.005 0.001
3.36 5.12 7.21 10.56 13.61 22.86
3.01 4.26 5.71 8.02 10.11 16.39
2.81 3.86 5.08 6.99 8.72 13.90
2.69 3.63 4.72 6.42 7.96 12.56
2.61 3.48 4.48 6.06 7.47 11.71
2.55 3.37 4.32 5.80 7.13 11.13
2.51 3.29 4.20 5.61 6.88 10.70
2.47 3.23 4.10 5.47 6.69 10.37
2.38 3.07 3.87 5.11 6.23 9.57
2.28 2.90 3.61 4.73 5.73 8.72
2.16 2.71 3.33 4.31 5.19 7.81
ISTUDY
598
Statistical Tables
Denominator df 10
Area in Upper Tail 0.100 0.050 0.025 0.010 0.005 0.001
1 3.29 4.96 6.94 10.04 12.83 21.04
2 2.92 4.10 5.46 7.56 9.43 14.91
Numerator Degrees of Freedom (df) 3 4 5 6 7 8 2.73 2.61 2.52 2.46 2.41 2.38 3.71 3.48 3.33 3.22 3.14 3.07 4.83 4.47 4.24 4.07 3.95 3.85 6.55 5.99 5.64 5.39 5.20 5.06 8.08 7.34 6.87 6.54 6.30 6.12 12.55 11.28 10.48 9.93 9.52 9.20
12 2.28 2.91 3.62 4.71 5.66 8.45
24 2.18 2.74 3.37 4.33 5.17 7.64
∞ 2.06 2.54 3.08 3.91 4.64 6.76
12
0.100 0.050 0.025 0.010 0.005 0.001 0.100
3.18 4.75 6.55 9.33 11.75 18.64 3.10
2.81 3.89 5.10 6.93 8.51 12.97 2.73
2.61 3.49 4.47 5.95 7.23 10.80 2.52
2.48 3.26 4.12 5.41 6.52 9.63 2.39
2.39 3.11 3.89 5.06 6.07 8.89 2.31
2.33 3.00 3.73 4.82 5.76 8.38 2.24
2.28 2.91 3.61 4.64 5.52 8.00 2.19
2.24 2.85 3.51 4.50 5.35 7.71 2.15
2.15 2.69 3.28 4.16 4.91 7.00 2.05
2.04 2.51 3.02 3.78 4.43 6.25 1.94
1.90 2.30 2.72 3.36 3.90 5.42 1.80
0.050 0.025 0.010 0.005 0.001
4.60 6.30 8.86 11.06 17.14
3.74 4.86 6.51 7.92 11.78
3.34 4.24 5.56 6.68 9.73
3.11 3.89 5.04 6.00 8.62
2.96 3.66 4.69 5.56 7.92
2.85 3.50 4.46 5.26 7.44
2.76 3.38 4.28 5.03 7.08
2.70 3.29 4.14 4.86 6.80
2.53 3.05 3.80 4.43 6.13
2.35 2.79 3.43 3.96 5.41
2.13 2.49 3.00 3.44 4.60
16
0.100 0.050 0.025 0.010 0.005 0.001
3.05 4.49 6.12 8.53 10.58 16.12
2.67 3.63 4.69 6.23 7.51 10.97
2.46 3.24 4.08 5.29 6.30 9.01
2.33 3.01 3.73 4.77 5.64 7.94
2.24 2.85 3.50 4.44 5.21 7.27
2.18 2.74 3.34 4.20 4.91 6.80
2.13 2.66 3.22 4.03 4.69 6.46
2.09 2.59 3.12 3.89 4.52 6.19
1.99 2.42 2.89 3.55 4.10 5.55
1.87 2.24 2.63 3.18 3.64 4.85
1.72 2.01 2.32 2.75 3.11 4.06
18
0.100 0.050 0.025 0.010 0.005 0.001
3.01 4.41 5.98 8.29 10.22 15.38
2.62 3.55 4.56 6.01 7.21 10.39
2.42 3.16 3.95 5.09 6.03 8.49
2.29 2.93 3.61 4.58 5.37 7.46
2.20 2.77 3.38 4.25 4.96 6.81
2.13 2.66 3.22 4.01 4.66 6.35
2.08 2.58 3.10 3.84 4.44 6.02
2.04 2.51 3.01 3.71 4.28 5.76
1.93 2.34 2.77 3.37 3.86 5.13
1.81 2.15 2.50 3.00 3.40 4.45
1.66 1.92 2.19 2.57 2.87 3.67
20
0.100 0.050 0.025 0.010 0.005 0.001
2.97 4.35 5.87 8.10 9.94 14.82
2.59 3.49 4.46 5.85 6.99 9.95
2.38 3.10 3.86 4.94 5.82 8.10
2.25 2.87 3.51 4.43 5.17 7.10
2.16 2.71 3.29 4.10 4.76 6.46
2.09 2.60 3.13 3.87 4.47 6.02
2.04 2.51 3.01 3.70 4.26 5.69
2.00 2.45 2.91 3.56 4.09 5.44
1.89 2.28 2.68 3.23 3.68 4.82
1.77 2.08 2.41 2.86 3.22 4.15
1.61 1.84 2.09 2.42 2.69 3.38
30
0.100 0.050 0.025 0.010 0.005 0.001 0.100 0.050 0.025 0.010 0.005 0.001
2.88 4.17 5.57 7.56 9.18 13.29 2.84 4.08 5.42 7.31 8.83 12.61
2.49 3.32 4.18 5.39 6.35 8.77 2.44 3.23 4.05 5.18 6.07 8.25
2.28 2.92 3.59 4.51 5.24 7.05 2.23 2.84 3.46 4.31 4.98 6.59
2.14 2.69 3.25 4.02 4.62 6.12 2.09 2.61 3.13 3.83 4.37 5.70
2.05 2.53 3.03 3.70 4.23 5.53 2.00 2.45 2.90 3.51 3.99 5.13
1.98 2.42 2.87 3.47 3.95 5.12 1.93 2.34 2.74 3.29 3.71 4.73
1.93 2.33 2.75 3.30 3.74 4.82 1.87 2.25 2.62 3.12 3.51 4.44
1.88 2.27 2.65 3.17 3.58 4.58 1.83 2.18 2.53 2.99 3.35 4.21
1.77 2.09 2.41 2.84 3.18 4.00 1.71 2.00 2.29 2.66 2.95 3.64
1.64 1.89 2.14 2.47 2.73 3.36 1.57 1.79 2.01 2.29 2.50 3.01
1.46 1.62 1.79 2.01 2.18 2.59 1.38 1.51 1.64 1.80 1.93 2.23
14
40
ISTUDY
599
Statistical Tables
Denominator df 60
Area in Upper Tail 0.100 0.050 0.025 0.010 0.005 0.001
1 2.79 4.00 5.29 7.08 8.49 11.97
2 2.39 3.15 3.93 4.98 5.79 7.77
Numerator Degrees of Freedom (df) 3 4 5 6 7 8 2.18 2.04 1.95 1.87 1.82 1.77 2.76 2.53 2.37 2.25 2.17 2.10 3.34 3.01 2.79 2.63 2.51 2.41 4.13 3.65 3.34 3.12 2.95 2.82 4.73 4.14 3.76 3.49 3.29 3.13 6.17 5.31 4.76 4.37 4.09 3.86
12 1.66 1.92 2.17 2.50 2.74 3.32
24 1.51 1.70 1.88 2.12 2.29 2.69
∞ 1.29 1.39 1.48 1.60 1.69 1.89
80
0.100 0.050 0.025 0.010 0.005 0.001
2.77 3.96 5.22 6.96 8.33 11.67
2.37 3.11 3.86 4.88 5.67 7.54
2.15 2.72 3.28 4.04 4.61 5.97
2.02 2.49 2.95 3.56 4.03 5.12
1.92 2.33 2.73 3.26 3.65 4.58
1.85 2.21 2.57 3.04 3.39 4.20
1.79 2.13 2.45 2.87 3.19 3.92
1.75 2.06 2.35 2.74 3.03 3.70
1.63 1.88 2.11 2.42 2.64 3.16
1.48 1.65 1.82 2.03 2.19 2.54
1.24 1.32 1.40 1.49 1.56 1.72
100
0.100 0.050 0.025 0.010 0.005 0.001
2.76 3.94 5.18 6.90 8.24 11.50
2.36 3.09 3.83 4.82 5.59 7.41
2.14 2.70 3.25 3.98 4.54 5.86
2.00 2.46 2.92 3.51 3.96 5.02
1.91 2.31 2.70 3.21 3.59 4.48
1.83 2.19 2.54 2.99 3.33 4.11
1.78 2.10 2.42 2.82 3.13 3.83
1.73 2.03 2.32 2.69 2.97 3.61
1.61 1.85 2.08 2.37 2.58 3.07
1.46 1.63 1.78 1.98 2.13 2.46
1.21 1.28 1.35 1.43 1.49 1.62
120
0.100 0.050 0.025 0.010 0.005 0.001
2.75 3.92 5.15 6.85 8.18 11.38
2.35 3.07 3.80 4.79 5.54 7.32
2.13 2.68 3.23 3.95 4.50 5.78
1.99 2.45 2.89 3.48 3.92 4.95
1.90 2.29 2.67 3.17 3.55 4.42
1.82 2.18 2.52 2.96 3.28 4.04
1.77 2.09 2.39 2.79 3.09 3.77
1.72 2.02 2.30 2.66 2.93 3.55
1.60 1.83 2.05 2.34 2.54 3.02
1.45 1.61 1.76 1.95 2.09 2.40
1.19 1.25 1.31 1.38 1.43 1.54
∞
0.100 0.050 0.025 0.010 0.005 0.001
2.71 3.84 5.02 6.63 7.88 10.83
2.30 3.00 3.69 4.61 5.30 6.91
2.08 2.60 3.12 3.78 4.28 5.42
1.94 2.37 2.79 3.32 3.72 4.62
1.85 2.21 2.57 3.02 3.35 4.10
1.77 2.10 2.41 2.80 3.09 3.74
1.72 2.01 2.29 2.64 2.90 3.47
1.67 1.94 2.19 2.51 2.74 3.27
1.55 1.75 1.94 2.18 2.36 2.74
1.38 1.52 1.64 1.79 1.90 2.13
1.00 1.00 1.00 1.00 1.00 1.00
ISTUDY
600
Statistical Tables
TABLE A.6 Percentiles of the chi-square distribution
df 1 2 3 4 5
0.100 2.71 4.61 6.25 7.78 9.24
Area in Upper Tail 0.050 0.025 0.010 3.84 5.02 6.63 5.99 7.38 9.21 7.81 9.35 11.34 9.49 11.14 13.28 11.07 12.83 15.09
0.001 10.83 13.82 16.27 18.47 20.52
6 7 8 9 10
10.64 12.02 13.36 14.68 15.99
12.59 14.07 15.51 16.92 18.31
14.45 16.01 17.53 19.02 20.48
16.81 18.48 20.09 21.67 23.21
22.46 24.32 26.12 27.88 29.59
11 12 13 14 15
17.28 18.55 19.81 21.06 22.31
19.68 21.03 22.36 23.68 25.00
21.92 23.34 24.74 26.12 27.49
24.72 26.22 27.69 29.14 30.58
31.26 32.91 34.53 36.12 37.70
16 17 18 19 20
23.54 24.77 25.99 27.20 28.41
26.30 27.59 28.87 30.14 31.41
28.85 30.19 31.53 32.85 34.17
32.00 33.41 34.81 36.19 37.57
39.25 40.79 42.31 43.82 45.31
21 22 23 24 25
29.62 30.81 32.01 33.20 34.38
32.67 33.92 35.17 36.42 37.65
35.48 36.78 38.08 39.36 40.65
38.93 40.29 41.64 42.98 44.31
46.80 48.27 49.73 51.18 52.62
ISTUDY
Index
abbreviated life table, 97 additive rule of probability, 114 adjacent values, 30 adjusted rate, 75 all possible models, 438 allocation concealment, 539 alternative hypothesis, 228 analysis of variance, 279 average, 34 backward elimination, 439 bar chart, 24 Bayes’ theorem, 137 bell-shaped curve, 172 Berkson’s fallacy, 365 Bernoulli random variable, 161, 324 bias, 191, 511 big data, 2, 544 bimodal distribution, 37 binary data, 17 binomial distribution, 162, 324 binomial distribution, normal approximation to, 324 biostatistics, 1 blinding, 540 block randomization, 539 Bonferroni correction, 287 box plot, 29 case-control study, 123, 543 categorical data, 17 censoring, 479 central limit theorem, 9, 192 chi-square distribution, 308, 353 chi-square test, 351, 353 chi-square test, 2 × 2 table, 351 chi-square test, r × c table, 356 circle of powers, 417 clinical trial, 1, 538 cluster sampling, 11, 519 coefficient of determination, 413, 440 coefficient of determination, adjusted, 440 coefficient, regression model, 403
cohort study, 121, 544 collinearity, 439 combination, 164 comparative study, 537 complement, 111 concordant pair, 359 conditional probability, 115 confidence interval, 9, 209 confidence interval, difference in means, independent samples, 262, 265 confidence interval, difference in means, paired samples, 257 confidence interval, mean, 209, 210 confidence interval, odds ratio, 362 confidence interval, one-sided, 213 confidence interval, proportion, 327 confidence interval, proportion, binomial exact, 329 confidence interval, proportion, Wilson, 328 confidence interval, two-sided, 209 confounder, 72, 439, 467 consistency, 201 continuity correction, 325, 354 continuous data, 19 continuous random variable, 159 control group, 539 correlation, 10, 381 correlation coefficient, 382 correlation coefficient, Pearson, 382 correlation coefficient, Spearman rank, 387 Cox proportional hazards model, 495 critical value, 231 cross-sectional data, 89 cross-sectional study, 542 crossover study, 220 crossover trial, 541 crude rate, 70 cumulative frequency polygon, 27 cumulative relative frequency, 22 death rate, 68 degrees of freedom, 216 demographic data, 67 601
ISTUDY
602 descriptive statistics, 4, 17 design effect, 527 diagnostic test, 6, 135 dichotomous data, 17 direct standardization, 74 discordant pair, 359 discrete data, 18 discrete random variable, 159 disjoint, 114 distribution-free method, 297 ecological fallacy, 385 effective sample size, 527 empirical probability, 160 empirical rule, 42 equipoise, 541 error, 403 estimation, 191, 209 estimator, 191 event, 111 exact test, 300 exhaustive, 118 explanatory variable, 399 F distribution, 283 F-test, 283 factorial, 164 false negative, 136, 179 false positive, 136, 180 finite population correction factor, 513 Fisher’s exact test, 358 forward selection, 439 frequency distribution, 20 frequency polygon, 26 frequentist definition, 112 Gaussian distribution, 172 graph, 24 hazard function, 101, 484, 495 hazard ratio, 496 histogram, 24 homoscedasticity, 404, 432 hypothesis test, 9, 227 hypothesis test, correlation coefficient, Pearson, 385 hypothesis test, correlation coefficient, Spearman, 388 hypothesis test, difference in means, equal variances, 259 hypothesis test, difference in means, independent samples, 258
Index hypothesis test, difference in means, paired samples, 254 hypothesis test, difference in means, unequal variances, 263 hypothesis test, difference in proportions, 332 hypothesis test, mean, 227 hypothesis test, one-sided, 233 hypothesis test, proportion, 329 hypothesis test, two-sided, 230 independent events, 117 indicator variable, 435, 460 indirect standardization, 77 infant mortality rate, 68 inference, 6, 111, 191 intention to treat, 541 interaction, 436, 467 interquartile range, 38 intersection, 111 interval estimation, 209 intraclass correlation coefficient, 527 Kaplan-Meier method, 487 Kruskal-Wallis test, 307 ladder of powers, 417 least squares method, 405 level of confidence, 211 life expectancy, 97 life table, 5, 89 life table, cohort, 89, 481 life table, current, 481 life table, period, 89, 481 line graph, 31 linear regression, 10, 399 linear regression, multiple, 431 linear regression, simple, 399 log-rank test, 491 logistic function, 457 logistic regression, 10, 455 logistic regression, multiple, 464 longitudinal data, 89 longitudinal study, 543 loss to follow-up, 479 Mann-Whitney test, 304 maximum likelihood estimation, 191, 326 McNemar’s test, 358, 360 mean, 34 measure of central tendency, 34 measure of variability, 38 median, 36
ISTUDY
603
Index method of least squares, 404 mode, 37 model evaluation, 413, 440 model selection, 438, 468 model selection, all possible models, 438 model selection, backward elimination, 439 model selection, forward selection, 439 model selection, stepwise selection, 439 mortality rate, 68 multiple comparisons procedures, 286 multiplicative method, 92 multiplicative rule of probability, 116 mutually exclusive, 114 natural experiment, 69 negative correlation, 383 negative likelihood ratio, 144 negative predictive value, 139 nominal data, 17 nonparametric test, 297 nonresponse, 512, 528 normal distribution, 172 null event, 114 null hypothesis, 227 numerical summary measure, 34 observational study, 542 odds, 122, 361 odds ratio, 122, 360 one-sample t-test, 230 one-sample z-test, 230 one-way analysis of variance, 279, 282 one-way analysis of variance, between-groups variability, 283 one-way analysis of variance, within-groups variability, 283 ordinal data, 17 outlier, 30, 386 p-value, 229 paired data, 254 paired t-test, 255 parameter, 162 parametric test, 297 parsimonious model, 438 Pearson correlation coefficient, 382 percentile, 27 perinatal mortality rate, 69 permuted block randomization, 540 person-year, 99 placebo, 86, 540
point estimation, 209 Poisson distribution, 168 population at risk, 68 population mean, 160 population regression line, 403 population standard deviation, 160 population variance, 160 positive correlation, 383 positive likelihood ratio, 144 positive predictive value, 139 posterior probability, 139 power, 238 power curve, 238 PPS sampling, 523 precision medicine, 8 prevalence, 118, 138, 147 primary sampling unit, 519 prior probability, 139 probability, 6, 111 probability density, 172 probability distribution, 159 probability proportionate to size sampling, 523 probability sample, 510 product limit method, 487 proportion, 68 quartile, 29 R2 , 413, 440 random sample, 192 random variable, 159 randomization, 539 randomized clinical trial, 538 randomized response, 528 randomized study, 538 range, 38 ranked data, 18 rate, 67 rate, age-specific, 70 rate, crude, 70 ratio estimator, 522 receiver-operating characteristic curve, 146 relative frequency, 22 relative odds, 122 relative risk, 121 residual, 405 residual plot, 414, 440 residual sum of squares, 406 response variable, 399 risk difference, 341 risk ratio, 121, 341
ISTUDY
604 robust estimator, 37 ROC curve, 146 sample size estimation, one mean, 212, 241 sample size estimation, one proportion, 330 sample size estimation, two means, 266 sample size estimation, two proportions, 335 sample space, 112 sampling distribution, mean, 192 sampling distribution, proportion, 326 sampling fraction, 512 sampling frame, 511 sampling units, 511 sampling variability, 199 scatter plot, 30 screening test, 6, 135 secondary sampling unit, 519 self-pairing, 254 sensitivity, 136 sign test, 297 significance level, 228 simple random sample, 7, 11, 512 Simpson’s paradox, 466 simulation, 199 skewed data, 29, 37 slope, 403, 431 specificity, 136 spectrum bias, 150 standard deviation, 41 standard deviation from regression, 408 standard error, 192, 326 standard normal deviate, 176 standard normal distribution, 173 standardization, 74 standardized mortality ratio, 77 stationary population, 93, 104 statistical inference, 191 statistical package, 54 statistical significance, 228 statistics, 1 stepwise selection, 439 stochastic ordering, 24 stratified random sample, 515 stratified analysis, 466 stratified randomization, 540 stratified sampling, 11 Student’s t distribution, 215, 216 study design, 537 study population, 511 study units, 511 survey, 510
Index survival analysis, 479 survival curve, 91, 479 survival curve, Kaplan-Meier, 487 survival curve, product-limit, 487 survival function, 479 survival time, 479 symmetric data, 29, 37 systematic random sample, 514 t distribution, 215 t-test, independent samples, 258 t-test, paired samples, 254 table, 20 target population, 511 test statistic, 230, 231 theoretical probability distribution, 160 total probability rule, 119 transformation, 416 true positive, 136 two-sample t-test, 258 two-stage cluster design, 524 two-way scatter plot, 30, 381 type I error, 234 type II error, 235 under-1 mortality rate, 68 under-5 mortality rate, 69 unimodal distribution, 37 union, 111 variability, 4 variable, 159 variance, 39 variance, pooled estimate, 260 Venn diagram, 112 vital statistics, 67 Welch’s t-test, 263 Wilcoxon rank sum test, 304 Wilcoxon signed-rank test, 301 y-intercept, 403, 431 Yates’ correction, 355 z-score, 176