Methods and Applications of Statistics in Clinical Trials, Volume 2: Planning, Analysis, and Inferential Methods 9781118304761, 2013034342, 1871871891, 1118304764, 9781118595978, 1118595971

This comprehensive book features both new and established material on the key statistical principles and concepts for de

222 81 104MB

English Pages 963 Year 2014

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Methods and Applications of Statistics in Clinical Trials, Volume 2: Planning, Analysis, and Inferential Methods
 9781118304761, 2013034342, 1871871891, 1118304764, 9781118595978, 1118595971

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Methods and Applications of Statistics in Clinical Trials Volume 2

WILEY SERIES IN METHODS AND APPLICATIONS OF STATISTICS Advisory

Editor

N. Balakrishnan McMaster University, Canada The Wiley Series in Methods and Applications of Statistics is a unique grouping of research that features classic contributions from Wiley's Encyclopedia of Statistical Sciences, Second Edition (ESS, 2e) alongside newly written articles that explore various problems of interest and their intrinsic connection to statistics. The goal of this collection is to encompass an encyclopedic scope of coverage within individual books that unify the most important and interesting applications of statistics within a specific field of study. Each book in the series successfully upholds the goals of ESS, 2e by combining established literature and newly developed contributions written by leading academics, researchers, and practitioners in a comprehensive and accessible format. The result is a succinct reference that unveils modern, cutting-edge approaches to acquiring, analyzing, and presenting data across diverse subject areas. WILEY SERIES IN METHODS AND APPLICATIONS OF STATISTICS Balakrishnan • Methods and Applications of Statistics in the Life and Health Sciences Balakrishnan • Methods and Applications of Statistics in Business, Finance, and Management Science Balakrishnan • Methods and Applications of Statistics in Engineering, Quality Control, and the Physical Sciences Balakrishnan • Methods and Applications of Statistics in the Social and Behavioral Sciences Balakrishnan • Methods and Applications of Statistics in the Atmospheric and Earth Sciences Balakrishnan • Methods and Applications of Statistics in Clinical Trials, Volume 1: Concepts, Principles, Trials, and Designs Balakrishnan • Methods and Applications of Statistics in Clinical Trials, Volume 2: Planning, Analysis, and Inferential Methods

Methods and Applications of Statistics in Clinical Trials Volume 2 Planning, Analysis, and Inferential Methods Edited by

N. Balakrishnan McMaster University Department of Mathematics and Statistics Hamilton, Ontario, Canada

WILEY

Copyright © 2014 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. All rights reserved. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic format. For information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Methods and applications of statistics in clinical trials vol 2/ [edited by] N. Balakrishnan. p . ; cm. — (Methods and applications of statistics) Includes bibliographical references and index. ISBN 978-1-118-30476-1 (cloth) I. Balakrishnan, N., 1956- editor of compilation. II. Series: Wiley series in methods and applications of statistics. [DNLM: 1. Clinical Trials as Topic. 2. Statistics as Topic. QV 771.4] R853.C55 610.72'4—dc23 2013034342 Printed in the United States of America. 10 9 8 7 6 5 4 3 2 1

Contents

Contributors

xix

Preface

xxiii

1

Analysis of Over- and Underdispersed D a t a 1.1 Introduction 1.2 Overdispersed Binomial and Count Models 1.3 Other Approaches to Account for Overdispersion 1.4 Underdispersion 1.5 Software Notes References

1 1 2 4 6 7 7

2

Analysis of Variance (ANOVA) 2.1 Introduction 2.2 Factors, Levels, Effects, and Cells 2.3 Cell Means Model 2.4 One-Way Classification 2.5 Parameter Estimation 2.6 The R(.) Notation—Partitioning Sum of Squares 2.7 ANOVA—Hypothesis of Equal Means 2.8 Multiple Comparisons . . 2.9 Two-Way Crossed Classification 2.10 Balanced and Unbalanced Data 2.11 Interaction Between Rows and Columns 2.12 Analysis of Variance Table References

10 10 11 12 12 13 13 15 16 17 17 20 20 24

3

Assessment of Health-Related Quality of Life 3.1 Introduction 3.2 Choice of HRQOL Instruments 3.3 Establishment of Clear Objectives in HRQOL Assessments 3.4 Methods for HRQOL Assessment 3.5 HRQOL as the Primary End Point 3.6 Interpretation of HRQOL Results

26 26 27 27 29 31 32

v

vi

Contents 3.7 Examples 3.8 Conclusion References Further Reading

32 36 36 39

4 Bandit Processes and Response-Adaptive Clinical Trials: The Art of Exploration Versus Exploitation 40 4.1 Introduction 40 4.2 Exploration Versus Exploitation with Complete Observations 41 4.3 Exploration Versus Exploitation with Censored Observations 46 4.4 Conclusion 48 References 49 5 Bayesian Dose-Finding Designs in Healthy Volunteers 5.1 Introduction 5.2 A Bayesian Decision-Theoretic Design 5.3 An Example of Dose Escalation in Healthy Volunteer Studies 5.4 Discussion References

51 51 52 54 59 61

6 Bootstrap 6.1 Introduction 6.2 Plug-In Principle 6.3 Monte Carlo Sampling—The "Second Bootstrap Principle" 6.4 Bias and Standard Error 6.5 Examples 6.6 Model Stability 6.7 Accuracy of Bootstrap Distributions 6.8 Bootstrap Confidence Intervals 6.9 Hypothesis Testing 6.10 Planning Clinical Trials 6.11 How Many Bootstrap Samples Are Needed 6.12 Additional References References

62 62 64 66 66 67 72 77 83 91 92 95 99 99

7 Conditional Power in Clinical Trial Monitoring 7.1 Introduction 7.2 Conditional Power 7.3 Weight-Averaged Conditional Power or Bayesian Predictive Power . . . 7.4 Conditional Power of a Different Kind: Discordance Probability . . . . 7.5 Analysis of a Randomized Trial 7.6 Conditional Power: Pros and Cons References

102 102 102 105 106 107 108 109

Contents

vi

8

Cost-Effectiveness Analysis 8.1 Introduction 8.2 Definitions and Design Issues 8.3 Cost and Effectiveness Data 8.4 The Analysis of Costs and Outcomes 8.5 Robustness and Generalizability in Cost-Effectiveness Analysis References Further Reading

111 Ill Ill 114 115 120 123 125

9

Cox-Type Proportional Hazards Models 9.1 Introduction 9.2 Cox Model for Univariate Failure Time Data Analysis 9.3 Marginal Models for Multivariate Failure Time Data Analysis 9.4 Practical Issues in Using the Cox Model 9.5 Examples 9.6 Extensions 9.7 Softwares and Codes References Further Reading

126 126 126 129 131 136 141 141 144 145

10 Empirical Likelihood Methods in Clinical Experiments 10.1 Introduction 10.2 Classical EL: Several Ingredients for Theoretical Evaluations 10.3 The Relationship Between Empirical Likelihood and Bootstrap Methodologies 10.4 Bayes Methods Based on Empirical Likelihoods 10.5 Mixtures of Likelihoods 10.6 An Example: ROC Curve Analyses Based on Empirical Likelihoods . . 10.7 Applications of Empirical Likelihood Methodology in Clinical Trials or Other Data Analyses 10.8 Concluding Remarks Appendix References

146 146 152

11 Frailty Models 11.1 Introduction 11.2 Univariate Frailty Models 11.3 Multivariate Frailty Models 11.4 Software References

166 166 167 170 171 172

12 Futility Analysis 12.1 Introduction 12.2 Common Statistical Approaches to Futility Monitoring 12.3 Examples 12.4 Discussion References Further Reading

174 174 175 178 180 184 186

154 156 156 157 158 158 161 162

vi

Contents 13 Imaging Science in Medicine I: Overview 13.1 Introduction 13.2 Advances in Medical Imaging 13.3 Evolutionary Developments in Imaging 13.4 Conclusion References

187 187 189 190 211 212

14 Imaging Science in Medicine, II: Basics of X-Ray Imaging 213 14.1 Introduction to Medical Imaging: Different Ways of Creating Visible Contrast Among Tissues 213 14.2 What the Body Does to the X-Ray Beam: Subject Contrast From Differential Attenuation of the X-Ray Beam by Various Tissues 222 14.3 What the X-Ray Beam Does to the Body: Known Medical Benefits Versus Possible Radiogenic Risks 235 14.4 Capturing the Visual Image: Analog (20th Century) X-Ray Image Receptors 248 15 Imaging Science in Medicine, III: Digital (21st Century) X-Ray Imaging 264 15.1 The Computer in Medical Imaging 264 15.2 The Digital Planar X-Ray Modalities: Computed Radiography and Digital Radiography and Fluoroscopy 279 15.3 Digital Fluoroscopy and Digital Subtraction Angiography 287 15.4 Digital Tomosynthesis: Planar Imaging in Three Dimensions 290 15.5 Computed Tomography: Superior Contrast in Three-Dimensional X-Ray Attenuation Maps 292 16 Intention-to-Treat Analysis 16.1 Introduction 16.2 Missing Information 16.3 The Intention-to-Treat Design 16.4 Efficiency of the Intent-to-Treat Analysis 16.5 Compliance-Adjusted Analyses 16.6 Conclusion References Further Reading 17 Interim Analyses 17.1 Introduction 17.2 Opportunities and Dangers of Interim Analyses 17.3 The Development of Techniques for Conducting Interim Analyses 17.4 Methodology for Interim Analyses 17.5 An Example: Statistics for Lamivudine 17.6 Interim Analyses in Practice 17.7 Conclusions References

313 313 313 316 319 320 320 320 321 323 323 324 . . . 325 325 328 329 331 331

Contents 18 Interrater Reliability 18.1 Definition 18.2 The Importance of Reliability in Clinical Trials 18.3 How Large a Reliability Coefficient Is Large Enough? 18.4 Design and Analysis of Reliability Studies 18.5 Estimate of the Reliability Coefficient—Parametric 18.6 Estimation of the Reliability Coefficient—Nonparametric 18.7 Estimation of the Reliability Coefficient—Binary 18.8 Estimation of the Reliability Coefficient—Categorical 18.9 Strategies to Increase Reliability (Spearman-Brown Projection) 18.10 Other Types of Reliabilities References

....

ix 334 334 334 335 335 336 336 337 337 337 338 338

19 Intrarater Reliability 19.1 Introduction 19.2 Intrarater Reliability for Continuous Scores 19.3 Nominal Scale Score Data 19.4 Ordinal and Interval Score Data 19.5 Concluding Remarks References Further Reading

340 340 340 348 353 354 355 356

20 Kaplan-Meier Plot 20.1 Introduction 20.2 Estimation of Survival Function 20.3 Additional Topics References

357 357 358 363 364

21 Logistic Regression 365 21.1 Introduction 365 21.2 Fitting the Logistic Regression Model 366 21.3 The Multiple Logistic Regression Model 368 21.4 Fitting the Multiple Logistic Regression Model 369 21.5 Example 369 21.6 Testing for the Significance of the Model 371 21.7 Interpretation of the Coefficients of the Logistic Regression Model . . . 373 21.8 Dichotomous Independent Variable 373 21.9 Polytomous Independent Variable 375 21.10 Continuous Independent Variable 375 21.11 Multivariate Case 377 References 379 22 Metadata 22.1 Introduction 22.2 History/Background 22.3 Data Set Metadata 22.4 Analysis Results Metadata 22.5 Regulatory Submission Metadata

380 380 380 383 388 389

X

Contents References

390

23 Microarray 23.1 Introduction 23.2 What is a Microarray? 23.3 Other Array Technologies 23.4 Define Objectives of the Study 23.5 Experimental Design for Microarray 23.6 Data Extraction 23.7 Microarray Informatics 23.8 Statistical Analysis 23.9 Annotation 23.10 Pathway, GO, and Class-Level Analysis Tools 23.11 Validation of Microarray Experiments 23.12 Conclusions References

392 392 393 395 398 399 401 402 402 404 404 405 405 406

24 Multi-Armed Bandits, Gittins Index, and Its Calculation 24.1 Introduction 24.2 Mathematical Formulation of Multi-Armed Bandits 24.3 Off-Line Algorithms for Computing Gittins Index 24.4 On-Line Algorithms for Computing Gittins Index 24.5 Computing Gittins Index for the Bernoulli Sampling Process 24.6 Conclusion References

416 416 416 419 428 430 433 433

25 Multiple Comparisons 25.1 Introduction 25.2 Strong and Weak Control of the FWE 25.3 Criteria for Deciding Whether Adjustment is Necessary 25.4 Implicit Multiplicity: Two-Tailed Testing 25.5 Specific Multiple Comparison Procedures References

436 436 436 437 438 439 444

26 Multiple Evaluators 26.1 Introduction 26.2 Agreement for Continuous Data 26.3 Agreement for Categorical Data 26.4 Summary and Discussion References

446 446 447 449 453 453

27 Noncompartmental Analysis 27.1 Introduction 27.2 Terminology 27.3 Objectives and Features of Noncompartmental Analysis 27.4 Comparison of Noncompartmental and Compartmental Models 27.5 Assumptions of NCA and Its Reported Descriptive Statistics 27.6 Calculation Formulas for NCA

457 457 458 459 460 460 464

Contents

xi

27.7 Guidelines for Performance of NCA Based on Numerical Integration . . 472 27.8 Conclusions and Perspectives 477 References 477 Further Reading 482 28 Nonparametric ROC Analysis for Diagnostic Trials 28.1 Introduction 28.2 Different Aspects of Study Design 28.3 Nonparametric Models and Hypotheses 28.4 Point Estimator 28.5 Asymptotic Distribution and Variance Estimator 28.6 Derivation of the Confidence Interval 28.7 Statistical Tests 28.8 Adaptations for Cluster Data 28.9 Results of a Diagnostic Study 28.10 Summary and Final Remarks References

483 483 484 486 487 488 490 490 490 491 494 494

29 Optimal Biological Dose for Molecularly Targeted Therapies 29.1 Introduction 29.2 Phase I Dose-Finding Designs for Cytotoxic Agents 29.3 Phase I Dose-Finding Designs for Molecularly Targeted Agents 29.4 Discussion References Further Reading

496 496 497 497 502 503 505

30 Over- and Underdispersion Models 30.1 Introduction 30.2 Count Dispersion Models 30.3 Count Explanatory Models 30.4 Summary and Final Remarks . References

506 506 508 514 519 520

31 Permutation Tests in Clinical Trials 31.1 Randomization Inference—Introduction 31.2 Permutation Tests—How They Work 31.3 Normal Approximation to Permutation Tests 31.4 Analyze as You Randomize 31.5 Interpretation of Permutation Analysis Results 31.6 Summary References

527 527 528 531 532 533 534 534

32 Pharmacoepidemiology, Overview 32.1 Introduction 32.2 The Case-Crossover Design 32.3 Confounding Bias 32.4 Risk Functions Over Time 32.5 Probabilistic Approach for Causality Assessment

536 536 537 539 543 545

xi

Contents 32.6 Methods Based on Prescription Data References

546 547

33 Population Pharmacokinetic and Pharmacodynamic Methods 33.1 Introduction 33.2 Terminology 33.3 Fixed Effects Models 33.4 Random Effects Models 33.5 Model Building and Parameter Estimation 33.6 Software 33.7 Model Evaluation 33.8 Stochastic Simulation 33.9 Experimental Design 33.10 Applications References Further Reading

551 551 552 553 555 556 561 562 565 565 566 567 568

34 Proportions: Inferences and Comparisons 34.1 Introduction 34.2 One-Sample Case 34.3 Two Independent Samples 34.4 Note on Software References

570 570 571 578 588 589

35 Publication Bias 35.1 Publication Bias and the Validity of Research Reviews 35.2 Research on Publication Bias 35.3 Data Suppression Mechanisms Related to Publication Bias 35.4 Prevention of Publication Bias 35.5 Assessment of Publication Bias 35.6 Impact of Publication Bias References Further Reading

595 595 596 597 598 599 605 605 606

36 Quality of Life 36.1 Background 36.2 Measuring Health-Related Quality of Life 36.3 Development and Validation of HRQoL Measures 36.4 Use in Research Studies 36.5 Interpretation/Clinical Significance 36.6 Conclusions References

608 608 609 613 615 617 618 619

37 Relative Risk Modeling 37.1 Introduction 37.2 Why Model Relative Risks? 37.3 Data Structures and Likelihoods 37.4 Approaches to Model Specification

622 622 622 623 624

Contents 37.5 Mechanistic Models References

xi 629 630

38 Sample Size Considerations for Morbidity/Mortality Trials 38.1 Introduction 38.2 General Framework for Sample Size Calculation 38.3 Choice of Test Statistics 38.4 Adjustment of Treatment Effect 38.5 Informative Noncompliance References

633 633 633 634 636 639 640

39 Sample Size for Comparing Means 39.1 Introduction 39.2 One-Sample Design 39.3 Two-Sample Parallel Design 39.4 Two-Sample Crossover Design 39.5 Multiple-Sample One-Way ANOVA 39.6 Multiple-Sample Williams Design 39.7 Discussion References

642 642 643 645 646 648 650 651 652

40 Sample Size for Comparing Proportions 40.1 Introduction 40.2 One-Sample Design 40.3 Two-Sample Parallel Design 40.4 Two-Sample Crossover Design 40.5 Relative Risk—Parallel Design 40.6 Relative R i s k Crossover Design 40.7 Discussion References

653 653 654 655 657 659

41 Sample Size for Comparing Time-to-Event Data 41.1 Introduction 41.2 Exponential Model 41.3 Cox's Proportional Hazards Model 41.4 Log-Rank Test 41.5 Discussion References

664 664 664 667 669 670 670

42 Sample Size for Comparing Variabilities 42.1 Introduction 42.2 Comparing Intrasubject Variabilities 42.3 Comparing Intersubject Variabilities 42.4 Comparing Total Variabilities 42.5 Discussion References

672 672 672 676 680 687 687

661 663 663

xiv

Contents

43 Screening, Models of 43.1 Introduction 43.2 What is Screening? 43.3 Why Use Modeling? 43.4 Characteristics of Screening Models 43.5 A Simple Disease and Screening Model 43.6 Analytic Models for Cancer 43.7 Simulation Models for Cancer 43.8 Model Fitting and Validation 43.9 Models for Other Diseases 43.10 Current State and Future Directions References

689 689 689 691 692 693 695 704 708 715 716 717

44 Screening Trials 44.1 Introduction 44.2 Design Issues 44.3 Sample Size 44.4 Study Designs 44.5 Analysis 44.6 Trial Monitoring References

721 721 721 722 723 725 728 728

45 Secondary Efficacy End Points 731 45.1 Introduction 731 45.2 Literature Review 734 45.3 Review of Methodology for Multiplicity Adjustment and Gatekeeping Strategies for Secondary End Points 736 45.4 Summary 738 References 738 Further Reading 739 46 Sensitivity, Specificity, and Receiver Operator Characteristic (ROC) Methods 740 46.1 Evaluating a Single Binary Test Against a Binary Criterion 740 46.2 Evaluation of a Single Binary Test: ROC Methods 743 46.3 Evaluation of a Test Response Measured on an Ordinal Scale: ROC Methods 745 46.4 Evaluation of Multiple Different Tests 747 46.5 The Optimal Sequence of Tests 747 46.6 Sampling and Measurement Issues 749 46.7 Summary 750 References 751 47 Software for Genetics/Genomics 47.1 Introduction 47.2 Data Management 47.3 Genetic Analysis 47.4 Genomic Analysis

752 752 752 762 768

Contents 47.5 Other References Further Reading

xv 770 • 770 776

48 Stability Study Designs 48.1 Introduction 48.2 Stability Study Designs 48.3 Criteria for Design Comparison 48.4 Stability Protocol 48.5 Basic Design Considerations 48.6 Conclusions References

778 778 779 782 788 788 790 790

49 Subgroup Analysis 49.1 Introduction 49.2 The Dilemma of Subgroup Analysis 49.3 Planned Versus Unplanned Subgroup Analysis 49.4 Frequentist Methods 49.5 Testing Treatment by Subgroup Interactions 49.6 Subgroup Analyses in Positive Clinical Trials 49.7 Confidence Intervals for Treatment Effects within Subgroups 49.8 Bayesian Methods References

793 793 793 794 795 796 797 798 799 800

50 Survival Analysis, Overview 50.1 Introduction 50.2 History 50.3 Survival Analysis Concepts 50.4 Nonparametric Estimation and Testing 50.5 Parametric Inference 50.6 Comparison with Expected Survival 50.7 The Cox Regression Model 50.8 Other Regression Models for Survival Data 50.9 Multistate Models 50.10 Other Kinds of Incomplete Observation 50.11 Multivariate Survival Analysis 50.12 Concluding Remarks References 51 The F D A and Regulatory Issues 51.1 Caveat 51.2 Introduction 51.3 Chronology of Drug Regulation in the United States 51.4 FDA Basic Structure 51.5 IND Application Process 51.6 Drug Development and Approval Time Frame 51.7 NDA Process 51.8 U.S. Pharmacopeia and FDA

802 802 . . 802 804 805 807 807 807 809 809 811 811 811 812 815 815 815 816 820 820 829 831 834

xvi

Contents 51.9 CDER Freedom of Information Electronic Reading Room 51.10 Conclusion

835 835

52 The Kappa Index 52.1 Introduction 52.2 The Kappa Index 52.3 Inference for Kappa via Generalized Estimating Equations 52.4 The Dependence of Kappa on Marginal Rates 52.5 General Remarks References

836 836 836 840 842 843 843

53 Treatment Interruption 53.1 Introduction 53.2 Therapeutic TI Studies in HIV/AIDS 53.3 Management of Chronic Disease 53.4 Analytic Treatment Interruption in Therapeutic Vaccine Trials 53.5 Randomized Discontinuation Designs 53.6 Final Comments References

846 846 846 853 854 855 856 856

54 Trial Reports: Improving Reporting, Minimizing Bias, and Producing Better Evidence-Based Practice 860 54.1 Introduction 860 54.2 Reporting Issues in Clinical Trials 860 54.3 Moral Obligation to Improve the Reporting of Trials 863 54.4 Consequences of Poor Reporting of Trials 863 54.5 Distinguishing Between Methodological and Reporting Issues 864 54.6 One Solution to Poor Reporting: CONSORT 2010 and CONSORT Extensions 866 54.7 Impact of CONSORT 866 54.8 Guidance for Reporting Randomized Trial Protocols: SPIRIT 870 54.9 Trial Registration 870 54.10 Final Thoughts 871 References 872 55 U.S. Department of Veterans Affairs Cooperative Studies Program 55.1 Introduction 55.2 History of the Cooperative Studies Program (CSP) 55.3 Organization and Functioning of the CSP 55.4 Roles of the Biostatistician and Pharmacist in the CSP 55.5 Ongoing and Completed Cooperative Studies (1972-2000) 55.6 Current Challenges and Opportunities 55.7 Concluding Remarks References

876 876 876 878 885 887 887 895 897

Contents

xvi

56 Women's Health Initiative: Statistical Aspects and Selected Early Results 901 56.1 Introduction 901 56.2 WHI Clinical Trial and Observational Study 901 56.3 Study Organization 903 56.4 Principal Clinical Trial Comparisons, Power Calculations, and Safety and Data Monitoring 903 56.5 Biomarkers and Intermediate Outcomes 908 56.6 Data Management and Computing Infrastructure 908 56.7 Quality Assurance Program Overview 910 56.8 Early Results from the WHI Clinical Trial 911 56.9 Summary and Discussion 912 References 912 57 World 57.1 57.2 57.3

Health Organization (WHO): Global Health Situation 914 Introduction 914 Program Activities to the End of the Twentieth Century 915 Vision for the Use and Generation of Data in the First Quarter of the Twenty-First Century 919 Reference 923 Further Reading 923

Index

925

Contributors

David H. Christiansen, Christiansen Consulting, Boise, ID

Per Kragh Andersen, University of Copenhagen, Copenhagen, Denmark, pka@biostat. ku. dk

Shein-Chung Chow, Duke University Durham, NC, sheinchung. chow @ duke, edu

Garnet L. Anderson, Fred Hutchinson Cancer Research Center, Seattle, WA, garnet@whi. org

Joseph F. Collins

Chul Ahn, University of Texas Southwestern Medical Center, Dallas, TX, chul. ahn@utsouthwestern. edu

Jason T. Connor, Berry Consultants, Orlando, FL, jason@berryconsultants. com

Edgar Brunner, Professor Emeritus of Biostatistics, University Medical Center, Gottingen, Germany, Edgar. Brunner@ams. med. unigoettingen. de

Richard J. Cook, University of Waterloo, Waterloo, ON, Canada, rjcook@uwaterloo. ca Xiangqin Cui, University of Alabama at Birmingham, Birmingham, AL, [email protected]

Jtirgen B. Bulitta, State University of New York at Buffalo, Buffalo, NY, Jurgen. Bulitta@monash. edu

C. B. Dean, Western University, Western Science Centre, London, ON, Canada, dean@stats. uwo. ca

Jianwen Cai, University of North Carolina, Chapel Hill, NC, [email protected]

Yu Deng, University of North Carolina, Chapel Hill, NC, [email protected]

Patrizio Capasso, University of Kentucky, Lexington, KY, patriziocapasso@aol. com

Diane L. Fairclough, University of Colorado Health Sciences, Center Denver, CO, [email protected]

Robert C. Capen, Merck Research Laboratories West Point, PA

John R. Feussner, Medical University of South Carolina, Charleston, SC

Jhelum Chakravorty, McGill University, Montreal, QC, Canada, jhelum. chakravorty@mail. mcgill. ca

Boris Freidlin, National Cancer Institute, Bethesda, MD, freidlinb@ctep. nci. nih.gov

Chi Wan Chen, Pfizer Inc., New York, NY xix

x

Contributors Patricia A. Granz, UCLA Jonsson Comprehensive Cancer Center, Los Angeles, CA, [email protected]

Jorg Kaufmann, AG Schering SBU Diagnostics & Radiopharmaceuticals, Berlin, Germany

Courtney Gray-McGuire, Case Western Reserve University, Cleveland, OH, courtney.gray-mcguire@case, edu

Niels Keiding, University of Copenhagen, Copenhagen, Denmark, [email protected]

Birgit Grund, University of Minnesota, Minneapolis, MN, [email protected]

Celestin C. Kokonendji, University of Franche-Comte, Besangon, France, [email protected]

Kilem L. Gwet, Advanced lytics, LLC, Gaithersburg, gwet62@gmail. com

AnaMD,

H. R. Hapsara, World Health Organization, Geneva, Switzerland, [email protected] William R. Hendee, Medical College of Wisconsin, Milwaukee, WI, whendee@mcw. edu William G. Henderson Tim Hesterberg, Insightful Corporation, Seattle, WA Nicholas H. G. Holford, University of Auckland, Auckland, New Zealand, n. Holford© auckland. ac. nz Norbert Hollander, University Hospital of Freiburg, Freiburg, Germany, norbert. hollaender@novartis. com David W. Hosmer, University of Massachusetts, Amherst, MA, hosmer@schoolph. umass. edu

Helena Chmura Kraemer, Stanford University, Palo Alto, CA, hckhome@pacbell. net John M. Lachin, George Washington University, Washington, DC, [email protected] Philip W. Lavori, Stanford University School of Medicine, Standford, CA, lavori© Stanford. edu

Morven Leese, Institute of Psychiatry—Health Services and Population Research Department, London, UK Stanley Lemeshow, Ohio University, Columbus, lemeshow. 1 ©osu. edu

State OH,

Jason J. Z. Liao, Merck Research Laboratories West Point, PA, Jason. Liao ©tevausa. com

Alan D. Hutson, University at Buffalo, Buffalo, NY, [email protected]

Tsae-Yun Daphne Lin, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Rockville, MD, daphne. lin@fda. hhs.gov

Peter B. Imrey, Cleveland Clinic, Cleveland, OH, [email protected]

Qing Lu, Michigan State University, East Lansing, MI, [email protected]

Elizabeth Juarez-Colunga, University of Colorado Denver, Aurora, CO, elizabeth.juarezcolunga@ucdenver. edu

Aditya Mahajan, McGill University, Montreal, QC, Canada, aditya. mahajan@mcgill. ca

Seung-Ho Kang, Ewha Woman's University, Seoul, South Korea

Michael A. Mclsaac, University of Waterloo, Waterloo, ON, Canada, mamcisaa@uwaterloo. ca

Contributors David Moher, Clinical Epidemiology Program, Ottawa Hospital Research Institute, Ottawa, ON, Canada and Department of Epidemiology and Community Medicine, University of Ottawa, Ottawa, ON, Canada, [email protected] Grier P. Page, RTI International, Research Triangle Park, NC, [email protected] Peter Peduzzi, Yale School of Public Health, New Haven, CT, peter.peduzzi@yale. edu Ross L. Prentice, Fred Hutchinson Cancer Research Center, Seattle, WA, rprentic@fhcrc. org Philip C. Prorok, National Institutes of Health, Bethesda, MD, Philip. Prorok@nih. hhs. gov Michael A. Proschan, National Institute of Allergy and Infectious Diseases, Bethesda, MD, ProschaM@mail. nih. gov Frank Rockhold, GlaxoSmithKline R&D, King of Prussia, PA, frank, w. rockhold@gsk. com Hannah R. Rothstein, City University of New York, NY, Hannah. Rothstein@baruch. cuny. edu W. Janusz Rzeszotarski, U.S. Food and Drug Administration, Rockville, MD

xi

and Department of Epidemiology and Community Medicine, University of Ottawa, Ottawa, ON, Canada, Ishamseer@ohri. ca Joanna H. Shih, National Cancer Institute, Bethesda, MD Richard M. Simon, National Cancer Institute, Bethesda, MD, [email protected] Yeunjoo Song, Case Western Reserve University, Cleveland, OH Chris Stevenson, Monash University, Victoria, Australia, Christopher. Stevenson@monash. edu Samy Suissa, McGill University, Montreal, QC, Canada, samy. suissa@clinepi. mcgill. ca Ming T. Tan, Georgetown University, Washington, DC, mtt34 @georgetown. edu Duncan C. Thomas University of Southern California, Los Angeles, CA, dthomas@usc. edu Susan Todd, University of ing Reading, Berkshire, s. c. todd@reading. ac. uk

ReadUK,

Lucy Turner, Clinical Epidemiology Program, Ottawa Hospital Research Institute, Ottawa, ON, Canada, lturner@ohri. ca Albert Vexler, University at Buffalo, Buffalo, NY, [email protected]

Mike R. Sat her, Department of Veterans Affairs, Albuquerque, NM, mike. sather@va. gov

Hansheng Wang, Peking University Beijing, P. R. China, [email protected]. edu. cn

Tony Segreti, Research Triangle Institute, Research Triangle, NC

Xikui Wang, University of Manitoba, Winnipeg, MN, Canada, xikui. wang@umanitoba. ca

Larissa Shamseer, Clinical Epidemiology Program, Ottawa Hospital Research Institute, Ottawa, ON, Canada

C. S. Wayne Weng, Chung Yuan Christian University, Chungli, Taiwan

xi

Contributors Andreas Wienke, University HalleWittenberg, Halle, Germany, andreas. wienke@uk-halle. de Anthony B. Wolbarst, University of Kentucky, Lexington, KY, awolbarst2@outlook. com Andrew R. Wyant, University of Kentucky, Lexington, KY, andrew. wyant@uky. edu Yang Xie, University of Texas Southwestern Medical Center, Dallas, TX Jihnhee Yu, University at Buffalo, Buffalo, NY, jinheeyu@buffalo. edu Antonia Zapf, University Medical Center, Gottingen, Germany, Antonia. Zapf@med. uni-goettingen. de Donglin Zeng, University of North Carolina, Chapel Hill, NC, dzeng@email. unc. edu Yinghui Zhou, The University of Reading, Reading, Berkshire, UK David M. Zucker, Hebrew University of Jerusalem Jerusalem, Israel, mszucker@mscc. huji. ac. il

Preface

Planning, developing, and implementing clinical trials, have become an important and integral part of life. More and more efforts and care go into conducting various clinical trials as they have been responsible in making key advances in medicine and treatments to different illnesses. Today, clinical trials have become mandatory in the development and evaluation of modern drugs and in identifying the association of risk factors to diseases. Due to the complexity of various issues surrounding clinical trials, regulatory agencies oversee their approval and also ensure impartial review. The main purpose of this two-volume handbook is to provide a detailed exposition of historical developments and also to highlight modern advances on methods and analysis for clinical trials. It is important to mention that the fourvolume Wiley Encyclopedia of Clinical Trials served as a basis for this handbook. While many pertinent entries from this Encyclopedia have been included here, a number of them have been updated to reflect recent developments on their topics. Some new articles detailing modern advances in statistical methods in clinical trials and their applications have also been included. A volume of this size and nature cannot be successfully completed without the cooperation and support of the contributing

authors, and my sincere thanks and gratitude go to all of them. Thanks are also due to Mr. Steve Quigley and Ms. Sari Friedman (of John Wiley & Sons, Inc.) for their keen interest in this project from day one, as well as for their support and constant encouragement (and, of course, occasional nudges, too) throughout the course of this project. Careful and diligent work of Mrs. Debbie Iscoe in the typesetting of this volume and of Angioline Loredo at the production state, is gratefully acknowledged. Partial financial support of the Natural Sciences and Engineering Research Council of Canada also assisted in the preparation of this handbook, and this support is much appreciated. This is the seventh in a series of handbooks on methods and applications of statistics. While the first handbook has focused on life and health sciences, the second handbook has focused on business, finance, and management sciences, the third has focused on engineering, quality control, and physical sciences, the fourth has focused on behavioral and social sciences, the fifth has focused on atmospheric and earth sciences, and the sixth handbook has concentrated on methods and applications of statistics to clinical trials. This is the second of two volumes describing in detail statistical developments concerning clinical trials, focusing specifically on planning, analysis, and inferential methods.

xxiii

xiv

Preface It is my sincere hope that this handbook and the others in the series will become basic reference resources for those involved in these fields of research!

PROF. N. BALAKRISHNAN McMASTER UNIVERSITY Hamilton, Canada February 2014

1 Analysis of Over- and Underdispersed Data Elizabeth

1.1

Juarez-Colunga

and C• B. Dean

Introduction

sive discussion of what he calls apparent overdispersion, which refers to scenarios in which the data exhibit variation beyond what can be explained by the model and this lack of fit is due to several "fixable" reasons. These reasons may be omit-

In the analysis of discrete data, for example, count data analyzed under a Poisson model, or binary data analyzed under a binomial model quite often the empirical variance exceeds the theoretical variance under the presumed model. Thi s phenomenon is called overdispersion. If overdispersion is ignored, standard errors of parameter estimates will be underestimated, and therefore p-values for tests and hypotheses will be too small, leading to incorrectly declaring a predictor as significant when in fact it may not be. The Poisson and binomial distributions are simple models but have strict assumptions. In particular, they assume a special mean-variance relationship since each of these distributions is determined by a single parameter. On the other hand, the normal distribution is determined by two parameters, the mean fi and variance cr2, which characterize the location and the spread of the data around the mean. In both the Poisson and binomial distributions, the variance is fixed once the mean or the probability of success has been defined. Hilbe [25] provides a very comprehen-

ting important predictors in the model, the presence of outliers, omitting important interactions as predictors, the need of a transformation for a predictor, and misspecifying the link function for relating the mean response to the predictors. Hilbe [25] also discusses how to recognize overdispersion, and how to adjust for it when it is present beyond apparent cases, and provides an excellent overall review of the topic. It is important to note that if apparent overdispersion has been ruled out, in loglinear or logistic analyses, the point estimates of the covariate effects will be quite similar regardless of whether overdispersion is accounted for or not. Hence, treatment and other effects will not be aberrant or give a hint of the presence of overdispersion. As well, this suggests that adjusting for overdispersion can be handled through adjustments of variance estimates [35]. Evidence of apparent or real overdispersion exists when the Pearson or deviance 1

2

Analysis

of Over- and Under dispersed

residuals are too large [6]; the corresponding Pearson and deviance goodness-of-fit statistics indicate a poor fit. Several tests have been developed for overdispersion in the context of Poisson or binomial analyses [11, 12, 54], as well as in the context of zero-heavy data [30, 51, 53, 52].

1.2 1.2.1

Overdispersed Binomial and Count Models Overdispersed Binomial Model

In the binomial context, overdispersion typically arises because the independence assumption is violated. This is commonly caused by clustering of responses; for instance, clinics or hospitals may induce a clustering effect due to differences in patient care strategies across institutions. Let Yi denote a binomial response for cluster i, i — 1 , . . . , M , which results in the sum of rrii binary outcomes j, that is, Yi = Yiji where j denotes individual j , j = 1 , . . . , rrii. If Yij are independent binary variables taking values 0 or 1 with probabilities (1 — pi) and Pi, respectively, then E(Yi) = niiPi and var(F^) = rriiPi( 1 — pi). If there exists correlation between two responses in any given cluster, with coTT(Yij,Yik) = ip > 0, then E(Yi)

=

rriiPi, and

var(Yi)

=

rriiPi(l - pi)[l + ^(nti - 1)], (1)

leading to overdispersion. Note that ip < 0 leads to underdispersion. If we consider PiS as random variables with E(pi) — ir and v&r(pi) = ^7r(l — ir), then the unconditional mean and variance also have the form of (1). And if we further assume that the pi follow Beta(a, 0) distribution, the distribution of Yi is the so called beta-binomial distribution, which has been studied extensively (see, for

Data

example, Hinde and Demetrio [26] and Molenberghs et al. [37]).

1.2.2

Overdispersed Poisson Model

Poisson and overdispersed Poisson data are examples of data from counting processes that arise when individuals experience repeated occurrence of events over time. Such data are known as recurrent event data (see, for example, Cook and Lawless [10] and Juarez-Colunga [29]). Consider M individuals each monitored for occurrence of events from a start time 0 through time r^, called the termination time, i = 1 , . . . , M. Let {Ni(t),t > 0} be the right-continuous counting process that records the number of events for individual i over the interval [0, t]. The termination time is here assumed to be independent of the counting process {Ni(t),t > 0}. Let the intensity of the counting process be AM H { t ) ) = where Hi(t) = {Ni(s) : 0 < s < t} represents the history of the process up to time t. This intensity represents the instantaneous probability of occurrence of an event at time t. If the counting process is Poisson, given the memoryless property of the Poisson process, the intensity only depends on the history through t, Xi(t\H(t)) = Ai(t), and the expected number of events over the entire follow-up can be written as = fj* A(t)dt. Let the total number of events in the entire followup be ni+ for individual i; then follows a Poisson distribution with mean fj,i+ = E(rii+) = var(n*+). Two types of data are common in counting processes, and we will consider both here in the context of overdispersion: (1) individual i gives rise to ni+ event times recorded as t n < U2 < ' • • < ti U i + < r and (2) only counts within specific followup times 0 = T i?0 < T M < ... < ri>e. = Ti are available; these are called panel

3 Analysis counts and are denoted flip = Ni{TilP) Ni(Ti,p-1), p — 1,2 • • • , eu with the total aggregated count for individual i denoted

p=i

n:ip

A simple way to incorporate overdispersion is through the use of an individualspecific random effect Given and the covariate vector corresponding to the ith individual, the counting process Ni(t) may be modeled as a Poisson process with intensity function Ai(t] Xi) = Vip(t\ ol) exp(aj •/?),

(2)

where p is a twice-differentiable baseline intensity function, depending on the parameter a , and /3 are the regression effects. We may take E(z^) = 1 without loss of generality, and let var(^) = (j). The function A(£; x) is now interpreted as a population average rate function among subjects with covariate vector x, since E(dN(t)\x) = A(t; x)dt. In addition to representing covariates unaccounted for, Vi may also be a cluster effect, taking the same value for all individuals within the same cluster. This can be used to account for unknown clinic effects, for example, where individuals are patients clustered within clinics. When Vi follows a gamma distribution, the marginal distribution of is negative binomial. The variance of the count of total aggregated events Ui+ has the form E(ni+) + (f>E(ni+)2i+. Let the expected number of events over the entire follow-up [0, r^] be /ii+ = Riexp(x,if3)i where Ri = /QTei p(t;a)dt is called the cumulative baseline intensity function. Similarly, defining the cumulative baseline intensity function in panel period p as RiP = J j " p(t; a)dt, we have Hip = E(n 0 to overdispersion, and (5 = 0 reduces to Poisson distribution]; (4) socalled COM-Poisson models, which are a generalization of the Poisson with one more parameter (y) that allows it to represent under- and overdispersion with respect to Poisson [45, 44] [they can also be seen as

7 Analysis a weighted Poisson with weights (A;!1""1')]. Recently, Sellers et al. [43] provided a survey of the methods and applications related to the COM-Poisson models. Grunwald et al. [21] propose a birth-event process approach to model correlated over- or underdispersed data; this model can handle correlation due to clustering or serial correlation.

1.5

Software Notes

Software for incorporating overdispersion includes SAS [42], using, for instance, procedures LOGISTIC, GENMOD, GLIMMIX, and NLMIXED, and R [39] using, for example, packages glm, lmer, and lme4. Parametric mixture models can also be conducted in the MCMC framework using WinBUGS [34], OpenBUGS [33], JAGS [38], or the package mcmc in R.

References [1] L. M. Ainsworth. Models and Methods for Spatial Data: Detecting Outliers and Handling Zero-Inflated Counts. PhD thesis, Simon Eraser University, 2007. [2] D. F. Andrews and A. M. Herzberg. Data: A Collection of Problems from Many Fields for the Student and Research Worker. Springer-Verlag, New York, 2000. [3] Dankmar Bohning and Wilfried Seidel. Editorial: recent developments in mixture models. Computational Statistics & Data Analysis, 41(3-4):349-357, January 2003. [4] James G. Booth, George Casella, Herwig Friedl, and James P. Hobert. Negative binomial loglinear mixed models. Statistical Modelling, 3(3): 179-191, October 2003. [5] Ronald J. Bosch and Louise M. Ryan. Generalized poisson models arising from Markov processes. Statistics & Probability Letters, 39(3):205-212, August 1998. [6] N. E. Breslow. Generalized linear models: checking assumptions and strengthening

Analysis of Variance (ANOVA) Data 11 of Over- and Under dispersed conclusions. Statistica Applicata, 8:23-41, 1996. [7] Anne Buu, Runze Li, Xianming Tan, and Robert A. Zucker. Statistical models for longitudinal zero-inflated count data with applications to the substance abuse field. Statistics in Medicine, 31(29) :4074-4086, July 2012. [8] D. Byar, C. Blackard, and the Veterans Administration Co-operative Urological Research Group. Comparisons of placebo, pyridoxine, and topical thiotepa in preventing recurrence of stage I bladder cancer. Urology, 10:556-561, 1977. [9] A. C. Cameron and P. Johansson. Count data regression using series expansions: with applications. Journal of Applied Econometrics, 12:203-223, 1997. [10] R.J. Cook and J.F. Lawless. The Statistical Analysis of Recurrent Events. Springer, New York, 2007. [11] C. Dean and J.F. Lawless. Tests for detecting overdispersion in Poisson regression models. Journal of the American Statistical Association, 84(406):467-472, June 1989. [12] C. B. Dean. Testing for overdispersion in Poisson and binomial regression models. Journal of the American Statistical Association, 87(418):451-457, June 1992. [13] Joan Del Castillo and Marta P6rezCasany. Weighted Poisson distributions for overdispersion and underdispersion situations. Annals of the Institute of Statistical Mathematics, 50(3) :567-585, 1998. [14] Melissa J. Dobbie and A. H. Welsh. Modelling correlated zero-inflated count data. Australian & New Zealand Journal of Statistics, 43(4):431-444, December 2001. [15] Bradley Efron. Double exponential families and their use in generalized linear regression. Journal of the American Statistical Association, 81(395):709-721, September 1986. [16] M. J. Faddy. Extended Poisson process modelling and analysis of count data. Biometrical Journal, 39(4) :431440, 1997.

8

Analysis of Over- and Under dispersed

Data

[17] M. J. Faddy and R. J. Bosch. Likelihoodbased modeling and analysis of data underdispersed relative to the Poisson distribution. Biometrics, 57(2) :620-624, June 2001.

[27] S. Iddi and G. Molenberghs. A combined overdispersed and marginalized multilevel model. Computational Statistics & Data Analysis, 56(6): 1944-1951, June

[18] M. Fiocco, H. Putter, and J. C. Van Houwelingen. A new serially correlated gamma-frailty process for longitudinal count data. Bio statistics, 10(2) :245257, April 2009.

[28] Vandna Jowaheer and Brajendra C. Sutradhar. Analysing longitudinal count data with overdispersion. Biometrika, 89(2):389-399, June 2002.

[19] Aldo M. Garay, Elizabeth M. Hashimoto, Edwin M. M. Ortega, and Victor H. Lachos. On estimation and influence diagnostics for zero-inflated negative binomial regression models. Computational Statistics & Data Analysis, 55(3): 1304-1318, 2011. [20] Piet Groeneboom, Geurt Jongbloed, and Jon A. Wellner. The support reduction algorithm for computing nonparametric function estimates in mixture models. Scandinavian Journal of Statistics, 35(3):385-399, September 2008. [21] Gary K. Grunwald, Stephanie L. Bruce, Luohua Jiang, Matthew Strand, and Nathan Rabinovitch. A statistical model for under- or overdispersed clustered and longitudinal count data. Biometrical Journal, 53(4):578-594, June 2011. [22] D. B. Hall. Zero-inflated Poisson and binomial regression with random effects: a case study. Biometrics, 56(4): 1030-1039, December 2000. [23] J. L. Hay and A. N. Pettitt. Bayesian analysis of a time series of counts with covariates: an application to the control of an infectious disease. Bio statistics, 2(4):433-444, December 2001. [24] Robin Henderson and Silvia Shimakura. A serially correlated gamma frailty model for longitudinal count data. Biometrika, 90(2):355-366, June 2003. [25] J. M. Hilbe. Negative Binomial Regression. Cambridge University Press, New York, 2nd edition, 2011. [26] John Hinde and Clarice G. B. Demetrio. Overdispersion: Models and estimation. Computational Statistics & Data Analysis, 27(2): 151-170, April 1998.

2012.

[29] E. Juarez-Colunga. Recurrent Event Studies: Efficient Panel Designs and Joint Modeling of Events and Severities. PhD thesis, Simon Fraser University, 2011. [30] Byoung Cheol Jung, Myoungshic Jhun, and Jae Won Lee. Bootstrap tests for overdispersion in a zero-inflated Poisson regression model. Biometrics, 61(2):626628, June 2005. [31] Kung-Yee Liang and Scott L Zeger. Longitudinal data analysis using generalized linear models. Biometrika, 73(1): 13-22, 1986. [32] Bruce G. Lindsay. Mixture models: theory, geometry and applications. In NSF-CBMS Regional Conference Series in Probability and Statistics, volume 5 of Institute of Mathematical Statistics, Hayward, 1995. [33] David D. Lunn, David D. Spiegelhalter, Andrew A. Thomas, and Nicky N. Best. The BUGS project: Evolution, critique and future directions. Audio, Transactions of the IRE Professional Group on, 28(25) :3049-3067, November 2009. [34] David J. Lunn, Andrew Thomas, Nicky Best, and David Spiegelhalter. WinBUGS—A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10(4):325-337, 2000. [35] Peter McCullagh and James A. Nelder. Generalized Linear Models. Chapman Hall, London, 2nd edition, 1989. [36] Geert Molenberghs, Geert Verbeke, and Clarice G. B. Demetrio. An extended random-effects approach to modeling repeated, overdispersed count data. Lifetime Data Analysis, 13(4):513-531, December 2007.

9 Analysis Analysis of Over- and Under dispersed of Variance (ANOVA) Data 11 [37] Geert Molenberghs, Geert Verbeke, Clarice G. B. Demetrio, and Afranio M. C. Vieira. A family of generalized linear models for repeated measures with normal and conjugate random effects. Statistical Science, 25(3):325~347, August 2010. [38] Martyn Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling, 2012. [39] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. [40] M. Ridout, C.G.B. Demetrio, and J. Hinde. Models for count data with many zeros. Proceedings of the XlXth International Biometric Conference. Cape Town, 1998. [41] M. S. Ridout and P. Besbeas. An empirical model for underdispersed count data. Statistical Modelling, 4(1):77-89, April 2004. [42] SAS Institute Inc. SAS/STAT Version 9.3. Cary, NC, 2011.

Software,

[43] K. F. Sellers, S. Borle, and G. Shmueli. The COM-Poisson model for count data: a survey of methods and applications. Applied Stochastic Models in Business and Industry, 28:104-116, 2012. [44] Kimberly F. Sellers and Galit Shmueli. A flexible regression model for count data. Annals of Applied Statistics, 4(2):943961, November 2010. [45] Galit Shmueli, Thomas P. Minka, Joseph B. Kadane, Sharad Borle, and Peter Boatwright. A useful distribution for fitting discrete data: Revival of the Conway-Maxwell-Poisson distribution. Journal of the Royal Statistical Society, Series C (Applied Statistics), 54(1): 127-142, January 2005. [46] Brajendra C. Sutradhar. Dynamic Mixed Models for Familial Longitudinal Data. Springer, New York, 2011. [47] Francis Tuerlinckx, Frank Rijmen, Geert Verbeke, and Paul De Boeck. Statistical inference in generalized linear mixed

models: a review. British Journal of Mathematical and Statistical Psychology, 59(Pt 2):225-255, November 2006. [48] Wai Yin Wan and Jennifer S.K. Chan. A new approach for handling longitudinal count data with zero-inflation and overdispersion: Poisson geometric process model. Biometrical Journal, 51(4):556-570, August 2009. [49] Yong Wang. On fast computation of the non-parametric maximum likelihood estimate of a mixing distribution. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 69(2): 185-198, January 2007. [50] Yong Wang. Maximum likelihood computation for fitting semiparametric mixture models. Statistics and Computing, 20(1):75-86, March 2009. [51] Liming Xiang, Andy H. Lee, Kelvin K. W. Yau, and Geoffrey J. McLachlan. A score test for overdispersion in zeroinflated Poisson mixed regression model. Statistics in Medicine, 26(7): 1608-1622, 2007. [52] F.-C. Xie, B.-C. Wei, and J.-G. Lin. Score tests for zero-inflated generalized Poisson mixed regression models. Computational Statistics & Data Analysis, 53(9):34783489, July 2009. [53] Zhao Yang, James W Hardin, and Cheryl L. Addy. Testing overdispersion in the zero-inflated Poisson model. Journal of Statistical Planning and Inference, 139(9) :3340-3353, September 2009. [54] Zhao Z. Yang, James W. Hardin, Cheryl L. Addy, and Quang H. Vuong. Testing approaches for overdispersion in Poisson regression versus the generalized Poisson model. Biometrical Journal, 49(4):565584, August 2007.

2 Analysis of Variance (ANOVA) Jorg

2.1

Kaufman

Introduction

among the conditions under which treatment is received, which results in sampling variability, meaning that results for a subject included in a study will differ to some extent from those of other subjects in the affected population. Thus, the sources of variability must be investigated and must be suitably taken into account when data from comparative studies are evaluated correctly. Clinical studies are in particular a fruitful field for the application of this methodology.

The development of analysis of variance (ANOVA) methodology has in turn had an influence on the types of experimental research being carried out in many fields. ANOVA is one of the most commonly used statistical techniques, with applications across the full spectrum of experiments in agriculture, biology, chemistry, toxicology, pharmaceutical research, clinical development, psychology, social science, and engineering. The procedure involves the separation of total observed variation in the data into individual components attributable to various factors as well as those caused by random or chance fluctuation. It allows performing hypothesis tests of significance to determine which factors influence the outcome of the experiment. However, although hypothesis testing is certainly a very useful feature of the ANOVA, it is by no means the only aspect. The methodology was originally developed by Sir Ronald A. Fisher [1], the pioneer and innovator of the use and applications of statistical methods in experimental design, who coined the name "Analysis of Variance—ANOVA."

The basis for generalizability of a successful clinical trial is strengthened when the coverage of a study is as broad as possible with respect to geographical area, patient demographics, and pretreatment characteristics as well as other factors that are potentially associated with the response variables. At the same time, heterogeneity among patients becomes more extensive and conflicts with the precision of statistical estimates, which is usually enhanced by homogeneity of subjects. The methodology of the ANOVA is a means to structure the data and their validation by accounting for the sources of variability such that homogeneity is regained in subsets of subjects and heterogeneity is attributed to the relevant factors. The ANOVA method is based on the use of sums of squares of the deviation of the ob-

For most biological phenomena, inherent variability exists within the response processes of treated subjects as well as 10

Analysis servations from respective means (—> Linear Model). The tradition of arraying sums of squares and resulting F-statistics in an ANOVA table is so firmly entrenched in the analysis of balanced data that extension of the analysis to unbalanced data is necessary. For unbalanced data, many different sums of squares can be defined and then be used in the numerators of F-statistics, providing tests for a wide variety of hypotheses. In order to provide a practically relevant and useful approach, the ANOVA through the cell means model is introduced below. The concept of the cell means model was introduced by Searle [2,3], Hocking and Speed [4], and Hocking [5] to resolve some of the confusion associated with ANOVA models with unbalanced data. The simplicity of such a model is readily apparent: No confusion exists on which functions are estimable, what their estimators are, and what hypotheses can be tested. The cell means model is conceptually easier, it is useful for understanding the ANOVA models, and it is, from the sampling point of view, the appropriate model to use. In many applications, the statistical analysis is characterized by the fact that a number of detailed questions need to be answered. Even if an overall test is significant, further analyses are, in general, necessary to assess specific differences in the treatments. The cell means model provides within the ANOVA framework the appropriate model for a correct statistical inference and provides such honest statements on statistical significance in a clinical investigation.

2.2

Factors, Levels, and Cells

Effects,

One of the principal uses of statistical models is to explain variation in measurements. This variation may be caused by the va-

of Variance (ANOVA)

11

Table 1: Factors, Levels, and Cells (i j k) Center (i) 1 2 3

Sex ( j ) male female male female male female

Treatment (k) Tl T2 T3 cell (2 12)

riety of factors of influence, and it manifests itself as variation from one experimental unit to another. In well-controlled clinical studies, the sponsor deliberately changes the levels of experimental factors (e.g., treatment) to induce variation in the measured quantities to lead to a better understanding of the relationship between those experimental factors and the response. Those factors are called independent and the measured quantities are called dependent variables. For example, consider a clinical trial in which three different diagnostic imaging modalities are used on both men and women in different centers. Table 1 shows schematically how the resulting data could be arrayed in a tabular fashion. The three elements used for classifications (center, sex, and treatment) identify the source of variation of each datum and are called factors. The individual classes of the classifications are the levels of the factor (e.g., the three different treatments Tl, T2, and T3 are the three levels of the factor treatment). Male and female are the two levels of the factor sex, and centerl, center2, and center3 are the three levels of the factor center. A subset of the data present for a "combination" of one level of each factor under investigation is considered a cell of the data. Thus, with the three factors, center (3 levels), sex (2 levels), and treatment (3 levels), 3 x 2 x

12

of Variance ( A N O V A ) 12

Analysis

3 = 18 cells numbered by triple indexing i j k exist. Repeated measurements in one cell may exist, which they usually do. Unbalanced data occur when the number of repeated observations per cell W'ijk different for at least some of the indices (i, j, k). In clinical research, this occurrence is the rule rather than the exception. One obvious reason could be missing data in an experiment. Restricted availability of patients for a specific factor combination is another often-experienced reason.

The difference yir-E(yir) = yir-V>i = eir is the deviation of the observed yir value from the expected value E(yi r ). This deviation, denoted is called the error term or residual error term, and from the introduction of the model above, it is a random variable with expectation zero and variance v(eir) = a2. Note that, in the model above, the variance is assumed to be the same for all e*rs. The cell means model can now be summarized as follows: yir

2.3

A customary practice since the seminal work of R. A. Fisher has been that of writing a model equation as a vehicle for describing ANOVA procedures. The cell means model is now introduced via a simple example in which only two treatments and no further factors are considered. Suppose that yir, i = 1, 2, r = 1, . . . , ni represents a random sample of two normal populations with means Ml and \i2 and common variance a 2 . The data point yi r denotes the rth observation on the ith population of size Ui and its value assumed to follow a Gaussian normal distribution: yi r ~ a2). The fact that the sizes n\ and n2 of the two populations differ indicates that a situation of unbalanced data exists. In linear model form, it is written Vir = Mi + eiri i = 1,2; r = 1 , . . . , m

(1)

where the errors e*r are identically independent normal N(0, a2) distributed (i.i.d. variables). Note that, a model consists of more than just a model equation: It is an equation such as Equation 1 plus statements that describe the terms of the equation. In the example above, Hi is defined as the population mean of the ith. population and is equal to the expectation of yir E(Vir)

E(yir) E(eir) v(eir)

Cell Means Model

=

Mi, r

=

1,...,

nu

for z = 1,2.

(2)

Mi 4" &ir = Mi = 0 = a2 for all i and r. —

(3)

Note that that Equation 3 does not assume explicitly the Gaussian normal distribution. In fact, one can formulate the cell means model more generally by only specifying means and variances. Below, however, it is restricted to the special assumption eir « i.i.d. N(0,a2). It will be shown that the cell means model can be used to describe any of the models that are classically known as ANOVA models.

2.4

One-Way Classification

Let us begin with the case of a cell means model in which the populations are identified by a single factor with i levels and n^ observations at the ith level for i = 1 , . . . , i.

2.4.1

Example 1

A clinical study was conducted to compare the effectiveness of three different doses of a new drug and placebo for treating patients with high blood pressure. For the study, 40 patients were included. To control for unknown sources of variation, 10 patients each were assigned at random to the four treatment groups. As response, the study considered the difference in diastolic blood pressure measurement between

Analysis

of Variance (ANOVA)

13

baseline (pre-value) and the measurement 4 weeks after administration of treatment. The response measurements y^, sample means and sample variances s 2 are shown in Table 2. The cell means model to analyze the data of example 1 is then given as

The ANOVA technique partitions the variation among observations into two parts: the sum of squared deviations from the model to the overall mean

= 1,2,3,4; r = 1 , . . . , 10

and the sum of squared deviations from the observed values yi to the model

Vir

=

eir

»

Vi + eiri

2

i.i.d. iV(0, i r - A ) i

2

r

R(fi) = SST-SSE(//) is denoted the reduction in sum of squares because of fitting the model E(yir) = v. The two models E(Vir) = Vi and E(Vir) = V can now be compared in terms of their respective reductions in sum of squares given by R(fa) and R(v)> The difference R(fa) — R(v) ls the extent to which fitting E(yir) = fa brings about a greater reduction in sum of squares than does fitting E ( y i r ) = V-

Obviously, the R(.) notation is a useful mnemonic for comparing different linear models in terms of the extent to which fitting each accounts for a different reduction in the sum of squares. The works of Searle [2,3], Hocking [5], and Littel et al. [6] are recommended for deeper insight. It is now very easy to partition the total sum of squares SST into terms that develop in the ANOVA. Therefore, the identity SST = i?(/i) + (R(fa) + (SST -

R(/j))

R(fa))

= R(li)+R(fa/ri

+ SSE(fa)

+

SSE(fa)

is used with R(vi/v) = R(fa) — R(v)- The separation of Table 3a and Table 3b is appropriate to Equation 12 the first and last line. Table 3a displays the separation into the components attributable to the model v in the first line, to the model fa extent in the second line, to the error term in the third line, and to the total sum of squares in the last line.

15

Table 3b displays only the separation into the two components attributable to the model fa extent /z and, in the second line, to the error term.

2.7

ANOVA—Hypothesis of Equal Means

Consider the following inferences about the cell means. This analysis includes the initial null hypothesis of equal means (global hypotheses, all means simultaneous by the same) so-called ANOVA hypothesis contingent with pairwise comparisons, contrasts, and other linear function, comprising either hypothesis tests or confidence intervals. In starting off with the model E(yij) = /ij, the global null hypothesis H0 : fa = V2 = - - - = Vi

is of general interest. A suitable F-statistic can be used for testing this hypothesis Ho (see standard References [2-6]). The F-statistic testing Ho and the sums of squares of Equation 12 are tabulated in Table 3a and Table 3b, columns 3-5. The primary goal of the experiment in example 1 was to show that the new drug is effective compared with placebo. At first, one may test the global null hypotheses

(12)

or SST - R(n) = R(fa/v)

of Variance (ANOVA)

H0 : Vi = V2 = Vs = VA

with E(yi r ) = Vi- Table 3c shows information for testing the hypothesis Ho. R(fa/fi)

= R(fa) - R(n) = 67.8} is the

difference for the respective reductions in sum of squares for the two models E(yir) — fa and E(yi r ) = /x. From the mean squares 1) = 22.60 and the error term R(vi/v)/(l~ (SST - SSE(Vi))/(n. - I) = a2 = 1.46 one obtains the F-statistic F = 15.5 for testing the null hypothesis Ho and the probability Pr(F > Fa) < 0.0001. As this probability is less then the type I error a = 0.05, the hypothesis Ho can be rejected in favor of

Analysis of Variance ( A N O V A ) 16

12

Table 3c: ANOVA Example 1 Source of Variation Model fa Residual Total a.f.m.

df 3 36 39

Sum of Square 67.81 52.52 120.33

the alternative fa ^ fa for at least one pair i and j of the four treatments. Rejection of the null hypothesis Ho indicates that differences exist among treatments, but it does not show where the differences are located. Investigators' interest is rarely restricted to this overall test, but rather to comparisons among the doses of the new drug or placebo. As a consequence, multiple comparisons comparing the three classes with placebo are required.

2.8

Multiple Comparisons

In many clinical trials, more than two drugs or more than two levels of one drug are considered. Having rejected the global hypothesis of equal treatment means (e.g., when the probability of the F-statistic in Table 3c, last column, is less than 0.05), questions related to picking out drugs that are different from others or determining what dose level is different from the others and placebo have to be addressed. These analyses generally require many (multiple) further comparisons among the treatments in order to detect effects of prime interest to the researcher. The excessive use of multiple significance tests in clinical trials can greatly increase the chance of falsepositive findings. A large amount of statistical research has been devoted to multiplecomparison procedures and the control of false-positive results caused by multiple testing. Each procedure usually has the objective of controlling the experimentwise or family-wise error rate. A multiple test controls the experiment-wise or

Mean Square 22.60 1.46

F-Statistic 15J5

Pr > F Factor B

Figure 1: Two-way classification, no interaction, 2 rows, 3 columns

Figure 2: Two-way classification, interaction, 2 rows, 3 columns

) 45 12

Analysis of Variance ( A N O V A ) 22

12

Table 5: Sample Size, Means T 1 T 2

B1 (27) 1.07 (32) 1.00

B2 (12) 1.36 (7) 1.30

often used. The problem in interpreting the output of computer-specific programs is to identify those sums of squares that are useful and those that are misleading. The information for conducting tests for row effects, column effects, and interactions between rows and columns is summarized in an extended ANOVA table. Various computational methods exist for generating sums of squares in an ANOVA table (Table 5) since the work of Yates [17]. The advantage of using the faj-model notation introduced above is that all fj/ ij are clearly defined. Thus, a hypothesis stated in terms of the is easily understood. Speed et al. [18], Searle [2,3], and Pendleton et al. [16] gave the interpretations of four different types of sums of squares computed (e.g., by the SAS, SPSS, and other systems). To illustrate the essential points, use the model in Equation 13, assuming all riij > 0. For reference, six hypotheses of weighted means are listed in Table 6 that will be related to the different methods [16,18] .

A typical method might refer to Hi, H2, or H3 as "main effect of A," row effect. Hypotheses H4 and H5 are counterparts of H2 and H3 generally associated with "main effect B," column effect. Hypothesis of no interaction is He, and it is seen to be common to all methods under the assumption n^ < 0. Hypotheses Hi, H2, and H3 agree in the balanced case (e.g., if riij = n for all i and j but not otherwise). The hypothesis H3 does not depend on the

riij. All means have the same weights 1/j and are easy to interpret. As it states, no difference exists in the levels of the factor A when averaged over all levels of factor B (Equation 16). Hypotheses Hi and H2 represent comparisons of weighted averages (Equations 18 and 20) with the weights being a function of the cell frequencies. A hypothesis weighted by the cell frequencies might be appropriate if the frequencies reflected population sizes but would not be considered as the standard hypothesis. Table 7 specifies the particular hypothesis tested by each type of ANOVA sums of squares. Table 8 shows three different analyses of variance tables for the liver enzyme described in Section 2.10 computed with PROC GLM SAS. The hypotheses for the interaction term TxB is given by Hq. The test is the same for Type I, II, and III sums of squares. In this example, no interaction exists between treatment and liver impairment, P-value Pr > F = 0.94. The hypothesis for "main effect A"— the treatment comparison—is given by Hi, H2, or H3, corresponding to the different weighting means (Table 6 and Table 7). The range of the results for the treatment comparison are different (Table 8, source T) 1. for Type I, the P-value, Pr > F = 0.0049, is less than 0.005 (highly significant) 2. for Type II, the P-value, Pr > F = 0.074, is between 0.05 and 0.1 (not significant) 3. for Type III, the P-value, Pr > F = 0.142, is greater than 0.1 (not significant). The hypothesis H i (i.e., Type II) is appropriate for treatment effect in the analysis of this example—a two-way design with unbalanced data.

Analysis

of Variance (ANOVA)

23

Table 6: Cell Means Hypotheses Hypothesis

Main Effect Factor A rows Factor B

Weighted Means n

ijVij Pi = fin YlinijVij

— =

n n

ij HjVirj/n.j

YLj

Ylj/YlinijnijfVijf/ni-

columns fji,j = /i.j, Interaction A x B faj — fa,j — faj, 4- fatj, = 0 For

Hi #2 Hs #4 Hs Hq

a\li,i'JJ'J?i'J?f

weighted mean equation (18) Cochran-Mantel-Haenszel weights equation (20) for t = nij7ii,j/n.j weighted mean equation (16) counterpart of Hi with factor B counterpart of Hs with factor B interaction between factor A and factor B

Table 7: Cell Means Hypotheses Being Tested Sum of Squares Type I Type II Row effect, Factor A Hi H2 H/[ H4 Column effect, Factor B Interaction, AxB He Hq

Type III H3 H5 H6

Type I, Type II, and Type III agree when balanced data occur.

Table 8: Liver Enzymes—Two-Way Classification with Interaction Term Treatment (T), Impairment (B), Interaction (T*B) Source T B T*B Source T B T*B Source T B T*B

DF 1 1 1 DF 1 1 1 DF 1 1 1

Typ I SS 0.20615 1.22720 0.00015 Typ III 0.05413 1.19127 0.00015 Typ II 0.08038 1.22720 0.00015

Mean Square 0.20615 1.22720 0.00015 Mean Square 0.05413 1.19127 0.00015 Mean Square 0.08038 1.22720 0.00015

F-Value 8.41 50.08 0.01 F-Value 2.21 48.61 0.01 F-Value 3.28 50.08 0.01

Pr > F 0.0049 0.0001 0.9371 Pr > F 0.1415 0.0001 0.9371 Pr > F 0.0742 0.0001 0.9371

12

Analysis

of Variance ( A N O V A ) 24

No interaction exists between treatment and liver impairment. The different test results for and H3 result from the unbalanced data and factor liver impairment. Any rules for combining centers, blocks, or strata in the analysis should be set up prospectively in the protocol. Decisions concerning this approach should always be taken blind to treatment. All features of the statistical model to be adopted for the comparison of treatments should be described in advance in the protocol. Hypothesis H2 is appropriate in the analysis of multicenter trials when treatment differences over all centers [13,14,19] are considered. The essential point emphasized here is that the justification of a method should be based on the hypotheses being tested and not on heuristic grounds or computational convenience. In the presence of a significant interaction, the hypotheses of main effects may not be of general interest and more specialized hypotheses might be considered. With regard to missing cells, the hypotheses being tested can be somewhat complex for the various procedures or types of ANOVA tables. Complexities associated with those models are simply aggravated when dealing with models for more than two factors. How many factors exist or how many levels each factor has, the mean of the observations in each filled cell, is an estimator of the population mean for that cell. Any linear hypothesis about cell means of any non empty cell is testable; see the work of Searle [Reference 3, pp. 384415].

References [1] R. A. Fisher, Statistical Methods for Research Workers. Edinburgh: Oliver & Body, 1925. [2] S. R. Searle, Linear Models. New York: John Wiley & Sons, 1971.

[3] S. R. Searle, Linear Models for Unbalanced Data. New York: John Wiley & Sons, 1987. [4] R. R. Hocking and F. M. Speed, A full rank analysis of some linear model problems. JASA 1975; 70: 706-712. [5] R. R. Hocking, Methods and Applications of Linear Models—Regression and the Analysis of Variance. New York: John Wiley & Sons, 1996. [6] R. C. Littel, W. W. Stroup, and R. J. Freund, SAS for Linear Models. Cary, NC: SAS Institute Inc., 2002. [7] P. Bauer, Multiple primary treatment comparisons on closed tests. Drug Inform. J. 1993; 27: 643-649. [8] P. Bauer, On the assessment of the performance of multiple test procedures. Biomed. J. 1987; 29(8): 895-906. [9] R. Marcus, E. Peritz, and K. R. Gabriel, On closed testing procedures with special reference to ordered analysis of variance. Biometrica 1976; 63: 655-660 [10] C. W. Dunnett and C. H. Goldsmith, When and how to do multiple comparison statistics in the pharmaceutical industry. In: C. R. Buncker and J. Y. Tsay, eds. Statistics and Monograph, vol. 140. New York: Dekker, 1994. [11] D. R. Cox, Planning of Experiments. New York: John Wiley & Sons, 1992. [12] G. G. Koch and W. A. Sollecito, Statistical considerations in the design, analysis, and interpretation of comparative clinical studies. Drug Inform. J. 1984; 18: 131-151. [13] J. Kaufmann and G. G. Koch, Statistical considerations in the design of clinical trials, weighted means and analysis of covariance. Proc. Conference in Honor of Shayle R. Searle, Biometrics Unit, Cornell University, 1996. [14] J. L. Fleiss, The Design and Analysis of Clinical Experiments. New York: John Wiley & Sons, 1985. [15] H. Sahai and M. I. Ageel, The Analysis of Variance—Fixed, Random and Mixed Models. Boston: Birkhauser, 2000.

Analysis of Variance ( A N O V A ) 49 [16] O. J. Pendleton, M. von Tress, and R. Bremer, Interpretation of the four types of analysis of variance tables in SAS. Commun. Statist.-Theor. Meth. 1986; 15: 2785-2808. [17] F. Yates, The analysis of multiple classifications with unequal numbers in the different classes. JASA 1934; 29: 51-56. [18] F. M. Speed, R. R. Hocking, and O. P. Hackney, Methods of analysis of linear models with unbalanced data. JASA 1978; 73: 105-112. [19] S. Senn, Some controversies in planning and analysing multi-centre trials. Stat. Med. 1998; 17: 1753-1765.

12

3

Assessment of Quality of Life

C. S. Wayne

3.1

Health-Related

Weng

Ihtroduction

parts of quality of life that are related to an individual's health. The key components of this definition of HRQOL include (1) physical functioning, (2) mental functioning, and (3) social well-being, and a well-balanced HRQOL instrument should include these three key components. For example, the Medical Outcomes Study Short Form-36 (SF-36), a widely used HRQOL instrument, includes a profile of eight domains: (1) Physical Functioning, (2) Role—Physical, (3) Bodily Pain, (4) Vitality, (5) General Health, (6) Social Functioning, (7) Role—Emotional, and (8) Mental Health. These eight domains can be further summarized by two summary scales: the Physical Component Summary (PCS) and Mental Component Summary (MCS) scales.

Randomized clinical trials are the gold standard for evaluating new therapies. The primary focus of clinical trials has traditionally been evaluation of efficacy and safety. As clinical trials evolved from traditional efficacy and safety assessment of new therapies, clinicians were interested in an overall evaluation of the clinical impact of these new therapies on patient daily functioning and well-being as measured by health-related quality of life (HRQOL). As a result, HRQOL assessments in clinical trials rose steadily throughout the 1990s and continue into the twenty-first century. What is HRQOL? Generally, quality of life encompasses four major domains [1]: 1. Physical status and functional abilities 2. Psychological status and well-being 3. Social interactions

This article is intended to provide an overview of assessment of HRQOL in clinical trials. For more specific details on a particular topic mentioned in this article, readers should consult the cited references. The development of a new HRQOL questionnaire and its translation into various languages are separate topics and are not covered in this article.

4. Economic or vocational status and factors The World Health Organization (WHO) defines "health" [2] as a "state of complete physical, mental, and social wellbeing and not merely the absence of infirmity and disease." HRQOL focuses on 26

32 Assessment

3.2

Choice of HRQOL Instruments

HRQOL instruments can be classified into two types: generic instruments and disease-specific instruments. The generic instrument is designed to evaluate general aspects of a person's HRQOL, which should include physical functioning, mental functioning, and social well-being. A generic instrument can be used to evaluate the HRQOL of a group of people in the general public or a group of patients with a specific disease. As such, data collected with a generic instrument allow comparison of HRQOL among different disease groups or against a general population. A generic instrument, is designed to cover a broad range of HRQOL issues and it may be less sensitive regarding important issues for a particular disease or condition. Disease-specific instruments focus assessment in a more detailed manner for a particular disease. A more specific instrument allows detection of changes in disease-specific areas that a generic instrument is not sufficiently sensitive to detect. For example, the Health Assessment Questionnaire (HAQ) is developed to measure functional status of patients with rheumatic disease. The HAQ assesses the ability to function in eight areas of daily life: dressing and grooming, arising, eating, walking, hygiene, reach, grip, and activities. Table 1 [3-30] includes a list of generic and disease-specific HRQOL instruments for common diseases or conditions. A comprehensive approach to assessing HRQOL in clinical trials can be achieved, using a battery of questionnaires when a single questionnaire does not address all relevant HRQOL components or a "module" approach, which includes a core measure of HRQOL domains supplemented in the same questionnaire by a disease- or treatment-specific set of items. The bat-

of Health-Related

Quality of Life

tery approach combines a generic HRQOL instrument and a disease-specific questionnaire. For example, in a clinical trial on rheumatoid arthritis (RA), one can include the SF-36 and HAQ to evaluate treatment effect on HRQOL. The SF-36 allows comparison of RA burden on patients' HRQOL with other diseases as well as the general population. The HAQ, being a disease-specific instrument, measures patients' ability to perform activities of daily life and is more sensitive to changes in a RA patient's condition. The module approach has been widely adopted in oncology, as different tumors impact patients in different ways. The most popular cancer-specific HRQOL questionnaires, EORTC QLQ-C30 and FACT, both include core instruments that measure physical functioning, mental functioning, and social well-being as well as common cancer symptoms, supplemented with a list of tumor- and treatment-specific modules. In certain diseases, a disease-specific HRQOL instrument is used alone in a trial because the disease's impact on general HRQOL is so small that a generic HRQOL instrument will not be sensitive enough to detect changes in disease severity. For example, the disease burden of allergic rhinitis on generic HRQOL is relatively small compared with the general population. Most published HRQOL studies in allergic rhinitis use Juniper's Rhinoconjunctivitis Quality of Life Questionnaire (RQLQ), a disease-specific HRQOL questionnaire for allergic rhinitis.

3.3

Establishment of Clear Objectives in HRQOL Assessments

A clinical trial is usually designed to address one hypothesis or a small number of hypotheses, evaluating a new therapy's efficacy, safety, or both. When considering whether to include HRQOL assessment in

27

32

Assessment

of Health-Related

Quality of Life 28

Table 1: Generic and Disease-Specific HRQOL Instruments for Common Diseases or Conditions Generic HRQOL Instruments Short Form-36 (SF-36) Sickness Impact Profile Nottingham Health Profile Duke Health Profile McMaster Health Index Questionnaire Functional Status Questionnaire WHO Quality of Life Assessment Disease-Specific HRQOL Instruments: Disease Pain

Depression

Rheumatic Disease: • Rheumatoid Arthritis

Instrument Brief Pain Inventory McGill Pain Questionnaire Visual Analogue Pain Rating Scales Beck Depression Inventory Center for Epidemiologic Studies Depression Scale (CES-D) Hamilton Rating Scale for Depression The Hospital Anxiety and Depression Questionnaire Zung Self-Rating Depression Scale WHO Well-Being Questionnaire Health Assessment Questionnaire (HAQ)

Reference 3^5 6 7 8 9 10 11

Reference 12 13 Various authors; see Reference 14, p. 341 15 16 17 18 19 20 21

• Osteoarthritis • Ankylosing Spondylitis • Juvenile Arthritis

Rheumatoid

Inflammatory Bowel Disease Asthma Airway Disease Seasonal Allergic Rhinitis Parkinson's Disease Cancer (both have tumor-specific modules)

Inflammatory Bowel Disease Questionnaire (IBDQ) Asthma Quality of Life Questionnaire (AQLQ) St. George's Respiratory Questionnaire (SGRQ) Rhinoconjunctivitis Quality of Life Questionnaire (RQLQ) Parkinson's Disease Questionnaire -39 item (PDQ-39) Parkinson's Disease Quality of Life Questionnaire (PDQL) EORTC QLQ-C30 Functional Assessment of Cancer Therapy (FACT)

22,23 24 25 26 27 28 29 30

32 Assessment a study, the question of what additional information will be provided by the HRQOL assessment must be asked. As estimated by Moinpour [31], the total cost per patient is $ 443 to develop an HRQOL study, monitor HRQOL form submission, and analyze HRQOL data. Sloan et al. [32] have revisited the issue of the cost of HRQOL assessment in a number of settings including clinical trials and suggest a wide cost range depending on the comprehensiveness of the assessment, which is not a trivial sum of money to be spent in a study without a clear objective in HRQOL assessment. The objective of HRQOL assessment is usually focused on one of the four possible outcomes: (1) improvement in efficacy leads to improvement in HRQOL, (2) treatment side effects may cause deterioration in HRQOL, (3) the combined effect of (1) and (2) on HRQOL, and (4) similar efficacy with an improved side effect profile leads to improvement in HRQOL. After considering possible HRQOL outcomes, one will come to a decision, whether HRQOL assessment should be included in the trial. In many published studies, HRQOL was included in the studies without a clear objective. These studies generated HRQOL data that provided no additional information at the completion of the studies. Goodwin et al. [33] provide an excellent review of HRQOL measurement in randomized clinical trials in breast cancer. They suggest that, given the existing HRQOL database for breast cancer, it is not necessary to measure HRQOL in every trial, at least until ongoing trials are reported. An exception is interventions with a psychosocial focus, where HRQOL must be the primary outcome.

3.4

Methods for HRQOL Assessment

The following components should be included in a study protocol with an HRQOL

of Health-Related

Quality of Life

29

objective: • Rationale for assessing HRQOL objective^) and for the choice of HRQOL instrument(s): - To help study personnel understand the importance of HRQOL assessment in the study, inclusion of a clear and concise rationale for HRQOL assessment is essential, along with a description of the specific HRQOL instrument (s) chosen. • HRQOL hypotheses: - The study protocol should also specify hypothesized HRQOL outcomes with respect to general and specific domains. It is helpful to identify the primary domain and secondary domains for HRQOL analysis in the protocol. • Frequency of HRQOL assessment: - In a clinical trial, the minimum number of HRQOL assessments required is two, at baseline and the end of the study for studies with a fixed treatment duration where most patients are expected to complete the treatment. One or two additional assessments should be considered between baseline and study endpoint depending on the length of the study so that a patient's data will still be useful if end point data were not collected. More frequent assessments should be considered if the treatment's impact on HRQOL may change over time. Three or more assessments are necessary to characterize patterns of change for individual patients. In oncology trials, it is common to assess HRQOL

32 Assessment

of Health-Related

Quality of Life 54

on every treatment cycle, as patients' HRQOL is expected to change over time. However, assessment burden can be minimized if specific time points associated with expected clinical effects are of interest and can be specified by clinicians (e.g., assess HRQOL after the minimum number of cycles of therapy required to observe clinical activity of an agent). Another factor to be considered in the frequency of HRQOL assessment is the recall period for a particular HRQOL instrument. The recall period is the period during which a subject is asked to assess his/her responses to an HRQOL questionnaire. The most common recall periods are 1 week, 2 weeks, and 4 weeks. Administering naires:

HRQOL

question-

- To objectively evaluate HRQOL, one needs to minimize physician and study nurse influence on patient's response to HRQOL questions. Therefore, the protocol should indicate that the patient is to complete the HRQOL questionnaire in a quiet place in the doctor's office at the beginning of his/her office visit, prior to any physical examination and clinical evaluation by the study nurse and physician. Specify the magnitude of difference in HRQOL domain score that can be detected with the planned sample size: - This factor is especially important when the HRQOL assessment is considered a secondary end point. As the sample size is based on the primary

end point, it may provide only enough power to detect a relatively large difference in HRQOL scores. The question of whether to increase the sample size to cover HRQOL assessment often depends on how many additional patients are needed and the importance of the HRQOL issue for the trial. Collecting HRQOL data when power will be insufficient to detect effects of interest is a waste of clinical resources and the patient's time. • Specify how HRQOL scores are to be calculated and analyzed in the statistical analysis section: - Calculation of HRQOL domain scores should be stated clearly, including how missing items will be handled. As a result of the nature of oncology studies, especially in late-stage disease, patients will stop treatment at different time points because of disease progression, intolerance to treatment side effects, or death and therefore fail to complete the HRQOL assessment schedule. For example, if data are missing because of deteriorating patient health, the study estimates of effect on HRQOL will be biased in favor of better HRQOL; the term "informative missing data" is the name for this phenomenon, which must be handled with care. Fairclough [34] has written a book on various longitudinal methods to analyze this type of HRQOL data. However, analyzing and interpreting HRQOL data in this setting remain a challenge. - Strategies to improve HRQOL data collection: Education at the

32 Assessment investigators' meeting and during the site initiation visit: It is important to have investigators and study coordinators committed to the importance of HRQOL assessment. Without this extra effort, HRQOL assessment is likely to be unsuccessful, simply because collecting HRQOL data is not part of routine clinical trial conduct. • Emphasize the importance of the HRQOL data: - Baseline HRQOL forms should be required in order to register a patient to a trial. Associate grant payment per patient with submission of a patient's efficacy data. Specifying some portion of the grant payment with the submission of HRQOL form has significantly increased the HRQOL completion rate in the author's clinical experience. - Establish a prospective reminder system for upcoming HRQOL assessments and a system for routine monitoring of forms at the same time clinical monitoring is being conducted. The checklist in Table 2 may be helpful when considering inclusion of HRQOL assessment in a clinical trial protocol.

3.5

HRQOL as the Primary End Point

To use HRQOL as the primary endpoint in a clinical trial, prior information must demonstrate at least comparable efficacy of a study treatment to its control. In this context, to design a study with HRQOL as

of Health-Related

Quality of Life

31

the primary end point, the sample size will have to be large enough to assure adequate power to detect meaningful differences in HRQOL between treatment groups. Another context for a primary HRQOL end point is in the setting of treatment palliation. In this case, treatment efficacy is shown by the agent's ability to palliate disease-related symptoms and overall HRQOL without incurring treatmentrelated toxicities. For example, patient report of pain reduction can document the achievement of palliation [e.g., see Tannock et al. [35] example below]. A HRQOL instrument usually has several domains to assess various aspects of HRQOL. Some HRQOL instruments also provide for an overall or total score. The HRQOL end point should specify a particular domain, or the total score, as the primary end point of the HRQOL assessment in order to avoid multiplicity issues. If HRQOL is included as a secondary end point, it is a good practice to identify a particular domain as the primary focus of the HRQOL assessment. This practice forces specification of the expected outcomes of HRQOL assessments. Some investigators have applied multiplicity adjustments to HRQOL assessments. The approach may be statistical prudent, but it does not provide practical value. The variability of HRQOL domain scores is generally large. With multiple domains being evaluated, only a very large difference between groups will achieve the required statistically significance level. When evaluating HRQOL as a profile of a therapy's impact on patients, clinical judgment of the magnitude of HRQOL changes should be more important than the statistical significance. However, this "exploratory" analysis perspective should also be tempered with the recognition that some significant results may be marginally significant and subject to occurrence by chance.

32

Assessment

of Health-Related

Quality of Life 32

Table 2: Checklist for HRQOL Assessment

• Rationale to assess HRQOL and the choice of HRQOL instrument(s) • Hypothesis in terms of expected HRQOL outcomes • Frequency of HRQOL assessment • Procedures for administering HRQOL questionnaires • Specify the magnitude of difference in HRQOL domain score that can be detected with the planned sample size • Specify in the statistical analysis section how HRQOL scores are to be calculated and analyzed • Strategies to improve HRQOL data collection

3.6

Interpretation of HRQOL Results

Two approaches have been used to interpret the meaningfulness of observed HRQOL differences between two treatment groups in a clinical trial: distributionbased and anchor-based approaches. The most widely used distribution-based approach is the effect size, among other methods listed in Table 3 [36-45]. Based on the effect size, an observed difference is classified into (1) 0.2 = a small difference, (2) 0.5 = a moderate difference, and (3) 0.8 = a large difference. To advocate using the effect size to facilitate the interpretation of HRQOL data, Sloan et al. [46] suggested a 0.5 standard deviation as a reasonable benchmark for a 0-100 scale to be clinically meaningful. This suggestion is consistent with Cohen's [47] suggestion of one-half of a standard deviation as indicating a moderate effect and therefore clinically meaningful. The anchor-based approach compares observed differences relative to an exter-

nal standard. Investigators have used this approach to define the minimum important difference (MID). For example, Juniper and Guyatt [26] suggested that a 0.5 change in RQLQ be the MID (RQLQ score ranges from 1 to 7). Osoba et al. [48] suggested that a 10-point change in the EORTC QLQ-C30 questionnaire would be a MID. Both of these two MIDs are group average scores. How these MIDs apply to individual patients is still an issue. Another issue in using MID is related to the starting point of patients' HRQOL scores. Guyatt et al. [49] provide a detailed overview of various strategies to interpret HRQOL results.

3.7 3.7.1

Examples HRQOL in Asthma

To evaluate salmeterol's effect on quality of life, patients with nocturnal asthma were enrolled into a double-blind, parallel group, placebo-controlled, multicenter study [50]. The study rationale was that patients with

32 Assessment

of Health-Related

Quality of Life

Table 3: Common Methods Used to Measure a Questionnaire's Responsiveness to Change Formula

Method Relative change Effect size Relative efficiency

(Meantesti - Mean te st2j

Meantesti (Meantesti - Mean tes t2)

SDtestl Square of Effect Size d i m e n s i o n

Reference 36 37,38 39

E f f e c t Size s tandard

Standardized response mean Responsiveness statistic Paired t statistic SE of measurement

(Meantesti - Mean tes t2)

SDdifference (Meantesti - Mean tes t2) S D s t a b l e group

(Meantesti - Mean tes t2) SEdiflference

41,42 43

x

Square Root (1 - Reliability Coefficient^) SDtest

40

44,45

Reprinted with permission.

nocturnal asthma who are clinically stable have been found to have poorer cognitive performance and poorer subjective and objective sleep quality compared with normal, healthy patients. To assess salmeterol's effect on reducing the impact of nocturnal asthma on patients' daily functioning and well-being, patients were randomized to receive salmeterol 42 fig or placebo twice daily. Patients were allowed to continue theophylline, inhaled corticosteroids, and "as-needed" albuterol. Treatment duration was 12 weeks, with a 2-week runin period. The primary study objective was to assess the impact of salmeterol on asthma-specific quality of life using the validated Asthma Quality of Life Questionnaire [24] (AQLQ). Patients were to return to the clinic every 4 weeks. Randomized patients were to complete an AQLQ at day 1; weeks 4, 8, 12; and at the time of withdrawal from the study for any reason. Efficacy (FEVi, PEF, nighttime awakenings,

asthma symptoms, and albuterol use) and safety assessments were also conducted at these clinic visits. Scheduling HRQOL assessment prior to efficacy and safety evaluations at office visits minimizes investigator bias and missing HRQOL evaluation forms. The AQLQ is a 32-item, selfadministered, asthma-specific instrument that assesses quality of life over a 2-week time interval. Each item is scored using a scale from 1 to 7, with lower scores indicating greater impairment and higher scores indicating less impairment in quality of life. Items are grouped into four domains: (1) activity limitation (assesses the amount of limitation of individualized activities that are important to the patient and are affected by asthma); (2) asthma symptoms (assesses the frequency and degree of discomfort of shortness of breath, chest tightness, wheezing, chest heaviness, cough, difficulty breathing

33

32

Assessment

of Health-Related

out, fighting for air, heavy breathing, difficulty getting a good night's sleep); (3) emotional function (assesses the frequency of being afraid of not having medications, concerned about medications, concerned about having asthma, frustrated); and (4) environmental exposure (assesses the frequency of exposure to and avoidance of irritants such as cigarette smoke, dust, and air pollution). Individual domain scores and a global score are calculated. A change of 0.5 (for both global and individual domain scores) is considered the smallest difference that patients perceive as meaningful [51]. To achieve 80% power to detect a difference of 0.5 in AQLQ between two treatment arms would only require 80 patients per arm at a significance level of 0.05. However, this study was designed to enroll 300 patients per arm so that it could also provide 80% power to detect differences in efficacy variables (e.g., FEVi, nighttime awakening) between two treatment arms at a significance level of 0.05. A total of 474 patients were randomly assigned to treatment. Mean change from baseline for the AQLQ global and each of the four domain scores was significantly greater (P < 0.005) with salmeterol compared with placebo, first observed at week 4 and continuing through week 12. In addition, differences between salmeterol and placebo groups were greater than 0.5 at all visits except at week 4 and week 8 for the environmental exposure domain. At week 12, salmeterol significantly (P < 0.001 compared with placebo) increased mean change from baseline in FEVi, morning and evening PEF, percentage of symptom-free days, percentage of nights with no awakenings due to asthma, and the percentage of days and nights with no supplemental albuterol use. This study demonstrated that salmeterol's effect in improving patients' asthma symptoms had a more profound effect on improving pa-

Quality of Life 34 tients' daily activity and well-being.

3.7.2

HRQOL in Seasonal Allergy Rhinitis

A randomized, double-blind, placebocontrolled study was conducted to evaluate the effects on efficacy, safety, and quality of life of two approved therapies (fexofenadine HCI120 mg and loratadine 10 mg) for the treatment of seasonal allergy rhinitis (SAR) [52]. Clinical efficacy was based on a patient's evaluation of SAR symptoms: (1) sneezing; (2) rhinorrhea; (3) itchy nose, palate, or throat; and (4) itchy, watery, or red eyes. The primary efficacy end point was the total score for the patient symptom evaluation, defined as the sum of the four individual symptom scores. Each of the symptoms was evaluated on a 5-point scale (0 to 4), with higher scores indicating more severe symptoms. Treatment duration was 2 weeks, with a run-in period of 37 days. After randomization at study day 1, patients were to return the clinic every week. During these visits, patients were to be evaluated for the severity of SAR symptoms and to complete a quality of life questionnaire. Patient-reported quality of life was evaluated using a validated disease-specific questionnaire—the Rhinoconj unctivitis Quality of Life Questionnaire (RQLQ) [26]. The RQLQ is a 28-item instrument that assesses quality of life over a 1-week time interval. Each item is scored using a scale from 0 (i.e., not troubled) to 6 (i.e., extremely troubled), with lower scores indicating greater impairment and higher scores indicating less impairment in quality of life. Items are grouped into seven domains: (1) sleep, (2) practical problems, (3) nasal symptoms, (4) eye symptoms, (5) non-nose/eye symptoms, (6) activity limitations, and (7) emotional function. Individual domain scores and an overall score are calculated. A change of 0.5 (for

32 Assessment both global and individual domain scores) is considered the smallest difference that patients perceive as meaningful [53]. The RQLQ assessment was a secondary end point. No sample size and power justification was mentioned in the published paper. A total of 688 patients were randomized to receive fexofenadine HCI 120 mg, loratadine 10 mg, or placebo once daily. Mean 24-hour total symptom score (TSS) as evaluated by the patient was significantly reduced by both fexofenadine HCI and loratadine from baseline (P < 0.001) compared with placebo. The difference between fexofenadine HCI and loratadine was not statistically significant. For overall quality of life, a significant improvement from baseline occurred for all three treatment groups (mean improvement was 1.25, 1.00, and 0.93 for fexofenadine HCI, loratadine, and placebo, respectively). The improvement in the fexofenadine HCI group was significantly greater than that in either the loratadine ( P < 0.03) or placebo (P < 0.005) groups. However, the magnitude of differences among the treatment groups was less than the minimal important difference of 0.5. The asthma example demonstrates that salmeterol not only significantly improved patients' asthma-related symptoms, both statistically and clinically, but also relieved their asthma-induced impairments on daily functioning and well-being. On the other hand, the SAR example demonstrates that both fexofenadine HCI and loratadine were effective in relief of SAR symptoms. The difference between fexofenadine HCI and loratadine in HRQOL was only statistically significant, but not clinically. However, Hays and Woolley [54] have cautioned investigators about the potential for oversimplication when applying a single minimal clinically important difference (MCID).

of Health-Related

3.7,3

Quality of Life

Symptom Relief Late-Stage Cancers

35

for

Although the main objective for the treatment of early-stage cancers is to eradicate the cancer cells and prolong survival, it may not be achievable in latestage cancers. More often, the objective for the treatment of late-stage cancers is palliation, mainly through relief of cancerrelated symptoms. As the relief of cancerrelated symptoms represents a clinical benefit to patients, the objective of some clinical trials in late-stage cancer is relief of a specific cancer-related symptom such as pain. To investigate the benefit of mitoxantrone in patients with symptomatic hormone-resistant prostate cancer, hormone-refractory patients with pain were randomized to receive mitoxantrone plus prednisone or prednisone alone [35]. The primary end point was a palliative response defined as a two-point decrease in pain as assessed by a six-point pain scale completed by patients (or complete loss of pain if initially 1+) without an increase in analgesic medication and maintained for two consecutive evaluations at least 3 weeks apart. Palliative response was observed in 23 of 80 patients (29%; 95% confidence interval; range 19-40%) who received mitoxantrone plus prednisone and in 10 of 81 patients (12%; 95% confidence interval; range 6-22%) who received prednisone alone (P = 0.01). No difference existed in overall survival. In another study assessing gemcitabine's effect on relief of pain [55], 162 patients with advanced symptomatic pancreatic cancer completed a lead-in period to characterize and stabilize pain and were randomized to receive either gemcitabine 1000 mg/m2 weekly x 7 followed by 1 week of rest, then weekly x 3 every 4 weeks thereafter, or to fluorouracil (5-FU) 600 mg/m2 once weekly. The primary efficacy measure was clinical benefit response, which

32

Assessment

of Health-Related

was a composite of measurements of pain (analgesic consumption and pain intensity), Karnofsky performance status, and weight. Clinical benefit required a sustained (>4 weeks) improvement in at least one parameter without worsening in any others. Clinical benefit response was experienced by 23.8% of gemcitabine-treated patients compared with 4.8% of 5-FUtreated patients (P = 0.0022). In addition, the median survival durations were 5.65 and 4.41 months for gemcitabinetreated and 5-FU-treated patients, respectively (P = 0.0025). Regarding the use of composite variables, researchers have urged investigators to report descriptive results for all components so that composite results do not obscure potential negative results for one or more of the components of the composite [56,57]. In a third study example, although symptom assessment was not the primary end point, it was the main differentiating factor between the two study arms in study outcomes. As second-line treatment of small-cell lung cancer (SCLC), topotecan was compared with cyclophosphamide, doxorubicin, and vincristine (CAV) in 211 patients with SCLC who had relapsed at least 60 days after completion of firstline therapy [58]. Response rate and duration of response were the primary efficacy end points. Patient-reported lungcancer-related symptoms were also evaluated as secondary end points. Similar efficacy in response rate, progression-free survival, and overall survival was observed between topotecan and CAV. The response rate was 26 of 107 patients (24.3%) treated with topotecan and 19 of 104 patients (18.3%) treated with CAV (P - 0.285). Median times to progression were 13.3 weeks (topotecan) and 12.3 weeks (CAV) (P = 0.552). Median survival was 25.0 weeks for topotecan and 24.7 weeks for CAV (P = 0.795). However, the proportion of patients who experienced symptom

Quality of Life 36 improvement was greater in the topotecan group than in the CAV group for four of eight lung-cancer-related symptoms evaluated, including dyspnea, anorexia, hoarseness, and fatigue, as well as interference with daily activity (P < 0.043).

3.8

Conclusion

Although HRQOL assessment in clinical trials has increased steadily over the years, a substantial challenge remains in interpretation of HRQOL results and acceptance of its value in clinical research. Both issues will require time for clinicians and regulators to fully accept HRQOL assessments. To help build acceptance, existing HRQOL instruments should be validated in each therapeutic area, rather than developing new instruments. The most urgent need in HRQOL research is to increase HRQOL acceptance by clinicians and regulators so that pharmaceutical companies will continue to include financial support for HRQOL assessments in new and existing drug development programs. Acknowledgment. The author is deeply indebted to Carol M. Moinpour for her numerous suggestions and to Carl Chelle for his editorial assistance.

References [1] J. A. Cramer and B. Spilker, Quality of Life and Pharmacoeconomics: An Introduction. Philadelphia: LippincottRaven, 1998. [2] World Health Organization, The First Ten Years of the World Health Organization. Geneva: World Health Organization, 1958, p. 459. [3] J. E. Ware, Jr. and C. D. Sherbourne, The MOS 36-Item Short-Form Health Survey (SF-36). I. Conceptual framework and item selection. Med. Care 1992; 30: 473-483.

32 Assessment

of Health-Related

Quality of Life

37

[4] C. A. McHorney, J. E. Ware, Jr., and A. E. Raczek, The MOS 36-Item ShortForm Health Survey (SF-36): II. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Med. Care 1993; 31: 247263.

[13] R. Melzack, The McGill Pain Questionnaire: major properties and scoring methods. Pain 1975; 1: 277-299.

[5] C. A. McHorney et al., The MOS 36Item Short-Form Health Survey (SF36): III. Tests of data quality, scaling assumptions, and reliability across diverse patient groups. Med. Care 1994; 32: 40-66.

[15] A. T. Beck et al., An inventory for measuring depression. Arch. Gen. Psychiat. 1961; 4: 561-571.

[6] M. Bergner et al., The Sickness Impact Profile: development and final revision of a health status measure. Med. Care 1981; 19: 787-805. [7] S. M. Hunt, J. McEwen, and S. P. McKenna, Measuring Health Status. London: Croom Helm, 1986. [8] G. R. Parkerson, Jr., W. E. Broadhead, and C. K. Tse, The Duke Health Profile. A 17-item measure of health and dysfunction. Med. Care 1990; 28: 10561072. [9] L. W. Chambers, The McMaster Health Index Questionnaire (MHIQ): Mehtodologic Documentation and Report of the Second Generation of Investigations. Hamilton, Ontario, Canada: McMaster University, Department of Clinical Epidemiology and Biostatistics, 1982. [10] A. M. Jette et al., The Functional Status Questionnaire: reliability and validity when used in primary care. J. Gen. Intern. Med. 1986; 1: 143-149. [11] S. Szabo (on behalf of the WHOQOL Group), The World Health Organization Quality of Life (WHOQOL) assessment instrument. In: B. Spilker (ed.), Quality of Life and Pharmacoeconomics in Clinical Trials. Philadelphia: LippincottReven, 1996. [12] C. S. Cleeland, Measurement of pain by subjective report. In: C. R. Chapman and J. D. Loeser (eds.), Issues in Pain Measurement. New York: Raven Press, 1989.

[14] I. McDowell and C. Newell, Measuring Health: A Guide to Rating Scales and Questionnaires, 2nd ed. New York: Oxford University Press, 1996.

[16] L. S. Radloff, The CES-D scale: a selfreport depression scale for research in the general population. Appl. Psychol. Measure. 1977; 1: 385-401. [17] M. Hamilton, Standised assessment and recording of depressive symptoms. Psychiat. Neurol. Neurochir. 1969; 72: 201205. [18] A. Zigmond and P. Snaith, The Hospital Anxiety and Depression Questionnaire. Acta Scand. Psychiat. 1983; 67: 361368. [19] W. W. K. Zung, A self-rating depression scale. Arch. Gen. Psychiat. 1965; 12: 63-70. [20] P. Bech et al., The WHO (Ten) WeilBeing Index: validation in diabetes, [comment]. [Clinical Trial. Journal Article. Multicenter Study. Randomized Controlled Trial]. Psychother. Psychosomat. 1996; 65: 183-190. [21] J. F. Fries et al., The dimension of health outcomes: the Health Assessment Questionnaire, disability and pain scales. J. Rheumatol. 1982; 9: 789-793. [22] G. H. Guyatt, A. Mitchell, E. J. Irvine et al., A new measure of health status for clinical trials in inflammatory bowel disease. Gastroenterology 1989; 96: 804810. [23] E. J. Irvine, B. Feagan et al., Quality of life: a valid and reliable measure of therapeutic efficacy in the treatment of inflammatory bowel disease. Gastroenterology 1994; 106: 287-296. [24] E. F. Juniper, G. H. Guyatt, P. J. Ferrie, and L. E. Griffith, Measuring quality of life in asthma. Am. Rev. Respir. Dis. 1993; 147: 832-838.

32

Assessment

of Health-Related

[25] P. W. Jones, F. H. Quirk, and C. M. Baveystock, The St. Geroge's Respiratory Questionnaire. Respir. Med. 1991; 85: 25-31. [26] E. F. Juniper and G. H. Guyatt, Development and testing of a new measure of health status for clinical trials in rhinoconjunctivitis. Clin. Exp. Allergy 1991; 21: 77-83. [27] V. Peto et al., The development and validation of a short measure of functioning and well being for individuals with Parkinson's disease. Qual. Life Res. 1995; 4(3): 241-248. [28] A. G. E. M. De Boer et al., Quality of life in patients with Parkinson's disease: development of a questionnaire. J. Neurol Neurosurg. Psychiat. 1996; 61(1): 7074. [29] N. K. Aaronson et al., The European Organization for Research and Treatment of Cancer QLQ-C30: a quality-oflife instrument for use in international clinical trials in oncology. J. Natl. Cancer Inst. 1993; 85: 365-376. [30] D. F. Cella et al., The Functional Assessment of Cancer Therapy scale: development and validation of the general measure. J. Clin. Oncol. 1993; 11: 570579. [31] C. M. Moinpour, Costs of quality-of-life research in Southwest Oncology Group trials. J. Natl. Cancer Inst. Monogr. 1996; 20: 11-16. [32] J. A. Sloan et al. and the Clinical Significance Consensus Meeting Group, The costs of incorporating quality of life assessments into clinical practice and research: what resources are required? Clin. Therapeut. 2003; 25(Suppl D). [33] P. J. Goodwin et al., Health-related quality-of-life measurement in randomized clinical trails in breast cancer— taking stock. J. Natl. Cancer Inst. 2003; 95: 263-281. [34] D. L. Fairclough, Design and Analysis of Quality of Life Studies in Clinical Trials. Boca Raton, FL: Chapman & Hal1/CRC Press, 2002.

Quality of Life 38 [35] I. F. Tannock et al., Chemotherapy with mitoxantrone plus prednisone or prednisone alone for symptomatic hormoneresistant prostate cancer: a Canadian randomized trial with palliative end points. J. Clin. Oncol 1996; 14: 17561764. [36] R. A. Deyo and R. M. Centor, Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J. Chronic Dis. 1986; 39: 897-906. [37] Kazis et al., 1989. [38] R. A. Deyo, P. Diehr, and D. L. Patrick, Reproducibility and responsiveness of health status measures: statistics and strategies for evaluation. Control Clin. Trials 1991; 12(4 Suppl): 142S-158S. [39] C. Bombardier, J. Raboud, and Auranofin Cooperating Group, A comparison of health-related quality-of-life measures for rheumatoid arthritis research. Control Clin. Trials 1991; 12(4 Suppl): 243S-256S. [40] J. N. Katz et al., Comparative measurement sensitivity of short and longer health status instruments. Med. Care 1992; 30: 917-925. [41] G. H. Guyatt, S. Walter, and G. Norman, Measuring change over time: assessing the usefulness of evaluative instruments. J. Chronic Dis. 1987; 40: 171-178. [42] G. H. Guyatt, B. Kirshner, and R. Jaeschke, Measuring health status: what are the necessary measurement properties? J. Clin. Epidemiol 1992; 45: 1341-1345. [43] M. H. Liang et al., Comparative measurement efficiency and sensitivity of five health status instruments for arthritis research. Arthritis Rheum. 1985; 28: 542-547. [44] K. W. Wyrwich, N. A. Nienaber, W. M. Tierney, and F. D. Wolinsky, Linking clinical relevance and statistical significance in evaluating intra-individual changes in health-related quality of life. Med. Care 1999; 37: 469-478.

32 Assessment of Health-Related [45] K. W. Wyrwich, W. M. Tierney, and F. D. Wolinsky, Further evidence supporting an SEM-based criterion for identifying meaningful intra-individual changes in health-related quality of life. J. Clin. Epidemiol. 1999; 52: 861-873. [46] J. A. Sloan et al., Randomized comparison of four tools measuring overall quality of life in patients with advanced cancer. J. Clin. Oncol. 1998; 16: 36623673. [47] J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. London: Academic Press, 1988. [48] D. Osoba, G. Rodrigues, J. Myles, B. Zee, and J. Pater, Interpreting the significance of changes in health-related quality of life scores. J. Clin. Oncol. 1998; 16: 139-144. [49] G. H. Guyatt et al. and the Clinical Significance Consensus Meeting Group, Methods to explain the clinical significance of health status measures. Mayo Clin. Proc. 2002; 77: 371-383. [50] R. F. Lockey et al., Nocturnal asthma— effect of salmeterol on quality of life and clinical outcomes. Chest 1999; 115: 666-673. [51] E. F. Juniper et al., 1994. [52] P. Van Cauwenberge and E. F. Juniper, The Star Study Investigating Group. Comparison of the efficacy, safety and quality of life provided by fexofenadine hydrochloride 120 mg, loratadine 10 mg and placebo administered once daily for the treatment of seasonal allergic rhinitis. Clin. Exp. Allergy 2000; 30: 891899. [53] E. F. Juniper et al., 1996. [54] R. D. Hays and J. M. Woolley, The concept of clinically meaningful difference in health-related quality-of-life research. How meaningful is it? Pharmacoeconomics 2000; 18: 419-423. [55] H. A. Burris, III et al., Improvements in survival and clinical benefit with gemcitabine as first-line therapy for patients with advanced pancreas cancer: a randomized trial. J. Clin. Oncol. 1997; 15: 2403-2413.

Quality

of Life

39

[56] N. Freemantle, M. Calvert, J. Wood, J. Eastaugh, and C. Griffin, Composite outcomes randomized trials. Greater precision but with greater uncertainty? JAMA 2003; 289: 2554-2559. [57] M. S. Lauer and E. J. Topol, Clinical trials—Multiple treatment, multiple end points, and multiple lessons. JAMA 2003; 289: 2575-2577. [58] J. von Pawel et al., Topotecan versus cyclophosphamide, doxorubicin, and vincristine for the treatment of recurrent small-cell lung cancer. J. Clin. Oncol. 1999; 17: 658-667.

Further Reading [1] B. Spilker (ed.), Quality of Life and Pharmacoeconomics in Clinical Trials. Philadelphia: Lippincott-Reven, 1996. [2] M. A. G. Sprangers, C. M. Moinpour, T. J. Moynihan, D. L. Patrick, and D. A. Revicki, Clinical Significance Consensus Meeting Group. Assessing meaningful change in quality of life over time: a users' guide for clinicians. Mayo Clin. Proc. 2002; 77: 561-571. [3] J. E. Ware et al., SF-36 Health Survey: Manual and Interpretation Guide. Boston: The Health Institute, New England Medical Center, 1993. [4] World Health Organization, International Classification of Impairments, Disabilities, and Handicaps. Geneva: World Health Organization, 1980.

4 Bandit Processes and Response-Adaptive Clinical Trials: The Art of Exploration Versus Exploitation Xikui

4.1

Wang

Introduction

individual ethics. They also require t h a t randomization be used as an indispensable element of the planned experiment. Challenges to clinical trials are often characterized by complex ethical and methodological issues. We face a delicate balance between minimizing biases of t r e a t m e n t comparison (in order to acquire evidencebased scientific knowledge) and maximizing ethics (in order to safeguard t h e wellbeing of patients in the trials). In desperate medical situations, collective ethics and individual ethics are in conflict, and it was argued by P u l l m a n and Wang [14] t h a t response-adaptive clinical trials are not only ethically justifiable b u t also morally required. Response-adaptive designs represent a m a j o r advancement in clinical trial methodology t h a t helps balance t h e ethical issues and improve efficiency without undermining t h e validity and integrity of t h e clinical research. Response-adaptive designs use information so far accumulated from t h e trial to modify t h e randomization procedure and deliberately bias treatment allocation in order to assign more patients

Statistics show t h a t t h e global life expectancy was 46 years in 1950 [17] and increased to 61 years in 1980 and to 67 years in 1998 [22]. On its web page, t h e World Health Organization [23] indicates t h a t t h e average life expectancy at birth of the global population in 2011 was 70 years. We now live longer and better t h a n ever before. Most of t h e gains have occurred in low- and middle-income countries and are a t t r i b u t e d to improved nutrition and sanitation and improved public health infrastructure. Another important cause is of course t h e advancement, in medicine t h a t have helped to improve both the mortality and morbidity of our lives. Clinical trials, as part of the mainstream clinical research, have significantly impacted on our lives because they provide t h e most reliable and efficient method to evaluate t h e effectiveness of new medical interventions. Clinical trials are controlled experiments on h u m a n subjects and so involve t h e p a r a m o u n t issues of collective ethics and

40

65 Bandit

Processes

and

t o t h e potentially better treatment. Bandit processes are statistical models for optimal sequential selections from several statistical populations or arms [1]. Some or all of these populations have unknown distributions. Each selection offers a r a n d o m numerical payoff, t h e value of which provides information for t h e statistical inference of its population distribution, if unknown. T h e goal is t o optimize a certain measure of expected payoffs from all selections. Bandit processes have found a variety of applications in diversified fields such as clinical trials [21], stochastic scheduling [3], queueing network [10], and dynamic pricing [19], to n a m e a few. T h e common goal of b o t h responseadaptive clinical trials and bandit processes is to achieve an appropriate balance between t h e competing goals of potential f u t u r e benefit and expected immediate payoff. T h e f u t u r e benefit is normally achieved by gathering useful information in order to make better informed decisions in t h e future, t h a t is, by exploring t h e unknown environment and reducing uncertainty. T h e optimal expected immediate payoff is usually achieved by exploiting t h e currently available knowledge about t h e unknown environment. T h e essential challenge is t h a t exploration and exploitation cannot be achieved in an optimal manner at t h e same time. T h e objective is t o determine a certain kind of compromise and find a strategy t h a t achieves t h e compromise. In t h e context of a clinical trial, exploration means acquiring scientific knowledge for t h e advancement of medicine and benefits t h e collective ethics. In contrast, exploitation means applying t h e currently available knowledge for t h e best possible t r e a t m e n t of t h e current patient in t h e trial, and hence favors t h e individual ethics. As for bandit processes, exploration means selecting from populations with un-

Response-Adaptive

Clinical

Trials

J^l

known distributions in order to reduce uncertainty and exploitation means selecting from t h e population with t h e highest expected immediate payoff. T h e goal of this article is twofold. First, we unify t h e response-adaptive clinical trials and bandit processes by Markov decision processes and demonstrate t h e existence of optimal strategies for bandit processes under t h e Bayesian approach. Second, when t h e responses are delayed and their observations are censored, such Markov decision processes are replaced by general controlled stochastic processes depending on t h e setting of t h e model. We provide technical details for information gathering t o become a stochastic process, by showing t h e measurability of t h e state transition. T h e article is organized as follows. In Section 4.2, we introduce Markov decision processes and show t h a t bandit processes and response-adaptive clinical trials can be formulated as Markov decision processes. T h e existence of optimal strategies is shown for bandit processes with complete observations. In Section 4.3, with the Bayesian approach, we show t h a t the mapping from t h e prior distribution t o the posterior distribution is measurable when t h e observations are censored. Section 4.4 concludes t h e article.

4.2

Exploration Versus Exploitation with Complete Observations

In this section, we discuss t h e process of exploration versus exploitation (i.e., the E E process) when t h e observations are completely known before making decisions. T h e currently available knowledge of t h e exploration process forms t h e state of the dynamic process and is exploited, and the state evolution is described by a stochastic process or Markov process. This process of information gathering is controlled for t h e

42

Bandit

Processes

and

Response-Adaptive

purpose of achieving a given optimality criterion.

4.2.1

The Model of Markov Decision Processes

Markov decision processes are a marriage between Markov processes and dynamic programming [15]. Alternatively, they are known as controlled Markov processes, stochastic dynamic programming, Markov decision programming, and Markov control processes. They capture t h e two fundamental features underlying both responseadaptive clinical trials and bandit processes: a practical situation involving uncertainty and a sequential setting. For simplicity, we consider only a discrete time Markov decision process consisting of five elements: t h e state space i,j G oo -K

1

71 = 1

r(sn,an)

^Ta^Vsnjan) ,n=l

if t h e limit exists, where 0 < a < 1, and (3) t h e infinite horizon average criterion

N G * v( sAi' ) = N—*oo lim ~NE l 81 ]TV(s n ,a n ) .n—1

if t h e limit exists. T h e goal is to find an optimal strategy such t h a t the controlled stochastic system performs optimally with respect to a predetermined optimality criterion.

4.2.2

The Bandit Processes Under the Bayesian Approach

Traditionally a bandit process is directly defined as a Markov decision process [5]. We view a bandit process from t h e statistical point of view [1] and reformulate it as a Markov decision process. For the bandit

67 Bandit

Processes

and

process, the underlying environment has unknown parameters and we use the prior or posterior distribution of the unknown parameters as the state of the controlled stochastic (or Markov) process. Such a state summarizes all the information we need for making a decision. Let A = I be the index set of populations (or arms), which can be finite, countably infinite, or compact. For each i G / , population i, denoted as POPi = {Xi 5n , n = 1,2, • • •}, consists of conditionally independent and identically distributed random variables, given the distribution Fi of the population. Populations are assumed to be independent. The bandit process, denoted as BP = {POPi : i G /}, consists of populations indexed by i G I and is characterized by the collection F = (JFi,i G I) of distributions, where F is unknown. If population i G I is selected for observation at time n = 1,2, • • •, the expected value of Xi, n is the expected immediate payoff and the observation x of Xi^n helps understand the distribution if unknown. Let V be the set of all distribution functions on = [0,00) and T be a subset of V1 = {(F logL\dij,y)

> 7r0,

(4)

where 7To, the tolerance level, can be set at a low value such as 0.05 or 0.20. The dose at which the above probability is equal to 7ro is called the maximum safe dose for the ith subject following the j t h observation. The maximum safe dose is subject related, and posterior estimates may differ among subjects who have already been observed, being lower for a subject who previously had absorbed more drug than average and higher if the absorption was less. After each treatment period of the doseescalation procedure, the posterior predictive probability that a future response will lie above the safety limit is updated. The decision of which dose to administer to each subject in each dosing period is made using a predefined criterion. This criterion can be based on safety; for example, one could use the maximum safe dose as the recommended dose. It can also

Volunteers

be based on accuracy of estimates of unknown parameters; for example, the optimal choice of doses is that which will maximize the determinant of the posterior variance-covariance matrix of the joint posterior distribution or minimize the posterior variance of some key parameter.

5.3

An Example of Dose Escalation in Healthy Volunteer Studies

An example, described in detail by Whitehead et al. [5], in which the principal pharmacokinetic measure was C m a x is outlined briefly here. The safety cutoff for this response was taken to be yL = log(200). Seven doses were used according to the schedule: 2.5, 5, 10, 25, 50, 100, 150 /xg. The actual trial was conducted according to a conventional design, and the dosing structure and the resulting data are listed in Table 2. From a SAS PROC MIXED analysis, maximum likelihood estimates of the parameters in model [1] are 0i = 1.167,02 - 0.822, S2 = 0.053, and r 2 = 0.073.

(5)

Note that it follows that the withinsubject correlation is estimated as 0.579. As an illustration, the Bayesian method described above is applied retrospectively to this situation. Conjugate priors must first be expressed. It is assumed that prior expert opinion suggested that the C m a x values would be 5.32 and 319.2 pg/ml at the doses 2.5 and 150 /xg, respectively. This prior indicates that the highest dose, 150 /xg, is not a safe dose because the predicted C m a x exceeds the safety limit of 200 pg/ml. The value for p is set as 0.6. This forms the bivariate normal distributions for 0, at//0.756\

/

0.940

-0.109 \ \

The value for ao is set as 1, as suggested by Whitehead et al. [5]. The value for

79 Bayesian

Dose-Finding

Designs

in Healthy

Volunteers

Table 2: Real Data from a Healthy Volunteer Trial in Whitehead et al. (2006) Period 1 Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Dose 2.5 2.5 2.5 10

2 Cmax

Dose 5 5

Cmax

2.5 2.5 5 5 29.0 15.0 21.8 23.0

50

54.7

50 50 50 50 50

161.5 101.4 194.1 61.6 81.7

4

Dose 10

Cmax

5 25 5 5 10

10.0 33.3 10.2 20.6 28.5

10.4 13.5

10 25

14.6 34.1

25 25 25 10 2.5 100 50

34.9 70.6 40.7 16.6 125.6 79.4

100 100

131.3 232.5

100 50

84.2 113.8

Source: Whitehead et al. Stat Med. 2006; 25: 433-445.

Dose

Cmax

11.0

14.5

24.1

2.5 2.5 2.5 10 10 10 10 10

3

5 25 50 25

19.4 71.6 37.1

50 25 5

73.4 53.0 14.4

100

108.1

100 150 100

104.5 217.6 178.9

10 10 50 10

20.4 19.7 74.6 29.7

10 10 50

14.6 27.0 47.7

50 50 50

59.2 76.1 166.6

50 10

87.1 17.8

12

56

Bayesian Dose-Finding

Designs in Healthy

flo can be found via the safety constraint: P(yoi > log£|doi = 2.5) = 0.05. This implies that dose 2.5 /xg will be the maximum safe dose for all new subjects at the first dosing period in the first cohort. Therefore, Ga(l, 0.309). To illustrate Bayesian dose escalation, data are simulated based on the parameters found from the mixed model analysis given by Whitehead et al. [5]. Thus, C m a x values are generated from three cohorts of eight healthy volunteers, each treated in four consecutive periods and receiving three active doses and one randomly placed placebo. A simulated dose escalation with doses chosen according to the maximum safe dose criterion is shown in Table 3. The first six subjects received the lowest dose, 2.5 fig. All subjects at the next dosing period received dose 10 /xg, in which dose 5 jig was skipped for subjects 1 to 4. Subjects 7 and 8, who were on placebo in the first dosing period, skipped two doses, 2.5 and 5 /xg, to receive 10 /xg in the second dosing period. If this dosing proposal was presented to a safety committee in a real trial, the committee members might wish to alter this dose recommendation. The Bayesian approach provides a scientific recommendation for dose escalations. However, the final decision on which doses are given should come from a safety committee. The procedure would be able to make use of results from any dose administered. In Table 3, it is shown that the maximum safe dose for a new subject at the beginning of the second cohort is 25 /xg. All subjects in the second cohort received 50 /xg at least twice. Subjects 17 to 22 in the first dosing period of the final cohort received 50 /xg. However, they all had a high value of C m a x . The Bayesian approach then recommended a lower dose, 25 /xg, for subjects 17 to 20 (subjects 21 and 22 were on placebo). This shows that the Bayesian approach can react to different situations quickly: When escalation looks

Volunteers

safe, dose levels can be skipped; when escalation appears to be unsafe, lower doses are recommended. Two high doses that were administered in the real trial, 100 and 150 /xg, were never used in this illustrative run. The posterior distributions for 6 at the end of the third cohort are , ~

/ / l . 3 7 6 \ ( 0.024 v^o.79o;' v - ° - 0 0 6

—0.006 \ \ ° - 0 0 2 / J'

and v ~ Ga(37, 2.625). Figure 1 shows the doses administered to each subject and the corresponding responses. Table 4 gives the maximum likelihood estimates from the real data in Table 2 that were used as the true values in the simulation, together with the maximum likelihood estimates from the simulated data in Table 3. Results show that o 2 and T2 were underestimated from the simulated data, with there being no evidence of between-subject variation. Consequently, the estimated correlation from the simulated data is zero, in contrast to the true value of 0.579 used in the simulation. This is a consequence of the small data set, and it illustrates the value of fixing p during the escalation process. Different prior settings will result in different dose escalations. For example, if the value for ao is changed from 1.0 to 0.1, then the dose escalation will be more rapid. Table 5 summarizes the recommended doses and simulated responses from another simulation run where ao = 0.1. In the second dosing period of the first cohort, subjects 1, 2, and 4 skipped two doses, 5 and 10 /xg. Subjects 3, 7, and 8 skipped three doses in that cohort. The starting dose for all subjects in the first dosing period of the second and third cohorts was 50 /xg (25 /xg and 50 /xg were the doses in Table 3). All subjects, except subject 13 at the fourth period, repeatedly received 50 /xg during the second cohort. In the third cohort, the dose of 100 /xg was used quite frequently. The highest dose, 150 /xg, was never used. This example shows that different prior settings will

57 Bayesian

Dose-Finding

Designs in Healthy

Volunteers

12

Table 3: A Simulated Dose Escalation Period 1 Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Dose 2.5 2.5 2.5 2.5 2.5 2.5 25 25 25 25 25 25 50 50 50 50 50 50

2 Cmax

9.9 12.4 10.4 11.1 6.5 7.7 38.9 41.4 42.5 53.4 50.8 48.7 138.2 194.4 173.2 139.9 181.6 117.9

3 Dose 10 10 10 10 10 10 50 50 50 50 50 50 25 25 25 25 50 50

Cmax

19.7 27.0 35.6 35.7 21.3 24.5 84.4 71.0 87.1 67.2 84.4 82.2 73.6 57.9 76.6 43.1 93.6 97.5

Dose 25 25

4 Cmax

25 25 50 25 50 50 50 50 50 50 25 25 25 50 50 50

Dose

Cmax

64.1 76.4 33.7 62.9 93.2 52.8 53.7 58.6 60.4 66.5 89.4 75.5 46.5 69.2 47.2 87.7 75.5 79.2

25 25 50 50 50 50

44.8 27.1 58.8 57.9 58.0 58.1

50 50 50 50 50 50

92.7 91.9 79.6 110.6 62.6 76.4

25 50 50 50 50 50

37.4 80.2 79.2 63.9 82.5 65.9

Table 4: Maximum Likelihood Estimates (MLE) and Bayesian Model Estimates of Simulated Data in Table 3 (with Standard Errors or Standard Deviations) Truth for simulations (MLE from real data) Final MLE (exclude the prior information) Bayesian prior model estimates Bayesian posterior model estimates

0i 1.167 (0.158) 1.568 (0.143) 0.759 (0.967) 1.376 (0.156)

62 0.822 (0.046) 0.741 (0.041) 1.00 (0.192) 0.790 (0.042)

a1* 0.053

T* 0.073

P 0.579

d} 73.42

0.090

0.000

0.000

69.45

0.309

0.463

0.6

2.5

0.071

0.106

0.6

58.18

58

Bayesian Dose-Finding

Designs in Healthy

Volunteers

Dose 157 - f 135113 91 6947 25 - p 31







2

3

4





7

8

-f—T—t—f—t—I 1 1 1 1 1 1 1 1 1 T T 1 1 1 1 1 5

6

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Subject

Key: c



C ^ e (0, 0.2L);

max 210-



C ^ € (0.2L, 0.5L);

*

C ^ € (0, 0.5L, L).

safty limit

180-

150120-

50% safty limit

t

90*

60 30" 3

* - *— $

*

^ t * *

2 0 % saft y limi t

*

-r

*

*

*

*

%

*

* t *

*

*

*

* *

*

2

3

4

6

I I I I I I I I I I I I I I I I I I I I I I ~ 5

7

8

9

10 11 12 13 14 15 16 17 18 19 20 21 22 23

24

Subject

Figure 1: An illustration of a simulated dose escalation (using d a t a from Table 3)

59 Bayesian

Dose-Finding

affect dose-escalation procedures. Multiple simulations should therefore be conducted to gain a better understanding of the properties of a design. Different scenarios should be tried to ensure that the procedure has good properties, whether the drug is safe, unsafe, or only safe for some lower doses.

5.4

Discussion

Bayesian methods offer advantages over conventional designs. Unlike conventional designs, more doses within the predefined dose range, or even outside of the predefined dose range, can be explored without necessarily needing extra dosing time as dose level skipping can be permitted. From the simulation runs in Tables 3 and 5, more doses, such as 40, 60, and 80 fig, could perhaps be included in the dose scheme. Simulations have shown that skipping dose levels does not affect either safety or accuracy; on the contrary, it improves safety or accuracy, as a wider range of doses is used [12]. Providing a greater choice of doses, while allowing dose skipping, leads to procedures that are more likely to find the target dose and to learn about the dose-response relationship curve efficiently. Ethical concerns can be expressed through cautious priors or safety constraints. Dose escalation will be dominated by prior opinion at the beginning, but it will soon be influenced more by the accumulating real data. Before a Bayesian design is implemented in a specific phase I trial, intensive simulations should be used to evaluate different prior settings, safety constraints, and dosing criteria under a range of scenarios. Investigators can then choose a satisfactory design based on the simulation properties that they are interested in. For instance, they may look for a design that gives the smallest number of toxicities or the most accurate estimates. Until recently, such prospective assessment of de-

Designs

in Healthy

Volunteers

12

sign options was not possible, but now that the necessary computing tools are available, it would appear inexcusable not to explore any proposed design before it is implemented. Although the methodology described here is presented only for a single pharmacokinetic outcome, the principles are easily generalized for multiple end points. Optimal doses can be found according to the safety limits for each of the endpoints, and then the lowest of these doses can be recommended. The Bayesian decisiontheoretic approach has also been extended for application to an attention deficit disorder study [13], where a pharmacodynamic response (heart rate change from baseline), a pharmacokinetic response (AUC), and a binary response (occurrence of any doselimiting events) are modeled. A one-off simulation run indicates that the Bayesian approach controls unwanted events, doselimiting events, and AUC levels exceeding the safety limit, while achieving more heart rate changes within the therapeutic range. Bayesian methodology only provides scientific dose recommendations. These should be treated as additional information for guidance of the safety committee of a trial rather than dictating final doses to be administered. Statisticians and clinicians need to be familiar with this methodology. Once the trial starts, data need to be available quickly and presented unblinded with dose recommendations to the safety committee. They can then make a final decision, based on the formal recommendations, together with all of the additional safety, laboratory, and other data available.

12

84 Bayesian

Dose-Finding

Designs in Healthy

Volunteers

Table 5: A Simulated Dose Escalation (the value for ao is 0.1) Period 1 Subject 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Dose 2.5 2.5 2.5 2.5 2.5 2.5 50 50 50 50 50 50 50 50 50 50 50 50

2 Cmax

8.2 9.9 6.2 8.4 8.0 6.4 88.9 104.4 74.9 98.3 86.2 89.4 58.5 48.3 70.5 45.3 55.3 49.1

3 Dose 25 25 50 25 25 25 50 50 50 50 50 50 100 100 50 100 100 50

Cmax

53.4 40.4 118.5 61.1 65.1 37.7 64.4 49.5 64.7 84.0 109.9 115.0 172.8 140.7 73.8 89.7 157.6 72.8

Dose 50 50 50 50 50 50 50 50 50 50 50 50 50 100 100 100 100 50

4 Cmax

Dose

Cmax

66.1 78.9 92.7 74.1 60.9 98.2 46.6 46.2 48.4 56.7 46.6 68.5 60.5 98.2 142.0 115.6 102.5 63.9

50 50 50 50 50 50

76.7 84.2 101.7 83.0 78.2 72.0

50 50 100 50 50 50

57.9 72.6 83.6 60.8 59.2 93.4

50 100 100 100 100 100

92.7 119.8 106.5 101.5 156.5 145.3

61 Bayesian

Dose-Finding

References [1] U.S. Food and Drug Administration. 1997. Guidance for Industry: General Considerations for the Clinical Evaluation of Drugs. [2] K. Gough, M. Hutchison, O. Keene, B. Byrom, S. Ellis S, et al., Assessment of dose proportionality—Report from the Statisticians in the Pharmaceutical Industry/Pharmacokinetic UK Joint Working Party. Drug Inf J. 1995; 29: 1039-1048. [3] S. Patterson, S. Francis, M. Ireson, D. Webber, and J. Whitehead, A novel Bayesian decision procedure for earlyphase dose finding studies. J Biopharm Stat. 1999; 9: 583-597. [4] J. Whitehead, Y. Zhou, S. Patterson, D. Webber, and S. Francis, Easy-toimplement Bayesian methods for doseescalation studies in healthy volunteers. Biostatistics. 2001; 2: 47-61. [5] J. Whitehead, Y. Zhou, A. Mander, S. Ritchie, A. Sabin, and A. Wright, An evaluation of Bayesian designs for doseescalation studies in healthy volunteers. Stat Med. 2006; 25: 433-445. [6] Y. Zhou, Dose-escalation methods for phase I healthy volunteer studies. In: S. Chevret (ed.), Statistical Methods for Dose-Finding Experiments. Chichester, UK: Wiley, 2006, pp. 189-204. [7] S. C. Chow and J. P. Liu, Design and Analysis of Bioavailability and Bioequivalence Studies. Amsterdam: Dekker, 1999. [8] W. J. Westlake, Bioavailability and bioequivalence of pharmaceutical formulations. In: K. E. Peace (ed.), Biopharmaceutical Statistics for Drug Development. New York: Dekker, 1988, pp. 329352. [9] D. V. Lindley, Making Decisions. London: Wiley, 1971. [10] J. Q. Smith, Decision Analysis: A Bayesian Approach. London: Chapman & Hall, 1988.

Designs in Healthy

Volunteers

12

[11] J. O. Berger, Statistical Decision Theory and Bayesian Analysis, New York: Springer, 1985. [12] Y. Zhou and M. Lucini, Gaining acceptability for the Bayesian decisiontheoretic approach in dose escalation studies. Pharrn Stat. 2005; 4: 161-171. [13] L. Hampson, Bayesian Methods to Escalate Doses in a Phase I Clinical Trial. M.Sc. dissertation. School of Applied Statistics, University of Reading, Reading, UK, 2005.

6 Bootstrap Tim

Hesterberg

6.1

Introduction

We begin with an example of the simplest type of bootstrapping in this section, then discuss the idea behind the bootstrap, implementation by random sampling, using the bootstrap to estimate standard error and bias, the central limit theorem and different types of bootstraps, the accuracy of the bootstrap, confidence intervals, hypothesis tests, planning clinical trials, and the number of bootstrap samples needed and ways to reduce this number, and we conclude with references for additional reading. Figure 1 shows a normal quantile plot of arsenic concentrations from 271 wells in Bangladesh, from htt p: / / www. bgs. ac. uk/arsenic/bangladesh / Data/SpecialStudyData.csv referenced from statlib http://lib.stat.cmu.edu/ datasets. The sample mean and standard deviation are x — 124.5 and s = 298, respectively. The usual formula standard error is s/y/n = 18.1, and the usual 95% confils dence interval x±ta/2,n-is/V^ (88.8, 160.2). This interval may be suspect because of the skewness of the data, despite the reasonably large sample size. We may use the bootstrap for inferences for the mean of this data set. We draw 62

a bootstrap sample, or resample, of size n with replacement from the data and compute the mean. We repeat this process many times, say 104 or more. The resulting bootstrap means comprise the bootstrap distribution, which we use to estimate aspects of the sampling distribution for X. Figure 2 shows a histogram and a normal quantile plot of the bootstrap distribution. The bootstrap standard error is the standard deviation of the bootstrap distribution; in this case the bootstrap standard error is 18.2, which is close to the formula standard error. The mean of the bootstrap means is 124.4, which is close to x (the difference is —0.047, to three decimal places). The bootstrap distribution looks normal, with some skewness. This amount of skewness is a cause for concern. This example may be counter to the intuition of many readers, who use normal probability plots to look at data. This bootstrap distribution corresponds to a sampling distribution, not raw data. This distribution is after the central limit theorem has had its one chance to work, so any deviations from normality here may translate into errors in inferences. We may quantify how badly this amount of skewness affects confidence intervals; we defer this to Section 6.8, on Bootstrap Confidence Intervals. We first discuss the idea

87 Bootstrap

i 1 r - 3 - 2 - 1 0

1

2

Quantiles of Standard Normal Figure 1: Arsenic concentrations in 271 wells in Bangladesh

Figure 2: Bootstrap distribution for arsenic concentrations

12

64

Bootstrap

behind the bootstrap, and give some idea of its versatility.

6.2

Plug-In Principle

The idea behind the bootstrap is the plugin principle [1]—if a quantity is unknown, then we plug in an estimate for it. This principle is used all the time in statistics. The standard deviation of a sample mean for i.i.d. observations from a population with standard deviation a is cj j\Jn; when a is unknown, we plug in an estimate s to obtain the usual standard er-

ror s/y/fi. W h a t is different in the bootstrap is that we plug in an estimate for the whole population, not just for a numerical summary of the population. Statistical inference depends on the sampling distribution. The sampling distribution depends on the following: 1. the underlying population(s), 2. the sampling procedure, and 3. the statistic, such as X . Conceptually, the sampling distribution is the result of drawing many samples from the population and calculating the statistic for each. The bootstrap principle is to plug in an estimate for the population, then mimic the real-life sampling procedure and statistic calculation. The bootstrap distribution depends on 1. an estimate for the population(s), 2. the sampling procedure, and 3. the statistic, such as X . The simplest case is when the original data are an i.i.d. sample from a single population, and we use the empirical distribution Fn to estimate the population, where Fn{u) = (1 < u). This

equation gives the ordinary nonparametric bootstrap, which corresponds to drawing samples of size n without replacement from the original data.

6.2.1

How Useful is the Bootstrap Distribution?

A fundamental question is how well the bootstrap distribution approximates the sampling distribution. We discuss this question in greater detail in Section 6.7 Accuracy of Bootstrap Distributions, but note a few key points here. For most common estimators (statistics t h a t are estimates of a population parameter, e.g., X is an estimator for whereas a I statistic is not an estimator), and under fairly general distribution assumptions, c e n t e r : the center of the bootstrap distribution is not an accurate approximation for the center of the sampling distribution. For example, the center of the bootstrap distribution for X is centered at approximately x, the mean of the sample, whereas the sampling distribution is centered at /i. spread: the spread of the bootstrap distribution does reflect the spread of the sampling distribution. bias: the bootstrap bias estimate (see below) does reflect the bias of the sampling distribution. s k e w n e s s : the skewness of the bootstrap distribution does reflect the skewness of the sampling distribution. The first point bears emphasis. It means

that the bootstrap is not used to get better parameter estimates because the bootstrap distributions are centered around statistics 0 calculated from the d a t a (e.g., x or regression slope (3) rather t h a n the unknown population values (e.g., \x or /?). Drawing thousands of bootstrap observations from

Bootstrap the original data is not like drawing observations from the underlying population; it does not create new data. Instead, the bootstrap sampling is useful for quantifying the behavior of a parameter estimate, such as its standard error or bias, or calculating confidence intervals. Exceptions do exist where bootstrap averages are useful for estimation, such as random forests (2). These examples are beyond the scope of this article, except that we give a toy example to illustrate the mechanism. Consider the case of simple linear regression, and suppose that a strong linear relationship exists between y and x. However, instead of using linear regression, one uses a step function; the data are split into eight equal-size groups based on x, and the y values in each group are averaged to obtain the altitude for the step. Applying the same procedure to bootstrap samples randomizes the location of the step edges, and averaging across the bootstrap samples smooths the edges of the steps. This method is shown in Figure 3. A similar effect holds in random forests, using bootstrap averaging of tree models, which fit higher-dimensional data using multivariate analogs of step functions.

6-2.2

Other Population Estimates

Other estimates of the population may be used. For example, if there was reason to assume that the arsenic data followed a gamma distribution, we could estimate parameters for the gamma distribution, then draw samples from a gamma distribution with those estimated parameters. In other cases, we may believe that the underlying population is continuous; rather than draw from the discrete empirical distribution, we may instead draw samples from a density estimated from the data, say a kernel density estimate. We return to this point in Section 6.7.2 Boot-

65

strap Distributions Are Too Narrow.

6.2.3

Other Sampling Procedures

When the original data were not obtained using an i.i.d. sample, the bootstrap sampling should reflect the actual data collection. For example, in stratified sampling applications, the bootstrap sampling should be stratified. If the original data are dependent, the bootstrap sampling should reflect the dependence; this information may not be straightforward. Some cases exist where the bootstrap sampling should differ from the actual sampling procedure, including: • regression (see Section 6.5, Examples) • planning clinical trials (see Section 6.10, Planning Clinical Trials) • hypothesis testing (see Section 6.9, Hypothesis Testing), and • small samples (see Section 6.7.2, Bootstrap Distributions Are Too Narrow)

6.2.4

Other Statistics

The bootstrap procedure may be used with a wide variety of statistics—mean, median, trimmed mean, regression coefficients, hazard ratio, x-intercept in a regression, and others—using the same procedure. It does not require problem-specific analytical calculations. This factor is a major advantage of the bootstrap. It allows statistical inferences such as confidence intervals to be calculated even for statistics for which there are no easy formulas. It offers hope of reforming statistical practice—away from simple but nonrobust estimators like a sample mean or least-squares regression estimate, in favor of robust alternatives.

66

Bootstrap

Figure 3: Step function defined by eight equal-size groups, and average across b o o t s t r a p samples of step functions

6,3

Monte Carlo Sampling— The "Second Bootstrap Principle"

T h e second b o o t s t r a p "principle" is t h a t t h e b o o t s t r a p is implemented by random sampling. This aspect is not actually a principle, but an implementation detail. Given t h a t we are drawing i.i.d. samples of size n from t h e empirical distribution Fn, there are at most nn possible samples. In small samples, we could create all possible b o o t s t r a p samples, deterministically. In practice, n is usually too large for t h a t to be feasible, so we use random sampling. Let B be the number of bootstrap samples used (e.g., B = 10 4 ). T h e resulting B statistic values represent a random sample of size B with replacement from t h e theoretical bootstrap distribution t h a t consists of nn values (including ties). In some cases, we can calculate the theoretical b o o t s t r a p distribution without simulation. In t h e arsenic example, parametric bootstrapping from a g a m m a distribution causes t h e theoretical b o o t s t r a p distribution for t h e sample mean to be another

g a m m a distribution. In other cases, we can calculate some aspects of the sampling distribution without simulation. In the case of the nonparametric b o o t s t r a p when t h e statistic is t h e sample mean, t h e mean and s t a n d a r d deviation of t h e theoretical bootstrap distribution are x and a p ^ / ^ / n , respectively, where

Note t h a t this differs from t h e usual sample standard deviation in using a divisor of n instead of n — 1. We return to this point in Section 6.7.2, Bootstrap Distribution Are Too Narrow. T h e use of Monte Carlo sampling adds additional unwanted variability, which may be reduced by increasing the value of B. We discuss how large B should be in Section 6.11, How Many Bootstrap Samples Are Needed.

6.4

Bias and Standard Error

Let 9 = 9(F) be a parameter of a population, such as t h e mean, or difference in

Bootstrap

0.005

Relative Risk

0.010

67

0.015

Proportion in low risk group Slope = Relative Risk

Figure 4: Bootstrap distribution for relative risk regression coefficients between subpopulations. Let 0 be the corresponding estimate from the data, 0* the estimate from a bootstrap sample, A *

6

i

Note how this example relates to the plugin principle. The bias of a statistic is E(6) — 0, which for a functional statistic may be expanded as Ef(0) - 6(F), the expected value of 0 when sampling from F minus the value for population F. Substituting F for the unknown F in both terms yields the theoretical bootstrap analog

average of B bootstrap estimates, and si = (B-

EP(6*)-0(F)

The Monte Carlo version in Equation (1) substitutes the sample average of bootstrap statistics for the expected value.

ir1 £;&-!*)» 6=1

the sample standard deviation of the bootstrap estimates. Some bootstrap calculations require that 6 be a functional statistic, which is one that depends on the data only through the empirical distribution, not on n. A mean is a functional statistic, whereas the usual sample standard deviation s with divisor n — 1 is not—repeating each observation twice gives the same empirical distribution but a different s. The bootstrap bias estimate for a functional statistic is

f-e

(2)

(i)

6.5

Examples

In this section, we consider some examples, with a particular eye to standard error, bias, and normality of the sampling distribution.

6.5.1

Relative Risk

A major study of the association between blood pressure and cardiovascular disease found that 55 of 3338 men with high blood pressure died of cardiovascular disease during the study period, compared with 21 out

68

Bootstrap

of 2676 patients with low blood pressure. The estimated relative risk is 6 = p\jp2 = 0.0165/0.0078= 2.12. To bootstrap this, we draw samples of size rii = 3338 with replacement from the first group, independently draw samples of size ri2 = 2676 from the second group, and calculate the relative risk 0*. In addition, we record the individual proportions p\ and P2. The bootstrap distribution for relative risk is shown in the left panel of Figure 4. It is highly skewed, with a long right tail caused by a divisor relatively close to zero. The standard error, from a sample of 104 observations, is 0.6188. The theoretical bootstrap standard error is undefined because some of the n ^ n ^ 2 bootstrap samples have 6* undefined because the denominator p* is zero; this aspect is not important in practice. The average of the bootstrap replicates is larger than the original relative risk, which indicates bias. The estimated bias is 2.205 - 2.100 = 0.106, which is 0.17 standard errors. Although the bias does not seem large in the figure, this amount of bias can have a huge impact on inferences; a rough calculation suggests that the actual noncoverage of one side of a two-sided 95% confidence interval would be $(0.17+1.96) = 0.0367 rather than 0.025, or 47% too large. The right panel of Figure 4 shows the joint bootstrap distribution of p\ and Each point corresponds to one bootstrap sample, and the relative risk is the slope of a line between the origin and the point. The original data are at the intersection of horizontal and vertical lines. The solid diagonal lines exclude 2.5% of the bootstrap observations on each side; the slopes are the end point of a 95% bootstrap percentile confidence interval. The bottom and top dashed diagonal lines are the end points of a t interval with standard error obtained using the usual delta method. This interval corresponds

to calculating the standard error of residuals above and below the central line (the line with slope 0), going up and down 1.96 residual standard errors from the central point (the original data) to the circled points; the end points of the interval are the slopes of the lines from the origin to the circled points. A t interval would not be appropriate in the example because of the bias and skewness. In practice, one would normally do a t interval on a transformed statistic, for example, log of relative risk, or log-odds-ratio log(pi(l-p2)/((l-Pi)p2)). Figure 5 shows a normal quantile plot for the bootstrap distribution of the log of relative risk. The distribution is much less skewed than is relative risk, but it is still noticeably skewed. Even with a log transformation, a t interval would only be adequate for work where accuracy is not required. We discuss confidence intervals even more in Section 6.8, Bootstrap Confidence Intervals.

6.5.2

Linear Regression

The next examples for linear regression are based on a data set from a large pharmaceutical company. The response variable is a pharmacokinetic parameter of interest, and candidate predictors are weight, sex, age, and dose (3 levels—200, 400, and 800). In all, 300 observations, are provided, one per subject. Our primary interest in this data set will be to use the bootstrap to investigate the behavior of stepwise regression; however, first we consider some other issues. A standard linear regression using main effects is shown in Table 1. The left panel of Figure 6 contains a scatterplot of clearance versus weight, for the 25 females who received dose = 400, as well as regression lines from 30 bootstrap samples. This graph is useful for giving a rough idea of variability. A bootstrap percentile confidence interval for mean clear-

Bootstrap

69

Quantiles of Standard Normal

Figure 5: Bootstrap distribution for log of relative risk

Weight

Weight

Figure 6: Bootstrap regression lines. Left panel: 25 females receiving dose = 400. Right panel: all observations, predictions for females receiving dose = 400 Table 1: Standard Linear Regression

(Intercept) Wgt Sex Age Dose

Value 32.0819 0.2394 -7.2192 -0.1507 0.0003

Std. Error 4.2053 0.0372 1.2306 0.0367 0.0018

t Value 7.6290 6.4353 -5.8666 -4.1120 0.1695

Pr(> \t\) 0.0000 0.0000 0.0000 0.0000 0.8653

70

Bootstrap

ance given weight would be the range of the middle 95% of heights of regression lines at a given weight. The right panel shows all 300 observations and predictions for the clearance/weight relationship using (1) all 300 observations, (2) the main-effects model, and (3) predictions for the "base case," for females who received dose = 400. In effect, this graph uses the full data set to improve predictions for a subset, "borrowing strength." Much less variability is observed than in the left panel, primarily because of the larger sample size and also because the addition of an important covariate (age) to the model reduces residual variance. Note that the y values here are the actual data; they are not adjusted for differences between the actual sex and age and the base case. Adjusting the male observations would raise these values. Adjusting both would make the apparent residual variation in the plot smaller to match the residual variance from the regression. Prediction Intervals and Nonnormality The right panel also hints at the difference between a confidence interval (for mean response given covariates) and a prediction interval (for a new observation). With large n, the regression lines show little variation, but the variation of an individual point above and below the (true) line remains constant regardless of n. Hence as n increases, confidence intervals become narrow but prediction intervals do not. This example is reflected in the standard formulae for confidence intervals: y ± tay/l/n

+ (x - x)2/Sxx

(3)

and prediction intervals in the simple linear regression case: y±tay/l

+ l / n + (x ~x)2/Sxx

(4)

As n —> oo the terms inside the square root decrease to 0 for a confidence interval but approach 1 for a prediction interval; the prediction interval approaches y ± z a . Now, suppose that residuals are not normally distributed. Asymptotically and for reasonably large n the confidence intervals are approximately correct, but prediction intervals are not—the interval y ± za is only correct for normally distributed data. Prediction intervals should approach (y ± F~\a/2), y ± F~l( 1 - a/2)) as n -> oo, where F is the estimated residual distribution. In other words, no central limit theorem exists for prediction intervals. The outcome for a new observation depends primarily on a single random value, not an average across a large sample. Equation (4) should only be used after confirming that the residual distribution is approximately normal. And, in the opinion of this author, Equation (4) should not be taught in introductory statistics to students ill-equipped to understand that it should only be used if residuals are normally distributed. A bootstrap approach that takes into account both the shape of the residual distribution and the variability in regression lines is outlined in Prediction Intervals below. Stepwise Regression Now, consider the case of stepwise regression. We consider models that range from the interceptonly model to a full second-order model that includes all main effects, all interactions, and quadratic functions of dose, age, and weight. We use forward and backward stepwise regression, with terms added or subtracted to minimize the Cp statistic, using the step function of S-PLUS (insightful Corp., Seattle, WA). The resulting coefficients and inferences are shown in Table 2. The sex coefficient is retained even though it has small t value because main

Bootstrap

& CD

71

o CM

Q O

-0.05

0.0

-40

0.05

-2 0

Coef.dose

0

Coef.sex

Figure 7: Bootstrap distribution for dose and sex coefficients in stepwise regression.

Table 2: Coefficients and Inferences

(Intercept) Wgt Sex Age I(age~2) Wgt: sex

Value 12.8035 0.6278 9.2008 -0.6583 0.0052 -0.2077

Std. Error 14.1188 0.1689 7.1634 0.2389 0.0024 0.0910

t Value 0.9068 3.7181 1.2844 -2.7553 2.1670 -2.2814

Pr(> \t\) 0.3637 0.0002 0.1980 0.0055 0.0294 0.0218

72

Bootstrap

effects are included before interactions. We use t h e b o o t s t r a p here to check model stability, obtain standard errors, and check for bias.

6.6

Model Stability

T h e stepwise procedure selected a six-term model. We may use t h e bootstrap t o check t h e stability of t h e procedure under random sampling (does it consistently select the same model, or is there substantial variation?) and t o observe which terms are consistently included. Here, we create bootstrap samples by resampling subjects—whole rows of the d a t a — w i t h replacement. We sample whole rows instead of individual values to preserve covariances between variables. In 1000 b o o t s t r a p samples, only 95 resulted in the same model as for t h e original data; on average, 3.2 terms differed between the original model and the bootstrap models. T h e original model has 6 terms; t h e bootstrap models ranged from 4 to 12, with an average of 7.9, which is 1.9 more t h a n for the original data. This result suggests t h a t stepwise regression tends to select more terms for random d a t a t h a n for t h e corresponding population, which in t u r n suggests t h a t t h e original six-term model may also be overfitted. Figure 7 shows t h e b o o t s t r a p distributions for two coefficients: dose and sex. T h e dose coefficient is usually zero, although it may be positive or negative. This graph suggests t h a t dose is not very import a n t in determining clearance. T h e sex coefficient is bimodal, with the modes on opposite sides of zero. It t u r n s out t h a t t h e sex coefficient is usually negative when t h e weight-sex interaction is included, otherwise it is positive. Overall, t h e bootstrap suggests t h a t t h e original model is not very stable. For comparison, repeating the experiment with a more stringent criteria for

variable inclusion—a modified Cp statistic with double t h e penalty—results in a more stable model. T h e original model has t h e same six terms. In all, 154 bootstrap samples yielded t h e same model, and on average the number of different terms was 2.15. T h e average number of terms was 5.93, which is slightly less t h a n for the original data; this number suggests t h a t stepwise regression may now be slightly underfitting.

Standard Errors

At the end of the

stepwise procedure, the table of coefficients, standard errors, and t values was calculated, ignoring the variable selection process. In particular, the standard errors are calculated under the usual regression assumptions, and assuming t h a t the model were fixed from t h e outset. Call these nom-

inal standard errors. In each b o o t s t r a p sample, we recorded t h e coefficients and the nominal s t a n d a r d errors. For the main effects t h e bootstrap standard errors (standard deviation of b o o t s t r a p coefficients) and average of the nominal standard errors are shown in Table 3. T h e b o o t s t r a p s t a n d a r d errors are much larger t h a n t h e nominal standard errors. This information is not surprising—the bootstrap standard errors reflect additional variability because of model selection, such as t h e bimodal distribution for t h e sex coefficient. This is not to say t h a t one should use t h e b o o t s t r a p standard errors here. At t h e end of the stepwise variable selection process, it is appropriate to condition on t h e model and do inferences accordingly. For example, a confidence interval for t h e sex coefficient should be conditional on the weight-sex interaction being included in the model. But it does suggest t h a t the nominal s t a n d a r d errors may be optimistic. Indeed they are, even conditional on t h e model

Bootstrap

73

Table 3: Bootstrap standard errors and average of the nominal standard errors Intercept Wgt Sex Age Dose

Rsquare

Boot SE 27.9008 0.5122 9.9715 0.3464 0.0229

Avg.Nominal SE 14.0734 0.2022 5.4250 0.2137 0.0091

Sigma

Figure 8: Bootstrap distributions for R2 and residual standard deviation in stepwise regression

74

Bootstrap

terms, because t h e residual s t a n d a r d error is biased. B i a s Figure 8 shows b o o t s t r a p distributions for R 2 (unadjusted) and residual s t a n d a r d deviation. Both show very large bias. T h e bias is not surprising—optimizing generally gives biased results. Consider ordinary linear regression—unadjusted R2 biased. If it were calculated using t h e t r u e (3s instead of estimated /3s it would not be biased. Optimizing (3 to minimize residual squared error (and maximize R2) makes u n a d j u s t e d R2 biased. In classic linear regression, with t h e model selected in advance, we commonly use adjusted R2 to counteract t h e bias. Similarly, we use residual variance calculated using a divisor of (n — p — 1) instead of n, where p is t h e number of terms in t h e model. But in this case, it is not only t h e values of t h e coefficients t h a t are optimized b u t which terms are included in t h e model. This result is not reflected in t h e usual formulae. As a result, t h e residual standard error obtained from t h e stepwise procedure is biased downward, even using a divisor of

(n-p-1). Bootstrapping

Rows

or

Residuals

Two basic ways exist to b o o t s t r a p linear regression models—resampling rows or residuals [3]. To resample residuals, we fit t h e initial model Vi ~ (3o + *52(3jXij, calculate t h e residuals r^ = yi—yi^ then create new bootstrap samples as

y*i=Vi+r*

(5)

for i — 1 , . . . , n, where r* is sampled with replacement from t h e observed residuals { r i , . . . , r n } . We keep the original x and y values fixed to create new b o o t s t r a p y* values.

Resampling rows corresponds to a random effects sampling design—in which x and y are both obtained by random sampling from a joint population. Resampling residuals corresponds to a fixed effects model, in which t h e xs are fixed by t h e experimental design and ys are obtained conditional on the xs. So at first glance, it would seem appropriate to resample rows when t h e original d a t a collection has random xs. However, in classic statistics we commonly use inferences derived using the fixed effects model, even when the xs are actually random. We do inferences conditional on t h e observed x values. Similarly, in bootstrapping we may resample residuals even when the xs were originally random. In practice, t h e difference m a t t e r s most when there are factors with rare levels or interactions of factors with rare combinations. If resampling rows, it is possible t h a t a bootstrap sample may have none of the level or combination, in which case t h e corresponding term cannot be estimated and t h e software may give an error. Or, w h a t is worse, there may be one or two rows with t h e rare level, enough so t h e software would not crash b u t instead quietly give garbage answers t h a t are imprecise because they are based on few observations. Hence, with factors with rare levels or small samples more generally, it may be preferable t o resample residuals. Resampling residuals implicitly assumes t h a t t h e residual distribution is t h e same for every x and t h a t there is no heteroskedasticity. A variation on resampling residuals t h a t allows heteroskedasticity is the wild bootstrap, which in its simplest form adds either plus or minus t h e original residual r* to each fitted value, Vi

= & ± n

(6)

with equal probabilities. Hence t h e expected value of y* is and t h e standard

Bootstrap

75

Figure 9: Bootstrap curves for predicted kyphosis, for Age = 87 and Number = 4 deviation is proportional to r F o r more discussion see Reference 3. Other variations on resampling residuals exist, such as resampling studentized residuals or weighted error resampling for nonconstant variance [3]. Prediction Intervals The idea of resampling residuals provides a way to obtain more accurate prediction intervals. To capture both variation in the estimated regression line and residual variation, we may resample both. Variation in the regression line may be obtained by resampling either residuals or rows to generate random 0* values and corresponding V* = $o 4- Y10jxoji f° r predictions at Independently, we draw random residuals r* and add them to the y*. After repeating this many times, the range of the middle 95% of the ($* + r*) values gives a prediction interval. For more discussion and alternatives see Reference 3.

6.6.1

Logistic Regression

In logistic regression, it is straightforward to resample rows of the data, but resampling residuals fails—the y values must be

either zero or one, but adding the residual from one observation to the prediction from another yields values anywhere between —1 and 2. Instead, we keep the xs fixed and generate y values from the estimated conditional distribution given x. Let pi be the predicted probability that yi = I given Xi> Then *_f 1 ~~ \ 0

Vi

with probability p with probability 1 - p

( . ' ' '

The kyphosis data set [4] contains observations in 81 children who had corrective spinal surgery, on four variables: Kyphosis (a factor indicating whether a postoperative deformity is present), Age (in months), Number (of vertebrae involved in the operation), and Start (beginning of the range of vertebrae involved). A logistic regression using main effects gives coefficients as shown in Table 4 which suggest that Start is the most important predictor. The left panel of Figure 9 shows Kyphosis versus Start, together with predicted curve for the base case with Age = 87 (the median) and Number = 4 (the median). This graph is a sunflower plot [5,6],

76

Bootstrap

Quantiles of Standard Normal

1

-2

0

2

-2

0

2

Quantiles of Standard Normal

o

o CM