299 28 11MB
English Pages [917] Year 2017
WEBFFIRS
04/03/2017
18:59:21
Page vi
Piantadosi
Date: June 3, 2017
Time: 12:46 pm
CLINICAL TRIALS
Piantadosi
Date: June 3, 2017
Time: 12:46 pm
WILEY SERIES IN PROBABILITY AND STATISTICS Established by Walter A. Shewhart and Samuel S. Wilks
Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane, Jozef L. Teugels The Wiley Series in Probability and Statistics is well established and authoritative. It covers many topics of current research interest in both pure and applied statistics and probability theory. Written by leading statisticians and institutions, the titles span both state-of-the-art developments in the field and classical methods. Reflecting the wide range of current research in statistics, the series encompasses applied, methodological and theoretical statistics, ranging from applications and new techniques made possible by advances in computerized practice to rigorous treatment of theoretical approaches. This series provides essential and invaluable reading for all statisticians, whether in academia, industry, government, or research. A complete list of titles in this series can be found at http://www.wiley.com/go/wsps
Piantadosi
Date: June 3, 2017
Time: 12:46 pm
CLINICAL TRIALS A Methodologic Perspective Third Edition
STEVEN PIANTADOSI
Piantadosi
Date: June 3, 2017
Time: 12:46 pm
This edition first published 2017 by John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA © 2017 John Wiley & Sons, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. Edition History The right of Steven Piantadosi to be identified as the author of this work has been asserted in accordance with law. Wiley Global Headquarters 111 River Street, Hoboken, NJ 07030, USA For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials, or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats. Library of Congress Cataloging-in-Publication Data Names: Piantadosi, Steven. Title: Clinical trials : a methodologic perspective / Steven Piantadosi. Description: Third edition. | Hoboken, NJ : John Wiley & Sons, Inc., 2017. | Includes bibliographical references and index. Identifiers: LCCN 2016053939| ISBN 9781118959206 (cloth) | ISBN 9781118959220 (epub) Subjects: LCSH: Clinical trials–Statistical methods. Classification: LCC R853.C55 P53 2017 | DDC 610.72/4–dc23 LC record available at https://lccn.loc.gov/2016053939 Cover image: © axllll/GettyImages Cover design by Wiley Set in 10.25/12 pt TimesLtStd-Roman by Thomson Digital, Noida, India 10 9 8 7 6 5 4 3 2 1
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
Preface to the Third Edition About the Companion Websites 1
Preliminaries 1.1 1.2 1.3 1.4 1.5
1.6 1.7 2
xxviii 1
Introduction, 1 Audiences, 2 Scope, 3 Other Sources of Knowledge, 5 Notation and Terminology, 6 1.5.1 Clinical Trial Terminology, 7 1.5.2 Drug Development Traditionally Recognizes Four Trial Design Types, 7 1.5.3 Descriptive Terminology Is Better, 8 Examples, Data, and Programs, 9 Summary, 9
Clinical Trials as Research 2.1 2.2
xxv
10
Introduction, 10 Research, 13 2.2.1 What Is Research?, 13 2.2.2 Clinical Reasoning Is Based on the Case History, 14 2.2.3 Statistical Reasoning Emphasizes Inference Based on Designed Data Production, 16 2.2.4 Clinical and Statistical Reasoning Converge in Research, 17
v
Piantadosi
Date: June 15, 2017
vi
CONTENTS
2.3
2.4
2.5
2.6 2.7 3
Time: 1:48 pm
Defining Clinical Trials, 19 2.3.1 Mixing of Clinical and Statistical Reasoning Is Recent, 19 2.3.2 Clinical Trials Are Rigorously Defined, 21 2.3.3 Theory and Data, 22 2.3.4 Experiments Can Be Misunderstood, 23 2.3.5 Clinical Trials and the Frankenstein Myth, 25 2.3.6 Cavia porcellus, 26 2.3.7 Clinical Trials as Science, 26 2.3.8 Trials and Statistical Methods Fit within a Spectrum of Clinical Research, 28 Practicalities of Usage, 29 2.4.1 Predicates for a Trial, 29 2.4.2 Trials Can Provide Confirmatory Evidence, 29 2.4.3 Clinical Trials Are Reliable Albeit Unwieldy and Messy, 30 2.4.4 Trials Are Difficult to Apply in Some Circumstances, 31 2.4.5 Randomized Studies Can Be Initiated Early, 32 2.4.6 What Can I learn from 𝑛 = 20?, 33 Nonexperimental Designs, 35 2.5.1 Other Methods Are Valid for Making Some Clinical Inferences, 35 2.5.2 Some Specific Nonexperimental Designs, 38 2.5.3 Causal Relationships, 40 2.5.4 Will Genetic Determinism Replace Design?, 41 Summary, 41 Questions for Discussion, 41
Why Clinical Trials Are Ethical 3.1
3.2
3.3
Introduction, 43 3.1.1 Science and Ethics Share Objectives, 44 3.1.2 Equipoise and Uncertainty, 46 Duality, 47 3.2.1 Clinical Trials Sharpen, But Do Not Create, Duality, 47 3.2.2 A Gene Therapy Tragedy Illustrates Duality, 48 3.2.3 Research and Practice Are Convergent, 48 3.2.4 Hippocratic Tradition Does Not Proscribe Clinical Trials, 52 3.2.5 Physicians Always Have Multiple Roles, 54 Historically Derived Principles of Ethics, 57 3.3.1 Nuremberg Contributed an Awareness of the Worst Problems, 57 3.3.2 High-Profile Mistakes Were Made in the United States, 58 3.3.3 The Helsinki Declaration Was Widely Adopted, 58 3.3.4 Other International Guidelines Have Been Proposed, 61
43
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
3.4
3.5
3.6
3.7 3.8 4
3.3.5 Institutional Review Boards Provide Ethics Oversight, 62 3.3.6 Ethics Principles Relevant to Clinical Trials, 63 Contemporary Foundational Principles, 65 3.4.1 Collaborative Partnership, 66 3.4.2 Scientific Value, 66 3.4.3 Scientific Validity, 66 3.4.4 Fair Subject Selection, 67 3.4.5 Favorable Risk–Benefit, 67 3.4.6 Independent Review, 68 3.4.7 Informed Consent, 68 3.4.8 Respect for Subjects, 71 Methodologic Reflections, 72 3.5.1 Practice Based on Unproven Treatments Is Not Ethical, 72 3.5.2 Ethics Considerations Are Important Determinants of Design, 74 3.5.3 Specific Methods Have Justification, 75 Professional Conduct, 79 3.6.1 Advocacy, 79 3.6.2 Physician to Physician Communication Is Not Research, 81 3.6.3 Investigator Responsibilities, 82 3.6.4 Professional Ethics, 83 Summary, 85 Questions for Discussion, 86
Contexts for Clinical Trials 4.1
4.2
4.3
4.4
vii
Introduction, 87 4.1.1 Clinical Trial Registries, 88 4.1.2 Public Perception Versus Science, 90 Drugs, 91 4.2.1 Are Drugs Special?, 92 4.2.2 Why Trials Are Used Extensively for Drugs, 93 Devices, 95 4.3.1 Use of Trials for Medical Devices, 95 4.3.2 Are Devices Different from Drugs?, 97 4.3.3 Case Study, 98 Prevention, 99 4.4.1 The Prevention versus Therapy Dichotomy Is Over-worked, 100 4.4.2 Vaccines and Biologicals, 101 4.4.3 Ebola 2014 and Beyond, 102 4.4.4 A Perspective on Risk–Benefit, 103 4.4.5 Methodology and Framework for Prevention Trials, 105
87
Piantadosi
Date: June 15, 2017
viii
CONTENTS
4.5
4.6
4.7
4.8 4.9 5
Time: 1:48 pm
Complementary and Alternative Medicine, 106 4.5.1 Science Is the Study of Natural Phenomena, 108 4.5.2 Ignorance Is Important, 109 4.5.3 The Essential Paradox of CAM and Clinical Trials, 110 4.5.4 Why Trials Have Not Been Used Extensively in CAM, 111 4.5.5 Some Principles for Rigorous Evaluation, 113 4.5.6 Historic Examples, 115 Surgery and Skill-Dependent Therapies, 116 4.6.1 Why Trials Have Been Used Less Extensively in Surgery, 118 4.6.2 Reasons Why Some Surgical Therapies Require Less Rigorous Study Designs, 120 4.6.3 Sources of Variation, 121 4.6.4 Difficulties of Inference, 121 4.6.5 Control of Observer Bias Is Possible, 122 4.6.6 Illustrations from an Emphysema Surgery Trial, 124 A Brief View of Some Other Contexts, 130 4.7.1 Screening Trials, 130 4.7.2 Diagnostic Trials, 134 4.7.3 Radiation Therapy, 134 Summary, 135 Questions for Discussion, 136
Measurement 5.1 5.2
5.3
Introduction, 137 5.1.1 Types of Uncertainty, 138 Objectives, 140 5.2.1 Estimation Is The Most Common Objective, 141 5.2.2 Selection Can Also Be an Objective, 141 5.2.3 Objectives Require Various Scales of Measurement, 142 Measurement Design, 143 5.3.1 Mixed Outcomes and Predictors, 143 5.3.2 Criteria for Evaluating Outcomes, 144 5.3.3 Prefer Hard or Objective Outcomes, 145 5.3.4 Outcomes Can Be Quantitative or Qualitative, 146 5.3.5 Measures Are Useful and Efficient Outcomes, 146 5.3.6 Some Outcomes Are Summarized as Counts, 147 5.3.7 Ordered Categories Are Commonly Used for Severity or Toxicity, 147 5.3.8 Unordered Categories Are Sometimes Used, 148 5.3.9 Dichotomies Are Simple Summaries, 148 5.3.10 Measures of Risk, 149
137
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
5.4
5.5 5.6 6
6.2
6.3
6.4 6.5 7
5.3.11 Primary and Others, 153 5.3.12 Composites, 154 5.3.13 Event Times and Censoring, 155 5.3.14 Longitudinal Measures, 160 5.3.15 Central Review, 161 5.3.16 Patient Reported Outcomes, 161 Surrogate Outcomes, 162 5.4.1 Surrogate Outcomes Are Disease-Specific, 164 5.4.2 Surrogate Outcomes Can Make Trials More Efficient, 167 5.4.3 Surrogate Outcomes Have Significant Limitations, 168 Summary, 170 Questions for Discussion, 171
Random Error and Bias 6.1
172
Introduction, 172 6.1.1 The Effects of Random and Systematic Errors Are Distinct, 173 6.1.2 Hypothesis Tests versus Significance Tests, 174 6.1.3 Hypothesis Tests Are Subject to Two Types of Random Error, 175 6.1.4 Type I Errors Are Relatively Easy to Control, 176 6.1.5 The Properties of Confidence Intervals Are Similar to Hypothesis Tests, 176 6.1.6 Using a one- or two-sided hypothesis test is not the right question, 177 6.1.7 P-Values Quantify the Type I Error, 178 6.1.8 Type II Errors Depend on the Clinical Difference of Interest, 178 6.1.9 Post Hoc Power Calculations Are Useless, 180 Clinical Bias, 181 6.2.1 Relative Size of Random Error and Bias is Important, 182 6.2.2 Bias Arises from Numerous Sources, 182 6.2.3 Controlling Structural Bias is Conceptually Simple, 185 Statistical Bias, 188 6.3.1 Selection Bias, 188 6.3.2 Some Statistical Bias Can Be Corrected, 192 6.3.3 Unbiasedness is Not the Only Desirable Attribute of an Estimator, 192 Summary, 194 Questions for Discussion, 194
Statistical Perspectives 7.1
ix
Introduction, 196
196
Piantadosi
Date: June 15, 2017
x
CONTENTS
7.2
7.3
7.4
7.5
7.6
7.7 7.8 8
Time: 1:48 pm
Differences in Statistical Perspectives, 197 7.2.1 Models and Parameters, 197 7.2.2 Philosophy of Inference Divides Statisticians, 198 7.2.3 Resolution, 199 7.2.4 Points of Agreement, 199 Frequentist, 202 7.3.1 Binomial Case Study, 203 7.3.2 Other Issues, 204 Bayesian, 204 7.4.1 Choice of a Prior Distribution Is a Source of Contention, 205 7.4.2 Binomial Case Study, 206 7.4.3 Bayesian Inference Is Different, 209 Likelihood, 210 7.5.1 Binomial Case Study, 211 7.5.2 Likelihood-Based Design, 211 Statistics Issues, 212 7.6.1 Perspective, 212 7.6.2 Statistical Procedures Are Not Standardized, 213 7.6.3 Practical Controversies Related to Statistics Exist, 214 Summary, 215 Questions for Discussion, 216
Experiment Design in Clinical Trials 8.1 8.2
8.3
8.4
Introduction, 217 Trials As Simple Experiment Designs, 218 8.2.1 Design Space Is Chaotic, 219 8.2.2 Design Is Critical for Inference, 220 8.2.3 The Question Drives the Design, 220 8.2.4 Design Depends on the Observation Model As Well As the Biological Question, 221 8.2.5 Comparing Designs, 222 Goals of Experiment Design, 223 8.3.1 Control of Random Error and Bias Is the Goal, 223 8.3.2 Conceptual Simplicity Is Also a Goal, 223 8.3.3 Encapsulation of Subjectivity, 224 8.3.4 Leech Case Study, 225 Design Concepts, 225 8.4.1 The Foundations of Design Are Observation and Theory, 226 8.4.2 A Lesson from the Women’s Health Initiative, 227 8.4.3 Experiments Use Three Components of Design, 229
217
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
8.5
8.6
8.7
8.8 8.9 9
Design Features, 230 8.5.1 Enrichment, 231 8.5.2 Replication, 232 8.5.3 Experimental and Observational Units, 232 8.5.4 Treatments and Factors, 233 8.5.5 Nesting, 233 8.5.6 Randomization, 234 8.5.7 Blocking, 234 8.5.8 Stratification, 235 8.5.9 Masking, 236 Special Design Issues, 237 8.6.1 Placebos, 237 8.6.2 Equivalence and Noninferiority, 240 8.6.3 Randomized Discontinuation, 241 8.6.4 Hybrid Designs May Be Needed for Resolving Special Questions, 242 8.6.5 Clinical Trials Cannot Meet Certain Objectives, 242 Importance of the Protocol Document, 244 8.7.1 Protocols Have Many Functions, 244 8.7.2 Deviations from Protocol Specifications are Common, 245 8.7.3 Protocols Are Structured, Logical, and Complete, 246 Summary, 252 Questions for Discussion, 253
The Trial Cohort 9.1 9.2
9.3
9.4
9.5
xi
Introduction, 254 Cohort Definition and Selection, 255 9.2.1 Eligibility and Exclusions, 255 9.2.2 Active Sampling and Enrichment, 257 9.2.3 Participation may select subjects with better prognosis, 258 9.2.4 Quantitative Selection Criteria Versus False Precision, 262 9.2.5 Comparative Trials Are Not Sensitive to Selection, 263 Modeling Accrual, 264 9.3.1 Using a Run-In Period, 264 9.3.2 Estimate Accrual Quantitatively, 265 Inclusiveness, Representation, and Interactions, 267 9.4.1 Inclusiveness Is a Worthy Goal, 267 9.4.2 Barriers Can Hinder Trial Participation, 268 9.4.3 Efficacy versus Effectiveness Trials, 269 9.4.4 Representation: Politics Blunders into Science, 270 Summary, 275
254
Piantadosi
Date: June 15, 2017
xii
Time: 1:48 pm
CONTENTS
9.6
Questions for Discussion, 275
10 Development Paradigms
277
10.1 Introduction, 277 10.1.1 Stages of Development, 278 10.1.2 Trial Design versus Development Design, 280 10.1.3 Companion Diagnostics in Cancer, 281 10.2 Pipeline Principles and Problems, 281 10.2.1 The Paradigm Is Not Linear, 282 10.2.2 Staging Allows Efficiency, 282 10.2.3 The Pipeline Impacts Study Design, 283 10.2.4 Specificity and Pressures Shape the Pipeline, 283 10.2.5 Problems with Trials, 284 10.2.6 Problems in the Pipeline, 286 10.3 A Simple Quantitative Pipeline, 286 10.3.1 Pipeline Operating Characteristics Can Be Derived, 286 10.3.2 Implications May Be Counterintuitive, 288 10.3.3 Optimization Yields Insights, 288 10.3.4 Overall Implications for the Pipeline, 291 10.4 Late Failures, 292 10.4.1 Generic Mistakes in Evaluating Evidence, 293 10.4.2 “Safety” Begets Efficacy Testing, 293 10.4.3 Pressure to Advance Ideas Is Unprecedented, 294 10.4.4 Scientists Believe Weird Things, 294 10.4.5 Confirmation Bias, 295 10.4.6 Many Biological Endpoints Are Neither Predictive nor Prognostic, 296 10.4.7 Disbelief Is Easier to Suspend Than Belief, 296 10.4.8 Publication Bias, 297 10.4.9 Intellectual Conflicts of Interest, 297 10.4.10 Many Preclinical Models Are Invalid, 298 10.4.11 Variation Despite Genomic Determinism, 299 10.4.12 Weak Evidence Is Likely to Mislead, 300 10.5 Summary, 300 10.6 Questions for Discussion, 301 11 Translational Clinical Trials 11.1 Introduction, 302 11.1.1 Therapeutic Intent or Not?, 303 11.1.2 Mechanistic Trials, 304 11.1.3 Marker Threshold Designs Are Strongly Biased, 305
302
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
xiii
11.2 Inferential Paradigms, 308 11.2.1 Biologic Paradigm, 308 11.2.2 Clinical Paradigm, 310 11.2.3 Surrogate Paradigm, 311 11.3 Evidence and Theory, 312 11.3.1 Biological Models Are a Key to Translational Trials, 313 11.4 Translational Trials Defined, 313 11.4.1 Translational Paradigm, 313 11.4.2 Character and Definition, 315 11.4.3 Small or “Pilot” Does Not Mean Translational, 316 11.4.4 Hypothetical Example, 316 11.4.5 Nesting Translational Studies, 317 11.5 Information From Translational Trials, 317 11.5.1 Surprise Can Be Defined Mathematically, 318 11.5.2 Parameter Uncertainty Versus Outcome Uncertainty, 318 11.5.3 Expected Surprise and Entropy, 319 11.5.4 Information/Entropy Calculated From Small Samples Is Biased, 321 11.5.5 Variance of Information/Entropy, 322 11.5.6 Sample Size for Translational Trials, 324 11.5.7 Validity, 327 11.6 Summary, 328 11.7 Questions for Discussion, 328 12 Early Development and Dose-Finding 12.1 Introduction, 329 12.2 Basic Concepts, 330 12.2.1 Therapeutic Intent, 330 12.2.2 Feasibility, 331 12.2.3 Dose versus Efficacy, 332 12.3 Essential Concepts for Dose versus Risk, 333 12.3.1 What Does the Terminology Mean?, 333 12.3.2 Distinguish Dose–Risk From Dose–Efficacy, 334 12.3.3 Dose Optimality Is a Design Definition, 335 12.3.4 Unavoidable Subjectivity, 335 12.3.5 Sample Size Is an Outcome of Dose-Finding Studies, 336 12.3.6 Idealized Dose-Finding Design, 336 12.4 Dose-Ranging, 338 12.4.1 Some Historical Designs, 338 12.4.2 Typical Dose-Ranging Design, 339
329
Piantadosi
Date: June 15, 2017
xiv
Time: 1:48 pm
CONTENTS
12.5
12.6
12.7 12.8
12.4.3 Operating Characteristics Can Be Calculated, 340 12.4.4 Modifications, Strengths, and Weaknesses, 343 Dose-Finding Is Model Based, 344 12.5.1 Mathematical Models Facilitate Inferences, 345 12.5.2 Continual Reassessment Method, 345 12.5.3 Pharmacokinetic Measurements Might Be Used to Improve CRM Dose Escalations, 349 12.5.4 The CRM Is an Attractive Design to Criticize, 350 12.5.5 CRM Clinical Examples, 350 12.5.6 Dose Distributions, 351 12.5.7 Estimation with Overdose Control (EWOC), 351 12.5.8 Randomization in Early Development?, 353 12.5.9 Phase I Data Have Other Uses, 353 General Dose-Finding Issues, 354 12.6.1 The General Dose-Finding Problem Is Unsolved, 354 12.6.2 More than One Drug, 356 12.6.3 More than One Outcome, 361 12.6.4 Envelope Simulation, 363 Summary, 366 Questions for Discussion, 368
13 Middle Development 13.1 Introduction, 370 13.1.1 Estimate Treatment Effects, 371 13.2 Characteristics of Middle Development, 372 13.2.1 Constraints, 373 13.2.2 Outcomes, 374 13.2.3 Focus, 375 13.3 Design Issues, 375 13.3.1 Choices in Middle Development, 375 13.3.2 When to Skip Middle Development, 376 13.3.3 Randomization, 377 13.3.4 Other Design Issues, 378 13.4 Middle Development Distills True Positives, 379 13.5 Futility and Nonsuperiority Designs, 381 13.5.1 Asymmetry in Error Control, 382 13.5.2 Should We Control False Positives or False Negatives?, 383 13.5.3 Futility Design Example, 384 13.5.4 A Conventional Approach to Futility, 385 13.6 Dose–Efficacy Questions, 385
370
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
xv
13.7 Randomized Comparisons, 386 13.7.1 When to Perform an Error-Prone Comparative Trial, 387 13.7.2 Examples, 388 13.7.3 Randomized Selection, 389 13.8 Cohort Mixtures, 392 13.9 Summary, 395 13.10 Questions for Discussion, 396 14 Comparative Trials
397
14.1 Introduction, 397 14.2 Elements of Reliability, 398 14.2.1 Key Features, 399 14.2.2 Flexibilities, 400 14.2.3 Other Design Issues, 400 14.3 Biomarker-Based Comparative Designs, 402 14.3.1 Biomarkers Are Diverse, 402 14.3.2 Enrichment, 404 14.3.3 Biomarker-Stratified, 404 14.3.4 Biomarker-Strategy, 405 14.3.5 Multiple-Biomarker Signal-Finding, 406 14.3.6 Prospective–Retrospective Evaluation of a Biomarker, 407 14.3.7 Master Protocols, 407 14.4 Some Special Comparative Designs, 408 14.4.1 Randomized Discontinuation, 408 14.4.2 Delayed Start, 409 14.4.3 Cluster Randomization, 410 14.4.4 Non Inferiority, 410 14.4.5 Multiple Agents versus Control, 410 14.5 Summary, 411 14.6 Questions for Discussion, 412 15 Adaptive Design Features 15.1 Introduction, 413 15.1.1 Advantages and Disadvantages of AD, 414 15.1.2 Design Adaptations Are Tools, Not a Class, 416 15.1.3 Perspective on Bayesian Methods, 417 15.1.4 The Pipeline Is the Main Adaptive Tool, 417 15.2 Some Familiar Adaptations, 418 15.2.1 Dose-Finding Is Adaptive, 418 15.2.2 Adaptive Randomization, 418 15.2.3 Staging is Adaptive, 422
413
Piantadosi
Date: June 15, 2017
xvi
Time: 1:48 pm
CONTENTS
15.3 15.4 15.5 15.6 15.7 15.8 15.9
15.2.4 Dropping a Treatment Arm or Subset, 423 Biomarker Adaptive Trials, 423 Re-Designs, 425 15.4.1 Sample Size Re-Estimation Requires Caution, 425 Seamless Designs, 427 Barriers to the Use of AD, 428 Adaptive Design Case Study, 428 Summary, 429 Questions for Discussion, 429
16 Sample Size and Power 16.1 Introduction, 430 16.2 Principles, 431 16.2.1 What is Precision?, 432 16.2.2 What is Power?, 433 16.2.3 What is Evidence?, 434 16.2.4 Sample Size and Power Calculations are Approximations, 435 16.2.5 The Relationship between Power/Precision and Sample Size is Quadratic, 435 16.3 Early Developmental Trials, 436 16.3.1 Translational Trials, 436 16.3.2 Dose-Finding Trials, 437 16.4 Simple Estimation Designs, 438 16.4.1 Confidence Intervals for a Mean Provide a Sample Size Approach, 438 16.4.2 Estimating Proportions Accurately, 440 16.4.3 Exact Binomial Confidence Limits are Helpful, 441 16.4.4 Precision Helps Detect Improvement, 444 16.4.5 Bayesian Binomial Confidence Intervals, 446 16.4.6 A Bayesian Approach Can Use Prior Information, 447 16.4.7 Likelihood-Based Approach for Proportions, 450 16.5 Event Rates, 451 16.5.1 Confidence Intervals for Event Rates Can Determine Sample Size, 451 16.5.2 Likelihood-Based Approach for Event Rates, 454 16.6 Staged Studies, 455 16.6.1 Ineffective or Unsafe Treatments Should Be Discarded Early, 455 16.6.2 Two-Stage Designs Increase Efficiency, 456 16.7 Comparative Trials, 457 16.7.1 How to Choose Type I and II Error Rates?, 459 16.7.2 Comparisons Using the t-Test Are a Good Learning Example, 459
430
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
xvii
16.7.3 Likelihood-Based Approach, 462 16.7.4 Dichotomous Responses Are More Complex, 463 16.7.5 Hazard Comparisons Yield Similar Equations, 464 16.7.6 Parametric and Nonparametric Equations Are Connected, 467 16.7.7 Accommodating Unbalanced Treatment Assignments, 467 16.7.8 A Simple Accrual Model Can Also Be Incorporated, 469 16.7.9 Stratification, 471 16.7.10 Noninferiority, 472 16.8 Expanded Safety Trials, 478 16.8.1 Model Rare Events with the Poisson Distribution, 479 16.8.2 Likelihood Approach for Poisson Rates, 479 16.9 Other Considerations, 481 16.9.1 Cluster Randomization Requires Increased Sample Size, 481 16.9.2 Simple Cost Optimization, 482 16.9.3 Increase the Sample Size for Nonadherence, 482 16.9.4 Simulated Lifetables Can Be a Simple Design Tool, 485 16.9.5 Sample Size for Prognostic Factor Studies, 486 16.9.6 Computer Programs Simplify Calculations, 487 16.9.7 Simulation Is a Powerful and Flexible Design Alternative, 487 16.9.8 Power Curves Are Sigmoid Shaped, 488 16.10 Summary, 489 16.11 Questions for Discussion, 490 17 Treatment Allocation 17.1 Introduction, 492 17.1.1 Balance and Bias Are Independent, 493 17.2 Randomization, 494 17.2.1 Heuristic Proof of the Value of Randomization, 495 17.2.2 Control the Influence of Unknown Factors, 497 17.2.3 Haphazard Assignments Are Not Random, 498 17.2.4 Simple Randomization Can Yield Imbalances, 499 17.3 Constrained Randomization, 500 17.3.1 Blocking Improves Balance, 500 17.3.2 Blocking and Stratifying Balances Prognostic Factors, 501 17.3.3 Other Considerations Regarding Blocking, 503 17.4 Adaptive Allocation, 504 17.4.1 Urn Designs Also Improve Balance, 504 17.4.2 Minimization Yields Tight Balance, 504 17.4.3 Play the Winner, 505 17.5 Other Issues Regarding Randomization, 507
492
Piantadosi
Date: June 15, 2017
xviii
17.6
17.7 17.8 17.9
Time: 1:48 pm
CONTENTS
17.5.1 Administration of the Randomization, 507 17.5.2 Computers Generate Pseudorandom Numbers, 508 17.5.3 Randomized Treatment Assignment Justifies Type I Errors, 509 Unequal Treatment Allocation, 514 17.6.1 Subsets May Be of Interest, 514 17.6.2 Treatments May Differ Greatly in Cost, 515 17.6.3 Variances May Be Different, 515 17.6.4 Multiarm Trials May Require Asymmetric Allocation, 516 17.6.5 Generalization, 517 17.6.6 Failed Randomization?, 518 Randomization Before Consent, 519 Summary, 520 Questions for Discussion, 520
18 Treatment Effects Monitoring 18.1 Introduction, 522 18.1.1 Motives for Monitoring, 523 18.1.2 Components of Responsible Monitoring, 524 18.1.3 Trials Can Be Stopped for a Variety of Reasons, 524 18.1.4 There Is Tension in the Decision to Stop, 526 18.2 Administrative Issues in Trial Monitoring, 527 18.2.1 Monitoring of Single-Center Studies Relies on Periodic Investigator Reporting, 527 18.2.2 Composition and Organization of the TEMC, 528 18.2.3 Complete Objectivity Is Not Ethical, 535 18.2.4 Independent Experts in Monitoring, 537 18.3 Organizational Issues Related to Monitoring, 537 18.3.1 Initial TEMC Meeting, 538 18.3.2 The TEMC Assesses Baseline Comparability, 538 18.3.3 The TEMC Reviews Accrual and Expected Time to Study Completion, 539 18.3.4 Timeliness of Data and Reporting Lags, 539 18.3.5 Data Quality Is a Major Focus of the TEMC, 540 18.3.6 The TEMC Reviews Safety and Toxicity Data, 541 18.3.7 Efficacy Differences Are Assessed by the TEMC, 541 18.3.8 The TEMC Should Address Some Practical Questions Specifically, 541 18.3.9 The TEMC Mechanism Has Potential Weaknesses, 544 18.4 Statistical Methods for Monitoring, 545 18.4.1 There Are Several Aproaches to Evaluating Incomplete Evidence, 545
522
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
xix
18.4.2 Monitoring Developmental Trials for Risk, 547 18.4.3 Likelihood-Based Methods, 551 18.4.4 Bayesian Methods, 557 18.4.5 Decision-Theoretic Methods, 559 18.4.6 Frequentist Methods, 560 18.4.7 Other Monitoring Tools, 566 18.4.8 Some Software, 570 18.5 Summary, 570 18.6 Questions for Discussion, 572 19 Counting Subjects and Events
573
19.1 Introduction, 573 19.2 Imperfection and Validity, 574 19.3 Treatment Nonadherence, 575 19.3.1 Intention to Treat Is a Policy of Inclusion, 575 19.3.2 Coronary Drug Project Results Illustrate the Pitfalls of Exclusions Based on Nonadherence, 576 19.3.3 Statistical Studies Support the ITT Approach, 577 19.3.4 Trials Are Tests of Treatment Policy, 577 19.3.5 ITT Analyses Cannot Always Be Applied, 578 19.3.6 Trial Inferences Depend on the Experiment Design, 579 19.4 Protocol Nonadherence, 580 19.4.1 Eligibility, 580 19.4.2 Treatment, 581 19.4.3 Defects in Retrospect, 582 19.5 Data Imperfections, 583 19.5.1 Evaluability Criteria Are a Methodologic Error, 583 19.5.2 Statistical Methods Can Cope with Some Types of Missing Data, 584 19.6 Summary, 588 19.7 Questions for Discussion, 589 20 Estimating Clinical Effects 20.1 Introduction, 590 20.1.1 Invisibility Works Against Validity, 591 20.1.2 Structure Aids Internal and External Validity, 591 20.1.3 Estimates of Risk Are Natural and Useful, 592 20.2 Dose-Finding and Pharmacokinetic Trials, 594 20.2.1 Pharmacokinetic Models Are Essential for Analyzing DF Trials, 594 20.2.2 A Two-Compartment Model Is Simple but Realistic, 595
590
Piantadosi
Date: June 15, 2017
xx
Time: 1:48 pm
CONTENTS
20.2.3 PK Models Are Used By “Model Fitting”, 598 20.3 Middle Development Studies, 599 20.3.1 Mesothelioma Clinical Trial Example, 599 20.3.2 Summarize Risk for Dichotomous Factors, 600 20.3.3 Nonparametric Estimates of Survival Are Robust, 601 20.3.4 Parametric (Exponential) Summaries of Survival Are Efficient, 603 20.3.5 Percent Change and Waterfall Plots, 605 20.4 Randomized Comparative Trials, 606 20.4.1 Examples of Comparative Trials Used in This Section, 607 20.4.2 Continuous Measures Estimate Treatment Differences, 608 20.4.3 Baseline Measurements Can Increase Precision, 609 20.4.4 Comparing Counts, 610 20.4.5 Nonparametric Survival Comparisons, 612 20.4.6 Risk (Hazard) Ratios and Confidence Intervals Are Clinically Useful Data Summaries, 614 20.4.7 Statistical Models Are Necessary Tools, 615 20.5 Problems With P-Values, 616 20.5.1 P-Values Do Not Represent Treatment Effects, 618 20.5.2 P-Values Do Not Imply Reproducibility, 618 20.5.3 P-Values Do Not Measure Evidence, 619 20.6 Strength of Evidence Through Support Intervals, 620 20.6.1 Support Intervals Are Based on the Likelihood Function, 620 20.6.2 Support Intervals Can Be Used with Any Outcome, 621 20.7 Special Methods of Analysis, 622 20.7.1 The Bootstrap Is Based on Resampling, 623 20.7.2 Some Clinical Questions Require Other Special Methods of Analysis, 623 20.8 Exploratory Analyses, 628 20.8.1 Clinical Trial Data Lend Themselves to Exploratory Analyses, 628 20.8.2 Multiple Tests Multiply Type I Errors, 629 20.8.3 Kinds of Multiplicity, 630 20.8.4 Inevitible Risks from Subgroups, 630 20.8.5 Tale of a Subset Analysis Gone Wrong, 632 20.8.6 Perspective on Subgroup Analyses, 635 20.8.7 Effects The Trial Was Not Designed to Detect, 636 20.8.8 Safety Signals, 637 20.8.9 Subsets, 637 20.8.10 Interactions, 638 20.9 Summary, 639 20.10 Questions for Discussion, 640
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
21 Prognostic Factor Analyses
xxi
644
21.1 Introduction, 644 21.1.1 Studying Prognostic Factors is Broadly Useful, 645 21.1.2 Prognostic Factors Can Be Constant or Time-Varying, 646 21.2 Model-Based Methods, 647 21.2.1 Models Combine Theory and Data, 647 21.2.2 Scale and Coding May Be Important, 648 21.2.3 Use Flexible Covariate Models, 648 21.2.4 Building Parsimonious Models Is the Next Step, 650 21.2.5 Incompletely Specified Models May Yield Biased Estimates, 655 21.2.6 Study Second-Order Effects (Interactions), 656 21.2.7 PFAs Can Help Describe Risk Groups, 656 21.2.8 Power and Sample Size for PFAs, 660 21.3 Adjusted Analyses of Comparative Trials, 661 21.3.1 What Should We Adjust For?, 662 21.3.2 What Can Happen?, 663 21.3.3 Brain Tumor Case Study, 664 21.4 PFAS Without Models, 666 21.4.1 Recursive Partitioning Uses Dichotomies, 666 21.4.2 Neural Networks Are Used for Pattern Recognition, 667 21.5 Summary, 669 21.6 Questions for Discussion, 669 22 Factorial Designs 22.1 Introduction, 671 22.2 Characteristics of Factorial Designs, 672 22.2.1 Interactions or Efficiency, But Not Both Simultaneously, 672 22.2.2 Factorial Designs Are Defined by Their Structure, 672 22.2.3 Factorial Designs Can Be Made Efficient, 674 22.3 Treatment Interactions, 675 22.3.1 Factorial Designs Are the Only Way to Study Interactions, 675 22.3.2 Interactions Depend on the Scale of Measurement, 677 22.3.3 The Interpretation of Main Effects Depends on Interactions, 677 22.3.4 Analyses Can Employ Linear Models, 678 22.4 Examples of Factorial Designs, 680 22.5 Partial, Fractional, and Incomplete Factorials, 682 22.5.1 Use Partial Factorial Designs When Interactions Are Absent, 682 22.5.2 Incomplete Designs Present Special Problems, 682 22.6 Summary, 683 22.7 Questions for Discussion, 683
671
Piantadosi
Date: June 15, 2017
xxii
Time: 1:48 pm
CONTENTS
23 Crossover Designs
684
23.1 Introduction, 684 23.1.1 Other Ways of Giving Multiple Treatments Are Not Crossovers, 685 23.1.2 Treatment Periods May Be Randomly Assigned, 686 23.2 Advantages and Disadvantages, 686 23.2.1 Crossover Designs Can Increase Precision, 687 23.2.2 A Crossover Design Might Improve Recruitment, 687 23.2.3 Carryover Effects Are a Potential Problem, 688 23.2.4 Dropouts Have Strong Effects, 689 23.2.5 Analysis is More Complex Than for a Parallel-Group Design, 689 23.2.6 Prerequisites Are Needed to Apply Crossover Designs, 689 23.2.7 Other Uses for the Design, 690 23.3 Analysis, 691 23.3.1 Simple Approaches, 691 23.3.2 Analysis Can Be Based on a Cell Means Model, 692 23.3.3 Other Issues in Analysis, 696 23.4 Classic Case Study, 696 23.5 Summary, 696 23.6 Questions for Discussion, 697 24 Meta-Analyses
698
24.1 Introduction, 698 24.1.1 Meta-Analyses Formalize Synthesis and Increase Precision, 699 24.2 A Sketch of Meta-Analysis Methods, 700 24.2.1 Meta-Analysis Necessitates Prerequisites, 700 24.2.2 Many Studies Are Potentially Relevant, 701 24.2.3 Select Studies, 702 24.2.4 Plan the Statistical Analysis, 703 24.2.5 Summarize the Data Using Observed and Expected, 703 24.3 Other Issues, 705 24.3.1 Cumulative Meta-Analyses, 705 24.3.2 Meta-Analyses Have Practical and Theoretical Limitations, 706 24.3.3 Meta-Analysis Has Taught Useful Lessons, 707 24.4 Summary, 707 24.5 Questions for Discussion, 708 25 Reporting and Authorship 25.1 Introduction, 709
709
Piantadosi
Date: June 15, 2017
Time: 1:48 pm
CONTENTS
xxiii
25.2 General Issues in Reporting, 710 25.2.1 Uniformity Improves Comprehension, 711 25.2.2 Quality of the Literature, 712 25.2.3 Peer Review Is the Only Game in Town, 712 25.2.4 Publication Bias Can Distort Impressions Based on the Literature, 713 25.3 Clinical Trial Reports, 715 25.3.1 General Considerations, 716 25.3.2 Employ a Complete Outline for Comparative Trial Reporting, 721 25.4 Authorship, 726 25.4.1 Inclusion and Ordering, 727 25.4.2 Responsibility of Authorship, 727 25.4.3 Authorship Models, 728 25.4.4 Some Other Practicalities, 730 25.5 Other Issues in Disseminating Results, 731 25.5.1 Open Access, 731 25.5.2 Clinical Alerts, 731 25.5.3 Retractions, 732 25.6 Summary, 732 25.7 Questions for Discussion, 733 26 Misconduct and Fraud in Clinical Research 26.1 Introduction, 734 26.1.1 Integrity and Accountability Are Critically Important, 736 26.1.2 Fraud and Misconduct Are Difficult to Define, 738 26.2 Research Practices, 741 26.2.1 Misconduct May Be Increasing in Frequency, 741 26.2.2 Causes of Misconduct, 742 26.3 Approach to Allegations of Misconduct, 743 26.3.1 Institutions, 744 26.3.2 Problem Areas, 746 26.4 Characteristics of Some Misconduct Cases, 747 26.4.1 Darsee Case, 747 26.4.2 Poisson (NSABP) Case, 749 26.4.3 Two Recent Cases from Germany, 752 26.4.4 Fiddes Case, 753 26.4.5 Potti Case, 754 26.5 Lessons, 754 26.5.1 Recognizing Fraud or Misconduct, 754 26.5.2 Misconduct Cases Yield Other Lessons, 756
734
Piantadosi
Date: June 15, 2017
xxiv
Time: 1:48 pm
CONTENTS
26.6 Clinical Investigators’ Responsibilities, 757 26.6.1 General Responsibilities, 757 26.6.2 Additional Responsibilities Related to INDs, 758 26.6.3 Sponsor Responsibilities, 759 26.7 Summary, 759 26.8 Questions for Discussion, 760 Appendix A Data and Programs
761
A.1 Introduction, 761 A.2 Design Programs, 761 A.2.1 Power and Sample Size Program, 761 A.2.2 Blocked Stratified Randomization, 763 A.2.3 Continual Reassessment Method, 763 A.2.4 Envelope Simulation, 763 A.3 Mathematica Code, 763 Appendix B Abbreviations
764
Appendix C Notation and Terminology
769
C.1 Introduction, 769 C.2 Notation, 769 C.2.1 Greek Letters, 770 C.2.2 Roman Letters, 771 C.2.3 Other Symbols, 772 C.3 Terminology and Concepts, 772 Appendix D Nuremberg Code D.1
788
Permissible Medical Experiments, 788
References
790
Index
871
Piantadosi
Date: June 3, 2017
Time: 10:18 am
PREFACE TO THE THIRD EDITION
The third edition of this work reflects three trends. The most important trend is the continued evolution of clinical trials as a scientific tool. Changes in therapeutic approaches based on knowledge of gene function, cell pathways, and disease mechanisms, for example, have in recent years fundamentally altered therapeutic questions and the designs that assess them. This is very evident in fields such as cancer, where it is commonplace to talk about precision medicine and targeted therapy as products and goals of increased biological knowledge. Most other medical disciplines are experiencing similar evolution. Detailed biological knowledge yields sharper therapeutic questions, and often requires refined clinical trial designs. This evolution will certainly continue into the future. The second trend underlying this third edition derives from my uneasiness with the apparent imbalance of theory and empiricism in clinical trials. Too many clinical investigators appear to mis-learn the lessons of clinical trials, and carry inflexible views and voids in place of workable principles. Trials in late development are strongly empirical, but the role of biological theory is dominant in translation and early development. This edition places greater emphasis on the balanced interplay of ideas that is essential to understand the wide application of clinical trials. A third trend motivating this edition relates to the training of clinical researchers. Since I first began teaching this subject in the 1980’s, the structure of medical education has changed very little. Clinical investigator training at many institutions has stalled in a compromise somewhere between therapeutics and public health. Despite the huge body of work on methods for therapeutic research specifically, the training of high quality clinical investigators remains endangered by weak curricula, impossible funding and competition for their time, over-regulation, and skeptical views of their science. The implication of all these trends is that more time, effort, and clarity is needed in teaching the science of clinical trials. It is my hope that this edition will make some progress in each of these areas. Colleagues have continued to support this work by giving me their valuable time in discussion and manuscript review. I am grateful to many of them for such help. Chris xxv
Piantadosi
Date: June 3, 2017
xxvi
Time: 10:18 am
PREFACE TO THE THIRD EDITION
Szekely, Ph.D. at Cedars Sinai Medical Center reviewed many chapters and references in detail. Chengwu Yang, M.D, M.S., Ph.D. at Hershey Medical Center was also kind enough to review chapters and provide advice based on classroom experience. Special thanks to Jim Tonascia, Ph.D. and Shing Lee, Ph.D. for constructive advice. Students in my class Design and Analysis of Clinical Trials, first at Johns Hopkins, and since 2007 in the Specialty Training and Research (STaR) Program at the University of California Los Angeles from 2007 to 2014 provide the best motivation for writing through their needs and curiosity. My UCLA classes in 2013 and 2014 read and commented on new chapters and late revisions. I would also like to thank the faculty, students, and sponsors from several teaching workshops who provided many practical questions and examples for interesting clinical trial designs over the years. These include the Clinical Trial Methods in Neurology Workshop in Vail, Colorado in 2007–2012, and the AACR Methods in Cancer Biostatistics Workshop held in Sonoma in 2007 and 2009, and Tahoe in 2015, which sharpened my focus on the role and needs of biostatisticians in clinical cancer research. Similarly, our annual Clinical and Translational Research Workshop at Cedars-Sinai Medical Center draws students and faculty together in didactics and mentoring for trial development. Those and similar teaching venues stimulate clarity and flexibility from a clinical trialist. My current role as Director of a Cancer Institute has not lessened my interest in methodologic topics that also extend outside cancer, or my belief that such diversification improves understanding overall. My local environment teaches the importance of evolving methodology to keep pace with scientific questions, and the high value of clinical and translational research for the improved lives of patients and advancement of science. My hope is that this book contributes to those goals across disciplinary boundaries. STEVEN PIANTADOSI Los Angeles, California 2017
Piantadosi
Date: June 3, 2017
Time: 10:18 am
PREFACE TO THE THIRD EDITION
xxvii
Books are fatal: they are the curse of the human race. Nine-tenths of existing books are nonsense, and the clever books are the refutation of that nonsense. The greatest misfortune that ever befell man was the invention of printing. [Benjamin Disraeli] Writing is an adventure. To begin with, it is a toy and an amusement. Then it becomes a mistress, then it becomes a master, then it becomes a tyrant. The last phase is that just as you are about to be reconciled to your servitude, you kill the monster and fling him to the public. [Winston Churchill] Writing is easy; all you do is sit staring at a blank sheet of paper until the drops of blood form on your forehead. [Gene Fowler] The product of paper and printed ink, that we commonly call the book, is one of the great visible mediators between spirit and time, and, reflecting zeitgeist, lasts as long as ore and stone. [Johann Georg Hamann] Don’t take any shit from the zeitgeist. [George Carlin]
The first draft of anything is shit. [Ernest Hemingway]
All writers are vain, selfish and lazy, and at the very bottom of their motives lies a mystery. Writing a book is a horrible, exhausting struggle, like a long bout of some painful illness. One would never undertake such a thing if one were not driven on by some demon whom one can neither resist nor understand. [George Orwell]
Books are never finished, they are merely abandoned. [Oscar Wilde]
Reading is more important than writing. [Roberto Bola˜no]
Piantadosi
Date: June 3, 2017
Time: 10:21 am
ABOUT THE COMPANION WEBSITE
This book is accompanied by a companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e The website includes:
·· ·
xxviii
A set of downloadable files that tracks the chapters Data files Program files
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
1 PRELIMINARIES
1.1
INTRODUCTION
The best time to contemplate the quality of evidence from a clinical trial is before it begins. High-quality evidence about the effects of a new treatment is a consequence of good study design and execution, which themselves are the results of careful planning. This book presents elements of clinical trial methodology that are needed in planning, designing, conducting, analyzing, and assessing clinical trials with the goal of improving the evidence derived from these important studies. The topics discussed include subjects relevant to early planning and design—some that find general agreement among methodologists and others that are contentious. It is unlikely that a reader experienced with clinical trials will agree with all that I say or emphasize, but my perspective should be mainstream, internally consistent, and useful for learning. Much of what is written about clinical trials tells us what to do. As a result, many investigators, sponsors, and regulators become proficient in standard practice, perhaps to the point that they view it as restrictive. Not as much of the literature explains why we do what is recommended. Knowing only what to do creates patterned thinking, whereas knowing why we do it allows appropriate creative exceptions and coping with atypical circumstances. I try throughout this book to emphasize the rationale for what we do in clinical trials, and specifically discourage unhelpful patterned thinking. My further purpose is to challenge the dogma and traditions to which many young investigators have been exposed by their older colleagues. I will not stereotype and validate impressions gathered from the very imperfect culture of clinical investigation that exists nearly everywhere. In recent years especially, I have noticed some aggressive ignorance regarding the scientific principles underlying clinical trials. The goal of this book is to teach foundational principles and their sometimes complicated implications so Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
1
Piantadosi
Date: July 27, 2017
2
Time: 4:45 pm
PRELIMINARIES
that the sublime, ordinary, and absurd in clinical trials becomes more clear. As a general rule, I will discard cherished perspectives if they lack a strong scientific basis. This book is not an introduction to clinical trials and has evolved to become more technical and lengthy as it has been updated through the years. Even so, I am aware from colleagues that portions of it have found their way into didactics for beginning students of clinical trials. My intent for this book is to be part of a one quarter or semester structured postgraduate course for an audience with quantitative skills and therapeutic focus. I take for granted that the audience has already completed a formal didactic introduction to clinical trials. I also expect that the audience will have completed a basic postgraduate series in biostatistics. The first edition of this book evolved over a dozen years by merging a course in experimental design with one on clinical trials. The second edition was the result of seven additional years of teaching and concomitant changes in the field. The third edition incorporates changes and needs that I perceive in the field over the past 8–10 years, and responds partly to my experiences teaching diverse clinical investigators at a second major medical school and academic medical center.
1.2
AUDIENCES
The intended teaching audience includes both medical and statistical professionals early in their clinical research careers. Each component of the audience presents a certain dilemma. The book assumes a working knowledge of basic biostatistics. It is not really possible to get beyond the introductory ideas of clinical trials without statistics. It is also helpful if the reader understands some more advanced statistical concepts, including ideas like methods of inference, error control, lifetables, survival models, and likelihoods. Clinicians often lack this knowledge. However, modern clinical researchers are seeking the required quantitative background through formal training in clinical investigation methods or experimental therapeutics. Biostatistics professionals require no clinical knowledge to understand the concepts in this book. Even so, fundamental principles of clinical research are helpful, including some basic human biology, research ethics, therapeutic developmental paradigms, clinical terminology and ideas of clinical pharmacology. Eventually, for the clinical trials biostatistician, immersion in a disease-oriented discipline will be vital to make the most substantive collaborative contributions as well as to identify and solve relevant methodological problems. Uneven discussion (basic to technically complex) results from this mixed target audience and perspectives, as well as from the very nature of clinical trials. The classes that I teach using this book typically contain a mixture of biostatistics graduate students, medical doctors in specialty or subspecialty training (especially working toward a degree in clinical investigation), and other health professionals training to be sophisticated managers or consumers of clinical trials. For such an audience, the goal is to provide breadth and to write so as not to be misunderstood. This book should be supplemented with lecture and discussion, and possibly a computer lab. The reader who does not have an opportunity for formal classroom dialogue will need to explore the references more extensively. Exercises and discussion questions are provided at the end of each chapter. Most are intentionally made open-ended, with a suggestion that the student answer them in the form of a one- or two-page memorandum, as though providing an expert opinion to less-experienced investigators.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SCOPE
3
In short, this book targets clinical trialists, who are not so simple to define. Operationally, a clinical trialist is someone whose career is focused on the science of trials. A trialist uses investigational methods across disciplinary boundaries, compared to a specialist who might perform some trials in a single domain. Being true interdisciplinary experts, trialists can be derived from a number of sources: (i) quantitative or biostatistical, (ii) administrative or managerial, (iii) clinical, or (iv) ethics. Students can effectively approach the subject from any of these perspectives. Eventually however, a mature trialist will be conversant with important details from all the foundational fields. It is common today for rigorous trialists to be strongly statistical. This is because of the fairly rapid recent pace of methods for clinical trials coming from that field, and because statistics pertains to all of the disciplines in which trials are conducted (or all of science for that matter). However, the discussion in this book does not neglect the other viewpoints that are also essential to understanding trials. Many examples herein relate to cancer because that is the primary field in which I work, but the concepts will generalize to other areas. Studying trials in different clinical disciplines is the best device for understanding principles. Unfortunately, the structure of many institutions and collaborations inhibits this. Scientists who specialize in clinical trials are frequently dubbed “statisticians” by their clinical colleagues. I will sometimes use that term with the following warning regarding rigor: statistics is an old and broad profession. There is not a one-to-one correspondence between statisticians or biostatisticians and knowledge of clinical trials. However, trial methodologists, whether statisticians or not, are likely to know a lot about biostatistics and will be accustomed to working with statistical experts. Many trial methodologists are not statisticians at all, but evolve from epidemiologists or clinicians with a strongly quantitative orientation, as indicated above. My personal experience is that there are many excellent clinical trialists whom statisticians would label “physicians.” The stereotyping is not important. The essential idea is that the subject has many doorways. In recent years, many biostatistics professionals have gravitated toward bioinformatics, computational biology, or high-dimensional data. I shall not take the space here to define these terms or distinguish them from research informatics or (bio)medical informatics. It is just worth pointing out that many biostatisticians have no experience with clinical trials. Technologies are of high value, but each wax and wanes with limited impact on methods of comparison. The principles of valid comparisons in medicine have evolved over hundreds of years, and especially in the recent century. These principles have not been altered by the phenomenal ongoing scientific or technological advances during that period of time. Revolutions such as germ theory, drug development, anesthesia and surgery, immunology, genomic science, biomedical imaging, nutrition, computerization, and all the others of history have not diminished the foundational need of medicine to assay treatments fairly. In fact, as these revolutions integrated themselves into scientific medicine, the need for therapeutic comparison expands and the principles for doing so remain constant.
1.3
SCOPE
I have made an effort to delineate and emphasize principles common to all types of trials: translational, developmental, safety, comparative, and large-scale studies. This follows
Piantadosi
Date: July 27, 2017
4
Time: 4:45 pm
PRELIMINARIES
from a belief that it is more helpful to learn about the similarities among trials rather than differences. However, it is unavoidable that distinctions must be made and the discussion tailored to specific types of studies. I have tried to keep such distinctions, which are often artificial, to a minimum. Various clinical contexts also treat trials differently, a topic discussed briefly in Chapter 4. There are many important aspects of clinical trials not covered here in any detail. These include administration, funding, conduct, quality control, and the considerable infrastructure necessary to conduct trials. These topics might be described as the technology of trials, whereas my intent is to focus on the science of trials. Technology is vitally important, but falls outside the scope of this book. Fortunately, there are excellent sources for this material. No book or course can substitute for regular interaction with a trial methodologist during both the planning and analysis of a clinical investigation. Passive reliance on such consultations is unwise, but true collaborations between clinicians and trialists will result when both grasp the relevant body of knowledge. Although many clinicians think of bringing their final data to a statistician, a collaboration will be most valuable during the design phase of a study when an experienced trialist may prevent serious methodologic errors, create efficiencies, or help avoid mistakes of inference. The wide availability of computers is a strong benefit for clinical researchers, but presents some hidden dangers. Although computers facilitate efficient, accurate, and timely keeping of data, modern software also permits or encourages researchers to produce “statistical” reports without much attention to study design and without fully understanding assumptions, methods, limitations, and pitfalls of the procedures being employed. Sometimes a person who knows how to run procedure-oriented packages on computerized data is called the “statistician,” even though he or she might be a novice at the basic theory underlying the analyses. It then becomes possible to produce a final report of a study without the clinical investigator understanding the study execution, the limitations of analysis, and without the analyst being conversant with the data. What a weak chain that is. An additional warning about the limits of technology is necessary. There is no computerized summary of clinical data that can ensure its correctness. Inspection via a computer-generated printout of data values by a knowledgeable investigator can help re-ensure us that gross errors are unlikely. But once the data are collapsed, re-coded, tabulated, transformed, summarized, or otherwise combined with each other by the computer, errors are likely to be hidden. Orderly looking summaries do not ensure correctness of the underlying data. The ideas in this book are intended to assist reliability, not by being old-fashioned but by being rigorous. Good design inhibits errors by involving a statistical expert in the study as a collaborator from the beginning. Most aspects of the study will improve as a result, including reliability, resource utilization, quality assurance, precision, and the scope of inference. Good design can also simplify analyses by reducing bias and variability and removing the influence of irrelevant factors. In this way number crunching becomes less important than sound statistical reasoning. The student of clinical trials should also understand that the field is growing and changing in response to both biological and statistical developments. A picture of good methodology today may be inadequate in the near future. Designing trials to use biomarkers and genomics data effectively is one area of evolution. But change is probably even more true of analytic methods than design, where the fundamentals change slowly.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
OTHER SOURCES OF KNOWLEDGE
5
Analysis methods depend on new statistical developments or theory. These in turn depend on (i) computing hardware, (ii) reliable and accessible software, (iii) training and re-training of trialists in the use of new methods, (iv) acceptance of the procedure by the statistical and biological communities, and (v) sufficient time for the innovations to diffuse into practice. It is equally important to understand what changes or new concepts do not improve methodology, but are put forward in response to nonscience issues or because of creeping regulation. A good example of this is the increasing sacrifice of expertise in favor of objectivity in clinical trial monitoring (discussed in Chapter 18). Such practices are sometimes as ill considered as they are well meaning, and may be promulgated by sponsors without peer review or national consensus. Good trial design requires a willingness to examine many alternatives within the confines of reliably answering the basic biological question. The most common errors related to trial design are devoting insufficient resources or time to the study, rigidly using standard types of designs when better (e.g., more efficient) designs are available, or undoing the benefits of a good design with poor execution or analysis. I hope that the reader of this book will come to understand where there is much flexibility in the design and analysis of trials and where there is not.
1.4
OTHER SOURCES OF KNOWLEDGE
The periodical literature related to clinical trial methods is large and can only be approached via electronic searches. The relevant journals are not all accessible through a single source. There is also a considerable volume of methodology appearing in clinical journals. Some useful web resources are listed in Table 1.1. With regard to the Internet, it is important to recognize that it is not static, and therefore not the best source for references. However, many documents important to clinical trials can be found there with simple searches. Some of these documents in text form have been dropped from this edition because of the ease with which they can be located online. A number of books and monographs have dealt with many facets of clinical trials. I will mention only a few that have been recently updated or have proved to be quite durable. The classic text by Meinert [1026, 1028] is recently updated and takes a practical view of the organization, infrastructure, and administrative supports necessary to perform multicenter randomized trials. In fourth edition as of 2010 is the book by Friedman, Furberg, and DeMets [546] that is an excellent introduction. Pocock [1216] also discusses conceptual and practical issues without the need for extensive statistical background, and has a recent series of design papers oriented to the clinician investigator [1220–1223]. Some recent design developments are covered by Harrington [678], and controversial topics are discussed by Chow [258]. Every trialist should read the extensive work on placebos by Shapiro and Shapiro [1367]. Monographs about clinical trials appear occasionally in disease-specific contexts. This was true in cancer and AIDS in the last decade or two, but many of the books are now slightly dated. Even in a very active program of clinical research, a relatively short exposure to the practical side of clinical trials cannot illustrate all the important lessons. This is because it may take years for any single clinical trial, and many such studies, to yield all of their
Piantadosi
Date: July 27, 2017
6
Time: 4:45 pm
PRELIMINARIES
TABLE 1.1 Some Web Resources for Clinical Trials Information assert-statement.org cochrane.org clinicaltrials.gov consort-statement.org jameslindlibrary.org icmje.org gpp-guidelines.org
mcclurenet.com/ICHefficacy.html controlled-trials.com
Standards for scientific and ethical review of clinical trials Trial-based information about the effects of health care Federally and privately supported clinical research Evidence-based tool to improve the quality of reports of randomized trials Evolution of fair tests of medical treatments; examples including key passages of text Uniform requirements for manuscripts submitted to biomedical journals Encourages responsible and ethical publication of clinical trials sponsored by pharmaceutical companies ICH efficacy guidelines Current controlled trials: provides access to peerreviewed biomedical research
information useful for learning about methodology. Even so, the student of clinical trials will learn some lessons more quickly by being involved in an actual study, compared with simply studying theory. In this book, I illustrate many concepts with published trials. In this way, the reader can have the benefit of observing studies from a long-term perspective, which would otherwise be difficult to acquire.
1.5
NOTATION AND TERMINOLOGY
There is no escaping the need for mathematical formalism in the study of clinical trials. It would be unreasonable, a priori, to expect mathematics to be as useful as it is in describing nature [1555]. Nevertheless, it is, and the mathematics of probability is the particular area most helpful for clinical trials. Galileo said: The book of the universe is written in mathematical language, without which one wanders in vain through a dark labyrinth.
Statisticians light their dark labyrinth using abstract ideas and symbols (Greek letters among them) as a shorthand for important mathematical quantities and concepts. I will also use these ideas and symbols when appropriate in this book. Mathematics is a superb shorthand, and many essential ideas cannot be explained well without its notation. However, because this is not a book primarily about statistics, the use of symbols and abstraction is as minimal as possible. A review and explanation of common usage of symbols consistent with the clinical trials literature is given in Appendix B. Because some statistical terms may be unfamiliar to some readers, definitions and examples are also listed in that appendix. Abbreviations used in the book are also explained there. This book cannot provide the fundamental statistical background needed to understand clinical trial design and analysis thoroughly. As stated above, I assume the reader already
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
NOTATION AND TERMINOLOGY
7
has much of this knowledge. Help with statistical principles that underlie clinical trial methods is available in the form of many practical and readable references for “medical statistics.” I would add only that some older texts dealing with that topic may be perfectly satisfactory since the fundamentals change quite slowly. Some specialized references will be mentioned later.
1.5.1
Clinical Trial Terminology
The terminology of clinical trials is not without its vagaries, despite Meinert’s efforts to standardize definitions in a dictionary devoted to the topic [1029]. Most of my terms are consistent with those definitions (also see Appendix B). However, I employ descriptive alternatives to the widely used, uninformative, inconsistent, outdated, and difficult-togeneralize “phase I, II, III, or IV” designations for clinical trials. That opaque terminology seems to have been derived or solidified from the heavy use of developmental trials for cytotoxic drugs in the closing decades of the twentieth century. Those terms have become jargon and found their way inappropriately to other contexts, helped by regulatory speak. In research and clinical care, we would never permit imprecision to the degree allowed in this old clinical trial jargon. It really can be bad enough to be unethical, and my belief is that such jargon is presently inhibiting creative design. Drug development terms can be ambiguous or inappropriate for nondrug trials. Even newer cancer therapies may not fit easily into the old terminology. Although medical disciplines tend to view their own research issues as being unique, and encourage local or specialty-specific terminology, teaching requires terminology that is independent of the context. Some terms that I specifically avoid are pilot, proof of concept, and exploratory. These terms tend to be context dependent, but more importantly have no definitions. They seem to be used mainly to deflect criticisms from obvious design flaws.
1.5.2
Drug Development Traditionally Recognizes Four Trial Design Types
One cannot separate terminology from the developmental paradigm employed. This actually represents the entire issue at hand, because incorrect terms imply an incorrect paradigm. A look at Chapter 10 might be useful now because there the paradigm and terminology are consistent and descriptive. In therapeutic drug (especially cytotoxic drug) development historically, clinical trials were classified simply as phases I, II, III, and IV. Phase I studies are pharmacologically oriented and usually attempt to find the best dose of drug to employ. Phase II trials look for evidence of activity, efficacy, and safety at a fixed dose. Phases I and II are usually not formally hypothesis driven, meaning that comparisons to other treatments are often external to the experiment. In phase III, new treatments are compared with alternatives, no therapy, or placebo. The comparison group is internal to the design. Investigators would not undertake the greater expense and effort of phase III comparative testing unless there was preliminary evidence from phase I and II that a new treatment was safe and active or effective. Phase IV is postmarketing surveillance and may occur after regulatory approval of a new treatment or drug to look for uncommon side effects. This type of study design is also used for purposes other than safety and activity, such as marketing, and to uncover potential new indications that might support continued product exclusivity.
Piantadosi
Date: July 27, 2017
8
Time: 4:45 pm
PRELIMINARIES
TABLE 1.2 Descriptive Terminology for Clinical Trials Jargon
Developmental Stage
Descriptive Terminology
Pilot or phase 0 Phase I
Translation Early
Phase II Phase IIA Randomized phase II𝑎 Phase IIB Phase III
Middle
Translational trial Treatment mechanism Dose-finding (DF) Dose-ranging Safety and activity (SA)
Phase IV Large simple 𝑎
Late
Post-development
Underpowered comparative Comparative Comparative treatment efficacy (CTE) Expanded safety (ES) Large scale (LS)
Usually discussed as part of “phase II” but is a late development (comparative) design.
Although this terminology has been widely applied to therapeutics, it often does not fit well. In some cancer settings, phase II is divided into IIa and IIb. Phase IIa trials are small-scale feasibility studies using surrogate or intermediate endpoints such as cancer precursor lesions or biomarkers. Surrogate outcomes are defined and discussed in Chapter 5. Phase IIb trials are randomized comparative studies using intermediate endpoints. Phase III cancer prevention trials are comparative designs (like IIb) using definitive clinical endpoints such as cancer incidence [831]. Some cancer prevention investigators have used the term “phase IV” to mean a defined population study [641, 642]. The same authors define phase V to be demonstration and implementation trials. 1.5.3
Descriptive Terminology Is Better
The “phase I, II, III” labels have become too strong a metaphor, and frequently interfere with our need to think creatively about study purposes and designs. It often seems fitting that the labels have Roman numbering. It is common to see protocols with “phase I/II” titles, and I have even seen one or two titled “phase I/II/III,” indicating how investigators labor under the restrictive labels to accommodate more flexible goals (or how confused they are). A more general and useful description of studies should take into account the purposes of the design and the stage of the developmental paradigm, independent of the treatment being studied. A mapping from old to new terminology is provided in Table 1.2. Although mostly obvious, the descriptions are unconventional, so I will use them in parallel with traditional labels when needed for clarity. In this book, the terms phase I, II, III, and IV, if used at all, refer narrowly to drug trials. Descriptive terminology also accounts for translational trials, an important class discussed in Chapter 11. It distinguishes several types of early developmental designs, particularly those aimed at establishing a safe dose of a new drug or agent to study. Comparative efficacy is an important class of trial designs that embody many of the rigorous design fundamentals discussed throughout this book. Details about these design types are given in Chapter 14.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SUMMARY
1.6
9
EXAMPLES, DATA, AND PROGRAMS
It is not possible to learn all the important lessons about clinical trials from classroom instruction or reading, nor is it possible for every student to be involved with actual trials as part of a structured course. This problem is most correctable for topics related to the analysis of trial results, where real data can usually be provided easily. For some examples and problems used in this book, data are provided on a Web site described in Appendix A. I have made a concerted effort to provide examples of trials that are instructive but small, so as to be digestible by the student. Computerized data files and some programs to read and analyze them are provided. The site also contains some sample size and related programs that may be helpful for design calculations. More powerful sample size (and other) design software is available commercially and elsewhere on the Internet. Experience with the design of clinical trials can be difficult to acquire. Although oriented primarily toward that topic, this book cannot replace an experienced teacher, good mentorship, collaborations with knowledgeable experts, and participation in many trials. I encourage the student to work as many examples as feasible and return to the discussion here as actual circumstances arise.
1.7
SUMMARY
The premise of this book is that well-designed experimental research is a necessary basis for therapeutic development and clinical care decisions. The purpose of this book is to address issues in the methodology of clinical trials in a format accessible to interested clinical trialists and statistical scientists. The audience is intended to be practicing clinicians, statisticians, trialists, and others with a need for understanding good clinical research methodology. The reader familiar with clinical trials will notice a few differences from usual discussions, including descriptive terms for types of trials, and an even-handed treatment of different statistical perspectives. Examples from the clinical trials literature are used, and data and computer programs for some topics are available. A review of essential notation and terminology is also provided.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
2 CLINICAL TRIALS AS RESEARCH
2.1
INTRODUCTION
In the late nineteenth and early twentieth century, therapeutics was in a state of nihilism. Nineteenth-century science had discovered that many diseases improved without therapy, and that many popular treatments such as certain natural products and bloodletting were ineffective. The nihilism was likely justifiable because it could be claimed that nearly the entire accomplishment of therapeutics up to that point was only the history of the placebo effect [1367]. The skepticism that would become a part of scientific medicine would be applied unevenly up to the present day [636]. When scientists showed that diseases such as pellagra and diabetes could have their effects relieved with medicinals, belief in treatment began to strengthen. Following the discovery of penicillin and sulfanilamide in the twentieth century, the period of nihilism ended [282, 1476]. In the twentieth century, discovery of effective drugs for the treatment of many diseases such as cancer, cardiovascular disease, infections, and mental illness as well as the crafting of vaccines and other preventive measures demonstrated the value of therapeutics. Clinical trials played a key role in the development of many therapies and preventives, especially in the last half of the century. There is evidence to support the idea that the overall economic status of developed nations is substantially due to health of the population [1184], itself dependent on effective public health and therapeutic interventions. Randomized trials have been the centerpiece of development because of bias control, but their complexity, changing role, and other pressures on the acquisition of
Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
10
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
11
medical knowledge have now made them a focal point of conflicting interests [173]. Currently there is a premium on evidence generated by clinical trials and regulatory requirements to use them. But there are also many obstacles to their implementation and success. This is especially true in the broad scope of trial applications in the search for effective prevention agents, therapeutics, comparing competing therapies, establishing optimum treatment combinations and schedules, assessing devices, and surgical treatments. In the twenty-first century, we are focused on personalized or precision medicine, roughly equivalent to the idea that we are all victims of our own rare disease. This perspective is derived from awareness of the implications of gene expressions, epigenetic factors, gene–environment interactions, gene–gene interactions, and the proteosome. The idea of homogeneous diseases has been deconstructed, and a new synthesis has yet to arrive. Despite the new terminology, what is presently driving therapeutics via genomics is not fundamentally new to scientific medicine. We are actually seeing the elucidation of fundamental modes of action within the framework of validated biological theory, and the opening of those mechanisms to new therapeutic approaches. Despite the promise of precision medicine, it is necessary to keep in mind potential limitations. One problem was illustrated by the targeted drug ivacaftor that compensates for a protein defect in 5% of people with cystic fibrosis. It consumed decades of development time and has been hailed as a shining example of the promise of precision medicine. Unfortunately it costs $300,000 per year per patient. Even more problematic is a recent trial indicating that high-dose ibuprofen, aerosolized saline, and azithromycin yield nearly equal benefits for about $300 per year, but applicable to 100% of cystic fibrosis patients [763]. Discovery and utility questions are both accessible to clinical trials. Clinical trials, being part of the empirical arm of science, are just now catching up to the consequences of the new discoveries. Research problems derived from genome-based medicine that impinge directly on clinical trial design and conduct include biologically based cohort selection or enrichment, use of validated biomarkers and imaging to increase trial efficiency, assessment of therapeutic interactions, and tests of a growing multiplicity of new agents and modalities. Experiment design and analysis have become increasingly important because of the greater detail in modern biological theories and the complexities in treatments of disease. This is well illustrated by genome-based medicine. The clinician is interested in modest sized, biologically significant treatment effects that could be obscured by uncontrolled natural variation or bias in poorly designed studies. This focus places rigorous clinical trials at the very center of clinical research today, although our interest in small effect sizes creates many issues for clinical investigation [16]. Other contemporary pressures also encourage the application of rigorous clinical trials. Societal expectations to relieve suffering through medical progress, governmental regulation of prescription drugs and devices, and the economics of pharmaceutical development all encourage or demand efficient and valid study design. Nearly all good clinical trials have basic biological, public health, and commercial value, encouraging investigators to design studies that yield timely and reliable results. In the current era, the pendulum of nihilism has swung very strongly in the opposite direction. Today, there is belief in the therapeutic efficacy of many treatments, as well
Piantadosi
Date: July 27, 2017
12
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
as our ability to develop new ones. Unproven, complimentary, alternative, fringe, and other methods abound, with their own advocates and practitioners. Many patients put their confidence in untested or inadequately tested treatments. Even in disease areas where therapies are regularly evaluated rigorously, many patients assume treatments are effective, or at least worth the risk, or they would not be under investigation. Other people perhaps with few alternatives are simply willing to take a chance that a new treatment will work, especially when the side effects appear to be minimal. The view from inside science is that we are in the era of rationally designed or targeted drugs and biologicals. Targeting the right molecular entities in pathogenic or regulatory pathways promises high efficacy with low toxicity, motivating current therapeutic optimism. The approach is founded on identification of targets in disease pathogenesis, ability to detect the targets, and specific treatments, which are often small molecules or monoclonal antibodies. Numerous successes have greatly stimulated more research and optimism. Examples include statins in atherosclerosis, imatinib mesylate in chronic myelogenous leukemia, Herceptin in patients with overexpressed HER2 protein breast cancer, and PD-1 or PD-L1 blockade in advanced cancers such as melanoma, to name only a few. To a modern clinical scientist, there is great opportunity to serve the needs of society by providing the most reliable evidence about effects, risks, and value of smart new treatments. This provides pressure to use good clinical trials. However, the same circumstances can create incentives to bypass rigorous evaluation because strong belief in efficacy can arise from unreliable data, as history has shown. Whether contemporary circumstances encourage or discourage clinical trials depend largely on mindset and values. Cost containment and the need to add more value to medical practice with lower expenditures may also amplify conflicting pressures on clinical trials. The clinical scientist is vitally interested in using strong evidence to support effective therapies and discarding ineffective ones. Unfortunately it is cheaper and occasionally easier to test happenstance data (which sometimes exist for financial or related reasons) for evidence of therapeutic differences. Some seriously well-meaning, empowered, confused health care administrators might trade designed data for less costly happenstance data as a basis for therapeutic decisions. Some recent studies of how clinical trials perform as agents of change indicate that scientific medicine is doing a reasonable job. A review of 363 publications testing current practice at a high impact clinical journal showed 40% of such practices to be ineffective, while 38% were supported, and 22% of results were inconclusive [1231]. It is likely the case that controversial or suspicious treatments were chosen for study, so we might expect that current practice contains more than 40% truly useful therapies. Importantly there seems to be some willingness of investigators to test established practice. But the same evidence indicates that some new therapies are embraced prematurely, and consequently require a high cost to perform definitive evaluations later. Another investigation of 860 published randomized trials over the last 50 years [385] found that the chance of demonstrating superiority of a new therapy compared to standard is 50–60%. This reassuring and appropriate—trials seem to have been implemented when uncertainty is maximal, which is where they have their greatest utility. When 1128 systematic reviews from the Cochrane collaboration in the years 2004–2011 were studied [1512], over 44% were assessed as providing insufficient evidence to support clinical practice. This indicates that the best evidence that we can gather bearing on
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RESEARCH
13
important therapeutic questions often requires additional rigorous evaluations in the form of comparative clinical trials.
2.2 2.2.1
RESEARCH What Is Research?
Research is systematic, creative, process-driven work intended to increase knowledge of the natural world. Original research is highly creative, whereas derivative research is more obvious. This definition is not the one found in standard dictionaries but is appropriate for the purposes here. Later, the focus of this definition on the natural world will be important. The reasons why we engage in research include to (1) generate new knowledge, (2) verify or refute other findings or theory, (3) explain observations and determine causes, (4) make reliable predictions, (5) uncover past experience, and (6) collate, organize, or simplify knowledge. The origins for research are curiosity, the needs of society and individuals, and human nature. Research has a number of characteristics that distinguish it from other activities. These include the following: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Goal orientation: the purpose is specific and finite. Empirical component: based on direct experience or observation by the researcher. Theory-based component: based on an existing body of knowledge (theory). Objectivity and logic: principles and procedures are employed that are known to be valid for the purposes. Design: process steps are designed to achieve the goals. Descriptive and analytical: findings are presented ways extending pure description further toward analytic interpretation. Critical disposition: assessment of findings is made without undue respect for subjectivity. Methodical approach: conducted without bias using systematic methods and procedures. Repeatability: design and procedures can be replicated to validate results.
This is the signature of a unique human activity. The clinical trialist uses every one of these research methods. The results of research improve quality of life, reduce work and other burdens, and satisfy cultural and psychological needs. Often overlooked is the strongly positive economic effect that research enterprises have. The impact on the researcher is professional satisfaction, increased overall competency, visibility and respect, and in some cases financial gain. A practice-oriented physician who does not conduct research still needs basic research competency as a consumer of research output which comes often in the form of clinical trials. A goal of any clinical trial research is to “make progress,” but that is only motivational. The real issues are granular and require discipline and attention to detail. That awareness distinguishes the experienced investigator from the beginner. Any trialist must understand two different modes of thinking—clinical and statistical—that support science. Each way
Piantadosi
Date: July 27, 2017
14
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
of reasoning is specific, and both underlie the re-emergence of therapeutics as a modern science. The two methods of reasoning arose independently and must be combined skillfully if they are to serve therapeutic questions effectively. 2.2.2
Clinical Reasoning Is Based on the Case History
The word clinical is derived from the Greek kline, which means bed. In modern usage, clinical not only refers to the bedside but pertains more generally to the care of human patients. The quantum unit of clinical reasoning is the case history, and the primary focus of clinical inference is the individual patient. Before the widespread use of the experimental method, clinical modes of generalizing from the individual to the population were informal. The concepts of person-to-person variability and its sources were primitive and handled informally—uncertainty was qualitative. Medical experience and judgment was not, and probably cannot be, captured in a set of rules. Instead, it is a form of “tacit knowledge” [1224, 1225] and is very concrete. The lack of tools to deal formally with variation is a serious shortcoming of clinical reasoning. Under appreciation of natural variation may also contribute to the attractiveness of what might be called “genomic determinism” or the implicit idea that disease and treatment efficacy are essentially fully determined by factors in the genome. Determinism draws strength from the fact that it is actually true in some cases. More generally, clinical inference benefits from formal accounting of variation, either because the practitioner understands the biological domain is fundamentally random, or accepts it as a mere practicality. New and potentially useful clinical observations are made against this background of reliable experience. Following such observation, many advances have been made by incremental improvement of existing ideas. This process explains much of the progress made in medicine and biology up to the twentieth century. Incremental improvement is a reliable but slow method that can optimize many complex processes, especially those amenable to quantification. For example, the writing of this book proceeded by slightly improving earlier drafts, especially true for the second and third editions. However, there was a foundation of design that greatly facilitated the entire process. Clinical trials can provide a similar foundation of design for clinical inference, greatly amplifying the benefits of careful observation and small improvements. There is often discomfort in clinical settings over how or if population-based estimates, such as those from a clinical trial, pertain to any individual, especially a new patient outside the study. This is not so much a problem interpreting the results of a clinical trial as a difficulty trying to use results appropriately to select the best treatment for a new individual. There is no formal way to accomplish this generalization in a purely clinical framework. It depends partly on judgment, which itself depends on experience. However, clinical experience historically has been summarized in nonstatistical ways. A stylized example of clinical reasoning, and a rich microcosm of issues, can be seen in the following case history, transmitted by Francis Galton: The season of strawberries is at hand, but doctors are full of fads, and for the most part forbid them to the gouty. Let me put heart to those unfortunate persons to withstand a cruel medical tyranny by quoting the experience of the great Linnæus. It will be found in the biographical notes, written by himself in excellent dog-Latin, and published in the life of him by Dr. H. Stoever, translated from German into English by Joseph Trapp (1794).
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RESEARCH
15
Linnæus describes the goutiness of his constitution in p. 416 (cf. p. 415) and says that in 1750 he was attacked so severely by sciatica that he could hardly make his way home. The pain kept him awake during a whole week. He asked for opium, but a friend dissuaded it. Then his wife suggested “Won’t you eat strawberries?” It was the season for them. Linnæus, in the spirit of experimental philosopher, replied, “tentabo—I will make the trial.” He did so, and quickly fell into a sweet sleep that lasted 2 hours, and when he awoke the pain had sensibly diminished. He asked whether any strawberries were left: there were some, and he ate them all. Then he slept right away till morning. On the next day, he devoured as many strawberries as he could, and on the subsequent morning the pain was wholly gone, and he was able to leave his bed. Gouty pains returned at the same date in the next year, but were again wholly driven off by the delicious fruit; similarly in the third year. Linnæus died soon after, so the experiment ceased. What lucrative schemes are suggested by this narrative. Why should gouty persons drink nasty waters, at stuffy foreign spas, when strawberry gardens abound in England? Let enthusiastic young doctors throw heart and soul into the new system. Let a company be run to build a curhaus in Kent, and let them offer me board and lodging gratis in return for my valuable hints [561].
The pedigree of the story may have been more influential than the evidence it provides. It has been viewed both as quackery and as legitimate [1228]. But as a trialist and occasional sufferer of gout, I find the story both quaint and enlightening with regard to a clinical mindset. Note especially how the terms trial and experiment were used, and the tone of determinism. Despite its successes, clinical reasoning by itself has no way to deal formally with a fundamental problem regarding treatment inefficacy. Simply stated that problem is why ineffective treatments frequently appear to be effective. This question plays in our minds as we read Linnæus’ case history. The answer to this question may include some of the following reasons:
·· ·· ·· ·· ·
The disease has finished its natural course. There is a natural exacerbation–remission cycle. Spontaneous cure has occurred. The placebo effect. There is a psychosomatic cause and, hence, a cure by suggestion. The diagnosis is incorrect. Relief of symptoms has been confused with cure. Distortions of fact by the practitioner or patient. Chance.
However, the attribution of effect to one or more of these causes cannot be made reliably using clinical reasoning alone. Of course, the same types of factors can make an effective treatment appear ineffective, a circumstance for which pure clinical reasoning also cannot offer reliable remedies. This is especially a problem when seeking clinically important benefits of small magnitude. The solution offered by statistical reasoning is to control the signal-to-noise ratio using experiment design.
Piantadosi
Date: July 27, 2017
16
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
2.2.3 Statistical Reasoning Emphasizes Inference Based on Designed Data Production The word statistics is derived from the Greek statis and statista, which mean state. The exact origin of the modern usage of the term statistics is obscured by the fact that the word was used mostly in a political context to describe territory, populations, trade, industry, and related characteristics of countries from the 1500s until about 1850. A brief review of this history was given by Kendall, who stated that scholars began using data in a reasoned way around 1660 [833]. The word statistik was used in 1748 to describe a particular body of analytic knowledge by the German scholar Gottfried Achenwall (1719–1772) in Vorbereitung zur Staatswissenschaft [6, 675]. The context seems to indicate that the word was already used in the way we mean it now, but some later writers suggest that he originated it [443, 939]. Porter [1229] gives the date for the use of the German term statistik as 1749. Statistics is a highly developed information science. It encompasses the formal study of the inferential process, especially the planning and analysis of experiments, surveys, or observational studies. It became a distinct field of study only in the twentieth century [1450, 1451]. Although based largely on probability theory, statistics is not, strictly speaking, a branch of mathematics. Even though the same methods of axioms, formal deductive reasoning, and logical proof are used in both statistics and mathematics, the fields are distinct in origin, theory, practice, and application. Barnett (1982) discusses various views of statistics, eventually defining it as the study of how information should be employed to reflect on, and give guidance for action in, a practical situation involving uncertainty [113].
Making reasonable, accurate, and reliable inferences from data in the presence of uncertainty is an important and far-reaching intellectual skill. Statistics is not merely a collection of ad hoc tricks and techniques, an unfortunate view occasionally held by clinicians and grant reviewers. Statistics is a way of thinking or an approach to everyday problems that relies heavily on designed data production and quantitative uncertainty. An essential value of statistical thought is that it minimizes the chance of drawing incorrect conclusions from either good or bad data. Modern statistical theory is the product of extensive intellectual development from the early twentieth century up to today, and has found application in all areas of science. It is not obvious in advance that such a theory should be applicable across so many disciplines, but statistics employs powerful and appropriate abstractions that allow generalization. That is one of the most remarkable aspects of statistical theory. Despite the universal applicability of statistical reasoning, it remains an area of substantial ignorance for many scientists. Statistical reasoning is characterized by the following general methods, in roughly this order:
1. 2. 3. 4.
Establish an objective framework for conducting an investigation. Place data and theory on an equal scientific footing. Employ designed data production through experimentation. Quantify the influence of chance on outcomes.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RESEARCH
17
5. Estimate systematic and random effects. 6. Combine theory and data using formal methods to make inferences. Reasoning using these tools enhances validity and permits efficient use of information, time, and resources. Although all of these components are important, designed data production is an absolute requirement for clinical trials and the experimental method widely. The quality of inference from data designed specifically for the question of interest cannot be overstated. Although estimating systematic and random effects (analysis) is also an important element of statistical reasoning, when applied to data that exists for other purposes without appropriate elements of design, the potential for error is high. Perhaps because it embodies a sufficient degree of abstraction but remains grounded by practical questions, statistics has been very broadly successful. Through its mathematical connections, statistical reasoning permits or encourages abstraction that is useful for solving the problem at hand and other similar ones. This universality is a great advantage of abstraction. In addition, abstraction can often clarify outcomes, measurements, or analyses that might otherwise be poorly defined. Finally, abstraction is a vehicle for creativity.
2.2.4
Clinical and Statistical Reasoning Converge in Research
Clinical and statistical reasoning have different origins and purposes. Statistical science is so much younger than clinical science that the two could be viewed as fundamentally incompatible. But the need that combines these different types of reasoning is research, which is ageless. A clinical researcher is someone who investigates formal hypotheses arising from work in the clinic [534, 536]. This requires two interdependent tasks that statistics does well: generalizing observations from few to many, and combining empirical and theory-based knowledge. It is statistics that unifies and balances the 2000-year-old dichotomy of rationalism (dogmatism) and empiricism in science. There is no science without statistics and vice versa. Statistical reasoning operates at three levels in research. The basic level is purely descriptive, consistent with the origins of both natural and statistical science. The ability to control and replicate description is fundamental. At the next level, statistics facilitates reliable measurement of associations between observables. This depends on appropriate models of association as well as joint quantification. At the highest level, statistics measures and validates causal relationships between observables. The additional validity to support causality is derived from appropriate research design that controls or eliminates all factors except the relevant effect. In the classical statistical paradigm, population inferences are justified on the basis of analysis from a representative random sample. A good illustration of this is the ability to predict the outcome of an election by surveying a random sample of potential voters. However, a random sample of potential voters is not necessarily representative of those who actually vote. Hence the most relevant and efficient sample for election prediction is not random at all but is constructed by exit polling of those who just voted. Although a slight overstatement and somewhat heretical to say, clinical trials, as with other biomedical experiments and the voting example, do not perfectly follow the notion of sample and population. In a laboratory animal experiment, is anyone concerned that the rats or mice are not a random sample of the world’s rodents? Not only that, but most
Piantadosi
Date: July 27, 2017
18
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
scientists are not concerned immediately that the experimental subjects are not human. What is relevant is only that there is an appropriate model of the human condition under study, and that inferences grounded in biological fact will then extend more or less to humans. We even humanize or add and remove genes from some laboratory animals to make models better, or to remove extraneous effects. Of course there are limits to models, which is why clinical trials are eventually needed. But it illustrates that the inferential process is not purely sample based. As another example, suppose we have performed a case-control study in a factory and discovered that exposure to some chemical is associated with a high risk of disease. To whom should our inference about limiting exposure pertain? We would not say that only those factory workers should not be exposed, nor would we say that only factory workers should not be exposed. Based on understanding biology and risk, we would caution against exposure of everyone despite the complete antithesis of a random sample. This inferential extrapolation is not merely conservative. It is affirmative in that we think we have learned something that generalizes reliably despite the absence of a true experiment design and a representative random sample. Using this reasoning, the concern that clinical trials and their cohorts are not representative of the real world is incorrect or meaningless. Clinical trials are the real world because they study humans from whom we make generalizations based on biology rather than representative sampling. There is no more real world that exists in any critic’s environment—all clinic-based experiences represent biased selection. More importantly with regard to comparative trials, relative treatment effects tend to generalize despite biased selection, as the case-control example above illustrates. This happens because meaningful treatment–covariate interactions, or effect modifiers, are relatively uncommon. In the science of clinical research, empirical knowledge comes from experience, observation, and data. Theory-based knowledge arises from either established biology or hypothesis. In statistics, the empirical knowledge comes from data or observations, while the theory-based knowledge is that of probability and determinism, formalized in mathematical models. Models specifically, and statistics in general, are the most efficient and useful way to combine theory and observation. This mixture of reasoning explains both the successful application of statistics widely and the difficulty that some clinicians have in understanding and applying statistical modes of thought. In most purely clinical tasks, as indicated above, there is relatively little need for statistical modes of reasoning. The best use and interpretation of diagnostic tests is one interesting exception. Clinical research, in contrast, demands critical and quantitative views of research designs and data. The mixture of modes of reasoning provides a solution to the inefficacy problem outlined in Section 2.2.2. To perform, report, and interpret research studies reliably, clinical modes of reasoning must be reformed by statistical ideas. Carter et al. focused on this point appropriately when they said Statistics is unique among academic disciplines in that statistical thought is needed at every stage of virtually all research investigations, including planning the study, selecting the sample, managing the data, and interpreting the results [236].
Failure to master statistical concepts can lead to numerous and important errors and biases in medical research, a compendium of which is given by Andersen [37]. Coincident
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DEFINING CLINICAL TRIALS
19
with this need for statistical knowledge in the clinic, it is necessary for the clinical trials statistician to master fundamental biological and clinical concepts and knowledge relevant to the disease under study. Failure to accomplish this can also lead to serious methodological and inferential errors. A clinical researcher must consult the statistical expert early enough in the conceptual development of the experiment to improve the study. The clinical researcher who involves a statistician only in the analysis of data from a trial can expect a substantially inferior product overall.
2.3 2.3.1
DEFINING CLINICAL TRIALS Mixing of Clinical and Statistical Reasoning Is Recent
The historical development of clinical trials has depended mostly on biological and medical advances, as opposed to applied mathematical or statistical developments. A broad survey of mathematical advances in the biological and medical sciences supports this interpretation [889]. For example, the experimental method was known to the Greeks, especially Strato of Lampsacus c. 250 BCE [973]. The Greek anatomists, Herophilus and Erasistratis in the third century BCE demonstrated by vivisection of prisoners that loss of movement or sensation occurred when nerves were severed. Such studies were not perpetuated, but it would be two millennia before adequate explanations for the observations would be formulated [1515, 1566]. There was considerable opposition to the application of statistics in medicine, especially in the late eighteenth century and early nineteenth century when methods were first being developed. The numerical method, as it was called, was proposed and developed in the early nineteenth century and has become most frequently associated with Pierre Charles Alexander Louis. His best known and most controversial study was published in 1828, and examined the effects of bloodletting as treatment for pneumonia [955]. Although the results did not clearly favor or disfavor bloodletting, his work became controversial because it appeared to challenge conventional practice on the basis of numerical results. The study was criticized, in part, because the individual cases were heterogeneous and the number of patients was relatively small. There was even a claim in 1836 by d’Amador that the use of probability in therapeutics was antiscientific [333]. Opposition to the application of mathematical methods in biology also came from Claude Bernard in 1865 [144, 145]. His argument was also based partly on individual heterogeneity. Averages, he felt, were as obscuring as they were illuminating. Gavarret gave a formal specification of the principles of medical statistics in 1840 [570]. He proposed that at least 200 cases, and possibly up to 500 cases were necessary for reliable conclusions. In the 1920s, R. A. Fisher demonstrated and advocated the use of true experiment designs, especially randomization, in studying biological problems [176, 471, 473]. Yet it was the mid-1900s before the methodology of clinical trials began to be applied earnestly. The delay in applying existing quantitative methods to clinical problem solving was probably a consequence of many factors, including inaccurate models of disease, lack of development of drugs and other therapeutic options, physician resistance, an authoritarian medical system that relied heavily on expert opinion, and the absence of the infrastructure needed to support clinical trials. In fact the introduction of
Piantadosi
Date: July 27, 2017
20
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
numerical comparisons and statistical methods into assessments of therapeutic efficacy has been resisted by medical practitioners at nearly every opportunity over the last 200 years [999]. Even today in biomedical research institutions and clinical trial collaborations, there is a firm tendency toward the marginalization of statistical thinking. The contrast with the recognized and overwhelming utility of mathematics in the physical sciences is striking. Among others, Eugene Wigner pointed this out around 1960, stating that . . . the enormous usefulness of mathematics in the natural sciences is something bordering on the mysterious and that there is no rational explanation for it.
It is noteworthy that he referred broadly to natural science [1555, 1556]. Part of the difficulty fitting statistics into clinical and biomedical science lies with the training of health professionals, particularly physicians who are often the ones responsible for managing clinical trials. Most medical school applicants have minimal mathematical skills, and the coverage of statistical concepts in medical school curricula is usually brief, if at all. Statistical reasoning is often not presented correctly or effectively in postgraduate training programs. The result is ignorance, discomfort, and the tendency to treat statistics as a post hoc service function. For example, consider the important synergistic role of biostatistics and clinical trials in cancer therapeutics over the last 50 years. Biostatistical resources in cancer centers sponsored by the National Institutes of Health were until relatively recently evaluated for funding using the same guidelines and procedures as for resources such as glass washing, animal care, and electron microscopes [1087]. A nice historical review of statistical developments behind clinical trials is given by Gehan and Lemak [572]. Broader discussions of the history of trials are given by Bull [201], Meinert [1026, 1028], and Pocock [1213]. It is interesting to read early discussions of trial methods to see issues of concern that persist today [706]. Since the 1940s clinical trials have seen a widening scope of applicability. This increase is a consequence of many factors, including the questioning of medical dogma, notable success in applying experiment designs in both the clinical and basic science fields, governmental funding priorities, regulatory oversight of drugs and medical devices with its more stringent demands, and development of applied statistical methods. In addition, the public, governmental, industrial, and academic response to important diseases such as cardiovascular disease, cancer, and AIDS has increased the willingness of, and necessity for, clinical scientists to engage in structured experiments to answer important questions reliably [573, 637, 670]. We also should not forget the history of the polio epidemic, particularly in the United States from about 1950 to 1965, and the large-scale clinical trials that were needed to prove a vaccine. As a child during that period, I had only a slight awareness of the considerable public fear of polio, but do remember the wide implementation of the vaccines. The overall story of the polio field trials is told by Kluger [857], and is a remarkable microcosm of modern clinical trial issues. It is also noteworthy for the differences in public perceptions of clinical trials and biomedical research broadly. The perception that well-done clinical trials are robust has led to their wider use today. To refine this idea, individual clinical trials are not necessarily robust to deviations from their design assumptions. For example, imprecision, low power, or bias may result
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DEFINING CLINICAL TRIALS
21
from complications or incorrect assumptions designing a trial. However some design components are strongly robust, even against the unknown. Randomization, for example, controls the effects of both known and unknown confounders—this is one of its principal strengths. More generally, we can say that the methodology of clinical trials is robust. In fact, trial methods could be described as antifragile, to use a concept articulated by N. Taleb in his book of that name [1462]. This means that various stresses on trials (not chaos or disorder) have led to methodologic strengthening and improvement, yielding better individual trials. The same might be said of science generally. Discomfort with, and arguments against, the use of rigorous experimental methods in clinical medicine do persist. No single issue is more of a focal point for such objections than the use of randomization because of the central role it plays in comparative trials. Typical complaints about randomization are illustrated by Abel and Koch [2, 3], who explicitly reject randomization as a (1) means to validate certain statistical tests, (2) basis for (causal) inference, (3) facilitation of masking, and (4) method to balance comparison groups. Similar arguments are given by Urbach [1499]. These criticisms of randomization are extreme and I believe them to be misguided. I will return to this discussion in Chapter 17.
2.3.2
Clinical Trials Are Rigorously Defined
An experiment is a series of observations made under conditions controlled by the scientist. To have an experiment, the scientist must control the application of treatment (or intervention). The essential characteristic that distinguishes experimental from nonexperimental studies is whether or not the scientist controls or manipulates the treatment (factors) under investigation. In nonexperimental studies, subjects are exposed (note the different term) for reasons beyond the investigator’s control. The reasons why exposure occurred may be unknown, or possibly even known to be confounded with outcome. Usually in experiments there is a second locus of control over extraneous influences, but this is not definitional. The play of chance is an influence that the scientist intends to reduce, for example. This additional control isolates the effect of treatment on outcome, makes the experiment efficient, enhances validity, and supports causality. However, efficiency is relative and is not a required characteristic of an experiment. Design is the process or structure that controls treatment administration and isolates the factors of interest. A clinical trial is an experiment testing a medical treatment on human subjects. Based on the reasoning above, a control group internal to the study is not required to satisfy the definition. In particular, nonrandomized studies can be clinical trials. The clinical trialist also attempts to control extraneous factors that may affect inference about the treatment. Which factors to control depends on context, resources, and the type of inference planned. The investigator will typically control (minimize) factors that contribute to outcome variability, selection bias, inconsistent application of the treatment, and incomplete or biased ascertainment of outcomes. We might think of these extraneous factors as noise with random and nonrandom components. Use of the term observational to describe medical studies that are not clinical trials, such as epidemiological studies, is common but inaccurate. All scientific studies depend on observation. The best terminology to distinguish clinical trials from other studies is
Piantadosi
Date: July 27, 2017
22
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
experimental versus nonexperimental. Conceptual plans for observation, data capture, follow-up of study participants, and analysis are similar for many types of medical studies. Use of an explicit comparison does not distinguish a clinical trial from a nonexperimental study. All medical studies are at least implicitly comparative—what do we observe compared to that which is expected under different conditions? In an important class of clinical trials, the comparison group is internal to the experiment, for example in randomized trials. But trials are not the only type of comparative studies. In 2014, the National Institutes of Health revised its definition of a “clinical trial” [1100, 1101]. The new definition is A research study in which one or more human subjects are prospectively assigned to one or more interventions (which may include placebo or other control) to evaluate the effects of those interventions on health-related biomedical or behavioral outcomes.
The terms in italics are further defined in NIH documents. Specifically prospective assignment indicates that the trial may contain only a single cohort. It is not perfectly clear to me that earlier NIH definitions of a clinical trial were inadequate, but this one is certainly consistent with the perspective here. 2.3.3
Theory and Data
The word theory can carry nearly opposite meanings depending on context, a characteristic shared by only a few words in the English language called contronyms or auto-antonyms. First, theory represents a coherent body of evidence – essentially established fact. Every branch of science has such theory from which new experiments are derived and against which they are measured. Some theory is mature and unlikely to be revolutionized because it has been well tested (e.g., physical chemistry and plane geometry), whereas other theories seem established but need to be reconciled (e.g., relativity and quantum mechanics). Some theory, such as evolution, is literally written in stone (not to mention comparative anatomy, embryology, and molecular biology!), but misunderstood for other reasons. A body of evidence could be explained by more than one coherent theory, in which case crucial experiments or more sophisticated observations will cause one to prevail. Used in this way, theory is connected to rationalism or dogmatism historically. I use the word theory more or less exclusively in this first sense. The second meaning of theory is speculation, frequently used as such in popular discussion, and is virtually opposite to its primary meaning above. Because scientists often speculate, it is easy to blur the usage of the term. Scientific speculations are formal, structured, and resolved by data (empiric evidence). From inside science, there is seldom any confusion about established knowledge as opposed to speculation. Workers know when they are fitting a stone, as opposed to redesigning an entire edifice. From outside science, established fact is sometimes challenged as speculation by misapplying the word theory as in the evolution example above. More subtly, one can confuse established fact with speculation as to the mechanisms responsible. For example, it may be incontrovertible that two molecules react to yield a product, and that the reaction is consistent with chemical theory. Nevertheless, there can be legitimate scientific speculation as to the mechanisms by which the products are formed at the atomic or subatomic level.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DEFINING CLINICAL TRIALS
23
Clinical trials are examples of the experimental method, a balanced synthesis of theory (rationalism) and data (empiricism). The experimental method requires theory to construct useful clinical hypotheses, and the paradigm by which they are evaluated. The method also requires empirical knowledge, that is, data, to provide evidence either in support of, or against, the hypothesis. The ability of data derived from carefully planned experiments to disprove theory (falsifiability) is a recognizable hallmark of the scientific method [1227]. Statistical reasoning (Section 2.2.3) contributes greatly but not exclusively to the empirical side of the balance in the experimental method. Statistics is also a science and, like mathematics, is dominated by theory. The execution of a clinical trial represents an empiric act that tends to hide the considerable biological theory on which it is actually based. Even the foundation for ordinary therapeutic questions rests on a huge substrate of established biology. Suppose we test a new analgesic to see if it relieves the pain of migraine headaches. Any result from such a trial is unlikely to revise our knowledge about pathophysiology, and be even less consequential for foundational clinical sciences such as pharmacology, anatomy, and neurology. For example, a negative result will be taken as a failure of the drug, not of the scientific paradigms on which the trial or headache pathophysiology are based. In contrast, the success of a migraine drug that acts through a nonanalgesic mechanism could be revolutionary for pathophysiology without altering fundamental clinical knowledge. Finally, if we are convinced that surreptitiously removing needles from voodoo dolls relieved migraine pain, we would have to re-interpret vast amounts of clinical and basic science, if not physics. It’s not clear how this last question would even be constructed from a balanced scientific view of existing knowledge and designed data production. Incompatibility with established theory is at the core of my skeptical view of some complementary and alternative medicine therapies in Section 4.5. 2.3.4
Experiments Can Be Misunderstood
The denotations outlined in the previous section are not always appreciated by researchers. Even so, most medical practitioners and clinical researchers have a positive disposition toward clinical trials based on their experience. This is not always the case for the public, where use of the word experiment is often pejorative and makes some individuals uncomfortable. See also Section 2.3.5 which touches upon this topic. Terminology can convey values, especially to potential clinical trial participants. One informal use of the word experiment implies an unacceptable situation in which the experimenter lacks respect for the study participants. As an example, consider this article from a national daily newspaper: Patients are not always fully informed that they are guinea pigs in medical research studies authorized by the Food and Drug Administration, The (Cleveland) Plain Dealer said. The newspaper analysis of FDA files found that in 4154 inspections of researchers testing new drugs on humans, the FDA cited more than 53% for failing to fully disclose the experimental nature of the work [55].
This was the entire article, and it tends to equate being guinea pigs, with experiment and deception, some of it on the part of the FDA. This text is little more than an advertisement favoring misconceptions. It is not brevity that corrupts, as evidenced in a lengthy piece by Lemonick and Goldstein [920]. This longer article, discussed in more
Piantadosi
Date: July 27, 2017
24
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
detail in Chapter 3, is similarly misinformed and suffers principally from its own double standards. Another sign of lack of respect for research participants is typified by the 2012 Toshiba advertisement in which a “professional medical test subject” suffering from side effects indicates his acceptance of being a “guinea pig,” though not with regard to his computer [478, 1190]. That such circumstances would be a topic of humor is disrespectful rather than creative. The insensitivity to research volunteers was surprising also because the company has used volunteers to assess some of its medical products such as MRI scanners. Toshiba seems to have missed the point even after they were challenged on it. Despite misconceptions, the use of the word experiment in a scientific context to describe clinical trials is appropriate, and has a tradition [133, 483, 541, 542, 717, 827, 1019, 1394]. The usage of this term in this book will always be according to the definition above, as in conducting an experiment, rather than indicating disrespect for, or experimenting on, study participants. Aside from terminology, the newspaper blurb illustrates another issue about populist perceptions regarding clinical trials, particularly in the United States. To most nonscientists and the lay press, clinical trials are viewed as a component of the medical establishment. Stories about wasting resources, abuses, and difficulties with clinical trials fit well with popular anti-establishment thinking. Similarly there is broad sympathy with the following stereotypes: the lone practitioner idealized as representing the best interests of the patient; one who has discovered a cure; new, less costly, or alternative approaches; oppression by rigor, unthinking critics, or tradition. Alternative medicine sometimes appeals to these stereotypes, which can make good newspaper copy. The history and current status of the failed anticancer agent, hydrazine sulfate (discussed in Section 20.8.5), fits this paradigm almost perfectly. Unfortunately, such populist thinking about clinical trials and the mass media perspective that encourages it are exactly backward. Clinical trials have nearly always been an anti-establishment tool, challenging authoritarian views and those aligned with special or conflicted interests. Trials may never receive a great deal of public sympathy or the perspective they deserve—they are expensive and require government or corporate funding, and don’t appear to advocate directly for the individual. Nevertheless, clinical trials are our most powerful tool to counteract dogma, a perspective rarely taken by the media. Another issue related to terminology surrounds the terms patient, subject, and participant in the context of clinical trials. Many investigators use these terms interchangeably. There is a general dislike of the term subject in the advocacy community, because it appears to imply a lack of respect for individuals in a clinical trial. My view is that this is a misperception based partly on the melding of different definitions for the term. The relevant definition is as the topic, focus, or object of the research, in which case subject is respectful, clear, and appropriate. For retrospective or nonexperimental designs, the term subject is also appropriate. The irrelevant definition refers to someone under the rule of another, which is not the case in medical research as articulated by the principle of autonomy in Chapter 3. One can’t simply replace subject with participant because the latter term sometimes informally refers to investigators and support staff in a trial. Participants in a clinical trial are not patients strictly speaking because the physician– patient relationship is different from the investigator–subject relationship. We should
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DEFINING CLINICAL TRIALS
25
therefore also be careful to distinguish the roles of physician and investigator. Having said this, it is impossible to be perfectly consistent in the use of these terms. Participants in a clinical trial are always patients in some context even while being research subjects. Similarly, physicians also carry a dual role as practitioners and investigators (Section 3.2). Although I have tried to use these terms consistently and appropriately in this book, the language is imperfect but always respectful.
2.3.5
Clinical Trials and the Frankenstein Myth
For decades the public has had a double standard and inconsistent view of clinical trials as objects of either risk or benefit. Trials always represent both risks and benefits of biomedical research, but the public portrayal of a trial is usually either risk or benefit. Risk is the perspective offered most often or when a trial is criticized for any reason. The double standard is that our culture wants benefits without risk. Currently we are also hesitating about the cost, which tends not to be seen as an investment in the future. Public fear and criticism regarding clinical trials invariably follows the “Frankenstein myth.” The theme of this myth is that an arrogant scientist has overstepped his or her intellectual, moral, ethical, or legal boundaries and taken advantage of the unsuspecting public. This version of the myth comes from the popular twentieth century version of Frankenstein, which is a corrupted derivative of the original Frankenstein or the Modern Prometheus by Mary [1375–1378]. Shelley’s original work is a fascinating read for the modern scientist now nearly 200 years later. The Frankenstein myth is virtually the only storyline offered in criticism of science by modern media. A principal theme of Frankenstein or the Modern Prometheus is the alienation of the creature from his scientist–creator, suggesting that unnatural creation is a violation of natural law. The Prometheus in the title turns out to be a synthesis of both the Greek and Roman myths: the Greek Prometheus stole fire from Olympus to soothe and save humankind; the Roman Prometheus did not save men but created them from earth and fire. Shelley’s Victor Frankenstein (the creature remained unnamed) sought to perform both unnatural acts. A fear of the modern lay person as a prospective clinical trial participant is the same as Frankenstein’s creation: manipulation by the powerful hands of others. In modern terms, this is loss of autonomy, one of the worst predicaments possible. Fear is proportional to the power and unfamiliarity of the scientific creations. Nuclear energy, genetic modifications, and human subjects research, for example, generate enormous fear. Clinical trials represent a compromise of autonomy for both the research participant and the physician scientist. For the participant, ethics issues related to autonomy are discussed in Chapter 3. To a lesser degree, the investigator also surrenders some autonomy to the research protocol and to peer review. The compromise in autonomy required by a clinical trial is not forced, but is a carefully considered agreement among all parties. The trite modern spin is more about equating the investigator with Victor Frankenstein, and less about a shared perspective between the creature and the contemporary research subject. However, the superficial and stereotypical predicament is not the most significant one. The risk–benefit calculus for a clinical trial is done multiple times—by the sponsor, investigators, peer reviewers, regulators, IRBs, treating physicians, and ultimately by
Piantadosi
Date: July 27, 2017
26
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
the study participant. Lack of alignment in these assessments of risk–benefit is a real problem for both science and the public. How difficult it should be to perform clinical trials based on regulation, economic resources, scientific priorities, and societal attitudes is our collective predicament. The twentieth century Frankenstein metaphor gives voice to some fears but does not help with key issues. 2.3.6
Cavia porcellus
If Victor Frankenstein is not enough, it is worth a few comments about the term guinea pig, a widespread pejorative layman’s expression that embodies public anxiety about medical research. This term has absolutely no place among professionals in reference to human experimentation. It is always emotional and derogatory no matter the prior tone of the conversation. This is why the newspaper article quoted above is only negative. The history of the guinea pig Cavia porcellus involvement in medical research is fascinating and has been sketched by Endersby [424]. Cavia is derived from the Inca (Quechua) name for the animals, cui or cuy, which itself might reflect the sounds they make. The cuy was domesticated as livestock perhaps as early as 7000 BCE and today is still kept in homes, fed scraps, and cooked. They also had, and continue to have, ceremonial uses. When the Spanish arrived in South America, the cuy was a domestic animal whose descendents would, via selective breeding, become the new species Cavia porcellus. By the seventeenth century the animal was known in Europe, formally classified by naturalists, and was beginning to be called a “guinea pig” for historical reasons that are not clear. The animal’s popularity in Europe grew through the eighteenth and nineteenth centuries in both homes and laboratories. Cavia porcellus made good laboratory subjects, and were used by Lavoisier, Koch, and Brown-S´equard, the American physiologist. Guinea pigs were central to many debates on vivisection and their name came to be identified with the idea of experimental subject about two centuries ago. They can get scurvy, contributed heavily to the discovery of vitamin C, and were instrumental in genetic studies in the work of Sewall Wright, JBS Haldane, and RA Fisher. The gentleness and domestication of Cavia combined with their connection to laboratory research is perhaps a natural metaphor for the anxiety that human research subjects might feel. Although much more widely used than Cavia today, rats and mice do not seem to evoke the same sympathy or metaphor. Any reference to guinea pigs completely destroys constructive dialog regarding clinical trials. 2.3.7
Clinical Trials as Science
The use of clinical trials is consistent with, if not at the heart of, the character of the scientific process in medicine. Clinical trials usually do not require that science be defined more rigorously than a statement due to Thomas Henry Huxley who said in 1880: science is simply common sense at its best; that is, rigidly accurate in observation and merciless to fallacy in logic [756].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DEFINING CLINICAL TRIALS
27
Accordingly we could suppose that medicine is not science because of its historical origins, occasional lack of sense, inaccurate observation, illogic, and considerable other nonscience content and activity. But there is at least a strong consilience to use Wilson’s term [1564] between the methods of medicine and those of science, much as there is between statistics and mathematics. Differentiating between science and nonscience in a formal way turns out to be a difficult philosophical problem, often approached by debates over “demarcation” criteria [322]. One influential demarcation criterion put forward as a hallmark of the scientific method is falsification of theory [1227]. Used alone, it is probably inadequate broadly as well as for clinical trials. I cannot review this complex issue here, but it is safe to say that clinical trials incorporate definitive characteristics of the scientific method as outlined below. Science can be differentiated readily from other fields (for example, see reference [1003] for one perspective) partly on the basis of experimentation. Scientific medicine advances through its experimental construct, the clinical trial. In 1963, Sir Austin Bradford Hill stated the case quite well [717]: In the assessment of a treatment, medicine always has proceeded, and always must proceed, by way of experiment. The experiment may merely consist in giving the treatment to a particular patient or series of patients, and of observing and recording what follows—with all the difficulty of interpretation, of distinguishing the propter hoc from the post hoc. Nevertheless, even in these circumstances and in face of the unknown a question has been asked of Nature, and it has been asked by means of trial in the human being. There can be no possible escape from that. This is human experimentation—of one kind at least. Somebody must be the first to exhibit a new treatment in man. Some patient, whether for good or ill, must be the first to be exposed to it.
Clinical trials have become so well recognized as necessary to the method of scientific medicine that some researchers ignore, discount, or fail to recognize the essential role of biological theory. Scientific inference is not a structure, as represented by an experiment, but a dialectic or process of reconciliation between experiments and theory. This point is vital to understanding why a clinical trial alone does not represent a scientific test of a therapy in the absence of a plausible mechanism of action for that therapy. Experiments or trials cause us to modify theory—more specifically to replace one theory with another. Science requires both the theory and the experiment, but never intends for us to replace the theory with the experiment. Furthermore in a mature science the right experiment is determined by questions that arise from prevailing theory [544, 545]. This issue will surface again in the discussion of trials for complementary and alternative medicine (Section 4.5). In addition to being experiments, clinical trials embody important general characteristics of the scientific method. These include instrumentalizing perception or measurement, which enhances repeatability and quantification; externalizing plans and memory in a written record, to facilitate reference and defense; control of extraneous factors as part of the study design (e.g., using internal controls and methods to control bias); and submitting completed work to external recognition, verification, or disproof. All these fundamental and general characteristics of scientific inquiry are integrated in the modern practice of clinical trials. Sometimes investigations that superficially appear to be clinical trials are not. Examples are so-called seeding trials, occasionally conducted by pharmaceutical companies as
Piantadosi
Date: July 27, 2017
28
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
marketing tools because they encourage physicians to prescribe a new drug [840]. The distinction between such efforts and true clinical trials can be made by examining the purposes and design of the study. Characteristics of seeding trials include (1) a design that cannot support the research goals, (2) investigators recruited because of their prescribing habits rather than scientific expertise, (3) a sponsor that provides unrealistically high reimbursements for participation, (4) minimal data collected or those of little scientific interest, (5) the study conducted through the sponsor’s marketing program rather than a research division, and (6) the agent tested being similar to numerous therapeutic alternatives.
2.3.8
Trials and Statistical Methods Fit Within a Spectrum of Clinical Research
Clinical experiments must be taken in context, both for the evidence they yield and for the methodology they employ. Trial results coexist with clinical and preclinical research of all types, some of which is supportive of their findings and some of which is not. The following categorization of clinical research, adapted from Ref. [16], shows how designed experiments fit within the spectrum of clinical research methods. Based on scientific objectives, sponsor funding, and technical skills and training of investigators, clinical research can be divided into seven areas:
1. Studies of disease mechanisms: The studies may be either descriptive or analytic but require laboratory methods and clinical observations under controlled conditions. Examples are metabolic and pharmacological studies. 2. Studies of disease management: These studies involve evaluations of developing treatments, modalities, or preventive measures and have outcomes with direct clinical relevance. Often internal controls and true experiment designs will be used. Most clinical trials fit within this category. 3. In vitro studies on materials of human origin: These are studies using observational designs and blood, tissue, or other samples that attempt to show associations thought to be clinically important. Series of surgical or pathologic studies are often this type. 4. Models of human health and disease processes: These studies include animal and theoretical (e.g., mathematical) models of disease, often used to guide laboratory and clinical investigations. 5. Field surveys: These are descriptive or analytic studies of risk factors or disease correlates in populations. Examples include epidemiologic and genetic studies. 6. Technology development: This type of research includes development and mechanistic testing of new diagnostic and therapeutic methods. Examples of technology developed in these types of studies include imaging devices and methods, diagnostics, vaccines, and applied analytic methods such as biostatistical methods. 7. Health care delivery: These studies focus on economic and social effects of health care practice and delivery. Examples include studies of education and training, cost effectiveness, health care access, and financing.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PRACTICALITIES OF USAGE
29
Any one or several of these types of clinical research studies can relate directly to clinical trials. For example, they may be precursor, follow-up, or parallel studies essential to the design and interpretation of a clinical trial. In any case, experiment designs have a central role in the spectrum of clinical research studies.
2.4 2.4.1
PRACTICALITIES OF USAGE Predicates for a Trial
A clinical trial cannot take place unless key circumstances support it. The list of requirements includes (1) having a good question, (2) uncertainty/equipoise in the scientific community, (3) an appropriate level of risk–benefit for the interventions, (4) receptivity in the context for the trial, (5) a design appropriate to the scientific question, and (6) the necessary resources and technology to perform the study. A deficiency in any of these components will stop the trial from either getting underway or being successful. Aside from these factors, it might be useful to review investigators’ responsibilities as outlined in Section 3.6.3. The extent to which these prerequisites are met is a matter of judgment at the outset but will be obvious by the time a study is well underway. If we view trials through these principles, we can get a good perspective on their likely success, quality, and perhaps even impact. Furthermore, these underpinnings provide a way to see the considerable commonality of methodology across the wide spectrum to which trials are applied. 2.4.2
Trials Can Provide Confirmatory Evidence
Aside from testing novel ideas, clinical trials are frequently used to test or reinforce weak evidence from earlier studies. There are a variety of reasons why more than one trial might be performed for nearly the same research question (usually by different sets of investigators). In some circumstances, this may happen inadvertently. For example, in pharmaceutical studies the onset or planning of a trial may be confidential and could be undertaken simultaneously by different companies using slightly different treatments. In retrospect these trials may be seen as confirmatory. The same situation may arise in different countries for scientific or political reasons. Confirmatory trials may also be planned. When the results of a new trial seem to contradict prevailing biological theory, many researchers or practitioners may be unconvinced by the findings, particularly if there are methodological problems with the study design, as there frequently are. This could be true, for example, when there is relatively little supporting evidence from preclinical experiments or epidemiologic studies. Confirmatory trials may be needed to establish the new findings as being correct and convince researchers to modify their beliefs. When the magnitude of estimated treatment effects is disproportionate to what one might expect biologically, a similar need for confirmatory studies may arise. This use of trials is difficult, and can be controversial, but is approachable by structured reasoning [1182]. There are numerous types of design flaws, methodologic errors, problems with study conduct, or analysis and reporting mistakes that can render clinical trials less than convincing. We would expect the analysis and reporting errors to be correctable if the design
Piantadosi
Date: July 27, 2017
30
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
of the study is good, but they are frequently accompanied by more serious shortcomings. In any case, trials with obvious flaws are open to criticism, which may limit their widespread acceptance. Such skepticism may motivate a confirmatory study. Finally, there is a tendency for studies with positive results to find their way preferentially into the literature (publication bias), which can distort the impression that practitioners get (Chapter 25). This implies that even a published positive result may leave some doubt as to true efficacy. Some positive studies are the result of chance, data-driven hypotheses, or subgroup analyses that are more error prone than a nominal p-value would suggest. Readers of clinical trials tend to protect themselves by reserving final judgment until findings have been verified independently or assimilated with other knowledge. The way in which practitioners use the information from clinical trials seems to be very complex. An interesting perspective on this is given by Henderson [700] in the context of breast cancer clinical trials. Medicine is a conservative science, and behavior usually does not change on the basis of one study. Thus, confirmatory trials of some type are often necessary to provide a firm basis for changing clinical practice. 2.4.3
Clinical Trials Are Reliable Albeit Unwieldy and Messy
Clinical trials are constrained by fiscal and human resources, ethical concern for the participants, and the current scientific and political milieu. Trials are unwieldy and expensive to conduct, requiring collaborations among patients, physicians, nurses, data managers, and methodologists. They are subject to extensive review and oversight at the institutional level, by funding agencies, and by regulators. Trials consume significant time and resources and, like all experiments, can yield errors. Multiinstitutional trials may cost from millions up to hundreds of millions of dollars and often take many years to complete. Clinical trials studying treatments for prevention of disease are among the most important but cumbersome, expensive, and lengthy to conduct. Certain studies may be feasible at one time, given all these constraints, but infeasible a few months or years later (or earlier). In other words, many studies have a window of opportunity during which they can be accomplished and during which they are more likely to have an impact on clinical practice. For comparative trials, this window often exists relatively early after the process of development of a therapy. Later clinicians’ opinions about the treatment solidify, even without good data, and economic incentives will dominate the decision to apply or withhold a new treatment. Many practitioners may be reluctant to undertake or participate in a clinical trial for economic and infrastructural reasons. The issues surrounding a trial of extracorporeal membrane oxygenation (ECMO) in infants with respiratory failure illustrate, in part, the window of opportunity. This trial is discussed in Sections 3.5.3, 17.4.3, and 17.7. Clinical trials are frequently messier than we would like. A highly experienced and respected basic science researcher tells the story of an experiment he instructed a lab technician to perform by growing tumors in laboratory rats: Once we decided to set up an immunologic experiment to study host defenses. I was out of town, and I had a new technician on board, and I asked him to order 40 rats. When I returned, I discovered we had a mixture of Sprague-Dawley, Wistar, and Copenhagen rats. Furthermore, their ages ranged from three months to two years. I had asked him to inject
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PRACTICALITIES OF USAGE
31
some tumor cells and was horrified to find out that he had used three tumor lines of prostate cancer cells and administered 103 –108 cells in various parts of the animals and on different days during a period of 2 weeks. Unfortunately, 20% of the animals escaped during the experiment so we could present data only on age-matched controls [277].
At this point we might begin to lose confidence in the findings of this experiment as a controlled laboratory trial. However, these heterogeneous conditions might well describe a clinical trial under normal circumstances, and it could be a useful one because of the amount of structure used successfully. These types of complications are common in clinical trials, not because of the investigator’s laxity but because of the nature of working with sick humans. Alternatives to clinical trials are usually messier than trials themselves. Trials are more common now than ever before because they are the most reliable means of making correct inferences about treatments in many commonly encountered clinical circumstances. When treatment differences are about the same size as patient-topatient variability and/or the bias from flawed research designs, a rigorous clinical trial (especially a randomized design) is the only reliable way to separate the treatment effect from the noise. Physicians should be genuinely interested in treatments that improve standard therapy only moderately. We can be sure that patients are always interested in even small to moderate treatment benefits. The more common or widespread a disease is, the more important even modest improvements in outcome will be. The bias and variability that obscures small but important treatment effects has larger consequences if we expect benefits to accrue mostly in the future. This is the case for prevention and treatment of many chronic diseases. Without rigorous design and disciplined outcome ascertainment, significant advances in treatment aggregated from several smaller improvements at earlier times would be missed. This lesson is quite evident from the treatment of breast cancer since about 1985, for example. A paradoxical consequence of the reliability of rigorous trials to demonstrate modest sized treatment effects, and/or those evident only after years of observation, is that the methodology might sometimes be viewed as not contributing to substantial advancement. It is easy to undervalue trials when they each produce only a modest effect. Hence it is important to ask the right research questions and identify the optimal research designs for them. Well-performed clinical trials offer other advantages over most uncontrolled studies. These include a complete and accurate specification of the study population at base line, rigorous definitions of treatment, bias control, and active ascertainment of endpoints. These features will be discussed later.
2.4.4
Trials Are Difficult to Apply in Some Circumstances
Not all therapeutic areas can tolerate an infusion of rigor, however needed it may be. Comparative trials may be difficult to apply in some settings because of human factors, such as ethical considerations, or strongly held beliefs of practitioners or participants. Logistical difficulties prevent many trials that otherwise could be appropriate. Early on, the AIDS epidemic motivated trialists to examine many basic assumptions about how studies were designed and conducted and relax some of the developmental rigor [218].
Piantadosi
Date: July 27, 2017
32
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
In oncology there are many conditions requiring multiple therapies or modalities for treatment. Although numerous factors are theoretically manageable by study design, practical limitations can thwart systematic evaluation of treatment combinations and interactions. Similarly, when minor changes in a complex therapy are needed, a trial may be impractical or inappropriate. In very early developmental studies when the technique or treatment is changing significantly or rapidly, formal trials may be unnecessary and could slow the developmental process. Trials are probably more difficult to apply when studying treatments that depend on proficiency than when studying drugs. Proficiency may require specific ancillary treatments or complex diagnostic or therapeutic procedures such as surgery. Well-designed trials can in principle isolate the therapeutic component of interest, but this may be a challenge. When treatments are difficult to study because they depend strongly on proficiency, they may also be less useful because they may be difficult to apply broadly. Time and money may prevent the implementation of some trials. Trials are also not the right tool for studying rare outcomes in large cohorts or some diagnostic modalities. However, in all these cases the methods used to gather data or otherwise to design methods to answer the biological question can benefit from quantification, structure, and statistical modes of thought. Bailar et al. suggest a framework for evaluating studies without internal controls [100]. This provides additional support for the value of studying trial methodology. Biological knowledge may suggest that trials would be futile unless carefully targeted. For example, suppose a disease results from a defect in any one of several metabolic or genetic pathways, and that the population is heterogeneous with respect to such factors. Treatments that target only one of the pathways are not likely to be effective in everyone. Therefore clinical trials would be informative and feasible only if done in a restricted population.
2.4.5
Randomized Studies Can Be Initiated Early
While comparative clinical trials may not be the method of choice for making treatment inferences in some circumstances, there are many situations where it is advantageous to implement trials early in the process of developing a new therapy. Some clinical trials methodologists have called for “randomization from the first patient” to reflect these circumstances [245, 246, 1427]. The window of opportunity for performing a randomized trial may only exist early and can close as investigators gather more knowledge of the treatment. Many disincentives to experimentation can arise later, such as practitioner bias, economic factors, and ethical constraints. Reasons why very early initiation of randomized trials are desirable include the following: (1) the ethics climate is conducive to, or even requires it, (2) high-quality scientific evidence may be most useful early, (3) early trials might delay or prevent widespread adoption of ineffective therapies, and (4) nonrandomized designs used after a treatment is widespread may not yield useful evidence. These points may be particularly important when no or few treatments are available for a serious disease (consider the history of AIDS) or when the new therapy is known to be safe as in some disease prevention trials. Many times randomized trials yield important information about ancillary or secondary endpoints that would be difficult to obtain in other ways.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PRACTICALITIES OF USAGE
33
There are also reasons why randomization from the first patient may be difficult or impossible to apply: (1) the best study design and protocol may depend on information that is not available early (e.g., the correct dose of a drug), (2) adequate resources may not be available, (3) investigators may be unduly influenced by a few selected case reports, (4) patients, physicians, and sponsors may be biased, (5) the trial may not accrue well enough to be feasible, and (6) sample size and other design parameters may be impossible to judge without preliminary trials. When safe and effective treatments are already available, there may not be much incentive to conduct randomized trials early. For examples of randomized trials initiated early, see Refs [246], [564], or [1266].
2.4.6
What Can I Learn from 𝒏 = 𝟐𝟎?
It is common for clinical investigators to have resource limitations that will permit only a relatively small number of subjects for their clinical trial. In most cases they proceed with formal study and protocol development until encountering a statistician, at which point the difficulties in achieving good precision for clinical outcomes become clear. Then it may become more attractive to turn the question around backward: “what can I achieve with a specified small sample size?”, rather than “how many subjects are required to meet my inferential goals?”. Investigators tend not to have a good sense of what is reasonable and what is unrealistic from a small sample. Yet these small cohorts are frequently employed, or added on to dose-finding studies as expansion cohorts, for example. For this discussion, I assume that 20 subjects will be accrued on a protocol-driven trial (not a case series) and that it will yield some person-months of follow-up observation. What can be learned from such an experience? Variability Twenty subjects will yield good estimates of person-to-person variability if the observations come from a single common distribution that is well behaved, such as the normal. This may not be the case if the cohort is heterogeneous or a mixture of sub-populations— overall variability will exceed that from any homogeneous subgroup. Estimating variability would be essential to the planning of additional studies. Mean Means of a measured value are relatively efficient and can be determined with reasonable accuracy from 20 observations. If we can estimate variability well, then we can characterize the mean within that variability well. If the cohort is a mixture, the overall mean may be uninterpretable and carry little clinical significance. Median The estimate of the median is the middle observation when all are ranked. It is less well characterized than the mean for a given sample size. Our estimate of the median will have 10 observations above and 10 below (to simplify, I have added a single observation and assumed no ties). The next observation will necessarily cause the estimated median to jump to the next higher or lower value in the data. This discreteness illustrates how the median is estimated less precisely than the mean. The same caveats regarding
Piantadosi
Date: July 27, 2017
34
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
TABLE 2.1 Precision in Correlation Coefficients Based on 20 Subjects 95% Bounds ̂ 𝑟
Lower
Upper
0.0 0.4 0.5 0.6 0.7 0.8 0.9
−0.443 −0.052 0.074 0.214 0.373 0.553 0.760
0.443 0.716 0.772 0.824 0.872 0.918 0.960
heterogeneity apply. Any quantile in the tail of a distribution is less well estimated than the median. The same can be said of the range and interquartile range. Correlations Like means, correlations are relatively efficient statistics assuming each individual will have a pair of measurements to contribute. The correlation coefficient, 𝑟, can be transformed into an approximately normal random variate, 𝑧, by 𝑧=
1 1+𝑟 log , 2 1−𝑟
the so-called 𝑧-transformation. The variance is 𝑉𝑎𝑟{𝑧} ≈ 1∕(𝑛 − 3), which is independent of 𝑟. On the 𝑧 scale, an approximate 95% confidence interval based on 20 pairs will be ±0.475. Unwinding the transformation yields the confidence bounds for the correlation shown in Table 2.1. Only strong correlations will have reasonable precision. Extremes If we are counting or measuring one event per subject, the best (or worst) case is when 𝑘 = 0 (or 𝑘 = 20) where 𝑘 is the count. Using the result of Section 16.4.3, the upper one-sided 95% exact binomial confidence limit will be 3∕20 = 0.15. If 𝑘 = 20, the lower confidence bound will be 0.85. As thresholds, these may not be so good, depending on the clinical implications of the events. This reasoning is directly applicable to claims of safety, often made on the basis of even smaller experiences. Knowing that serious adverse events are unlikely to be occurring with a frequency higher than 15% is not an adequate degree of confidence for many clinical circumstances. Proportions and Their Variability For proportions, the estimate of the success rate is 𝑝 = 𝑘∕𝑛. The estimate of the variance is 𝑉𝑎𝑟{𝑝} ̂ = 𝑝(1 ̂ − 𝑝)∕𝑛, ̂ which is maximal when 𝑝̂ = 1∕2. For a denominator of 𝑛 = 20 and 𝑘 = 10, a two-sided 95% confidence interval is approximately 0.28 − 0.72. These bounds are conservative with respect to 𝑝̂ or 𝑘, and anti conservative with respect to 𝑛. Prior Information It seems reasonable to assume that any prior information we bring to our small study is also based on about 20 observations or fewer of actual data. If more information were already available, our new trial would probably not add much. If this is a reasonable
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
NONEXPERIMENTAL DESIGNS
35
TABLE 2.2 Relative Precision in Event Rates as a Fraction of the Point Estimate 95% Bounds Events 20 15 10
Lower
Upper
0.65 0.60 0.54
1.55 1.66 1.86
assumption, 𝑛 = 20 would then be optimistically doubling the available information. The consequences will be seen in precision or confidence intervals. Again using proportions √ √ as an example, the denominator of the standard error of 𝑝̂ would go from 𝑛 to 2𝑛, √ which is a factor of 2. The error then decreases to 71% of its former value, which seems to represent a worthwhile improvement. This rule of thumb is independent of scale, so doubling sample size will generally increase precision by about 30%, which may or may not be worth the cost depending on the circumstance. Event Rates Event rates are estimated by the number of events, 𝑘, divided by 𝑇 , the total follow-up time in the cohort, or 𝜆̂ = 𝑘∕𝑇 . The denominator is√ the sum of censored plus uncensored the normal approximation event times. Confidence intervals for 𝜆̂ depend on 𝑘. Using √ ±𝑍 ∕ 𝑘 ̂ on a log scale, the interval on the natural scale is 𝜆 × 𝑒 𝛼 , where 𝑍𝛼 is the standard normal quantile. ̂ A few such So for a given number of events we know the bounds relative to 𝜆. multipliers are shown in Table 2.2. Additional details can be found in Section 16.5.1 and Table 16.9. For 20 events, the approximate 95% lower confidence bound is 65% of the point estimate and the upper bound is 155% of the point estimate, this being almost a three fold range. The breadth of such an interval suggests that we have little reliable information about the true event rate when 𝑘 ≤ 20. Events may be quite slow to accumulate depending on the absolute risk. In summary, the amount to be learned from a study with 𝑛 = 20 is minimal. The precision attainable for inefficient outcomes such as censored event rates and categorical outcomes is low. Estimates of variability, means, and correlations in homogeneous cohorts will likely be precise enough for planning additional studies. Investigators who find themselves resource limited to such small studies must be realistic about what can be achieved.
2.5 2.5.1
NONEXPERIMENTAL DESIGNS Other Methods Are Valid for Making Some Clinical Inferences
Nonexperimental comparisons are common in disciplines such as anthropology, history, archeology, sociology, and political science [369]. Many medical advances have been made without the formal methods of comparison found in controlled clinical trials—in particular without randomization, formal control of bias or random error, or statistical analysis. For example, vitamins, insulin, many drugs and antimicrobials, vaccines, and
Piantadosi
Date: July 27, 2017
36
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
tobacco smoke have had their effects convincingly demonstrated without using controlled clinical trials. Large treatment effects of nearly any kind are candidates for this type of discovery. Studies that demonstrated these effects were epidemiological, historically controlled, and, in some cases, uncontrolled. True experiment designs are not the only path to advancing medical knowledge. A humorous angle on this point was provided by Smith and Pell [1407] who examined parachute use to prevent death or trauma, and poked fun at overzealous evidence-based requirements that treatment efficacy be established only by systematic reviews of randomized trials. Some of my clinical colleagues are quite sympathetic to the implications of this humor, emphasizing that an appropriate balance is needed when assessing the types of studies that support a proposed treatment and the magnitude of the effect. Although Smith and Pell appeal to “common sense,” there is ample evidence in medical history that such a basis alone is inadequate. We could probably define common sense as a set of context-friendly biases to emphasize its lack of an objective basis. The five requirements discussed below can be used to supplement common sense. Many provocative associations between human disease and environmental exposures or other factors are not amenable to study by clinical trials because the exposures or factors are not under the experimenter’s control. A good example of this was the question of right- versus left-handedness and its relationship to longevity [671]. Ethics and other constraints will always keep some biologically interesting questions out of reach of an experiment, leaving a need to learn from naturally occurring circumstances using the best study designs possible. Similarly, comparisons internal to the trial are not always possible, which reinforces the need for single cohort studies. There is much to learn from these experiences provided the analysis and interpretation is temperate. One example where this did not happen is discussed by Mantel [984]. Presently there are political and economic forces at work to encourage the use of nonexperimental study designs to support therapeutic inferences and related questions of medical practice. There are many demands for reliable answers to such questions and we cannot expect clinical trials with their expense and regulatory overlay to address all of them. While those are the economic realities, there are occasional valid scientific reasons to prefer convenient nonexperimental designs for properly selected questions. For example, suppose we are interested in improving an algorithm for clinical decisionmaking or care in an intensive setting such as an emergency room or intensive care unit. Randomizing individuals would be a valid study design for such a question, but it might be too costly or logistically impractical. Having nurses or physicians with patients under different care models in the same setting would be impractical and dangerous. So we might start down a path of relaxed experiment features to get a useful but imperfect answer to an important question. Many possible designs might be considered, but any or all of them would be classified as nonexperimental as soon as the investigator loses control over the assignment of subjects to treatment groups. Questions regarding practice using nonexperimental designs are often labeled comparative effectiveness research, health services research, or patient-centered outcomes research. The term outcomes research is also used, but often refers to analyses of data that exist for reasons unrelated to the therapeutic question. None of these terms have precise definitions. For years most of us doing clinical trials thought we were doing comparative effectiveness research, but this label was taken in a slightly different direction in 2010. When provisions were planned for this type of research in the Affordable Care Act
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
NONEXPERIMENTAL DESIGNS
37
[294], some commercial entities were uncomfortable with the type of product-to-product comparisons that it might imply. The name was altered to patient-centered outcomes research.
Examples Ganciclovir was approved by the U.S. Food and Drug Administration for the treatment of retinitis caused by cytomegalovirus in patients with human immunodeficiency virus (HIV) infection. The evidence in favor of ganciclovir efficacy consisted of a relatively small but well-performed study using historical controls [782]. The treatment benefit was large, supported by other data, and the toxicity profile of the treatment was acceptable. (However, it took a clinical trial to establish that another drug, foscarnet, was associated with longer survival and similar visual outcome as ganciclovir.) In contrast, consider the difficulty in inferring the benefit of zidovudine (AZT) treatment in prolonging the survival of HIV positive patients from retrospective studies [1060]. Although AZT appeared to prolong survival in this study, the control group had a shorter survival than untreated patients earlier in the epidemic. Sicker patients did not live long enough to receive AZT, making the treatment appear beneficial. This illustrates the difficulty in defining treatment in retrospect. Some issues related to using observational data for assessing AIDS treatments are discussed by Gail [557].
Five Requirements Using nonexperimental designs to make reliable treatment inferences requires five things. First, there must be some physicians using the treatment in the way that we intend to study it. Because there is no clinical trial framework to specify treatment parameters such as dose and schedule, we have to find the treatment being used somewhere in the relevant way. In an epidemiological setting the corresponding concept is exposure to the agent of interest. Second, the study subjects have to provide valid observations for the biological question. We must be able to ascertain the relevant outcomes in all the subjects, for example. Missing outcomes will create a bias. Depending on the design of the study, outcomes may be ascertained prospectively or retrospectively. Third, the natural history of the disease with standard therapy, or in the absence of the intervention, must be known. Alternatively we require a control cohort. Controls might come from the same or similar data repository, or be constructed by appropriate matching. Fourth, the effect of the treatment or intervention must be large relative to random error and bias. For example, a single institution may have a large database of patients treated for the same condition by several different methods. Selection or indication bias can explain treatment choice, and such biases can be large enough to appear as treatment differences. Finally, evidence of efficacy must be consistent with other biological knowledge. We cannot reasonably expect that incompletely controlled data will outweigh what seemed to be established fact. These five criteria are difficult to satisfy convincingly because they highlight missing investigator control from some key aspects of a study. Results may also be influenced by hidden factors excluded from available data. A potentially helpful method for some alternative therapies is to evaluate “best case series” as a way to assess any therapeutic promise [1079].
Piantadosi
Date: July 27, 2017
38
2.5.2
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
Some Specific Nonexperimental Designs
Many nonexperimental designs have been put forward for answering therapeutic questions. The motivations for employing nonexperimental designs are convenience, cost, or feasibility, but never reliability. Some basic concepts for these research designs will be discussed here, especially to contrast them with clinical trials. For more detail on some nonexperimental designs, see Refs. [747, 854]. There are at least three types of nonexperimental designs that might provide evidence of treatment differences: epidemiological designs, historically controlled cohorts, and databases. Epidemiological designs are an efficient way to study rare events, and can control random error and some bias. Studies with historical controls can also control some types of bias, but are incapable of eliminating others. Database analyses can often be done for a small fraction of the cost of other studies, but offer the least opportunity to optimize design. Historical Controls Historical controls violate the temporal ordering of an experiment, and may render the investigator completely passive with regard to that half of the cohort. There may be no control over selection or treatment administration in the control group. There cannot be any control over time trends or factors correlated with them in such designs. Although improvements in supportive care, temporal trends, or different methods of evaluation for the same disease can make these studies uninterpretable, there are circumstances in which they might be valid and convincing. Consider a study of promising new treatments in a disease with a uniformly poor prognosis like advanced cancer. Large beneficial treatment effects could show themselves by prolonging survival far beyond that which is known to be possible (historically) using conventional therapy. However, consider how unreliable treatment inferences based on distant controls would be in a disease such as AIDS, where supportive care and treatment of complications have improved so much in the last 10–15 years. Silverman [1394] gives some striking examples of time trends in infectious disease mortality. Stroke mortality has also shown clinically significant secular trends [169, 195]. Similar problems can be found with historical controls in many diseases [396]. The perils of historical controls are so well known that further discussion here is unnecessary. Epidemiological Designs The cohort in a clinical trial embodies a strict temporal ordering of eligibility, treatment assignment, intervention, follow-up, and outcome ascertainment. Aside from the temporal ordering, the investigator designs and actively participates in implementing each component. In a very real sense there is a single cohort in a randomized experiment. The two halves of the cohort differ only in regard to the intervention of interest, strongly supporting causality for consequent effects. In nonexperimental studies, some part of this temporal ordering is perturbed, and the investigator may play a passive role with regard to other key elements. Nonexperimental designs cannot support causality as strongly as clinical trials, if at all. Three basic epidemiological designs are the single cohort study, the case-control study, and the cross-sectional study. Because none of these designs permit the investigator to control exposure, the characteristics of study subjects who have the relevant exposure or treatment is an outcome of the study rather than a design parameter. How
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
NONEXPERIMENTAL DESIGNS
39
TABLE 2.3 Margin Totals of a 𝟐 × 𝟐 Table Indicating Control Points for Basic Epidemiological Designs Case Noncase Total
Exposed
Nonexposed
Total
A C
B D
A+B C+D
A+C
B+D
A+B+C+D
the exposure group compares to the nonexposure group with regard to those characteristics may determine the reliability of the results. The investigator can control various aspects of each study, as illustrated in Table 2.3, with different implications for the results. A cross-sectional design controls only the total sample size. Then the numbers of exposed and unexposed subjects, as well as the numbers of cases and noncases, are outcomes of the study. A convenient or relevant cohort is not guaranteed to yield sufficient outcomes, or even be appropriate in other ways. A cross-sectional study classically uses random sampling of a dynamic population to ascertain disease status and related variables at a point in time. No follow-up is obtained, so the design permits measuring associations between variables but cannot resolve temporal ordering. In an epidemiological context, it is useful for assessing disease prevalence but not incidence. Such a design would not be useful for uncommon conditions or those of short duration. There are further limitations to this design if the sampling is not representative. A cross-sectional study might reveal relationships between a widely used treatment and associated risk factors, prognostic factors, safety signals, or outcomes. Lack of follow-up and inability to validate random sampling make this design generally unsuitable for reliable nuanced therapeutic inferences. In a cohort design, the investigator can at least partially control the total numbers of exposed and unexposed subjects (A+C and B+D in Table 2.3) using the cohort definition. Then the numbers of cases and noncases is an outcome. Such a design is not useful if the absolute risk in the cohort is too low to yield sufficient cases. A clinical trial, being a special type of cohort study, is also subject to this problem. If the absolute risk in the cohort is too low, the trial will not yield enough events to inform the difference between the treatments (exposures). If the exposure of interest is to a therapeutic intervention, cohort studies cannot control for factors contributing to the choice of treatments in the first place (confounding by indication). In a case-control design, the investigator controls the total number of cases and noncases or controls (A+B and C+D in Table 2.3), causing the exposure variable to be an outcome of the study. The case-control design can allow a relatively efficient study of rare conditions, whereas a cohort study may not. In a case-control study, the cohort may be constructed retrospectively following case ascertainment; or the case status may be determined retrospectively following the cohort identification. The investigator is a passive observer with respect to exposure. For example, Jenks and Volkers [792] reported a number of factors possibly associated with increased cancer risk. Some were biologically plausible, while others seem to be based only on statistical associations. Apparent increases in cancer risks for factors such as electric razors and height could be due to bias (or chance) uncontrolled by the research design. The critical weaknesses of case-control designs for therapeutic inferences are confounding by indication in the case cohort, and systematic differences in the control cohort.
Piantadosi
Date: July 27, 2017
40
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
Databases Database analyses are convenient and inexpensive to conduct, but surrender nearly all design controls. Patients treated at a single institution are now usually followed in data systems that facilitate both clinical management and some research questions. In these situations, treatment choice is not a pure predictor, but is also partly an outcome of how the patient is doing. Treatment comparisons will be suggestive but not definitive or reliable tests of efficacy. Computerization also makes it easier to combine data from different sources yielding very large datasets for some analyses. Sample size controls random variation but cannot correct bias. Common definitions of variables across independent data sources remains an unsolved problem. Two other contrasts between database analyses and experiment designs are noteworthy. One is that a database can account only for known confounders. Factors that are known to affect outcome can be recorded for later analysis, but it is very unlikely that investigators will accidentally archive confounders that have yet to be discovered. Hence the control over bias may be limited. This is unlike the role of randomization in controlling bias even from unknown confounders. A second limitation comes from the necessary use of statistical models to account for the effect of confounders. These models carry their own assumptions and may not yield accurate or unbiased estimates of the effects of interest when the requirements are violated. Especially in randomized experiments, such models are not required even though they are sometimes used—unadjusted comparisons will be valid. These and other limitations of database analyses have been well known for decades [213] but often overlooked. Database analyses often carry fancy names such as outcomes research and data mining. It is important to know why the database exists in the first place to determine if it can provide reliable evidence for any therapeutic question. Some databases exist as a byproduct of billing for medical services, which qualifies them only as happenstance data with respect to therapeutic questions. There is no control over observer bias, selection bias, or outcome ascertainment. An excellent example of the inadequacy of such studies is provided by Medicare data and the background to lung volume reduction surgery prior to the National Emphysema Treatment Trial (Section 4.6.6). 2.5.3
Causal Relationships
A central issue is whether or not a given study design can support a causal relationship between treatment (exposure) and outcome. Because the investigator controls the administration of treatment in experiments, they permit reliable inference about causality. Control over confounding factors, whether through randomization, eligibility, masking, or balancing, augments confidence in causality, but it never seems to be a serious doubt in experiments. The design alone is sufficient to support causal inference about treatment effects. An appropriate experiment design not only supports causal inference, but also strong findings from a well-designed experiment can lead to revisions in biological theory. In contrast, nonexperimental designs lack control over treatment administration, immediately weakening causality. Retrospective assembly of exposure history or cohort definition as is typical in epidemiological studies adds to the possibility of confounding and weakens causality. This is not to say that we can never be reasonably assured about treatment or exposure causing an observed effect in a nonexperimental design. It means that the design itself is insufficient to yield the certainty we seek. Evidence from outside the experiment might be brought to convince us.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
41
The difficulty in detecting causality is illustrated by the ease with which strong correlations can be found between obviously unrelated quantities. While it is a clich´e to say that correlation does not imply causality, seemingly strong associations can be found so frequently that rationalizations and supporting information can often be gathered by someone determined to support one. Proving the uselessness of a chance association may be more difficult than finding some small support for it. 2.5.4
Will Genetic Determinism Replace Design?
Genomic medicine seems to be dividing diseases into smaller and smaller subgroups based on variants or abnormalities. New therapies target these characteristics, and we expect that the resulting treatment effects in selected subgroups will be larger than effects seen with nontargeted treatments, or when targeted agents are applied in a heterogeneous population. It is too soon to know the extent to which such expectations will be realized, or if we can contemplate using nonexperimental study designs for assessing the large treatment effects. What is clear presently in looking at genomic medicine is that we now seem to have many more diseases to cope with, each with fewer patients, and we are facing the need to test combinations of targeted therapies. It is a fair question to ask if rational drug design, targeted therapy, or personalized medicine will lessen the demands for direct observer control over the application of treatment to the appropriate cohort. Other than counting on cures or other large treatment effects that can’t possibly be due to bias, I see no reason to expect that smaller more homogeneous cohorts will allow us to escape bias.
2.6
SUMMARY
Clinical and statistical thinking are different in origin and objectives, but not incompatible. They are complementary and must be joined for the purposes of research. The goal of both types of reasoning is to produce generalized knowledge. The clinical approach generalizes primarily on a biological basis, while statistical modes of reasoning generalize primarily on the basis of data. Clinical trials require a carefully planned combination of clinical and statistical reasoning. This leads to a formal definition of a clinical trial as a designed experiment. Although trials can be unwieldy because of their size, complexity, duration, and cost, they are applicable to many areas of medicine and public health. While not the only method for making valid inferences, clinical trials are essential in situations where it is important to learn about treatment effects or differences that are about the same magnitude as the random error and bias that invariably plague medical studies. Investigators should be alert to circumstances where the window of opportunity to conduct randomized trials occurs early.
2.7
QUESTIONS FOR DISCUSSION
1. Outline specific situations in your discipline where it would be difficult or inefficient to perform clinical trials and explain why.
Piantadosi
Date: July 27, 2017
42
Time: 4:45 pm
CLINICAL TRIALS AS RESEARCH
2. Give some circumstances or examples of clinical trials that do not have all the listed characteristics of the scientific method. 3. Historically the development of statistical methods has been stimulated by gambling, astronomy, agriculture, manufacturing, economics, and medicine. Briefly sketch the importance of these for developing statistical methods. Within medicine, what fields have stimulated clinical trials methods the most? Why? 4. Discuss reasons why and why not randomized disease prevention trials, particularly those employing treatments such as diet, trace elements, and vitamins, should be conducted without extensive “developmental” trials.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
3 WHY CLINICAL TRIALS ARE ETHICAL
3.1
INTRODUCTION
Clinical trials are ethically appropriate, necessary, or imperative in many circumstances of medical uncertainty. Physicians should be comfortable with their obligation to gather and disseminate reliable generalizable knowledge. This perspective must be balanced by awareness of the potential for research to disadvantage study participants. To prevent harm to study subjects while also honoring the obligation to learn, we are guided by principles that are the subject of this chapter. Research ethics have evolved and matured over time. But ethics is not deductive, and principles can conflict with one another. Judgments are frequently necessary to chart the most appropriate course of action. The ethics landscape can also change within the time frame of a single trial as a consequence of either internal study findings or external events. The result of these dynamics is that ethics are as influential as science on the design and conduct of clinical trials, which explains why studies are examined so critically from this perspective. Our society has an inconsistent and often contradictory view of clinical trials. The public view of a trial often adopts the most unfavorable perspective—risk or benefit— that a given circumstance will allow. For example, when viewed as social justice (e.g., ethnic participation), trials are seen as a great benefit and participation is highly valued or required. In contrast, when viewed as a risk (e.g., participant injury) trials, participants, investigators, and overseers arouse suspicion. From an economic perspective, clinical trials are universally viewed as an expense, the least favorable image. In reality, most trials are actually favorable investments that yield benefits to the well being of participants, biological knowledge that extends beyond the primary question answered, and the continued returns of reliable therapeutic evidence for current and future Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
43
Piantadosi
Date: July 27, 2017
44
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
patients. This potentially precarious balance cannot tolerate any cloudiness in ethics foundations. Research participants place a great deal of trust in the physician-investigator and often join trials when the chance of personal benefit is low [1587]. This may be a perfectly logical way for them to act because it reflects an understanding of research goals, therapeutic choices, and likely outcomes. Participants also accept or even endorse the dual role of physicians as practitioners and researchers. But illness of nearly any kind can render a person vulnerable to overly optimistic assessments of risk and benefit, requiring extra thoughtfulness when approaching them to participate in research. Clinical investigators must continually earn the privilege to perform research by carefully protecting the interests of those who grant it. For all of these important and pervasive reasons, I discuss ethics early in this book so that its impact can be appreciated. There are many thoughtful and detailed discussions of ethics related to medicine broadly [1255], medical research [1579], and clinical trials specifically [127, 541, 542, 827, 927, 1244]. Special problems related to cancer treatments are discussed in-depth and from a modern perspective in Ref. [1561]. An international view is given by McNeill [1019]. A very thoughtful perspective is given by Ashcroft et al. [84]. The discussion here is focused on the design of clinical trials but should also provide a perspective on more general issues. My discussion will partition the ethics landscape into five areas: (1) the nature and consequences of duality in the obligation of physicians, (2) historically derived principles of medical ethics, (3) contemporary foundational principles for clinical trials, (4) concerns about specific experimental methods like randomization or the use of placebos, and (5) current topics including professional conduct. All these areas have a significant relationship to the design and conduct of clinical trials. Randomized designs can be attractors of ethics concerns because they often bring into sharp focus fundamental sources of concern. However, the core issues do not depend on the method of treatment allocation, but may be related more to the fact that a practitioner can also be an experimentalist. So the discussion here is intended to include trials regardless of the mode of treatment allocation. The term “physician” should be interpreted to indicate health care practitioners broadly. Ethics concerns cannot be alleviated because important discoveries have been facilitated by clinical experiments (for example, see the debate framed in references [698] and [1183]). Our moral calculus does not work that way. The importance of science or the knowledge gained is not a counterweight to concerns about ethics. Conversely, practitioners have an ethical obligation to perform appropriate designed experiments – there is a moral imperative to learn. Fortunately, it is not necessary to choose between good experiments and ethical behavior. A large number of important clinical questions require, and are amenable to, a true experimental approach incorporating ethical treatment of research subjects.
3.1.1
Science and Ethics Share Objectives
Nature guarantees that some ill individuals will receive inferior treatments. Suboptimal or ineffective treatments can result from dogma, deception, miscalculation, desperation, misunderstanding, or ignorance—when the best therapy has not yet been reliably
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
45
determined and widely accepted. Our moral obligation is to learn efficiently and humanely from this unavoidable truth in a way that balances individual rights with the collective good. Clinical trials are often the best way to do this. Furthermore, no historical tradition or modern ethics principle proscribes either the advocacy of clinical trials by physicians or the role of physicians as experimentalists. When properly timed, designed, presented to subjects, and conducted, a clinical trial may be the most ethically appropriate way to acquire new knowledge. Although much can be learned without formal experimentation, persistent ignorance can be unethical. Therapeutic choices not based on reasonable scientific certainty represent their own kind of questionable ethic. Failure to resolve ignorance when there exists a reasonable tool to do so is unacceptable behavior for the physician. In some circumstances, there may actually be a moral obligation for individuals to participate in research [680]. One example is biobanking where the specimen is obtained at minimal risk or already exists for other reasons (I assume the appropriate privacy and use protections are in place). It makes little moral sense to allow unenlightened autonomy to inhibit such research [1334]. The present widespread use of robotic prostatectomies highlights a failure to conduct reasonable timely clinical trials. I will put aside the important question as to whether observation or intervention (surgery or radiotherapy) is the appropriate choice for early prostate cancer. Currently, there have been over one million robotic prostatectomy procedures performed without large high-quality randomized evidence as to the superiority of this method over conventional surgery. Probably 90% of such surgeries are done robotically today. The robotic procedure is more expensive as one might expect from initial outlays for a new technology. Some evidence is emerging that robotic procedures may yield lower rate of complications such as infection, bleeding, incontinence, or impotence, but none that reflects on definitive long-term outcomes such as recurrence-free or overall survival. Respect for the opinions of surgeons who perform these procedures drives the demand. But we must also have respect for the bias, economic, and other incentives that push technology in the absence of reliable evidence. The practitioner has an obligation to acknowledge uncertainty honestly and take steps to avoid and resolve ignorance. Well-designed and conducted clinical trials are ethically more appropriate than some alternatives, such as acquiring knowledge ineffectively, by tragic accident, or failing to learn from seemingly unavoidable but serious clinical errors. Thus, there are imperatives to conduct medical research [415]. The objectives of science and ethics are convergent. Medical therapies, diagnostics, interventions, or research designs of any type, including clinical trials, can be carelessly applied. When discussing potential ethics shortcomings, we must distinguish between the methodology of trials as a science and the characteristics, implementation, or application of a particular trial. If problems regarding ethics arise, they will nearly always be a consequence of the circumstance, details of the scientific question, design, or conduct of a specific trial. Problems are much less likely to be grounded in methodologic principles that are generally flexible. When investigators follow the basic principles of ethics and consent that have evolved over the last 70 years, the resulting studies are likely to be appropriate.
Piantadosi
Date: July 27, 2017
46
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
3.1.2
Equipoise and Uncertainty
Equipoise is the concept that a clinical trial, especially a randomized trial, is motivated by collective uncertainty about the superiority of one treatment versus its alternative. The existence of equipoise helps satisfy the requirement that study participants not be disadvantaged. It is a practical counterweight to the idealization of the personal care principle, discussed below, and supports a comparative trial as the optimal course of action to resolve scientific uncertainty, rather than merely a tolerable alternative. From the original description: . . . at the start of the trial, there must be a state of clinical equipoise regarding the merits of the regimens to be tested, and the trial must be designed in such a way as to make it reasonable to expect that, if it is successfully conducted, clinical equipoise will be disturbed [518].
Although often discussed from the perspective of the individual practitioner, equipoise is a collective concept, in contrast to the personal care principle. It means that uncertainty exists among experts in the medical community about the treatments being compared. It is not necessary for each investigator in a trial to be on a personal fence. Equipoise is the product of individuals’ doubt of certitude and respect for contrary expert opinion. It exists even when individual investigators have weakly held preferences for one treatment or another. Some investigators might hold firm beliefs about treatments, and consequently not think it appropriate to participate in a trial. This often happens but does not dismantle the notion of equipoise in the larger community. Equipoise is delicate in some ways. It may be lost with preliminary data from a trial, or be unattainable in the presence of weak or biased evidence. The traditional notion of equipoise also breaks down in the context of adaptive methods in clinical trials, which require modifications in the conduct of the trial based on interim findings. It also ignores the uncertainty inherent in the common practice of holding a hunch. As a practical matter, equipoise can be difficult to establish. A second concept, the uncertainty principle, can correct some of these difficulties and can supplement or displace equipoise as a foundational requirement for a trial. Uncertainty is defined by the individual practitioner and his or her comfort in recommending that a patient participate in a clinical trial. If the practitioner has genuine uncertainty about the appropriate treatment, then the patient can be encouraged to participate in a trial: A patient can be entered if, and only if, the responsible clinician is substantially uncertain which of the trial treatments would be most appropriate for that particular patient. A patient should not be entered if the responsible clinician or the patient for any medical or non-medical reasons [is] reasonably certain that one of the treatments that might be allocated would be inappropriate for this particular individual (in comparison with either no treatment or some other treatment that could be offered to the patient in or outside the trial) [1195].
The uncertainty principle has more appeal at the level of the individual physician where it allows them to be unsure if their hunch is correct [1323]. It does not correspond perfectly with equipoise because an investigator could be uncertain but represent a distinct
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DUALITY
47
minority opinion. A sizable number of practitioners who are uncertain will define a state of equipoise.
3.2
DUALITY
Health care provides benefits to both the individual and the society. Physician practice is weighted by ideals toward the individual, but societal benefit is at least indirect. Clinical trials also provide benefits to the individual and society, balanced less strongly in favor of the individual. They highlight the potential conflict of obligations that physicians and other health care practitioners have between their current patients and those yet to be encountered. I will refer to the roles of the physician on behalf of the individual and society as duality.
3.2.1
Clinical Trials Sharpen, But Do Not Create, Duality
Discussions of the dual obligations of physicians have been taking place in the literature for many years [590, 703, 717, 1490]. Guttentag [661] acknowledged this duality and suggested that research and practice for the same patient might need to be conducted by different physicians. Schafer [1335] points out the conflict of obligations, stating: In his traditional role of healer, the physician’s commitment is exclusively to his patient. By contrast, in his modern role of scientific investigator, the physician engaged in medical research or experimentation has a commitment to promote the acquisition of scientific knowledge.
The tension between these two roles is acute in clinical trials but is also evident in other areas of medicine. For example, epidemiologists recognize the need for protection of their research subjects and have proposed ethical guidelines for studies [128, 305]. Quantitative medical decision-making methods can also highlight ethical dilemmas [184]. Research and practice are not the only categories of physician activity. Levine [927] also distinguishes “nonvalidated practices” and “practice for the benefit of others.” Nonvalidated or investigational practices are those that have not been convincingly shown to be effective. The reasons for this might be that the therapy is new or that its efficacy was never rigorously shown in the first place. Practice for the benefit of others includes organ donation, vaccination, and quarantine. Clinical trials are a lightning rod for ethics concerns because they bring into sharp focus two seemingly opposed roles of the physician in modern medicine [928]. But these competing demands on the physician have been present throughout history and are not created uniquely by clinical trials. Before the wide use of clinical trials, Ivy [778] suggested that the patient is always an experimental subject. Shimkin [1382] also took this view. Even the ancient Greek physicians acknowledged this (see Section 3.2.4). Today the roles of patient advocate and researcher are not the only two points of conflict for the physician. For example, the desires for academic and financial success compete with both patient care and research. Thus, the practitioner is faced with numerous potential ethics conflicts.
Piantadosi
Date: July 27, 2017
48
3.2.2
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
A Gene Therapy Tragedy Illustrates Duality
One of the most poignant cases highlighting the dual nature of medical care and the ethics concerns surrounding it occurred at the University of Pennsylvania in 1999. An 18-yearold man with ornithine transcarbamylase (OTC) deficiency died after participating in a clinical trial testing a gene-based therapy for the condition. OTC deficiency is an inherited disorder in liver cells that prevents them from properly metabolizing nitrogen. In its common form, it causes death in affected newborn males. The clinical trial in question was a dose-ranging study, in which a modified human adenovirus was being used to introduce a normal OTC gene into liver cells. The participant who died was the last of 18 originally planned for the trial and received a relatively high dose of the viral vector. He died from multiple organ system failure due to an immunologically mediated reaction. The patient’s clinical condition prior to participating in the trial was not urgent. Altruism played a strong role in his decision to volunteer, as did the expectation of benefit. However, the investigators were (prematurely) optimistic of producing a lasting therapeutic benefit. In retrospect, they probably had too favorable a view of the safety and efficacy of the treatment. If so, this was an error of judgment on their part but not a violation of the standards of performing research in a practice setting. Their behavior reflected the principles of dual obligation in the same way as that of clinical researchers at all major medical centers. In the aftermath of this incident, reviewers found deficiencies in the oversight and monitoring of the program of gene therapy clinical trials. The mixture of a fatal toxicity with ethics issues, regulation, and gene therapy puts a high profile on this case.
3.2.3
Research and Practice Are Convergent
One of the principal discomforts about the ethics of clinical trials arises from the widely held, but artificial, distinction between research and practice in medicine. Some activities of the physician seem to be done primarily for the good of the individual patient, and we place these in the domain of practice. Other actions seem to be performed for the collective good or to acquire generalizable knowledge, and we label these as research. This distinction is often convenient, but artificial, because expert physician behavior is nearly always both research and practice. A more accurate distinction could be based on the physician’s degree of certainty. When the physician is certain of the outcome from a specific therapy for a particular patient, applying the treatment might be described as practice. When the physician is unsure of the outcome, applying the treatment could be considered research, at least in part. The continuum in degrees of certainty illustrates the artificial nature of the research–practice dichotomy. There are very few actions that the physician can carry out for the benefit of the individual patient and not yield some knowledge applicable to the general good. Likewise, nearly all knowledge gained from research can be of some benefit to individual patients. It is possible for patients to be misinformed about, or harmed by, either inappropriate practice actions or improper research activities. Both activities can have large or small risk–benefit ratios. Practice versus research seems to be partly a result of the setting in which the action takes place, which may be a consequence of the motives of the physician. Physician
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DUALITY
49
TABLE 3.1 Rational Motivations for Risk Acceptance in Clinical Trials Societal:
Individual:
Direct and indirect benefits to the current population from past studies Cost savings from new treatments for serious diseases Use of appropriate risk-benefit considerations to develop new therapies Proven therapies replace ineffective but seemingly low-risk ones, and not vice versa Altruism Amelioration of disease or improved quality of life Improved supportive care Hope
behavior can be “practice” in one setting, but when part of a different structure, the same behavior becomes “research.” Comparison of two standard therapies in a randomized trial might illustrate this. For example, there are two widely used treatments for earlystage prostate cancer: surgery and radiotherapy. Because of history, practice, and referral patterns, specialty training and belief, economics, and stubbornness, no randomized comparison has been done. If one were conducted, it would require more careful explanation of risks, benefits, and alternatives than either practice now demands. In modern academic medical centers, the convergence of research and practice is clearly evident. In these settings, physicians are trained not only to engage in good practice but also to maintain a mind-set that encourages learning both from clinical practice and from formal experimentation. Even with strong financial incentives to separate them, it is clear that research and clinical practice are one and the same in many of these institutions. Not surprisingly, these academic settings are the places most closely associated with performing clinical trials. However, academic medicine is a development of recent history. In the past, many medical advances were made in mostly traditional practice settings, precisely because physicians both took advantage of and created research opportunities. Double Standards The artificial distinction between research and practice can create double standards regarding ethical conduct. The practitioner is free to represent his or her treatment preferences for the patient with relatively informal requirements for explaining alternatives, rationale, risks, and benefits. Similarly, a second practitioner who prefers a different treatment may have minimal requirements to offer alternatives. However, the investigator studying a randomized comparison of the two standard treatments to determine which one is superior will likely incur ethical obligations well beyond either practitioner. A perspective on this point is given by Lantos [893, 894]. A second, and more far-reaching, double standard can be seen in cases where individuals have been injured as a result of treatments that were not thoroughly studied during development. Thalidomide for sedation during pregnancy is one example. Physicians cannot expect to provide any individual with health care that is in his or her best interest without knowing the detailed properties of that therapy, including risks and benefits. This knowledge often comes most efficiently from clinical trials. A discussion of the sources and potential problems arising from this double standard is given by Chalmers and Silverman [244]. A third source of double standards arises in considering different cultural expectations regarding medical care, patient information, and research. Some cultures place
Piantadosi
Date: July 27, 2017
50
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
fewer restrictions on research practices in humans than does the United States, or perhaps the individuals in the population have a greater expectation that these types of studies will be done. One cannot make value judgments from this fact. For example, there are considerable differences in the expectations regarding informed consent for research studies among the United States, Europe, Africa, and Asia. These differences are a consequence of many factors, including the structure and economics of health care, physician attitudes, patient expectations, attitudes toward litigation, the nature of threats to the public health, and cultural norms, especially regarding risk acceptance and consent. Even our own cultural perspective is inconsistent in this regard. I have seen no better illustration of this than the article by Lemonick and Goldstein [920] and the accompanying cover of Time magazine. This article should be read by everyone working in clinical trials, not for its factual content but because it is a perfect example of cultural double standards regarding medical research broadly and clinical trials specifically. As a result, it strongly reinforces the negative light in which such studies are frequently presented to the public. The sensational costs of clinical trials in human terms are easy to display— deaths of two research volunteers who should not have been harmed. In short, we desire risk-free medical studies. However, acceptance of risk at both the societal and individual levels is not only inevitable but also rational behavior. Some reasons are listed in Table 3.1. The wish for risk-free medical trials is as vain as the desire for risk-free automobiles or air travel, for example, as both predictably kill sizable numbers of healthy volunteers. The societal and scientific necessities are to manage risk–benefit, informing participants and removing coercion, and learning from accidents and mistakes. Clinical trials have a very respectable track record in these regards, probably much better than biomedical research in general. Even so, the reaction typified by Lemonick and Goldstein is to recapitulate mistakes of medical research, attribute them to clinical trials, and ignore the improvements and safeguards implemented after learning from mistakes and accidents. To add perspective, there has been much attention in recent years to deaths attributable to medical mistakes—largely medication errors. There may be tens of thousands of such deaths in the United States each year. It seems likely that the risk of serious harm on clinical trials is proportionately much lower because of the safeguards surrounding such studies. Factors that contribute to the safety of clinical trials include, but are not limited to, peer review of research protocols, Institutional Review Board oversight, reporting of adverse events, independent trial monitoring committees, attentive clinical care at centers of excellence, and openness of research results. Routine clinical care will probably never implement such an array of safeguards. A fourth societal double standard with regard to research and practice can be seen in off-trial access or compassionate use, discussed next.
Compassionate Use Evaluation of a new therapeutic is highly structured within each clinical trial as well as at the program or pipeline level. Eligibility and safety restrictions may exclude some individuals from trials who might otherwise benefit. Ethics imperatives also encourage allowing those whose last chance of benefit is an incompletely tested agent to receive that therapy, even outside ongoing trials. This “compassionate use” is a common circumstance at research institutions, and each year a few difficult cases may find their way into public
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DUALITY
51
awareness where the universal representation by media is the favorable value of access. If pharmaceutical sponsors or the FDA restrict access, they are portrayed as unsympathetic to the individual’s plight. A similar problem can arise after developmental trials are complete but before definitive regulatory action. Individuals who appear to have benefitted while on study can be faced with temporary loss of the therapy, although that would be uncommon. More likely the drug will not be available during the hiatus for new patients, creating a problem directly proportional to the promise of the new drug and the severity of the disease. Physicians who have become associated with a promising new agent also have multiple interests in sustaining availability outside developmental trials using these “expanded access” venues. The task of supervising compassionate use and expanded access also falls to the Institutional Review Boards where they can be a significant fraction of the workload. Risk–benefit is the issue as much as for a clinical trial. However, it is important to recognize the subtle but significant shift in perspective that this represents. It is actually the research apparatus supporting clinical care rather than the reverse. On expanded access, the demands to capture research quality data and side effects are identical to a clinical trial, but the information may never contribute to the scientific goals. Support for such activities by society is further evidence of double standards, and physician duality discussed below.
Personal Care The “personal care principle” (PCP) has been raised by some as an important ethical guideline for practitioner conduct. The PCP appears to have roots in the Hippocratic Oath and states that physicians have an obligation to act always in the best interests of the individual patient [542]. There are echoes of this in the Helsinki Declaration. In a sense, physicians are taught to believe in, and endorse, the personal care ideal. Many accept the notion that they should always make the best recommendation based on the knowledge available and the patient’s wishes. This does not mean that such recommendations are always based on scientific fact or well-formed opinion. Sometimes physician preferences are based on important non rational factors. Other times preferences are based on unimportant factors. Discussions of issues surrounding the PCP is given by Markman [519, 520, 988, 1297]. The PCP does not support the lack of knowledge or opinion that often motivates making treatment choices as part of a structured experiment, particularly by random assignment. If the PCP were in fact a minimal standard of conduct or ethics principle, it could create serious problems for conducting randomized clinical trials (e.g., Ref. [1299])—except that patients could always agree to participate in trials despite their practitioners’ opinions. For this reason, I will spend a little effort arguing against it. In its simplest form, the PCP makes the physician–patient relationship analogous to the attorney–client relationship. As such, it appears to be a product of, and contribute to, an adversarial relationship between society and the patient, making the sick person in need of physician protection. Although physicians and patients are partners, the therapeutic environment is not usually adversarial. Even research and clinical trials are not adversarial circumstances for participants, where one compromises to have an overall best outcome. Current requirements for representative research studies, although misguided in other ways, illustrate this.
Piantadosi
Date: July 27, 2017
52
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
There are fundamental problems with the PCP. For example, the PCP does not encourage patients and physicians to take full advantage of genuine and widespread uncertainty or differences of opinion that motivate randomization. It does not acknowledge the lack of evidence when treatments are being developed. It raises moral and ethical questions of its own, by discouraging or removing an objective evaluation tool that can correct for economic and other biases that enter decisions about treatment selection. All these issues are illustrated by the history of Laetrile, discussed briefly in Section 8.4.3. The ascendancy of a personal care ideal may have been an artifact of a time when medical resources were very abundant. An insightful physician will probably recognize that many of his/her treatment “preferences” are artificial and carry no ethical legitimacy. More to the point, the PCP is not a defining ethical principle but an idealization. The PCP is violated in many instances that have nothing to do with randomization or clinical trials, as discussed below.
3.2.4
Hippocratic Tradition Does Not Proscribe Clinical Trials
The Hippocratic Oath has been a frequently cited modern-day code of conduct for the practice of medicine. It represents an ideal that has been extensively modified by modern medical practice, scientific discovery, cultural standards, and patient expectations. Many physicians no longer repeat the full oath upon graduation from medical school and some have rarely (or never) seen it. Most clinical trialists have not studied it, and occasionally someone will imply that the oath is inconsistent with experimentation. To assess its actual implications for ethics and clinical trials, it is useful to examine the full text. This somewhat literal translation comes from Ref. [1041], (pp. xiii–xiv), who also discusses in great detail the ethical implications of the oath: I swear by Apollo the physician and by Asclepius and by Health [the god Hygeia] and Panacea and by all the gods as well as goddesses, making them judges [witnesses], to bring the following oath and written covenant to fulfillment, in accordance with my power and judgment: to regard him who has taught me this techn´e [art and science] as equal to my parents, and to share, in partnership, my livelihood with him and to give him a share when he is in need of necessities, and to judge the offspring [coming] from him equal to [my] male siblings, and to teach them this techn´e, should they desire to learn [it], without fee and written covenant; and to give a share both of rules and of lectures, and all the rest of learning, to my sons and to the [sons] of him who has taught me and to the pupils who have both made a written contract and sworn by a medical convention but by no other. And I will use regimens for the benefit of the ill in accordance with my ability and my judgment, but from [what is] to their harm or injustice I will keep [them]. And I will not give a drug that is deadly to anyone if asked [for it], nor will I suggest the way to such a counsel. And likewise I will not give a woman a destructive pessary. And in a pure and holy way I will guard my life and my techn´e. I will not cut, and certainly not those suffering from stone, but I will cede [this] to men [who are] practitioners of this activity.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DUALITY
53
Into as many houses as I may enter, I will go for the benefit of the ill, while being far from all voluntary and destructive injustice, especially from sexual acts both upon women’s bodies and upon men’s, both of the free and of the slaves. And about whatever I may see or hear in treatment, or even without treatment, in the life of human beings – things that should not ever be blurted out outside – I will remain silent, holding such things to be unutterable [sacred, not to be divulged]. If I render this oath fulfilled, and if I do not blur and confound it [making it to no effect] may it be [granted] to me to enjoy the benefits of both life and of techn´e, being held in good repute among all human beings for time eternal. If, however, I transgress and perjure myself, the opposite of these.
Although it underscores the morality needed in the practice of medicine, and therefore in medical research, it is easy to see why the oath is of marginal value to modern physicians. It contains components that are polytheistic, sexist, proscriptive, celibate, and superstitious. Specific clinical proscriptions (i.e., abortion, surgery for urinary stones, and the use of “deadly” drugs) have long since fallen, as has the pledge for free instruction. Principles that have found widespread acceptance or are reflected in the ethics of today’s medical practice are “doing no harm,” prescribing for “the benefit of the ill,” confidentiality, and an obligation to teach. Even confidentiality in modern medicine is open to some question [1393], particularly in the situations described below surrounding reportable diseases. One could not expect the Hippocratic Oath to mention clinical trials, even implicitly, because the concept did not exist when it was written. However, acting for the benefit of the ill (implied plural) is noteworthy because it is an obligation to society as well as to the individual. The idea of experiment was known to ancient Greek physicians, but it was taken to mean a change in treatment for a patient on recommended therapy [1041]. Many medical treatises were simply the compilation of such cases.
Modern Ethic The Hippocratic Oath is primarily an idealization filtered through modern views, and not a required code of conduct for all circumstances. These limitations also apply to our contemporary explicit ethical principles. The American Medical Association (AMA) does not endorse the Hippocratic Oath. The AMA’s own principles of ethics contain similar sentiments but are modernized and oriented toward practitioners [29]: The medical profession has long subscribed to a body of ethical statements developed primarily for the benefit of the patient. As a member of this profession, a physician must recognize responsibility to patients first and foremost, as well as to society, to other health professionals, and to self. The following Principles adopted by the American Medical Association are not laws, but standards of conduct which define the essentials of honorable behavior for the physician. 1. A physician shall be dedicated to providing competent medical care, with compassion and respect for human dignity and rights. 2. A physician shall uphold the standards of professionalism, be honest in all professional interactions, and strive to report physicians deficient in character or competence, or engaging in fraud or deception, to appropriate entities.
Piantadosi
Date: July 27, 2017
54
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
3. A physician shall respect the law and also recognize a responsibility to seek changes in those requirements which are contrary to the best interests of the patient. 4. A physician shall respect the rights of patients, colleagues, and other health professionals, and shall safeguard patient confidences and privacy within the constraints of the law. 5. A physician shall continue to study, apply, and advance scientific knowledge, maintain a commitment to medical education, make relevant information available to patients, colleagues, and the public, obtain consultation, and use the talents of other health professionals when indicated. 6. A physician shall, in the provision of appropriate patient care, except in emergencies, be free to choose whom to serve, with whom to associate, and the environment in which to provide medical care. 7. A physician shall recognize a responsibility to participate in activities contributing to the improvement of the community and the betterment of public health. 8. A physician shall, while caring for a patient, regard responsibility to the patient as paramount. 9. A physician shall support access to medical care for all people.
Principles 5 and 7 can be interpreted explicitly to recognize the obligation to conduct research, and therefore to perform clinical trials.
3.2.5
Physicians Always Have Multiple Roles
There are many circumstances besides clinical trials in which the physician’s duty extends beyond the individual patient or is otherwise conflicting with the societal obligations stated in the AMA principles. In some circumstances, significant obligations extend to other members of society or to patients yet to be encountered. In other circumstances, the physician may have conflicting obligations to the same patient (e.g., care of the terminally ill patient). Dual roles for the physician are acknowledged by the codes of conduct cited in this chapter. The examples listed below are circumstances consistent with both tradition and current ideas about “good” health care. They illustrate the somewhat mythical nature of acting only in the best interests of the individual patient. Although clinical trials highlight dual obligations, medical practitioners encounter and manage similar conflicts of commitment in other places. There have been occasional historical experimental studies that illustrate this point [112].
Examples The first example is the teaching and training of new physicians and health care professionals. This is necessary for the well-being of future patients, and is explicitly required by the Hippocratic tradition, but is not always in the best interests of the current patient. It is true that with adequate supervision of trainees, the risks to patients from teaching are small. In some ways the patient derives benefit from being on a teaching service. However, the incremental risks attributable to inexperienced practitioners are not zero, and there may be few benefits for the patient. Even if safe, teaching programs often result
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DUALITY
55
in additional inconvenience to the patient. Aside from outright errors, teaching programs may perform more physical exams, diagnostic tests, and procedures than are strictly necessary. Thus, the safest, most comfortable, and most convenient strategy for the individual patient would be to have an experienced physician and to permit no teaching around his or her bedside. Vaccination is another example for which the physician advocates, and with the individual patient, accepts risks for the benefit of other members of society. With a communicable disease like polio, diphtheria, or pertussis, and a reasonably safe vaccine, the most practical strategy for the individual is to be vaccinated. However, the optimal strategy for the individual patient is to have everyone else vaccinated so that he or she can derive the benefit without any risk. For example, almost all cases of polio between 1980 and 1994 were caused by the oral polio vaccine [240]. A strategy avoiding vaccination is impractical for the patient because the behavior of others cannot be guaranteed. This is not a workable policy for the physician to promote either, because it can be applied only to a few individuals, demonstrating the obligation of the physician to all patients. Triage, whether in the emergency room, battlefield hospital, or domestic disaster, is a classic example of the physician placing the interests of some patients above those of others and acting for the collective good. It is a form of rationing of health care, a broader circumstance in which the interests of the individual are not paramount. Although in the United States, we are not accustomed to rationing (at least in such stark terms), constrained health care resources produce it, and may increasingly do so in the future. For example, physicians working directly for profit-making companies and managed health plans, rather than being hired by the individual patient, could create competing priorities and highlight the dual obligations of the physician. Abortion is another circumstance in which the physician has obligations other than to the individual patient. No stand on the moral issues surrounding abortion is free of this dilemma. Whether the fetus or the mother is considered the patient, potential conflicts arise. This is true also of other special circumstances such as the separation of conjoined twins or in utero therapy for fatal diseases in which there is no obvious line of obligation. For a discussion of the many problems surrounding fetal research, see Ref. [1244]. Organ donation is almost never in the best medical interests of the donor, especially when the donor gives a kidney, bone marrow, blood, or other organ while still alive and healthy. It is true that donors have little risk in many of these situations, especially ones from which they recover quickly, such as blood donation, but the risk is not zero. Even in the case of the donor being kept alive only by artificial means, it seems that no medical benefit can be derived by donating organs, except to people other than the donor. However, it is clear that physicians endorse and facilitate altruistic organ donation, again illustrating their dual roles. Issues surrounding blood banking apart from viewing blood products as organ donation also highlight the dual obligations of the physician. Many practices surrounding blood banking are regulated (dictated) by the government to ensure the safety and availability of the resource. The physician overseeing the preparation and storage of blood products will consider needs and resources from a broad perspective rather than as an ideal for each patient. The practitioner may not be free to choose ideally for the patient with regard to such issues as single donor versus pooled platelets, age of the transfused product, and reduction of immunoactive components such as leukocytes.
Piantadosi
Date: July 27, 2017
56
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
The dual obligations of the physician can be found in requirements for quarantine, reporting, or contact tracing of certain diseases or conditions. Circumstances requiring this behavior include infectious diseases, sexually transmitted diseases, suspected child abuse, gunshot wounds, and some circumstances in forensic psychiatry. Although these requirements relate, in part, to nonmedical interests, reporting and contact tracing highlight dual obligations and seem to stretch even the principle of physician confidentiality. The fact that required reporting of infectious diseases compromises some rights of the patient is evidenced by government and public unwillingness to require it for HIV positive and AIDS diagnoses in response to concerns from those affected. There is no better contemporary example of this than the isolations and restrictions placed on patients with severe acute respiratory syndrome (SARS). Treatments to control epidemics (aside from quarantine) can present ethical conflicts between those affected and those at risk. For example, in the summer of 1996, there was an epidemic of food poisoning centered in Sakai, Japan, caused by the O157 strain of Escherichia coli. At one point in the epidemic, almost 10,000 people were affected, and the number of cases was growing by 100 per day. Serious complications were due to hemolytic uremic syndrome. At least seven people died, and nearly all those infected were schoolchildren. As a means to stop or slow the epidemic, Sakai health officials considered treating with antibiotics 400 individuals who were apparently in their week-long incubation period. The treatment was controversial because, while it kills O157 E. coli, endotoxin would be released that could make the individuals sick. The obligations of health providers in such a situation cannot be resolved without ethical conflicts. Dual obligations can be found in market-driven health care financing. In this system, employers contract with insurance companies for coverage and reimbursement. Both the patient and physician are effectively removed from negotiating important aspects of clinical care. If the physician is employed by a health care provider rather than the patient, competing obligations are possible [930]. The U.S. Supreme Court recently ruled that health maintenance organizations (HMOs) cannot be sued for denial of services, which is simply their form of rationing health care for economic reasons. The idea that some medical services are not available to those who could benefit from them is commonplace. But it is not accepted as a matter of principle in the United States despite it being an economic reality. A robust economy is fueled by bad decisions. Economies slow when consumers make good decisions, spending only on essentials, efficiencies, low-profit items, and quality products. This truth also holds in the medical care economy. Health expenditures will slow as good decisions are forced by regulation or other reforms. Wise purchasing must exclude services that some have come to expect, and physicians will be partners in the reforms, regulations, and availability, again highlighting the dual nature of their obligations. These examples of the competing obligations of the physician are no more or less troublesome than clinical trials. Like clinical trials, each has reasons why, and circumstances in which, they are appropriate and ethical. We are familiar with many of them and have seen their place in medical practice more clearly than the relatively recent arrival of clinical trials. However, the dual obligations of the physician are present and have always been with us. Obligations to individuals as well as to “society” can be compassionately and sensibly managed in most circumstances. This inevitable duality of roles that confronts the physician investigator is not a persuasive argument against performing clinical trials.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
HISTORICALLY DERIVED PRINCIPLES OF ETHICS
3.3
57
HISTORICALLY DERIVED PRINCIPLES OF ETHICS
Biomedical experimentation has taken a difficult path to reach its current state. A brief discussion of history is offered by Reiser, Dyck, and Curran [1256]. The landmarks that have shaped current ethics are relatively few, but highly visible. The major landmarks tend to be crises that result from improper treatment of research subjects. A concise review is given by McNeill [1019]. It can be difficult to extract the required principles, but it is essential to understand the troubled history.
3.3.1
Nuremberg Contributed an Awareness of the Worst Problems
The term “experimentation,” especially related to human beings, was given a dreadful connotation by the events of World War II. The evidence of criminal and unscientific behavior of physicians in the concentration camps of Nazi Germany became evident worldwide during the 1946–1947 Nuremberg trials. There were numerous incidents of torture, murder, and experimentation atrocities committed by Nazi physicians [1554]. In fact, 20 physicians and 3 others were tried for these crimes at Nuremberg [47]. Sixteen individuals were convicted and given sentences ranging from imprisonment to death. Four of the seven individuals executed for their crimes were physicians. At the time of the trial, there were no existing international standards for the ethics of experimentation with human subjects. The judges presiding at Nuremberg outlined 10 principles that are required to satisfy ethical conduct for human experimentation. This was the Nuremberg Code, adopted in 1947 [48]. Full text is given in Appendix D. The Code established principles for the following points:
·· · ·· · ·· ··
Study participants must give voluntary consent. There must be no reasonable alternative to conducting the experiment. The anticipated results must have a basis in biological knowledge and animal experimentation. The procedures should avoid unnecessary suffering and injury. There is no expectation for death or disability as a result of the study. The degree of risk for the patient is consistent with the humanitarian importance of the study. Subjects must be protected against even a remote possibility of death or injury. The study must be conducted by qualified scientists. The subject can stop participation at will. The investigator has an obligation to terminate the experiment if injury seems likely.
The Nuremberg Code has been influential in the United States and in international law to provide the groundwork for standards of ethical conduct and protection of research subjects. Numerous perspectives on the code have been offered over the years [588, 1388, 1475].
Piantadosi
Date: July 27, 2017
58
3.3.2
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
High-Profile Mistakes Were Made in the United States
Despite the events at Nuremberg, a persistent ethical complacency in the United States followed. In the late 1940s and early 1950s, the American Medical Association (AMA) was keenly aware of Nuremberg and felt that its own principles were sufficient for protecting research subjects [28, 778]. The principles advocated by the AMA at that time were (1) patient consent, (2) safety as demonstrated by animal experiments, and (3) investigator competence. A few examples of ethically inappropriate studies are sketched here, derived from Beecher’s landmark paper [132] (also refer to [802]). In 1936, the U.S. Public Heath Service had started a study of the effects of untreated syphilis in Tuskegee, Alabama. Three hundred ninety-nine African-American men with advanced disease were studied along with 201 controls. The study continued long after effective treatment for the disease was known, coming to public attention in 1972 [177, 803, 1497]. In fact, there had been numerous medical publications relating the findings of the study, some after the development of penicillin [1347]. The study had no written protocol. Another study at the Jewish Chronic Diseases Hospital in Brooklyn in 1963 saw cancer cells injected into 22 debilitated elderly patients without their knowledge to see if they would immunologically reject the cells [827]. Consent was said to have been obtained orally, but records of it were not kept. The hospital’s Board of Trustees was informed of the experiment by several physicians who were concerned that the subjects did not give consent. The Board of Regents of the State University of New York reviewed the case and concluded that the investigators were acting in an experimental rather than therapeutic relationship, requiring subject consent. At Willowbrook State Hospital in New York, retarded children were deliberately infected with viral hepatitis as part of a study of its natural history [827]. Some subjects were fed extracts of stool from those with the disease. Investigators defended the study because nearly all residents of the facility could be expected to become infected with the virus anyway. However, even the recruitment was ethically suspect because overcrowding prevented some patients from being admitted to the facility unless their parents agreed to the study. There are some more recent examples of studies where individuals may have been exposed to harmful treatments without being fully apprised of the risk. Some possible examples from tests of radioactive substances from 1945 to 1975 came to light [738, 1495, 1496]. Presumably patients would not have voluntarily accepted risks inherent in those studies if they were properly informed. Standards for informed consent were different in that historical period than they are now, making definitive interpretation of those events somewhat difficult.
3.3.3
The Helsinki Declaration Was Widely Adopted
In 1964, the 18th World Medical Association (WMA) meeting in Helsinki, Finland, adopted a formal code of ethics for physicians engaged in clinical research [593, 1271, 1586]. This became known as the Helsinki Declaration, which has been revised by the WMA nine times, most recently in 2013. This declaration is intended to be reviewed
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
HISTORICALLY DERIVED PRINCIPLES OF ETHICS
59
and updated periodically. The current version is on the WMA Web site [1585]. The WMA was founded in 1947 with a service objective, and is an amalgamation of medical associations from the different countries of the world. The U.S. representatives come from the AMA. Controversy found the Helsinki Declaration after 1994 when the AIDS Clinical Trials Group published the trial of Zidovudine therapy for mother-infant transmission of HIV infection [295]. This highlighted a growing international debate over universal versus pluralistic ethics standards and the appropriateness of placebo treatments. The FDA essentially ignored the 1996 and later revisions. A key point of controversy was that trial participants in the United States had access to the effective drug, whereas those in some of the developing counties did not. In 2000, another revision to the Helsinki Declaration was made. Among other points, the document was restructured, the scope was increased to cover tissue and data, and an emphasis was placed on benefit for the communities participating in research. Specifically, it was stated that “research is only justified if there is a reasonable likelihood that the populations in which the research is carried out stand to benefit from the results of the research.” Even before the exact language was adopted, the document was criticized as both having a utilitarian ethic and serving to weaken the researchers’ responsibility to protect study subjects. An exchange of points on these and related ideas is offered by Refs. [183, 929]. The most substantive and controversial issues arise from discussions surrounding 16 trials investigating the vertical transmission of AIDS, conducted by U.S. investigators in Africa [962]. Some studies were randomized placebo-controlled trials testing the ability of drug therapy to reduce mother–infant transmission of HIV. The studies were planned by academic investigators, and sponsored by NIH and the CDC. All research protocols underwent extensive ethics reviewed by IRBs and the host country’s health ministries, researchers, and practitioners. The basic criticisms of these trials focus on the use of a placebo treatment when drug therapy is known to be effective at reducing HIV transmission, and placement of the trials in a developing country setting where no treatment was the norm, and hence the placebo would be seen as appropriate. The studies would not be considered appropriate in the United States, for example, and were consequently judged by some to be globally unethical. The facts of the studies, criticisms, and rejoinders can be seen from the literature [44, 1506]. The issues are complex and relate to cultural differences, scientific needs, humanitarian efforts, medical and scientific paternalism, and ethics. I cannot resolve the issues here. For a mature, reasoned perspective on this, see the discussion by Brody [191]. The essential issue for this discussion is that the Helsinki Declaration was revised in 2000 to make such studies more difficult or impossible. The relevant text was paragraph 29: The benefits, risks, burdens and effectiveness of the method should be tested against those of the best current prophylactic, diagnostic, and therapeutic methods. This does not exclude the use of placebo, or no treatment, in studies where no proven prophylactic, diagnostic or therapeutic method exists.
Piantadosi
Date: July 27, 2017
60
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
Although overtly ambiguous, the language is explicit when one considers the history behind why it was revised. This provision is poised to hinder needed research in developing countries, be ignored on occasion, or be modified. The FDA had historically pointed to the Helsinki Declaration in its guidelines, but does not permit others to write its regulations. It seemed unlikely that the FDA would be able to embrace this version of the Helsinki Declaration unreservedly. In 2001, the WMA Council took the unusual step of issuing a note of clarification regarding paragraph 29. It was adopted by the WMA General Assembly in 2002: The WMA is concerned that paragraph 29 of the revised Declaration of Helsinki (October 2000) has led to diverse interpretations and possible confusion. It hereby reaffirms its position that extreme care must be taken in making use of a placebo-controlled trial and that in general this methodology should only be used in the absence of existing proven therapy. However, a placebo-controlled trial may be ethically acceptable, even if proven therapy is available, under the following circumstances: Where for compelling and scientifically sound methodological reasons its use is necessary to determine the efficacy or safety of a prophylactic, diagnostic or therapeutic method; or Where a prophylactic, diagnostic or therapeutic method is being investigated for a minor condition and the subjects who receive placebo will not be subject to any additional risk of serious or irreversible harm. All other provisions of the Declaration of Helsinki must be adhered to, especially the need for appropriate ethical and scientific review.
The second paragraph of the clarification was odd because it seemed to disregard situations where a placebo control was scientifically useful but ethically inappropriate. But the major difficulty with paragraph 29 was dogmatism and the supposition that either equivalence or superiority trials will always be achievable. Another clarification was offered by the WMA in 2004 with apparently little easing of the debate. Provisions of the Helsinki Declaration contain elements that illustrate the dual role of the physician. For example, it begins by stating “It is the mission of the physician to safeguard the health of the people.” However, in the next paragraph, it endorses the Declaration of Geneva of the World Medical Association, which states regarding the obligations of physicians, “the health of my patient will be my first consideration”, and the International Code of Medical Ethics that states “a physician shall act only in the patient’s interest when providing medical care which might have the effect of weakening the physical and mental condition of the patient.” Later it states “Medical progress is based on research which ultimately must rest in part on experimentation involving human subjects.” It is noteworthy that the principle of acting only in the individual patient’s interest seems to be qualified and that the declaration presupposes the ethical legitimacy of biomedical and clinical research. Among other ideas, it outlines principles stating that research involving human subjects must conform to generally accepted scientific principles, be formulated in a written protocol, be conducted only by qualified individuals, and include written informed consent from the participants. The FDA announced in 2006 that it would no longer reference the Declaration, and in 2008 it was replaced with Good Clinical Practice guidelines [848]. NIH training related to human subjects research no longer references Helsinki. Similarly, the European Union
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
HISTORICALLY DERIVED PRINCIPLES OF ETHICS
61
Clinical Trials Directive did not cite the 2004 or 2008 revisions. Another revision of the Helsinki Declaration came in 2008, with many fewer changes than were made in 2000. The current version is from the 64th WMA General Assembly in 2013, and opens some new issues [1046]. The relevant paragraph for the discussion above is now number 33 regarding the use of placebo treatment: The benefits, risks, burdens and effectiveness of a new intervention must be tested against those of the best proven intervention(s), except in the following circumstances: Where no proven intervention exists, the use of placebo, or no intervention, is acceptable; or Where for compelling and scientifically sound methodological reasons the use of any intervention less effective than the best proven one, the use of placebo, or no intervention is necessary to determine the efficacy or safety of an intervention and the patients who receive any intervention less effective than the best proven one, placebo, or no intervention will not be subject to additional risks of serious or irreversible harm as a result of not receiving the best proven intervention. Extreme care must be taken to avoid abuse of this option.
It remains to be seen if the Helsinki Declaration becomes an idealization without literal utility such as the Hippocratic Oath, or finds more universal appeal again.
3.3.4
Other International Guidelines Have Been Proposed
The United Nations General Assembly adopted the International Covenant on Civil and Political Rights in 1976, which states (Article 7): “No one shall be subjected to torture or to cruel, inhuman or degrading treatment or punishment. In particular, no one shall be subjected without his free consent to medical or scientific experimentation.” In 1982, the World Health Organization (WHO) and the Council for International Organizations of Medical Sciences (CIOMS) issued a document, Proposed International Guidelines for Biomedical Research Involving Human Subjects, to help developing countries apply the principles in the Helsinki Declaration and the Nuremberg Code. The guidelines were extended in a second document in 1991 dealing with epidemiologic studies, in part, in response to needs arising from field trials testing AIDS vaccines and drugs. The second document was called International Guidelines for Ethical Review of Epidemiologic Studies. In 1992, the Guidelines were revised at a meeting in Geneva, resulting in the International Ethical Guidelines for Biomedical Research Involving Human Subjects. Some areas of medical research are not mentioned in the guidelines, including human genetic research, embryo and fetal research, and research using fetal tissue (Council for International Organizations of Medical Sciences, 1993). Guideline 11 of these regulations states: As a general rule, pregnant or nursing women should not be subjects of any clinical trials except such trials as are designed to protect or advance the health of pregnant or nursing women or fetuses or nursing infants, and for which women who are not pregnant or nursing would not be suitable subjects.
This Guideline was opposite to thinking by many advocacy groups in the United States. The FDA and NIH relaxed or removed such restrictions in favor of allowing
Piantadosi
Date: July 27, 2017
62
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
pregnant women more self-determination regarding participation in clinical research of all types. In 2002, the guidelines were again revised [306]. Sections 16 and 17 discuss women as research subjects and states: Investigators, sponsors or ethical review committees should not exclude women of reproductive age from biomedical research. The potential for becoming pregnant during a study should not, in itself, be used as a reason for precluding or limiting participation. However, a thorough discussion of risks to the pregnant woman and to her fetus is a prerequisite for the woman’s ability to make a rational decision to enrol in a clinical study.. . . Pregnant women should be presumed to be eligible for participation in biomedical research.
In Canada, the Tri-Council Policy for the ethical conduct of research was adopted in 1998 and revised in 2010 [624]. It is noteworthy for its thoughtful treatment of research related to human reproduction and genetic research. Research guidelines are also published by the Nuffield Council [1125] and [1494].
3.3.5
Institutional Review Boards Provide Ethics Oversight
In response to mistakes and abuses in the United States, Congress established the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research through the 1974 National Research Act. Interestingly, it was the first national commission to function under the new Freedom of Information Act of 1974, so all its deliberations were public and fully recorded. The Act required the establishment of Institutional Review Boards (IRB) for all research funded in whole or in part by the federal government. In the form of the 1978 Belmont Report, this Commission provided a set of recommendations and guidelines for the conduct of research with human subjects and articulated the principles for actions of IRBs (National Commission for Protection of Human Subjects of Biomedical and Behavioral Research, 1978). In 1981, the federal regulations were modified to require IRB approval for all drugs or products regulated by the FDA. This requirement does not depend on the funding source, the research volunteers, or the location of the study. Regulations permitting compassionate use of experimental drugs were disseminated in 1987 and 1991. IRBs must have at least five members with expertise relevant to safeguarding the rights and welfare of patients participating in biomedical research. At least one member of the IRB should be a scientist, one a nonscientist, and at least one should be unaffiliated with the institution. The IRB should be made up of individuals with diverse racial, gender, and cultural backgrounds. Individuals with a conflict of interest may not participate in deliberations. The scope of the IRB includes, but is not limited to, consent procedures and research design. The Belmont Report outlined ethical principles and guidelines for the protection of human subjects. A major component of this report was the nature and definition of informed consent in various research settings. In the Belmont Report, three basic ethical principles relevant to research involving human subjects were identified. The report recognized that “these principles cannot always be applied so as to resolve beyond dispute particular ethical problems.” These principles are discussed in the next section.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
HISTORICALLY DERIVED PRINCIPLES OF ETHICS
63
IRBs approve human research studies that meet specific prerequisites. The criteria are (1) the risks to the study participants are minimized, (2) the risks are reasonable in relation to the anticipated benefits, (3) the selection of study participants is equitable, (4) informed consent is obtained and appropriately documented for each participant, (5) there are adequate provisions for monitoring data collected to ensure the safety of the study participants, and (6) the privacy of the participants and confidentiality of the data are protected. Informed consent is a particularly important aspect of these requirements. The consent procedures and documents must indicate that the study involves research, describes reasonable foreseeable risks and discomfort, and describes potential benefits and alternatives. In addition, the consent document must describe the extent to which privacy of data will be maintained, treatment for injuries incurred, and whom to contact for questions. Finally, the consent indicates that participation is voluntary and no loss of benefits will occur if the patient does not enter the study.
3.3.6
Ethics Principles Relevant to Clinical Trials
The principles of ethics to which physicians aspire probably cannot be applied universally and simultaneously. Furthermore, one cannot deduce an ethical course of action in all circumstances, even after accepting a set of principles. However, there are three principles of ethics outlined in the Belmont Report that are widely accepted in modern medical practice: respect for persons (individual autonomy), beneficence, and justice [927] (National Commission for Protection of Human Subjects of Biomedical and Behavioral Research, 1978). Taken together, they provide guidance for appropriate behavior when conducting human experimentation. The National Commission addressed the ethics of human research specifically in outlining the principles and acknowledged the conflicts that can occur in specific circumstances and even between the principles themselves. Respect for Persons: Autonomy Autonomy is the right of self-governance and means that patients have the right to decide what should be done for them during their illness. Because autonomy implies decision, it requires information for the basis of a decision. Autonomous patients need to be informed of alternatives, including no treatment when appropriate, and the risks and benefits associated with each. The principle of autonomy is not restricted to clinical trial settings but is broadly applicable in medical care. Practitioners who prefer a particular treatment usually recognize that realistic alternatives are possible and that the individual needs to make informed selections. In situations where the patient is incapacitated or otherwise unable to make informed decisions, the principle of autonomy extends to those closest to the patient. This may mean that patients without autonomy (e.g., children or incapacitated patients) should not be allowed to participate in research. Clinical trials often ask that subjects surrender some degree of autonomy. For example, in a trial, the subject may not be able to choose between two or more appropriate treatments (randomization may do it), or the subject may be asked to undergo inconvenient or extensive evaluations to comply with the study protocol. In a masked clinical trial, the subject may be made aware of risks and benefits of each treatment but be unable to apply that information personally with certainty. However, to some extent the
Piantadosi
Date: July 27, 2017
64
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
subject’s autonomy is retrievable. When clinical circumstances require it or when reliable information becomes available, the subject can be more fully informed or trial participation can be ended. The “re-consent” process is an example of this. Temporarily, giving up autonomy is not unique to clinical trials; it is a feature of many medical circumstances. Consider the patient who undergoes surgery using a general anesthetic. Although informed beforehand of the risks and benefits, the patient is not autonomous during the procedure, particularly if the surgeon encounters something unexpected. Respect for persons is an idea that incorporates two ethical convictions. The first is autonomy, as discussed above, and second, that persons with diminished autonomy need protection from potential abuses. Some individuals, especially those who have illnesses, mental disability, or circumstances that restrict their personal freedom, have diminished autonomy. People in these categories may need protection, or even exclusion, from certain research activities. Other individuals may need only to acknowledge that they undertake activities freely and are aware of potential risks. A circumstance in which application of this principle is problematic occurs in using prisoners for research purposes. One could presume that prisoners should have the opportunity to volunteer for research. However, prison conditions could be coercive on individuals who appear to volunteer for research activities. This is especially true if there are tangible benefits or privileges to be gained from participation in this research. Consequently, as the Belmont Report states, it is not clear whether prisoners should be allowed to volunteer or should be protected in such circumstances.
Beneficence and Nonmaleficence Beneficence is a principle that reflects the patient’s right to receive advantageous or favorable consideration, namely, derive benefit. Nonmaleficence is the physician’s duty to avoid harm (primum non nocere) and to minimize the risk of harm. We can refer to these principles jointly as beneficence. Because physicians also have a duty to benefit others when possible, the principle of beneficence has the potential to conflict with itself. For example, knowledge of what provides benefit and causes harm comes from research. Therefore, investigators are obliged to make practical and useful assessments of the risks and benefits involved in research. This necessitates resolving the potential conflict between risk to participants and benefit to future patients. Research can create more than minimal risk without immediate direct benefit to the research subject. Some such research will not be permitted by oversight committees. However, in other cases it may be justified. For example, many arguments have been made by patients infected with HIV that unproven but potentially beneficial treatments should be made available to them. Some of these treatments carry the possibility of harm with a low potential for benefit. The use of baboon bone marrow transplantation in AIDS is an example. The assessment of risks and benefits requires that research studies be scientifically valid and therefore properly designed. However, valid studies do not automatically have value or significance for science, the participants, or future patients. In addition to validity, the study must investigate an important question and have an appropriate risk–benefit ratio for the participants. Investigators should probably establish the scientific validity of a proposed study prior to considering the ethical question of whether or not it has value
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
CONTEMPORARY FOUNDATIONAL PRINCIPLES
65
TABLE 3.2 Principles for Ethical Clinical Trials as Described by Emanuel, Wendler, and Grady [422] Collaborative partnership Scientific value Scientific validity Fairness of subject selection Favorable risk–benefit Independent review Informed consent Respect for enrolled subjects
or significance. The principle of beneficence can be satisfied only when both components are favorable. Justice The principle of justice addresses the question of fairly distributing the benefits and burdens of research. Compensation for injury during research is a direct application of this principle. Injustice occurs when benefits are denied without good reason or when the burdens are unduly imposed on particular individuals. In the early part of this century, burdens of research fell largely upon poor patients admitted to the public wards of the hospital. In contrast, benefits of improvements learned at their expense often accrued to private patients. The injustice of denying treatment to men in the Tuskegee Syphilis Study has already been discussed. There are some circumstances where distinctions based on experience, competence, age, and other criteria justify differential treatment. For reasons already mentioned, research should be conducted preferentially on adults rather than children. Institutionalized patients and prisoners should be involved in research only if it relates directly to their conditions and there are not alternative subjects with full autonomy.
3.4
CONTEMPORARY FOUNDATIONAL PRINCIPLES
Principles of biomedical ethics imply several contemporary requirements for the ethical conduct of research [1541]. These include informed consent of the participants, assessment and disclosure of risks and benefits, and appropriate selection of research subjects. In today’s practice, application of these principles requires other components such as optimal study design, investigator competence, a balance of risk and benefit for study participants, patient privacy, and impartial oversight of consent procedures [1392]. An excellent synthesis of these ideas is given by Ref. [422], who provided requirements for evaluating the ethics of clinical research studies (Table 3.6.3). These were augmented later to add the requirement for a collaborative partnership [421]. The individual requirements are discussed below. Today, at least in academic medical centers, the appropriateness of these components is ensured by a combination of investigator training, infrastructure, institutional review, and peer review. However, there is no
Piantadosi
Date: July 27, 2017
66
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
formula to guarantee these, and as with all scientific endeavors, much rests on judgment and trust. 3.4.1
Collaborative Partnership
A collaborative partnership implies that the research involves the community in which it takes place. Members of the community participate in planning, oversight, and use of research results. Typical mechanisms for accomplishing this include forming community advisory boards, placing patient advocates on review or monitoring committees, and using the community to advocate for research support. This type of partnership may be a critical foundation for ethically conducting some types of research. An example might be studies in emergency situations (e.g., cardiopulmonary resuscitation) where informed consent is not possible. Prospective community awareness of such research would therefore be essential. 3.4.2
Scientific Value
If the study has scientific value, useful knowledge will be derived from the research. This means that not only the question of importance but also the results will be made available to society at large whether “positive” or “negative.” Value is also relevant to the use of scarce resources. Studies of high value will be allowed to consume resources preferentially. A study that has value should produce a product (result) that is visible to the scientific community through publication and presentation. An extension of this idea is that results should not be kept confidential because of proprietary concerns. More to the point, the investigator should not participate in a clinical trial that is likely to be valueless if the sponsor disapproves of the findings. 3.4.3
Scientific Validity
Scientific validity is a consequence of good study design, and means that subjects on a trial are contributing to answering a question that is important and that has a high chance of being answered by the experiment being undertaken. Research designs that are grossly flawed or those that cannot answer the biological question are not ethical. Similarly, those that ask unimportant questions are unethical, even if they pose minimal risk. As Rutstein said: It may be accepted as a maxim that a poorly or improperly designed study involving human subjects … is by definition unethical. Moreover, when a study is in itself scientifically invalid, all other ethical considerations become irrelevant [1319].
A second component of validity derives from investigator competence: technical, research, and humanistic. Technical competence is assessed by education, knowledge, certification, and experience. In addition to technical competence, the investigator must have research competence. This may be based on both training and experience. One tangible aspect of research competence might be that the investigator has performed a competent systematic review of current knowledge and previous trials to be certain that the planned study is justified. When the clinical trial is completed, a valid, accurate,
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
CONTEMPORARY FOUNDATIONAL PRINCIPLES
67
and complete description of the results should be published to ensure dissemination of the knowledge [712]. A component of a valid finished trial is an appropriate statistical analysis. Humanistic competence requires compassion and empathy. These cannot be taught in the same way that technical and research competence can, but the proper clinical and research environment and good research mentoring facilitate it. 3.4.4
Fair Subject Selection
Historically, participation in research has been viewed as a risk. At the level of the individual subject, it is both a risk and a benefit with acceptable balance as indicated below. When things go awry, such as the extreme case of death attributable to the treatment, participation will be viewed retrospectively purely as a risk. Distributing the risks of research fairly and avoiding exploitation of vulnerable people continues to be a principle of ethical behavior. Voluntary selection is a foundation of fairness. It follows that removal of obstacles to participation is also within the principle of fairness. In recent times, participation in research has been viewed increasingly as a benefit or a right. This view is not so much at the level of the individual research subject but pertains to sociodemographic, advocacy, or political groups claiming that their health needs have been neglected. This view makes a direct connection between representation in a study cohort and (disproportionate) derived benefits for similar members of society. It has political origins rather than scientific ones. The view of research participation as a right embraces several fallacies. First, it does not fully acknowledge the view based on risk. Second, it assumes that sociodemographic groups derive benefits when their members enter research studies, which is not always true. Third, it assumes that knowledge about groups is gained by direct participation alone, namely, that studies have primarily empirical external validity. 3.4.5
Favorable Risk–Benefit
Patients must be excluded from the study if they are at undue risk or are otherwise vulnerable [1535]. Having the study reviewed by the IRB or other ethics board, satisfying eligibility criteria, and using informed consent do not eliminate this duty. Patients could be at high risk as a result of errors in judgment about their risk, atypical reactions or side effects from the treatment, or for unknown reasons. The individuals affected and others likely to be affected should be excluded from further participation. The assessment of risk and benefits implies that the research is properly designed and has had competent objective review. If alternative ways of providing the anticipated benefits to the patient without involving research are known, they must be chosen. Investigators must distinguish between the probability of harm and the severity of the effect. These distinctions can be obscured when terms like “high risk” or “low risk” are used. For example, if a life-threatening or fatal side effect is encountered with low frequency, is this high risk? Similarly, benefits have magnitudes and probabilities associated with them. Furthermore, risks or benefits may not accrue only to the research subject. In some cases, the risks or benefits may affect patient families or society at large. A favorable risk–benefit setting does not require that the subjects be free of risk or that they be guaranteed benefit. The tolerance for risk increases as the underlying severity of the disease increases. In the setting of a fatal disease, patients and research subjects may
Piantadosi
Date: July 27, 2017
68
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
be willing to accept a modest chance for minimal symptomatic improvement with even a significant risk of death. A good example of the reasonableness of this is lung volume reduction surgery for patients with advanced emphysema. Even in properly selected patients, surgery carries a risk of death far in excess of most elective procedures, and there may be only a 30% chance of modest (but symptomatically important) benefit. 3.4.6
Independent Review
Review of proposed and ongoing research studies is performed by institutions through at least two mechanisms. In the United States, the first is the Institutional Review Board (IRB), which is responsible for the ethical oversight of all Public Health Service sponsored investigation. In other countries, this role is covered by an Independent Ethics Committee (IEC) (Australia), a Local Research Ethics Committee (LREC) (England), or a Research Ethics Board (Canada). For convenience, I will refer to all such committees as IRBs. IRB committees are typically composed of medical practitioners, bioethicists, lawyers, and community representatives. They review planned trials from an ethical perspective, including consent documents and procedures. IRBs are also being increasingly asked to review the scientific components of research studies. This can be helpful when the expertise is available and a hindrance when it is not. A second method by which institutions or collaborative clinical trial groups review studies is by using a Treatment Effects Monitoring Committee (TEMC). These committees oversee ongoing clinical trials with regard to treatment efficacy and safety. If convincing evidence about efficacy is provided by the trial before its planned conclusion, the DSMC will recommend early termination. Similarly, if serious unforeseen toxicities or side effects are discovered, the trial might be halted. Designing and executing this aspect of a trial can be quite complex and is discussed in Chapter 18. Most large multicenter clinical trials have additional layers of concept development and/or review, although perhaps not focused only on ethics. Reviews might be performed by program staff at sponsors, the FDA, or collaborators. Such reviews tend to be focused on scientific issues but always include a perspective on ethics. At NCI designated Comprehensive Cancer Centers, an extra formal structured scientific and ethics review is required prior to the submission of a research project to the IRB. The mechanism of study review is itself peer-reviewed, and the process provides a layer of quality assurance for cancer trials that usually makes IRB scrutiny more efficient and productive. 3.4.7
Informed Consent
Informed consent is a complex but important aspect of the practice and regulation of clinical trials. The requirement for consent is grounded in moral and legal theory and clinical practice. A perspective on this and historical developments in informed consent is given in Ref. [437]. Broad reviews of issues surrounding informed consent are given Refs. [77, 1300, 1301]. In the context of AIDS, a useful review has been written by Gray, Lyons, and Melton [630]. There are three keys to appropriate informed consent: capacity, comprehension, and voluntariness. Capacity is the ability of the prospective participant to understand. Capacity can be diminished by factors such as age, injury, medication, and medical condition. Comprehension is affirmation that the prospective participant understands the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
CONTEMPORARY FOUNDATIONAL PRINCIPLES
69
TABLE 3.3 Elements of Informed Consent from CFR 21, Section 50.25 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
A statement of the purpose of the research; procedures to be followed; duration of participation; investigational treatments or procedures Descriptions of risks and discomforts Description of benefits that can be reasonably expected Disclosure of alternatives that might be advantageous to the potential participant Description of how confidentiality will be maintained Explanation of compensation and medical treatments if injury occurs Contacts to answer questions and questions about research related injuries Explanation that participation is voluntary and that there will be no penalty or lost benefits for refusing to participate Potential risks to the participant, embryo, or fetus if the volunteer is or becomes pregnant Circumstances under which the investigator may terminate a volunteer’s participation Additional cost to the participant Consequences and procedures to withdrawal Statement about informing participants of significant new findings that might affect their willingness to participate Number of volunteers participating in the study
Items 9–14 are optional depending on circumstances.
information presented. It requires a face-to-face discussion, not simply a review of documents. It may also require significant time, both with the investigator and afterward in consultation with family members. Voluntariness is the concept that participants have come to their decision without coercion. This means that not only have obvious pressures been removed (time, influence of others), but also the effects of fear have been mitigated. Comprehension and voluntariness imply that consent is a process rather than a document. This process may take several meetings with investigators or other members of the research team over several days. The potential research participant will likely have additional discussions with family members or other trusted people. Investigators should be familiar with generic clinical trial information that potential participants are likely to read [586]. Common errors in consent documents include excessive length and too high a reading level for adequate comprehension. Both of these are easily correctable, but finding the right balance between the amount of information to include and the length of an informed consent document is difficult. Also investigators often wrongly view consent as a document rather than a process. The goal is to transmit culturally valid information regarding risks and benefits of participation in research. Because of the liability climate in the United States, there is sometimes a perspective here to view the consent document and process largely as a protection for the investigator. Elements of Informed Consent Elements of informed consent include information provided to the patient, comprehension of that information by the patient and his or her family, and an assessment of the voluntary nature of the consent (Table 3.3). The information required in a consent document generally includes the nature of the research procedure, its scientific purpose, and alternatives to participation in the study. Patient comprehension is facilitated by careful attention to the organization, style, and reading level of consent forms. When
Piantadosi
Date: July 27, 2017
70
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
children are subjects of research, it is frequently necessary to obtain informed consent from the legal guardian and obtain assent from the child. To verify that consent is given voluntarily, the conditions under which it is obtained must be free of coercion and undue influence. Conditions should not permit overt threats or undue influence to affect the individual’s choice. A problem area for consent is in conducting research on emergency medical treatments. It is instructive to review the problems with consent on the emergency room setting as a microcosm of consent issues [156]. The principal difficulty is in being able to obtain valid consent from critically ill patients to test promising new treatments with good research designs. Some patients may be unconscious, relatives may be unavailable, or there may not be sufficient time to meet the same standards of informativeness and consent, as in ordinary hospital or clinic settings. In 1996 the FDA and NIH proposed new measures for the protection of research subjects in emergency settings. The new FDA rules and NIH policies on emergency consent waiver make it easier to study drugs and devices in patients with life-threatening conditions who are unable to give informed consent. The new regulations permit enrolling subjects in research studies without their consent provided the following criteria are met: 1. An independent physician and an IRB agree to the research and that it addresses a life-threatening situation. 2. The patient is in a life-threatening situation. 3. Conventional treatments are unproven or unsatisfactory. 4. The research is necessary to determine the safety and efficacy of the treatment and it cannot be carried out otherwise. 5. Informed consent cannot feasibly be obtained from the patient or legal representative. 6. The risks and potential benefits of the experimental procedure are reasonable compared with those for the underlying medical condition of the patient and standard treatments. 7. Additional protections are in place such as consultations with the community, advance public disclosure of the study design and risks, public disclosure of the study results, and FDA review of the study protocol. The merits and weaknesses of these specialized rules will be evident in the next few years as they are applied in specific research projects. Therapeutic Misconception Therapeutic misconception is when a research participant misunderstands the benefit or intent of a research study and therefore expects more benefit than is warranted [76, 847]. This misconception may be crucial in an individual’s agreement to participate in the first place. The chance of potential research subjects taking a too favorable view of a trial is increased when the question is closely tied to clinical care, or when a person is faced with serious or terminal disease with few options. Our principal concern with therapeutic misconception is the risk of coercion when subjects have diminished autonomy. Accurate information, explanation of alternatives, and emphasizing voluntariness are methods to minimize therapeutic misconception.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
CONTEMPORARY FOUNDATIONAL PRINCIPLES
71
Similarly, if subjects overestimate benefits or underestimate potential risks, it has been termed therapeutic misconception. This can happen even when participants fully understand the research procedures and how they differ from regular clinical care. The term therapeutic optimism has been used to describe a participant’s hopes for the best possible outcome. Optimism can reasonably exist while understanding clearly the nature of research and risks and benefits. It is difficult to see optimism as any sort of ethical problem. Therapeutic misconception has become a classical criticism of dose-finding cancer trials. A primary goal of such trials is to cause side effects, so a safe dose of a new treatment can be determined. A secondary goal is legitimately therapeutic within the limits of lack of clinical knowledge regarding the new agent. But the typical setting for such studies when subjects are terminally ill with no good therapies is available. It is easy to see how they might have an overly optimistic view of such a trial even when the purpose and benefits are carefully explained. Historically, there has been a track record of legitimate benefit from these trials [736], although at a low frequency.
3.4.8
Respect for Subjects
Respect for subjects pertains to the way that potential participants are approached and given options, as well as the ongoing engagement of those who agree to participate in the study. One feature of respect is privacy, discussed below. Second is allowing participants to change their mind and withdraw from the trial without incurring penalties. Third, the new information gathered from the study must be made available to participants, perhaps during the course of the trial. Results should be available to the participants at the end of the study. Finally, the interests of the participants should be continually monitored while the study is taking place. Treatment effects monitoring (discussed in Chapter 18) provides a mechanism to ensure certain of these rights. Patient rights to privacy has a long tradition and has been underscored by the Health Insurance Portability and Accountability Act of 1996 (HIPAA) regulations [1131]. The purpose of HIPAA was to guarantee security and privacy of health information. The burden on researchers has been considerable [621, 812] The privacy right has been made ascendant even in circumstances such as the AIDS epidemic, where certain benefits to society could be gained by restricting the right to privacy (e.g., contact tracing and AIDS screening). It is maintained by appropriate precautions regarding written records and physician conduct. In the information age, extra care is required to maintain privacy on computerized records or other electronic media that can be easily shared with colleagues and widely disseminated. Maintaining privacy requires steps such as patient consent, restricting the collection of personal information to appropriate settings and items, ensuring security of records, and preventing disclosure. These and other privacy principles have been crafted into a comprehensive set of guidelines in Australia [299] and are relevant broadly.
Piantadosi
Date: July 27, 2017
72
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
3.5
METHODOLOGIC REFLECTIONS
The principles of ethics outlined above are directly represented in the conduct of trials, especially with respect to formal ethics review, investigator competence, informed consent, and disclosure of risks. There are further connections between principles of ethics and important study design features discussed later in this section. Some of these design characteristics are occasionally contentious. To begin with, it is worthwhile to examine the ethics imperative to conduct sound research—that is, to generate reliable evidence regarding therapeutic efficacy.
3.5.1
Practice Based on Unproven Treatments Is Not Ethical
Despite knowledge, skill, and empathy for the patient, the practitioner sometimes selects treatments without being absolutely certain about relative effectiveness. Many accepted therapies have not been developed or tested with the rigor of evidence-based medicine. Treatment preference based on weak evidence does not provide an ethics imperative as strong as that for therapies established on a rigorous scientific foundation, if at all. Astute practitioners are alert to both the ethical mandate to use the best treatment and the imperative to understand the basis of a practice and challenge it with rigorous experiments when appropriate. These mandates are not in opposition. On the contrary, the obligation to use the best treatment implies a requirement to test existing ones and develop new therapies. Some untested practices are sensible and are not likely to cause harm, such as meditation, exercise, relaxation, and visualization. These and related activities may be lifestyles as much as they are therapies. They may improve a patient’s quality of life, especially when applied to self-limited conditions, and normally do not replace treatments of established benefit. The discussion here does not pertain to such adjunctive treatments. Ethics concerns arise when unproven or untested therapies replace proven ones, particularly for chronic or fatal illnesses [837]. Ignorance about the relative merits of any treatment cannot carry ethical legitimacy for the physician, even in a purely “practice” setting. To summarize: A physician’s moral obligation to offer each patient the best available treatment cannot be separated from the twin clinical and ethical imperatives to base that choice of treatment on the best available and obtainable evidence. The tension between the interdependent responsibilities of giving personal and compassionate care, as well as scientifically sound and validated treatment, is intrinsic to the practice of medicine today… Controlled clinical trials – randomized when randomization is feasible, ethically achievable, and scientifically appropriate – are an integral part of the ethical imperative that physicians should know what they are doing when they intervene into the bodies, psyches, and biographies of vulnerable, suffering human beings [1292].
Lack of knowledge can persist purely as a consequence of the newness of a therapy. In this circumstance, clinical trials may be the most appropriate way to gain experience and acquire new knowledge about the treatment. However, ignorance can remain even in the light of considerable clinical experience with a therapy, if the experience is outside of a suitably structured environment or setting that permits learning. This can happen when proponents of a treatment have not made sufficient efforts to evaluate
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
METHODOLOGIC REFLECTIONS
73
it objectively, even though they may be clinically qualified and have good intentions. An example of this was mentioned in Section 3.1.1 with the discussion of robotic prostatectomy. There are widely used unproven therapies for many conditions, including AIDS, cancer, arthritis, diabetes, musculoskeletal conditions, skin diseases, and lupus. Such therapies are particularly common in cancer and AIDS [222–227]. The National Center for Complementary and Alternative Medicine at NIH defines alternative therapies to be those that are unproven and has attempted to investigate some of them. It does not appear so politically correct today to label a treatment as unproven despite the obvious. Our squeamishness comes partly from a general confidence in therapeutics and partly from the scientific norm of doubt of certitude, both discussed elsewhere in this book. There is a great diversity of such therapies, even if one restricts focus to a single disease like cancer [90, 185, 239, 1140]. The large number of unproven therapies suggests that weak or biased methods of evaluation are the norm. Many fringe treatments for various diseases are discussed on an Internet site that monitors questionable treatments [117]. A broad view of ethics related to the complementary and alternative medicine context is provided in the book by Humber and Almeder [748].
Example: Unproven Cancer Therapies In the past, the American Cancer Society (ACS) investigated and summarized scientific evidence concerning some unproven cancer treatments through its Committee on Questionable Methods of Cancer Management [27, 52, 53, 116]. Over 100 unproven or untested treatments for cancer were previously listed on their Internet site for fringe therapies. A few examples are provided in Table 3.4. Although many or all of these treatments have professional advocates and some have been associated with anecdotes of benefit, they are most often used in an environment that inhibits rigorous evaluation. See Ref. [118] for a review of this and related subjects. Unconventional methods like those in Table 3.4 are often made to appear reasonable to patients who have few therapeutic options and lack knowledge about cancer treatment. Few such treatments have been shown to be safe and effective using rigorous clinical trials. Evidence supporting them remains unconvincing despite the clinical experience that surrounds them. Some treatments listed in Table 3.4 have been studied and found to be ineffective, such as Laetrile and hydrazine. Others have proven effective uses, but not when employed unconventionally, such as vitamin C and dimethyl sulfoxide. In the 1990s, as many as 5% of cancer patients abandoned traditional therapies in favor of alternative methods and up to 70% employed some alternative therapy [1011]. Unconventional therapy was used by a sizable fraction of people in the United States [414]. Twenty years later, the proportion is much higher. Issues related to the evaluation of complementary therapy are discussed in Section 4.5 and an example of the evaluation of a particular unconventional treatment is given in Section 20.8.5. For a perspective on the potentially false basis for, and consequences of, some alternative therapies, see the discussion of shark cartilage in Ref. [1167]. Some unproven treatments for other diseases have proponents, but can be found in similar settings that discourage objective evaluation. In any case, strongly held
Piantadosi
Date: July 27, 2017
74
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
TABLE 3.4 Examples of Cancer Therapies Widely Regarded as Questionable Antineoplastons Anvirzel Bio-Ionic System BioResonance Tumor Therapy CanCell (also called Cantron, Entelev, and Protocel) Canova Method Cansema System Chaparral Controlled Amino Acid Therapy Coral calcium Di Bella therapy Dimethyl sulfoxide Electron replacement therapy Elixir Vitae Escharotic Salves Essiac Galavit Galvanotherapy Gerson Method Gonzalez (Kelley) Metabolic Therapy Grape cure Greek Cancer Cure Hoxsey Treatment Hydrazine Sulfate Hyperthermia, whole body
Immuno-Augmentative Therapy Induced Remission Therapy Induced Hypoglycemic Treatment Insulin Potentiation Therapy Intra-Cellular Hyperthermia Therapy Krebiozen Laetrile Livingston-Wheeler Regimen Macrobiotic Diet Metabolic therapy Mistletoe/Iscador Moerman Diet Multi Wave Oscillator Oncolyn PC-SPES Polyatomic Oxygen Therapy Resan Antitumor Vaccine Revici Cancer Control Shark Cartilage Stockholm Protocol Sundance Nachez Mineral Water Ultraviolet Blood Irradiation Vitamin B-17 Tablets Vitamin C Wheat Grass Zoetron Therapy
opinion cannot ethically substitute for experimental testing. Efforts by NIH to study some unconventional treatments have been subject to controversy [990]. Critical views are offered in Refs. [88, 515, 632, 1447]. 3.5.2
Ethics Considerations Are Important Determinants of Design
Clinical trials are designed to accommodate ethical concerns on at least two levels. The first is when the practitioner assesses the risks and benefits of treatment for the individual patient on an ongoing basis. The physician always has an obligation to terminate experimental therapy when it is no longer in the best interests of an individual patient. Second, the physician must be constantly aware of information from the study and decide if the evidence is strong enough to require a change in clinical practice, either with regard
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
METHODOLOGIC REFLECTIONS
75
to the participants in the trial or to new patients. In many trials this awareness comes from a Treatment Effects Monitoring Committee, discussed in Chapter 18. Both of these perspectives require substantive design features. Consider early pharmacological oriented studies in humans as an example. The initial dose of drug employed, while based on evidence from animal experiments, is often very conservatively chosen. It is unlikely that low doses will be of benefit to the subject, particularly with drugs used in life-threatening diseases like cancer or AIDS. Ethical concern over risk of harm requires these designs to test low doses ahead of high doses. Such studies often permit the subject to receive additional doses of drug later, perhaps changing their risk–benefit ratio. In middle development, staged designs, or those with stopping rules, are frequently used to minimize exposing subjects to ineffective drugs, and to detect large improvements quickly. Similar designs are used in comparative trials. These designs, developed largely in response to ethical concerns, are discussed in more depth in Chapter 18. In these and other circumstances, ethical concerns drive modifications to the clinically, statistically, or biologically optimal designs. The ethics needed in research design were illustrated exceptionally well in circumstances surrounding trials in acute respiratory distress syndrome (ARDS) conducted by the ARDS Network. This multicenter group was initiated in 1994 by the National Heart Lung and Blood Institute (NHLBI). A more detailed history of the controversy sketched here is provided by Steinbrook [1442, 1443]. The ARDS Network randomized study of tidal volumes took place between 1996 and 1999, and was published in 2000 [1113] indicating the superiority of low tidal volume management. A randomized study of the use of pulmonary artery catheter versus central venous catheter was begun in July 2000. Both studies were criticized by two physicians in complaints to the Office for Human Research Protections (OHRP) because of their concerns regarding the nature of the control groups. The tidal volume study was published at the time concerns were voiced, but the fluid and catheter trial was less than half completed. Because of the OHRP attention, the sponsor voluntarily suspended the trial in 2002 and convened an expert review panel. Although the panel firmly supported the study design and conduct, the trial remained closed while OHRP commissioned a separate review. In July 2003 after nearly a year, the OHRP allowed the trial to continue without modification [1443]. This case illustrates the attention given to the ethics of research design and the extraordinary – perhaps inappropriate – power that individuals can have when such questions are raised. I say inappropriate for two reasons. First, the suspension ignored the considerable process leading to the study implementation, including multidisciplinary scientific review and oversight and redundant independent reviews by Institutional Review Boards [391]. The IRBs are explicitly approved by OHRP. Second, the suspension process did not itself provide due process for disagreements (e.g., the OHRP did not make public the names of its eight consultants). The study may well have been damaged, although the OHRP did not require any design changes, asking only for additional IRB reviews and changes to the consent documents.
3.5.3
Specific Methods Have Justification
Rigorous scientific evaluation of therapies fits well with imperatives of ethics. Tensions that arise can be lessened through scrupulous attention to principles but tend to recur
Piantadosi
Date: July 27, 2017
76
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
with some specific elements of design. However, most elements of good study design do not trouble our sense of ethics, and they may be required as part of good research practice. Examples of design elements that are always appropriate include a written protocol, prospective analysis plan, defined eligibility, adequate sample size, and safety monitoring. Other components of design may raise ethics questions, at least in some settings. Examples include use of a control group, method of treatment allocation (e.g., randomization), placebos, and the degree of uncertainty required to conduct a trial. Concern over these points can be illustrated by randomized trials of extracorporeal membrane oxygenation (ECMO) for neonates with respiratory failure. This series of studies has been discussed at length in the pediatric and clinical trials literature. Some issues related to treatment allocation and the ECMO trials are discussed in Chapter 17. For a review with useful references from an ethical perspective, see Ref. [1040].
Randomization The justification for randomization follows from two considerations: (1) a state of relative ignorance – equipoise [518] or uncertainty, and (2) the ethical and scientific requirement for reliable design. If sufficient uncertainty or equipoise does not exist, randomization is not justified. Because convincing evidence can develop during the conduct of a trial, the justification for continuing randomization can be assessed by a Treatment Effects Monitoring Committee (Data Monitoring Committee) [122]. This is discussed briefly below and in Chapter 18. Randomization is a component of reliable design because it is the main tool to eliminate selection bias. It may be our only reliable method for doing so. Randomization controls effects due to known predictors (confounders), but importantly it also controls bias due to unobserved or unknown factors. This latter benefit is the major, and often overlooked, strength of randomization. Methods of treatment allocation, including randomization, are discussed in Chapter 17. The merits of randomization are not universally accepted either on ethical or on methodological grounds [1499], but mainstream opinion agrees on its usefulness and appropriateness. The benefits of randomization with respect to bias reduction do not come from the effects of chance, but simply from the breaking of the natural correlation of prognosis with physician therapeutic choice. Presumably, the same uneasiness would arise from any treatment assignment method that is free of bias. There are occasional concerns that unbalanced randomization is ethically questionable. It might imply that investigators favor one treatment, and the unbalanced treatment allocation eases their average concern. Unequal allocation will not compensate for lack of true equipoise or uncertainty, which would then be the real issue. However, there are legitimate reasons for unequal allocation that are free of such concerns. They include requiring more clinical experience with one treatment than another, cost optimization, variance minimization, and adaptive randomization for efficiency. Simple randomization, which is always valid in principle, can result in imbalances. We routinely seek to limit those imbalances not for ethical reasons, but because they are widely viewed as ugly, and they can attract pointless criticism. In some studies, aggregates of individuals are randomized – families, villages, cities, hospitals, emergency rooms, or clinics, for example. These “cluster” randomizations can be done when it is not feasible to randomized individuals. The intervention is applied
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
METHODOLOGIC REFLECTIONS
77
throughout a cluster and special methods for design and analyses are then required. A potential problem is how to apply appropriate consent procedures for subjects within a cluster. A clinic that has adopted a randomized assignment of a treatment must still present to individual study subjects risks, benefits, and alternatives. The choice presented to potential study participants may be less clear under this circumstance. Payment for randomization or participation in a clinical trial is usually not viewed favorably by IRBs. Even seemingly minimal payments may be coercive to disadvantaged people. Payments for time, parking, lunch, lodging, or other inconveniences of donating time to a research study are more appropriate. Treatment Preference A physician with a preference for a particular treatment should not be an investigator in a clinical trial. This assumes that the preference is strong enough so that the physician does not recognize a state of equipoise. A potential participant that strongly prefers one treatment in a comparative trial should not participate in the study. Even if individuals have preferences based on weak evidence or non rational factors, they are not eligible to participate in the trial. A similar proscription applies to patients who prefer some treatment not under study, for example, in a single-arm trial. Subjects with firm preferences or reservations are more likely to withdraw from the study or develop strong dissatisfaction with their results. Exclusions because of strong preferences are required by the principle of autonomy. Investigators should not exert too much effort to overcome the reservations of potential participants because it might be misinterpreted as coercion. Such concerns work against overenthusiastic “inclusiveness”; for example, see Section 9.4.4. Informed Consent Informed consent is a vital enabling element of clinical trials, and was discussed in detail in Section 3.4.7. Some challenging concerns today regarding consent relate to how much true understanding is gained by participants prior to study entry, the role of consent documents as protection for research institutions as much as for information, appropriate cultural and language translation of research consents, process documentation, and providing flexibility and privacy for participant autonomy in the use of biospecimens and genetic information. In short, what is at stake in research now is somewhat different than it was 40 years ago, but the methods of informed consent may not have kept up. When problems with the process of informed consent occur, it is often the result of investigator error. Patients and their families are vulnerable, and technical information about new treatments is hard to understand, especially when it is presented to them quickly. This can also be a problem when using treatments already proven to be effective. Informed consent procedures that evolved to protect research subjects from exploitation are now used by some as a means to protect them from litigation. While such concerns are understandable, the best prevention for legal controversy is an appropriate and documented process that transmits all the relevant information to a prospective study participant. Monitoring During the conduct of a trial, convincing evidence regarding the outcome could become available. Trial designs routinely incorporate interim analyses and guidelines to deal with this possibility without disrupting the statistical properties of the study. If investigators
Piantadosi
Date: July 27, 2017
78
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
find early evidence convincing, they should recommend stopping the study to minimize the number of subjects receiving inferior therapy. Interim analysis is especially of concern in comparative experiments. Safety monitoring of data collected during a clinical trial is also a required oversight. Adverse events attributable to therapy must be reported to sponsors and regulators promptly. Both safety and efficacy concerns may be addressed by a Treatment Effects Monitoring Committee (TEMC). This aspect of trial design and conduct is discussed in Chapter 18.
Use of Placebo Controls It has been argued that the history of therapeutics up to the recent age of scientific medicine has been the history of the placebo effect [1367]. Regardless of the accuracy of the claim, the power of suggestion needs to be considered when evaluating any treatment. The effect can be controlled by using a placebo (or sham) in a comparative experiment, and it is often necessary and practical to do so. Thus, the use of a placebo, like randomization, is justified by the imperative to employ valid design. It is not always ethical to employ a placebo – for example, if withholding treatment would place subjects at risk. Some clinical trials continue to be performed using placebos in questionable situations. Rothman and Michels [1290] discussed this problem and offered some examples. The choice of an appropriate control depends, in part, on the nature and strength of physicians’ belief about the treatment in question and alternatives to it. Except for trivial situations, it is not appropriate to replace effective therapy with a placebo, even with “informed consent.” It is better to evaluate the new treatment against standard therapy without compromising the benefit to the subjects in the control group. Placebos and “no treatment” are not exactly equivalent. In a comparative design, subjects might be randomized to a treatment, 𝐵, or its placebo, 𝑃 . If these were the only treatments, then 𝑃 would be a “no treatment” arm. However, if there is an active treatment, 𝐴, that ethically should be administered to everyone, then 𝐵 might still be rigorously tested by randomizing between 𝐴 + 𝐵 and 𝐴 + 𝑃 . Thus, tests of the incremental effects of a new treatment can often be placebo controlled. If 𝐵 is a potential substitute for 𝐴, then a direct comparison, with the need for a placebo, might be warranted. There are circumstances in which we might be comfortable not using standard therapy, and comparing a new treatment to a placebo. This might be the case, for example, if the standard treatment is weakly effective and associated with high morbidity and the new treatment being tested has low morbidity. Such a situation is not unique to trials because patients and physicians could sensibly refuse any treatment where the risk–benefit ratio is unfavorable. Also, placebo controls might be scientifically and ethically justified when new evidence arises suggesting that standard therapy is ineffective. The placebo question is more contentious for surgical treatments, for which a true placebo requires a sham procedure. Sham procedures stereotypically seem to place subjects always in an unacceptable risk–benefit circumstance, and therefore would be unacceptable. This perspective is not sufficiently refined and is discussed further in Chapter 4. Shams are sometimes required for device trials. A perspective on their importance is given by Redberg [1252].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PROFESSIONAL CONDUCT
79
Demonstration Trials When ample evidence already exists or investigators are already convinced that a particular treatment is best, it may be unethical to conduct a trial to convince colleagues of its superiority. Such studies are sometimes termed “demonstration trials” and are subject to comparator bias [982]. There are situations where not all practitioners are convinced by the same evidence or same standard of evidence. This happens often in different countries and/or depending on strength of prior belief and local standards of practice. Ethical standards are also local, and some studies are acceptable to physicians in one culture but viewed as inappropriate in another. In any case, most practitioners would discourage comparative trials performed only to strengthen evidence. Sometimes the standard of care contains procedures or treatments that are unnecessary. Electronic fetal monitoring during labor may be an example. A demonstration trial may be appropriate in such circumstances. More generally, demonstration trials to show equivalence are less problematic than those intended to show a difference.
3.6 3.6.1
PROFESSIONAL CONDUCT Advocacy
There is no perfect place in this book to discuss advocacy. I discuss it here because it relates strongly to both individual and collective rights even though it is not formally part of the ethics calculus. Advocacy for the most part is a type of professional conduct that seems to be relatively free of guidelines. Advocacy is simply the action of individuals or organizations to influence others. Advocates in the clinical trials domain can be physicians, patients, participants, or other interested individuals. Often they are family members or other individuals close to someone with a disease or condition who organize in ways that allow them to be effective in the dialog about research. Advocates leverage their voice by using a few fundamental tools: organization, information, and money. Organization may consist of not-for-profit entities, simple visible coordinated activities, or support for patients and families. Information is increasingly centered around an Internet presence, but may target patients, families, or political leaders. And money can be used directly to support activities of the advocates or as philanthropy to influence the interest or focus of researchers. It is generally considered honorable to draw attention to the plight of those with illness, and advocates for virtually any such cause enjoy a positive image in society. In some cases societal injustice may be corrected by the influence and awareness created by advocacy. However, advocacy is not inherently guaranteed to yield universal positive or useful effects for no other reason than that it is an unregulated human activity driven by self-interest. As related to clinical trials, advocacy has had some interesting history. It comes almost always packaged in a specific disease context, often for less visible or rare diseases. The purpose of advocacy has not been necessarily to encourage participation in trials. More often the purpose is to increase funding or attention. Early in the AIDS epidemic, for example, advocacy drew attention to the need for more research, sometimes using socially disruptive activities. It also called specifically for the dismantling of the traditional clinical trial process to allow HIV patients to receive investigational new drugs only because they might work, but explicitly not for the primary purpose of evaluation.
Piantadosi
Date: July 27, 2017
80
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
Hence, it was not advocacy for research. Advocacy also facilitated and encouraged drug sharing and underground pipelines that no doubt complicated or invalidated some trials. Only later as the epidemic became more chronic and some treatments showed promise did adherence to good experimental principles become an element of AIDS advocacy. Some advocacy focuses on the regulatory process for several reasons. First, it is a final common pathway where many scientific issues surface publicly. Second, the process attracts attention because since drugs, biologicals, and devices are imperfect entities, the ideal balance between speed, completeness, and reliability is impossible to attain. Third, nearly every therapeutic entity has both proponents and detractors. Thus, every regulatory decision leaves some room for debate. This is a perfect setting for advocates who often take formal stands in the public portion of regulatory proceedings. Only occasionally are there impartial calls for regulatory process changes based on objective reasoning. Advocacy can have positive effects Despite the fallabilities of self-interest, positive effects of advocacy include support for disease sufferers and their families, increased public awareness and acceptance, increased funding and leverage of existing funding for research, and education about alternatives and the research process. Philanthropic advocacy often is more risk accepting for innovative research ideas or approaches. In some cases, advocacy works to increase participation in clinical trials, which can leave a priceless legacy. A positive effect of advocacy is when others are made aware of important nuances or details of the history of someone with a disease. This can increase both awareness and tolerance. There is much reinforcement for this case history approach especially in the lay public. As is often the case, a strength can also be a weakness. In professional forums, retelling of personal cases histories has diminished value. Advocates sometimes miss valuable opportunities to transmit enduring collective messages to professional audiences. And some potential flaws One does not have to be an expert to be an advocate. But the tools of advocacy, such as visibility and money, can imply otherwise, especially to patients who may be desperate for answers or alternatives. Celebrity advocacy is the best illustration of this. Advocacy can create competition when resources are fixed, as is typically the case. In the restrictive public funding climate today, one cannot increase funding for HIV research or cancer without reducing funding for neurological disease, diabetes, or some other condition. Advocates often dismiss such issues by saying that they ask for increased funding. That often happens but actually does not remove the concern. We cannot say what is the correct comparative basis for research funding – Is it dollars per case of disease, per death from illness, or per life-year lost? What should be the role of political connectedness, public sympathy, or emotion? No matter which our society chooses, there will be inequities in the allocation of research and clinical trial funding across diseases that need our attention. For example, many people think that breast cancer is the leading cause of death from cancer in women. This is partly the by-product of the strong visible advocacy for breast cancer research today. Lung cancer, which is presently the biggest cancer killer in women, is a much less attractive target for advocacy. Yet it might be far easier to
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PROFESSIONAL CONDUCT
81
eliminate lung cancer deaths because we understand more about the behavioral issues and topical carcinogen than we do about breast cancer biology. A by-product of reducing lung cancer mortality in women is that it would likely have immediate relevance for the other half of the population. Is the current balance of advocacy for these two risks one that we would design rationally? Aside from funding for research, advocacy for a single issue may not allow leveraging of important scientific leads or opportunities that arise in seemingly unrelated areas. AZT, the first drug to show activity against HIV, was developed (and essentially dismissed) as an anti-cancer drug. We can’t predict very well where the best leads will arise, but they may not come from within the disease setting. For these and other reasons, advocacy does not always have a perfect relationship with researchers. In fact, advocacy need not have research goals at all. Advocacy groups also create their own culture and need to preserve their own roles. It once came to my attention discussing a design question that an advocacy group for a certain disease was publicly encouraging an unproven treatment. Appropriately, that treatment was to be excluded from the control arm in a trial testing a much more promising agent. By advocating to include unproven therapy, the group was inhibiting a definitive trial that might have yielded true progress against the disease. Such circumstances are unusual for sure, but there is no fundamental accountability for the expertise or morality of any advocacy. The usual face of advocacy in clinical trials is a desire for advocates to participate in the design and discussion of research approaches for their disease. Such participation can bring awareness to the researchers of perspectives and issues that might improve the usefulness or size of a clinical trial. There is a long-standing role for lay public representation on Institutional Review Boards and Ethics Committees. This does not literally mean an advocate in the sense of disease orientation, but might be interpreted as a general sort of advocacy for the public perspective. Given the diversity of research topics encountered by most IRBs, it would not seem appropriate to have a particular disease advocate on the committee.
3.6.2 Physician to Physician Communication Is Not Research Physician to physician, or professional, communications (PC) regarding trials occur in multiple epochs: during the development of research ideas, when trials become available for human subjects participation, during the execution of a trial, with public presentation, and as part of the long-term use of trial results. Here I will concentrate on the first two of these. There are occasional concerns regarding the need for IRB supervision of PC. The best gauge for this concern is the modern good practice of clinical trial registration in places like clinicaltrials.gov. This type of registration now carries a scientific and moral imperative, and tends to occur relatively early in trial implementation. PC and registrations are not human subjects research, even though they may be essential to it. They specifically are not regulated by Institutional Review Boards. PC begins early and a good clinical trial cannot develop without these types of discussion. Because of support groups and advocates, awareness of the developing trial in the lay public and among individuals with a disease may be quite extensive. This
Piantadosi
Date: July 27, 2017
82
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
would seem to offer only an opportunity for constructive discussion, although there are occasional criticisms. It is typical for pharmaceutical companies to keep developing clinical trial protocols confidential throughout their internal process. At some point, however, the protocol document appears in the research offices of academic centers. Then many diverse research staff need to be aware of the details, including administrators, research nurses, clinical research associates, physicians, and large numbers of institutional review committee members. Maintaining strict confidentiality is unrealistic and unnecessary at this point. When trials reach the point of IRB review or are being considered for inclusion in an institution’s research portfolio, a few details may be disseminated to professionals regarding pending availability. This is specifically not advertisement or subject recruitment, which would be IRB regulated. A clinical trial should be registered nationally and visible publicly at this time. The national registration record is typically much more detailed than what PC contains. Nowadays paper listings, e-mail blasts, and web postings for professionals are common. Of course some of these could find their way into the hands of prospective participants, but the information contained in them is simply too minimal to be useful to anyone but a referring physician. 3.6.3
Investigator Responsibilities
The scope of responsibilities that an investigator purposely or unwittingly shoulders is enormous. On a philosophical level, the list of responsibilities includes the following: 1. 2. 3. 4. 5. 6. 7. 8.
Clinical and research competence Moral and intellectual integrity Impeccable ethics standards Dogged approach to documentation and data collection Healthy skepticism Patience Perseverence Obgligation to publish.
On a practical level, responsibilities that fall to the investigator are shown in Table 3.5. Each topic has a large number of important details, and most of them are ongoing responsibilities throughout the trial and sometimes afterward. Any detail implied by Table 3.5 has the potential to damage or halt a trial if managed poorly. Aside from that potential, executing these responsibilities well provides public assurance that the data and the reported results are credible and accurate, and ensures protections for the rights and confidentiality of participants. There is no single source for all the responsibilities listed, but many details are in the Code of Federal Regulations Title 21. Details are also provided in an FDA Guidance [358]. GCP guidelines state that an investigator must be qualified by education, training, and experience to assume responsibility for the proper conduct of a clinical trial. It should be evident from the long list of responsibilities that contemporary education and training
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PROFESSIONAL CONDUCT
83
TABLE 3.5 Some Categories of Responsibility for the Clinical Trial Investigator Scientific expertise Clinical expertise Protection of human subjects Clinical care of subjects Good clinical practice (GCP) IRB communications HIPAA Compliance and institutional review Conflict of interest Resource acquisition Trial registration
Personnel management Execution of protocol Amendments Communications Progress reports Safety reports Study website Record-keeping Manuscript preparation Final report Data sharing
do not cover everything, emphasizing he need for a considerable mentored experience. Such experiences are not so easy to acquire. Investigators are usually well into their faculty years before becoming comfortable with this range of duties. It should be noted that commercial, government, or academic sponsors of trials have a similar but different list of responsibilities. 3.6.4
Professional Ethics
Conflict of Interest Objectivity is an unattainable ideal for all scientists, especially those working in most clinical research studies. Elsewhere I have discussed the need for using methods, procedures, and outcomes that encourage objectivity. It is equally important to be objective regarding behavioral factors that affect investigators themselves. Breakdown in objectivity often arises from conflicts of interest, a descriptive term that many organizations also define explicitly. There is much evidence that objectivity in clinical research can be degraded by commercial interests [43]. The generic approach to conflict of interest is to define it in financial terms. For most scientists and research settings, this can be a serious error. Most of us are far more likely to encounter influential intellectual conflicts of interest than financial ones. The actual currency of clinical research is ideas, in which it is all too easy to become invested and lose balance. Even when financial conflicts are managed according to policy, as most are nowadays, there can remain a residual intellectual conflict. This might help explain observations, such as Refs. [138, 934], that industry-sponsored research is more likely to result in positive findings or those favorable to the sponsor. Concerns over this issue are heightened by the fact that many formal collaborations between medical schools and industry sponsors do not follow guidelines for data access and publication set up by the International Committee of Medical Journal Editors [1342]. The problem is acute enough that there are discussions that drug industry research should not be published in scientific journals [384, 1409]. The Code of Federal Regulations (42 CFR 50) specifies rules to ensure objectivity in research sponsored by the Public Health Service. Investigators must disclose significant financial interests defined as equity exceeding $10,000 or 5% or more ownership that would reasonably appear to affect the design, conduct, or reporting of research supported
Piantadosi
Date: July 27, 2017
84
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
by the PHS. Personal financial interests as well as those of a spouse and dependent children are included. Significant financial interest is defined as anything of monetary value including salary, consulting fees, honoraria, equity such as stocks or other types of ownership, and intellectual property rights such as patents, copyrights, and royalties. Significant financial interest does not include income from seminars, lectures, or teaching sponsored by public or not-for-profit organizations below the thresholds cited above. Management of financial interests that could create bias may require investigators to reduce or eliminate the holdings, recognition, and oversight by the research institution, public disclosure, or modification of the research plan. Investigators can anticipate such problems and disqualify themselves from participation in studies that present the opportunity for bias. Many experienced investigators have potential conflicts, often of a minor or indirect nature. An example is when their university or institution benefits directly from a particular study, although they do not. Such cases can often be managed simply by public disclosure. Research misconduct or fraud is an extreme example of loss of objectivity. This important subject is discussed in Chapter 26. Misconduct can have many contributing factors, but a lack of objectivity about the role of the investigator is often one cause. Although often unrecognized, competing research interests are a potential source of lack of objectivity. Consider the investigator who has multiple research opportunities, or a mix of research and “practice” options, available to patients. For example, there might be competing clinical trials with the same or similar eligibility criteria. This can lead to the physician acting as an investigator for some patients, a different investigator for others, and as a practitioner for still others. If treatment or research preferences are driven partly by financial concerns, reimbursement issues, ego, research prowess, or other pressures, the potential for conflicting interests is high. Although this situation is common in academic medical centers where many studies are available, investigators often do not recognize it as a source of conflict. A solution to such problems is facilitated by non overlapping investigations, a clear priority for any competing studies, and not concurrently conducting research and practice in the same patient population. This may mean, for example, that when conducting a clinical trial, the physician might best refer ineligible subjects or those declining participation because of a treatment preference to a trustworthy colleague not participating in the study. There are two principal methods for dealing with conflicts of interest. The first is to decline to participate in any activity where one’s conflict of interest would be seen as detrimental to the process. The relevant perceptions are not necessarily our own, but the view likely to be held by others, particularly the public. The second method for dealing with conflicts of interest is disclosure to others involved in the process. This allows them to compensate or replace the affected individual if they deem it necessary. Disclosure may need to be an ongoing process as one’s personal or professional circumstances change, or as the issues and climate of the collaboration evolve.
Professional Statistics Ethics The ethical conduct of statistical science does not have a high profile among either the public or other scientific disciplines that draw heavily from its methods. However, professional societies have examined ethical issues surrounding the practice of statistical methods and have well-evolved guidelines for the profession. The American Statistical
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SUMMARY
85
Association (ASA), the Royal Statistical Society (RSS), and the International Statistical Institute (ISI) have all published guidelines for the conduct of their members [31, 771, 1293]. These guidelines are similar in content and spirit. The ASA is the oldest professional society in the United States. Its Committee on Professional Ethics perceives the potential benefit and harm from the use of statistical methods in science and public policy. Circumstances where statistical thinking can be misused include not only clinical trials and other areas of medicine but also statistical findings presented as evidence in courts of law, and some political issues. The ASA’s ethics guidelines followed those of Deming [352], one of its most prominent members, and were subsequently revised through 2016. Although the Society for Clinical Trials is more directly concerned and involved with clinical trials and has many statisticians as members, they have not dealt directly with professional ethics of trialists or statisticians in this context. The existing guidelines for statisticians do not articulate fundamental professional ethical principles directly in the way that has been done for subjects’ rights in medical research, nor do they deal directly with clinical trials. However, the message implicit in the guidelines is similar in spirit to those outlined earlier in this chapter, and is relevant to statistical practice surrounding clinical trials. These include investigator competence, disclosure of potential conflicts, confidentiality, documentation of methods, and openness. The ISI guidelines articulate the shared values of respect, professionalism, truthfulness, and integrity. The ASA Guidelines cover eight topics, including professional integrity and accountability; integrity of data and methods; responsibilities to science, public, sponsors, and clients; responsibilities to research subjects; responsibilities to research team colleagues; responsibilities to other statisticians; responsibilities regarding allegations of misconduct; and responsibilities of those employing statistical practitioners. For the present discussion, responsibilities to research subjects are paramount. Some details under that topic are as follows:
·· ·· · 3.7
Know and adhere to rules for the protection of human subjects Avoid either excessive or inadequate numbers of research subjects Protect the privacy and confidentiality of research subjects Be aware of legal limitations on privacy and confidentiality assurances Before participating, consider whether appropriate research subject approvals were obtained SUMMARY
Clinical trials highlight some of the competing obligations that physicians and patients face in health care today. There is potential for clinical trials to be poorly planned or conducted, compromising the ethical treatment of research subjects. This potential exists in many areas of medical practice and must be counterbalanced by safeguards and standards of ethical conduct. Features of some clinical trials such as randomization and the use of placebos illustrate clearly competing obligations. However, trials do not necessarily generate unique ethical concerns nor make them unsolvable. The history of medical experimentation in the twentieth century illustrates the potential for infringing on the rights of patients and an evolving standard for conducting medical
Piantadosi
Date: July 27, 2017
86
Time: 4:45 pm
WHY CLINICAL TRIALS ARE ETHICAL
research of all types. Protections for subjects are based on international agreements and guidelines, governmental regulations, institutional standards and review, and the ethical principles of autonomy, beneficence, and justice. Careful practical implementation of these protections usually yields clinical trials that are ethically acceptable. Ethical norms, especially those for clinical trials, appear to be culturally dependent. There are circumstances where the most ethical course of action for the physician is to recommend participation in a clinical trial. This may be the case when the physician is genuinely uncertain about the benefit of a treatment or the difference between alternative treatments. Opinions about treatments based on evidence of poor quality are not ethically legitimate, even if they are firmly held. Opinions based on scientifically weak evidence are subject to influence by financial, academic, or personal pressures. Clinical trial statisticians are also held to standards of professional conduct. These standards require competence, confidentiality, impartiality, and openness. Like other investigators, the statistician must demonstrate an absence of conflicts of interest.
3.8
QUESTIONS FOR DISCUSSION
1. Financial and academic pressures appear not to be the primary ethical conflict in clinical trials. Is this accurate or not? Discuss how these pressures can compromise the rights of the research subject. 2. Some clinical trials aren’t conducted in the United States because of either ethical concerns or lack of patient acceptance. Trials are sometimes more feasible in other countries even when conducted by U.S. sponsors. An example is the pertussis vaccine trial [631, 906, 995]. Comment on the ethics of this practice. 3. Are risks and benefits evenly distributed between trial participants and future patients? Discuss your point of view. 4. In recent years there has been heightened concern over potential differences in treatment effects between men and women, especially if a trial enrolled subjects of only one sex. In some cases trials have been essentially repeated by sex, such as the Physicians, Health Study [1199] and the Women Physicians’ Health Study [513]. When is this practice ethical and when is it not? Are there ethical concerns about resource utilization? 5. There is a need for physicians to conduct research in some settings where patients (and their families) can neither be informed nor give consent in the conventional manner. An example might be in the emergency department for cardiopulmonary resuscitation or head injury. Discuss whether or not clinical trials can be conducted ethically in such circumstances. 6. Developing interventions and conducting clinical trials in specific disease areas like medical devices, biological agents, surgery, AIDS, cytotoxic drugs, and disease prevention can be quite different. Discuss how clinical trials in these various areas raise different ethical issues.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
4 CONTEXTS FOR CLINICAL TRIALS
4.1
INTRODUCTION
The purpose of this Chapter is to discuss the way clinical trials are used in different medical contexts. Context is a methodological subculture, which is partly a consequence of treatment modality and partly the product of the medical specialty. The specific contexts discussed are drugs, devices, surgery, complementary and alternative medicine (CAM), and prevention. My intent is not to fragment principles but to illustrate common themes in diverse areas and explain some of the differences in usage of clinical trials. Aside from design specifics, context helps understand history, ethics, interpretation, and frequency of studies performed. Statistical principles of experiment design are very reliable and highly evolved through service to agriculture, industrial quality control, reliability testing, and medicine. Trials have been used for decades to develop and test interventions in nearly all areas of medicine and public health [1139, 1437]. In recent decades, they have been very actively employed in cancer, cardiovascular disease, and HIV infection. These areas have shared a need to evaluate new drugs, drug combinations, devices, surgery, other treatment modalities, and diagnostics. However, trial methods can also be used to assess treatment algorithms, making them applicable to a wide array of adaptive therapeutic and prevention questions. Clinical trials have not diffused evenly into all medical disciplines or contexts, a result of demands by practitioners and patients, and key external pressures. Pressures by practitioners and patients are different in various specialties and diseases, affecting the perceived necessity and acceptance of clinical trials. For example, the training of practitioners in some disciplines may not cultivate a reliance or insistence on clinical trials as rigorous evaluation tools. Also, patients may place inconsistent demands on different therapies or those administered at different points in the progression of disease. Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
87
Piantadosi
Date: July 27, 2017
88
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
A good example is treatment for cancer. Soon after diagnosis most patients demand cutting-edge and safe therapies. If the disease progresses, the same patients may turn to alternative therapies that appear safe but are marginally effective or completely unproven, without the same demands for demonstrated efficacy. This illustrates that patients’ risk– benefit assessments change as a disease advances. Different contexts also experience distinct key external pressures. These influences arise from regulation, the pace of therapeutic development—which itself may depend on scientific developments in other fields—chance, and economics. Regulation of drugs is a major determinant of the use of clinical trials in that context (discussed extensively below). Innovation and rigor are frequently at odds with the external pressure of cost control. Clinical trials are often inhibited when health insurance companies are reluctant to pay for routine costs surrounding experimental therapy. For insurers, the definition of experimental therapy can be quite broad. For example, if one compares two standard therapies for a disease in a randomized clinical trial, an insurance company might refuse to pay for either therapy or the associated costs, under the claim that the treatments are “experimental.” In contrast, the same insurer might pay for either treatment outside of the trial. Fortunately, this double standard is lessening. In 1998, the state of Maryland was among the first to pass legislation requiring insurers to pay for nonexperimental patient care costs incurred during clinical trials. This covers parallel costs or care that would have been reimbursed as standard of care. The National Emphysema Treatment Trial [1091] (discussed in more detail in Section 4.6.6) is a study in which the Center for Medicare and Medicaid Services (CMS, formerly the Health Care Financing Administration) agreed to reimburse for lung volume reduction surgery (only in the context of the randomized trial). At the federal level, in June 2000, there was an executive memorandum from the U.S. President to CMS to cover routine costs of qualifying clinical trials. Advocacy groups are pursuing federal legislation that will obligate all third-party payers to do the same. A more recent trial of long-term oxygen treatment (LOTT) for patients with emphysema [1452] also partnered with CMS, but without the strict requirement of NETT that only participants on the trial could receive treatment via randomized assignment. Extensive health care reform ongoing at the time of this writing will certainly affect clinical trial support as well as innovation widely.
4.1.1
Clinical Trial Registries
Learning about the scope of clinical trials in any field is essential. In the United States, cancer practitioners and researchers can obtain details regarding many clinical trials through the Physician Data Query (PDQ) system [1088, 1089]. Some of the information therein often is transmitted to the patient. PDQ was begun in 1984 and reflects an effort to make physicians treating cancer patients aware of the studies being performed, expert opinion about the trials, and sketches of the treatment protocol [745]. Currently, there are over 8000 active and 19,000 closed cancer clinical trials listed in PDQ, and these are updated bimonthly. A similar resource is available for AIDS trials at the AIDSinfo internet site [1098]. The U.S. government maintains a patient resource registry of clinical trials that is not disease specific, maintained by the National Library [1103]. It contains about 220,000 studies at thousands of clinical sites nationwide and in 193 countries. Almost 40,000
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
89
TABLE 4.1 Websites for Some Clinical Trials Registries Around the World as of 2016 Country
url
Australia and New Zealand Brazil Canada
http://www.anzctr.org.au/ http://www.ensaiosclinicos.gov.br/ https://health-products.canada.ca/ctdb-bdec/indexeng.jsp http://www.chictr.org.cn/enIndex.aspx http://registroclinico.sld.cu/en/home https://www.clinicaltrialsregister.eu/ http://drks-neu.uniklinik-freiburg.de/drks web/ http://ctri.nic.in/Clinicaltrials/login.php http://www.irct.ir/ http://www.agenziafarmaco.com/en/content/clinicaltrials https://center2.umin.ac.jp/ http://www.trialregister.nl/trialreg/index.asp http://www.sanctr.gov.za/ https://cris.nih.go.kr/cris/en/index.jsp http://www.slctr.lk/ http://www.clinicaltrials.in.th/ http://vistaar.makrocare.com/reg-updates/59turkey-develops-clinical-trials-turkey-database clinicaltrials.gov http://apps.who.int/trialsearch/ http://www.isrctn.com/
China Cuba Europe Germany India Iran Italy Japan The Netherlands South Africa South Korea Sri Lanka Thailand Turkey United States World Health Organization ISRCTN
of those studies are presently recruiting participants. The European Union has a similar registry [432]. Presently, there are almost 20 such registries worldwide as listed in Table 4.1. The Cochrane Collaboration is an international registry of all randomized clinical trials performed in specified disease areas. It is named after the epidemiologist Archie Cochrane (1909–1988) [273, 274] who first suggested the idea in the 1980s [374]. Since 1993 numerous investigators around the world have been exhaustively compiling the studies and data that will facilitate meta-analyses (Chapter 24) and high-quality evidence-driven treatment decisions. This database is of great use to researchers and practitioners. Additional details can be found at the cochrane collaboration Web site [275]. These resources sometimes depend on voluntary submission of information by researchers or sponsors of trials and are incomplete. Lack of public awareness of certain trials and their results can have serious consequences. For example, regulatory decisions regarding safety, efficacy, or labeling could be affected by incomplete information about study results [1444]. The same might be said of more routine therapeutic decision in circumstances where rigorous evidence is scarce. In 2004 the International Committee of Medical Journal Editors (ICMJE) announced a new requirement effective July 2005 for registration of clinical trials in a public registry as a prerequisite for publishing results in their 11 member journals [344, 345, 767]. The 1997 FDA Modernization Act required
Piantadosi
Date: July 27, 2017
90
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
registration of trials in the National Library of Medicine database. Originally only half of industry sponsored trials and most government sponsored trials were registered [1445]. The registration requirements by the ICMJE strongly increased the proportion of trials registered after 2005 [426, 869]. The World Health Organization (WHO) also requests trial registration in their International Clinical Trials Registry Platform (ICTRP) [1583, 1584]. WHO and ICMJE utilize the unique International Standard Randomized Controlled Trial Number and the corresponding ISRCTN registry [66]. As useful as the registries are, their proliferation can be a problem because it increases the number of places one has to search for trials of a given type.
4.1.2
Public Perception Versus Science
The most consequential difference in perspective for clinical trials is, in my opinion, the scientific versus public view. This dichotomy is seldom discussed but is important for several reasons. First, the two perspectives are often starkly different because of priorities and understanding. Second, it is the public who eventually participate in trials (or not), providing definitive evidence of how they are perceived. Third, the difference in views is often overlooked by investigators who have a difficult time with perspectives other than science, even when a study encounters important nonscience issues like fraud, conflict of interest, funding, or ethics. Public perception is more important than science whenever a trial is stressed, controversial, or held up as a bad example. Many examples of this dichotomy can be found by reading exoteric accounts of trials and following them back into the scientific literature and back in time (e.g., Ref. [1363]). The public interfaces clinical trials via NIH, FDA, philanthropy, and other less direct ways. Within the science domain the relationship between the frequency, quality, and nature of clinical trials and the context in which they are performed can be assessed using a framework based on the following characteristics:
·· ·· ·· ·
role of regulation for the therapy, ease with which observer bias can be controlled (e.g., placebos), uniformity, or lack thereof, of the treatment, expected magnitude of treatment effects, relevance of incremental improvement, general assessment of risk versus benefit, and tradition and training of practitioners, especially with respect to the scientific method.
Table 4.2 provides an overview of the characteristics of some contexts discussed in this chapter with regard to this framework. For specific trials, there may be additional factors explaining their acceptance or lack thereof, such as patient acceptance or window of opportunity. However, the perspective in the following sections regarding details of contexts and characteristics is more universal.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DRUGS
91
TABLE 4.2 Context Characteristics with Implications for Clinical Trials Context Characteristic
Drugs
Devices
Prevention
CAM𝑎
Surgery
Strength of regulation Ease of bias control Treatment uniformity Likely effect size Use of incremental improvement Short-term risk–benefit Long-term risk–benefit Tradition for rigorous trials
Strong Easy High Small Common
Moderate Difficult High Large Common
Strong Easy High Moderate Minimal
Minimal Easy Low Small None
None Difficult Low Large Common
Favorable Varied Strong
Varied Varied Weak
Favorable Favorable Strong
Unknown Unknown None
Varied Favorable Varied
𝑎
Complementary and alternative medicine.
4.2
DRUGS
Most of the discussion in this book pertains directly to the development and evaluation of drugs. Drugs are the most common therapeutic modality employed today, so there is ample reason for them to be the subject of many clinical trials. This context is very heterogeneous, both because of the vast universe of drug products and the different entities performing drug trials worldwide. The latter includes such diverse groups as pharmaceutical companies, single academic investigators, collaborative academic and government cooperative groups (themselves fairly heterogeneous), groups of practitioners, managed care, and various permutations of these groups in different countries. In the United States there is a dichotomy between the pharmaceutical and academic models for performing trials. The pharmaceutical perspective tends to emphasize discovery, systematic early development, market forces, the regulatory overlay, product formulation, toxicity, and liabilities. The academic view tends to emphasize development and comparison with less regard for regulatory considerations or other market forces. Pharmaceutical companies that do most of the drug development in the United States are motivated by a mixture of profit and idealized or regulatory mandated concern for public health and safety. Often the statistical tools and study designs applied by both developers and regulators are not completely adequate. The resulting mix of forces applies stress to most clinical trials, rarely making them ideal for the regulatory decisions that they support. One of the strongest motivations for performing good clinical trials when developing and evaluating drugs is the regulatory apparatus in the United States. To appreciate fully the impact that drug regulation has on the design and conduct of clinical trials, some acquaintance with the regulatory history of the Food and Drug Administration (FDA) is necessary. An introductory source for this information is Ref. [1594] or the FDA guidance documents [497]. The essential FDA perspective is that randomized controlled trials are excellent and irreplaceable tools for clarifying safety and efficacy concerns about new treatments prior to marketing approval [907].
Piantadosi
Date: July 27, 2017
92
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
The regulations that govern FDA control over drug products have largely been reactions to specific crises in public health and safety that result in legislation from Congress. This has limited thoughtful regulatory design. The FDA has the difficult job of balancing the conflicting needs of the public within statutory limitations. The public requires safe and effective medications but also requires that they be developed and made available in a timely fashion. RCTs are not strictly required by the FDA’s statutory “adequate and well-controlled” clause but they are the most reliable way to satisfy it. An FDA perspective on statistical issues in drug development is given by Anello [42], and specifically on oncology drugs by Hirschfeld and Pazdur [725]. An interesting perspective on drug regulation arose from the withdrawal of rofecoxib from the market in 2004. This nonsteroidal anti-inflammatory agent had been on the market since 1999 and was designed to be safer than earlier drugs in its class, although cardiovascular safety was questioned early in its history. After 5 years of use and the premature stopping of a polyp prevention trial by a monitoring committee, data were convincing that the drug was associated with an unacceptable risk of myocardial infarction or stroke [479, 1484]. The clinical trials on which approval was based did not adequately address cardiovascular safety, and definitive studies of the question were never undertaken. Thus, deficiencies in evidence and an unexpected finding led to the largest withdrawal of a prescription drug in history, highlighting both effective and ineffective aspects of the regulatory process. Some additional insight into drugs as a context can be gained by considering dietary supplements or botanicals, with their bewildering and faddish history. These agents, mostly natural products, are taken like drugs but are explicitly outside FDA regulation. A perspective on this is given by Marcus and Grollman [987]. The rationale for this is unclear now because of the way dietary supplements are taken, their availability and composition, and implicit or explicit health claims by manufacturers. They represent a microcosm of the drug context in the absence of regulation. The premise of safety is more a reflection of their traditional use rather than evidence from systematic trials. It is essentially impossible for patients and practitioners to know the risk–benefit of these substances because of the absence of rigorous testing.
4.2.1
Are Drugs Special?
The drug context is very heterogeneous and defies generalizations. It can be as different from itself as it is from other settings discussed below. There is little about drugs that makes them uniquely suitable or unsuitable for clinical trials. Often drugs can be relatively easily administered, but there are a large number of exceptions. Drugs are often used to gain small-sized or short-term benefits, especially when their risk is low. Nonprescription analgesics (and the differences among them) are an example of this. One characteristic that makes drugs amenable for comparative clinical trials is the readiness with which placebos can be used to control observer bias, at least in principle. A placebo should have the same appearance as the active drug, sometimes requiring the same color or similar taste. These are usually not significant restrictions. The feasibility of making a placebo encourages the use of rigorous clinical trials to evaluate drugs. Other modalities, such as therapeutic devices, are not as placebo-friendly. The reduction in bias that placebo controls afford enhances reliable estimation of small treatment differences.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DRUGS
93
There are characteristics of drugs that can significantly complicate the design and conduct of clinical trials. These include the proper scheduling or relative timing of treatments and titration of the therapeutic ratio. The best schedule of drug administration to satisfy simple optima (e.g., constant serum level) can often be determined from pharmacokinetic experiments. However, if efficacy depends on some other measure of exposure, such as peak concentration or time above a threshold, the schedule problem can be more complex. The situation can be intractable to theory when drug combinations are used because there may be interactions that cannot be predicted from simple time– concentration data. Another relevant characteristic of drugs is that they typically have uniform formulation and manufacture. This is usually the case even in developmental clinical trials. This uniformity simplifies many trial design considerations and provides reassurance for inference and external validity, especially for single-agent studies. It also means that the only feasible customization of many drugs for the individual is through dosing. For example, the therapeutic ratio is almost never under direct investigator control. Sometimes a change in formulation may improve the therapeutic ratio. This contrasts with radiotherapy, discussed below, where the therapeutic ratio can be controlled. One interesting exception is regional perfusion of drug to maximize exposure to a limb or organ, while minimizing toxicity to the remainder of the body. This technique is used for the treatment of some cancers. Even this complex procedure affords only crude control over the therapeutic ratio because the target is imperfectly isolated from the normal tissues and some drug escapes systemically. Doses for many drugs are established early in development based on relatively small experiments. This is usually optimal from a developmental perspective. However, the dose question sometimes needs to be addressed in comparative studies because of changes in formulations, better ancillary care, side effects, or other practical reasons. As outlined in Chapter 12, investigators must distinguish sharply between the dose–safety and the dose–efficacy relationship.
4.2.2
Why Trials Are Used Extensively for Drugs
Regulation Government regulation of drug treatments requiring them to be safe and effective has been constructed in response to various unfortunate events (e.g., thalidomide and birth defects). Nearly all drugs expose the entire body to their actions, creating the potential to produce side effects in unexpected ways. Regulation is the force that balances this risk against evidence of health benefit and ignorance, meaning lack of knowledge about harm. The need for regulation is created, in part, because such treatments are advertised and marketed on a large scale. It remains probably the most dominant influence for the rigorous testing of drugs. Because the universe of drugs is so large, regulatory considerations for them are not uniform, and allow for the basis of drug approval to depend on clinical circumstances. For example, Subpart H of the regulation allows approval on the basis of a surrogate endpoint or on a clinical endpoint other than survival or irreversible morbidity (21 CFR 314.510 [1502]. Approval can also be restricted to assure safe use (21 CFR 314.520 [1502]). Subpart E is intended to make drugs for life-threatening illnesses available more rapidly by encouraging an early collaboration between sponsors and the FDA to develop
Piantadosi
Date: July 27, 2017
94
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
efficient preclinical studies and human trials that could lead to approval without large randomized studies (21 CFR 312.80-312.88) [1502]. Early in the AIDS epidemic, alterations in the drug approval process hastened getting promising agents to the public to fill unmet medical needs. Another example is drugs to treat cancer, where some unique characteristics of the disease and its treatment also change the regulatory perspective. The oncology setting is characterized by a very heterogeneous life-threatening disease, multiple modes of therapy, a unique perspective on risk–benefit and serious adverse events, specialists trained to use dangerous drugs, investigative nature of the discipline, wide variety of products used, and relatively high risk of drug development. These features make oncology drug regulation subjectively somewhat different than for other settings, with frequent interest in pathways for rapid approval. Tradition There is a strong tradition of experimental trials for drugs, especially in disease areas such as cardiovascular disease and cancer. Because of the training of practitioners, the culture is respectful of the need for trials and good study design. There is also considerable public expectation that prescription drugs will be safe and effective. Perhaps because of perceived risk, the expectation for drugs is much higher than for, say, alternative medicine therapies. Confounding Uncontrolled drug studies are subject to the same selection bias and observer bias as other contexts. For drugs that do not produce strong side effects or stress the recipient, selection bias might be a slightly reduced concern. In any case, there is still a strong potential for observer bias. In short, drugs do not appear to be special with regard to the potential for confounded treatment effects. Incremental Improvement Drugs are often designed by incremental improvement of existing agents. Modifications to the compound might reduce unwanted side effects, increase absorption, or prolong halflife, for example. Usually even small changes in a drug require rigorous testing. Minor chemical modifications to a molecule can substantially alter its behavior and efficacy, as has been amply illustrated by drug development. Thus, incremental improvement does not provide a vehicle for minimizing the need for rigorous testing of new agents. Masking and Placebos Drugs are almost universally suited to the use of masking and placebos. The clinical setting or features of a drug may prevent placebo use, but the principle holds. Only occasionally do logistical difficulties prevent the use of a placebo. Economics The economics of drug therapy is tightly bound to regulation. Despite the high cost of getting a new drug to the commercial market and the high cost to the patient of some newer drugs, they often remain very cost-effective therapies. This encourages their use. Individual practitioners usually do not profit directly from the use of a particular drug, making the economic incentives somewhat different than for procedure-oriented
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DEVICES
95
therapies. These factors tend to make the economic considerations supportive of rigorous testing. Psychology The psychological acceptance of most drugs by patients is relatively high, particularly in the short term. They often represent a convenient, cost-effective, safe, and effective solution to symptoms and underlying causes of disease. Analgesics and antibiotics might be typical examples of this. For longer term use, even mild side effects can diminish acceptance or decrease compliance. Knowing that a drug has had fairly rigorous testing for safety and efficacy is a substantial benefit to the psychology of their acceptance. This firmly supports clinical trial testing during development.
4.3
DEVICES
The Food, Drug, and Cosmetic Act of 1938 defines a medical device as an instrument, apparatus, implement, machine, contrivance, implant, in vitro reagent, or other similar or related article, including any component, part, or accessory that is (a) recognized in the official National Formulary, or the U.S. Pharmacopeia, or any supplement to them; (b) intended for use in the diagnosis of disease or other conditions, or in the cure, mitigation, treatment, or prevention of disease in man or other animals; or (c) intended to affect the structure or any function of the body of man or other animals, and which does not achieve its primary intended purposes through chemical action within or on the body of man or other animals and which is not dependent upon being metabolized for the achievement of any of its principal intended purposes [1500].
In brief, a medical device simply is any object used in/on the body about which a health claim is made. Medical devices, loosely characterized, function through physical or electromechanical actions. This definition encompasses diverse entities, including rubber gloves, surgical instruments, cardiac pacemakers, tongue depressors, medical software, and diagnostic test kits. 4.3.1
Use of Trials for Medical Devices
Regulation There are currently over 1700 medical devices in the United States with a market of $70 billion. Many medical devices have not been subject to the same development or comparative testing as drugs for reasons discussed below. The regulatory apparatus surrounding the development and evaluation of medical devices is important but not as familiar to many clinical trialists as that for drugs. See Ref. [753] for a brief history of device regulation and Ref. [1071] for a discussion of software regulation. It may not seem obvious that medical software should be classified as a device, but in most instances it is no less “mechanical.” Software is certainly more similar to a device than to a drug. The regulatory climate surrounding devices is substantially different than that for drugs. Before 1976 devices could be marketed and used without demonstration of safety
Piantadosi
Date: July 27, 2017
96
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
and effectiveness. The FDA began regulating devices for evidence of safety and effectiveness based on the 1976 Medical Device Amendment to the Food, Drug, and Cosmetic Act of 1938. An interesting perspective on device regulation from a surgical perspective was given by Brantigan [178]. The summary statistics regarding device regulation presented here are derived from the work of Scott [1352]. Some medical devices are explicitly exempt from FDA clearance based on established safety, although registration, labeling, and good manufacturing practices are required (21 CFR Parts 862-892) [1502]. For medical devices that are not exempt, there are two mechanisms allowed by law under which the FDA can approve marketing. The first is premarket notification (510(k)), which is based on a device being “substantially equivalent” to an existing device before 1976. Additionally, the Safe Medical Devices Act of 1990 allows substantial equivalence claims to be based on 510(k) approvals granted after 1976. Clinical data proving safety and effectiveness are not required using this mechanism, and only about 10% of 501(k) devices have supporting clinical data. New or high-risk medical devices must be approved under a premarket approval application (PMA), which are based on data demonstrating clinical safety and effectiveness. These account for only about 10% of all devices [1352]. Because of differences in statute, regulation is not nearly as effective at maintaining rigor in device development as it is for drugs. Between the 1976 Medical Device Amendment and 2002, 102,533 devices were cleared for marketing in the United States through the 510(k) mechanism, as compared with 883 approved by the more demanding PMA process [1352]. “Substantial equivalence,” which has no statutory definition, was rejected by the FDA only 2% of the time for the 510(k) pathway, leaving the majority of medical devices to enter the market on the basis of being equivalent to something available prior to 1976 or its successor. Given the technological advancements since 1976, this seems incredible. In any case, the level of evidence supporting device approval is much lower than that for drugs, the supporting information is difficult to obtain, and there is no comprehensive documentation of safety and effectiveness for devices [1352].
Tradition The testing of medical devices should not be, in principle, different from evaluating drugs. Some special exceptions are mentioned below. The practicalities and traditions are different, however. Device studies have a dominant concern over functional and mechanistic features as compared to clinical outcomes. In contrast, drug studies have a more even balance of formulation, bioavailability (and related aspects analogous to device function), and clinical benefit. Like surgical treatments, medical devices can be developed inexpensively by small groups of investigators. At some point in development, investigators must distinguish between device function and clinical outcome. Case series with selected patients tend to confuse the difference between device function and clinical outcome, which should remain separate. A properly designed and functioning device does not automatically produce a beneficial effect in the patient. Trials that test differences in patient outcome attributable to a device need to be as large and rigorous as those for drugs. Because of the informality of the early developmental process and selection of favorable case reports, a device can sometimes gain widespread use without rigorous testing of clinical outcome. We may require more extensive follow-up to observe device failures than the amount of time required to see side effects from drugs or other systemically
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DEVICES
97
acting agents. Thus, the risk–benefit picture may change substantially over time or take an extended period to become clear. Examples, where this may have been a problem because complications became evident only after a long time, include prosthetic heart valves found years later to be catastrophically failing, some intrauterine birth control devices, and silicone breast implants, which have been implicated in a variety of ailments. Rigorous developmental trials, long-term animal studies, and post-marketing surveillance studies might have ameliorated each of these problems.
4.3.2
Are Devices Different from Drugs?
There can be reasons why in vivo developmental studies of some devices may not need to be as rigorous as many trials for drugs and biological agents. First, the action of most devices is physiologically localized rather than systemic. This leads to a greater determinism in how they work, how they might fail, and the possible manifestations of failure. Examples are catheters or implantable electrodes that have a relatively small set of possible side effects or complications compared with drugs and biologicals. Second, devices are often constructed of materials that have been tested in similar biological contexts and whose properties are well known. Again, this effectively rules out, or reduces the probability of, certain complications. Examples of this are biologically inert materials, such as some metals and synthetics. Third, devices typically operate on physical, chemical, or electronic principles that are known to be reliable because of previous testing or evaluation outside human subjects. Based on these characteristics, investigators may know much more about a device at a given point in its development than is known about a drug or biological at a comparable stage. These characteristics may contribute to less of a need for extensive developmental testing of devices compared with drugs in humans, but not less rigorous testing overall. The literature surrounding medical devices contains some justifications as to why randomized controlled trials are not needed or are actually the wrong approach. There are some reasons for the frequent use of designs other than randomized trials to evaluate medical devices. However, the basic perspective of the clinical trialist should be that rigorous, unbiased evaluation methods are appropriate for devices, and that there are important therapeutic questions that are best answered by such methods. In some instances, devices might need more extensive testing than other treatments. Historically, devices have had their share of trouble: a slightly humorous angle on this fact is provided by the Museum of Questionable Medical Devices [1077]. If devices were inherently safer than drugs, it might explain the difference in attitude toward these contexts. A device usually does not have a direct analogy to “dose,” which is one way that drugs produce safety failures or adverse effects. An important safety concern for device relates to their initial use, such as implantation. This procedure, often a complex invasive one, may carry significant risks, which although are not literally attributable to the device, are inseparable from it. This might be associated with a relatively high short-term risk, namely that of surgery, and a lower long-term risk. The only analogous concepts for drugs would be allergic or idiosyncratic reactions, or the risks of a large loading dose. Drugs and devices both carry risks associated with duration of exposure. Apart from initial use, duration of exposure is the principle dimension by which safety fails for devices, whereas for drugs it is common but secondary to dose, allergy, intolerance, and
Piantadosi
Date: July 27, 2017
98
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
the like. Said another way, for drugs the probability of an adverse effect increases strongly with dose and usually less strongly with exposure time. For devices the probability of an adverse event might increase with exposure time. A rationale for small simple designs for device trials is their potentially large effect size. The natural history is well known for the patient with an irregular heart rhythm, a failed heart valve, degenerated joint, unstable fracture, uncorrected anatomic defect, or many other conditions amenable to treatment by a device. Furthermore, small treatment effects attributable to the use of a particular device are usually not important. Many devices are designed and used only in situations where they can provide a large and specific treatment effect, for example, due to their mechanical properties. These situations in no way obviate the need for study rigor or infrastructure, but they may limit the size, duration, or architecture of a trial. In other circumstances, a new device represents an incremental improvement of an item known by experience to be safe and effective. Perhaps a new material is employed that is more durable, less reactive, or less prone to infection than one previously in use. The device may offer improved electronics or other features that can be convincingly evaluated ex vivo. The need for trials in these circumstances is not eliminated, but their design might be relaxed compared to drugs, where analogous minor changes in formulation could affect bioavailability or action. We might say that the actions of devices are less dependent than drugs on their respective “initial conditions,” meaning drug effects are potentially more chaotic. Although the considerations above may apply frequently to devices, there are numerous situations in which devices need to be evaluated for moderate to small treatment effects. These might arise when a device is a new treatment for a condition where standard therapy exists. Questions about efficacy and risk benefit are important, and are answerable using rigorous study designs. In addition we cannot equate mechanical function and reliability with clinical benefit, as discussed above. The former is partly knowable from preclinical studies, whereas the latter requires definitive trials.
4.3.3
Case Study
Many of the issues surrounding the development, regulation, and clinical use of medical devices are illustrated by deep-brain stimulation for control of manifestations of advanced Parkinson’s disease. Based on animal models and exploratory human studies, electrical stimulation of certain regions of the brain has been shown to reduce tremors and similar symptoms in some patients. The devices utilized, their anatomical targets, and their method of implantation evolved through relatively small developmental trials. Initial regulatory approval was given for a device for unilateral use to control Parkinsonian tremor, indicating basic comfort with safety and efficacy. In 1996 small feasibility trials of bilateral electrical stimulation of the globus pallidus or subthalamic nucleus were begun in the United States and Europe, in subjects with advanced Parkinson’s disease, who had been responsive to L-Dopa but were progressing symptomatically. These trials were designed with sample sizes in the range of two dozen subjects and with subjective clinical endpoints evaluated at 3 months—essentially immediately relative to the course of the disease. Promising early results caused these trials to be expanded to about 150 subjects (absurdly large sizes for feasibility) and longer term follow-up. The data, when presented
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PREVENTION
99
in a regulatory forum in 2000, raised relatively little concern over device function or safety, but raised significant questions about clinical efficacy, endpoint selection, study design, statistical analysis, and practitioner expertise. If the treatment effect had not been so large in magnitude (strong improvement in motor subscore of the Unified Parkinson’s Disease Rating Scale), the regulatory approval would likely have failed because of poor clinical trial methodology. Some of the methodological issues can be seen in a reported clinical trial from the deep brain stimulation [347]. This illustrates that while devices may proceed through early development taking advantage of some of the features discussed above, questions about clinical efficacy are no different than for drugs.
4.4
PREVENTION
Clinical trials that assess methods for prevention of disease are among the most important, complex, and expensive types of studies that can be done. Aside from this complexity, prevention trials also highlight ethics concerns and risk–benefit judgments because the participants are often at lower risk than individuals with a disease. Even risks at “epidemic” levels (a few percent) in the population may be much lower than risks encountered by individuals who actually have the disease. Some of the difficulties in conducting prevention trials are illustrated by the Women’s Health Initiative (WHI) randomized placebo-controlled trial also discussed in Section 8.4.2. The WHI may be the most expensive clinical trial even performed—a study of 16,608 healthy postmenopausal women that cost over $2 billion. Among other things, the trial studied dietary fat, but did not achieve a change in blood lipids in the intervention group. Despite the seemingly indisputable relationship between adiposity and cancer risk, the WHI result suggests that dietary fat does not cause cancer directly. Fruit and vegetable intake did not reduce cancer incidence, but did seem to reduce cardiovascular disease. The relationship between dietary fat and cancer risk seems to be subtle and not so easily teased apart in even a large rigorous clinical trial. Prevention trials can be loosely categorized as primary, secondary, or tertiary [150]. Here again, terminology seems to have arisen in oncology and then crept into broader contexts. Primary prevention trials assess an intervention in individuals who are initially free of disease. In some cases, such individuals will be at very low absolute risk for the outcome of interest. If a substantial number of cases are required to assess the effects of the treatment, as is often the case, then these trials will need to be quite large and have extensive follow-up to generate the information (events) to meet the scientific objectives. An example is diet and life style change to prevent cancer or cardiovascular disease. Secondary prevention attempts treatments or interventions in individuals with characteristics or precursors of the disease, or with early-stage disease, to prevent progression or sequelae. An example would be lowering of blood pressure to prevent stroke or myocardial infarction. A precursor condition is something like keratotic skin lesions, which often precede the occurrence of basal cell carcinoma. Individuals with such lesions might be an appropriate population in which to test agents that could reduce the frequency of this skin cancer. Secondary prevention trials tend to be smaller than primary prevention trials, because the population is at higher risk and will yield more events. Tertiary prevention attempts to prevent recurrences of the disease in individuals who have already had one episode. An example is the use of vitamins or trace elements
Piantadosi
Date: July 27, 2017
100
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
to prevent or delay second malignancies in patients who have already had a primary cancer. Such a population is usually at the highest risk of having an event, and consequently these might be the most efficient of all prevention trials with respect to size. However, there may be fundamental biological differences between preventing a disease in a basically healthy population and attempting to prevent a disease when it has already become manifest on one occasion. It is quite possible for the treatment to work under one condition but not under the other. Tertiary prevention is likely to be very important to reduce hospitalizations of affected individuals as a step in reducing health care costs. As prevention widens the nature of its interventions, it may become a less cohesive context for trials. Already we see prevention incorporating diverse interventions such as vaccines, diet, lifestyle, herbals, vitamins, trace elements, and drugs.
4.4.1
The Prevention Versus Therapy Dichotomy Is Over Worked
Prevention and therapy trials are usually said to be different based on the origins of ideas to be tested, nature of specific interventions, absolute risk of the study cohort, scientific expertise required, study size, duration, and cost. Consider the difference between trials that test diet for prevention of coronary heart disease versus those that evaluate coronary bypass grafting for its treatment. Those stereotypical differences would be evident there. However, there are few differences in methodologic principles between prevention and therapeutics such as those outlined in Section 2.4.1. Prevention and therapeutic questions are very similar when viewed as interventions to reduce risk. Therapy reduces risk in the individual, and prevention reduces risk in the population. But these are just convenient frames of reference because the therapy of individuals also reduces risks in the population (increasingly of consequence in our health care decisions). And prevention applied in the population has important consequences for the individual, some of which are discussed below. We often speak of prevention when we merely delay or exchange causes of death. For example, lowing someone’s risk of cancer may mean that later they will be a victim of cardiovascular or neurological disease. Treating one disease may do little to ameliorate a second. Most of the differences between prevention and therapy questions and their corresponding clinical trials relate to technology—size, cost, and duration. But these are a consequence almost entirely of the different absolute risk of the study cohort, rather than a reflection of principles. Thus, the dichotomy is over emphasized, especially for comparative trials. Some areas where therapeutic and prevention trials have substantive differences, though not necessarily of principle, include the following. Preventive agents are usually not developed with the same pipeline or paradigm as drugs. Evidence for efficacy of prevention agents often comes from epidemiologic or other nonexperimental studies. Because many prevention agents are virtually nontoxic, early safety and dose-finding studies are less critical and may not need to be oriented toward pharmacokinetics. Because of the long study duration that is typical of prevention trials, poor adherence, and delayed onset of treatment effectiveness may diminish or distort the treatment effect. Special trial designs and analyses that are insensitive to these problems are needed. Finally, because many prevention treatments have a low incidence of overlapping side effects, they are suitable for simultaneous or combined administration in factorial designs (discussed in Chapter 22).
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PREVENTION
101
There is one critical difference between therapy and prevention that is essential to understanding both clinical trials and the difficult policy decisions that surround prevention. The risk–benefit calculus tends to be at the individual level for therapy and at the population level for preventives. Preventives have consequences similar to a tax on the population, diffusing the consequences of a disease and spreading it over everyone. Some individuals will be damaged by the intervention and many others will be helped. A good preventive will have minimal negative consequences and strong positive ones at the population level. For example, most vaccines have this property but there are rare negative consequences for individuals. The expectation is that the balance of harm and benefits from the preventive intervention will be favorable when averaged over the population, despite the bad bargain for a few individuals. Exactly the same perspective can be taken for screening trials (below). The risk–benefit calculus for therapies tends to be only at the individual level. But health economics is shifting the traditional individual-level view of treatment risk–benefit to a population view. For example, we might take a dim view of treating hundreds or thousands of individuals with an expensive therapy to improve the outcome of only one. Other cost-effective evaluations of effective therapies might also be relevant. In any case, the view of therapeutic evaluation presently is primarily through effects on the individual, whereas the perspective on preventives (and screening) is primarily through the population.
4.4.2
Vaccines and Biologicals
Vaccines and biologicals are similar to the extent that both typically utilize natural substances to stimulate the body’s own beneficial responses. Examples of biologics are cells or humoral agents that stimulate responses in the recipient. The components of some vaccines (proteins from bacteria or viruses, live, or attenuated cells) are not necessarily “naturally present” in health, but the response is. This characterization is crude because some drugs are natural products also (e.g., steroids and plant products), and some vaccines are not. Through triggering host mechanisms, vaccines, and biologics also seem to have the potential for great amplification of effect, substantially changing our perspective about dose. Therapeutic benefit may rely on a different mechanism than drug metabolism or direct effect on receptors. In any case, the distinction is imperfect and this creates some difficulties for sketching the role of clinical trials in the development and evaluation of these substances. In this discussion, I will refer mostly to “vaccines,” although similar remarks often apply to other biological agents. Many vaccines are among the most effective and cost-effective interventions for public health. They are discussed here separately from other prevention agents only to reflect their uniqueness and as a practical distinction that persists today. Vaccines have been produced mostly in response to infectious diseases, but now some other illnesses may be amenable to vaccine treatment or prevention. Cancer is one possible example. A good deal of effort is also being devoted to both treatment and preventive vaccines for HIV. Gilbert [592] discusses some of the statistical design issues for HIV vaccine trials. Much of the public and professional attitude toward vaccines is predicated on their high degree of effectiveness and safety. Although somewhat forgotten today, the American public has had a favorable experience with an important vaccine trial, perhaps the largest clinical trial ever done. I refer to
Piantadosi
Date: July 27, 2017
102
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
the 1954 Salk polio vaccine trial [340, 1031] in which 1.8 million children were involved. Conclusive and impressive efficacy caused the vaccine to be licensed by the PHS the same day the results were announced. Not only did this study address one of the most important public health questions of its time, it was financed entirely without government money. In the early days of applying the new vaccine, several lots were contaminated with live virus, resulting in 59 cases of paralytic polio and 5 deaths. Despite this public confidence in the program remained high. The history of the vaccine is remarkable [857], and there is a striking contrast between it and contemporary attitudes and practice. The polio vaccine trials are a classic example of prevention trials [884]. A traditional view of contemporary vaccine trials is given by Farrington and Miller [447]. Vaccine development and use is typified by the following features. They are often intended to treat individuals at risk but without the disease, they often produce a long-lasting or permanent effect, and the diseases they prevent may be at seemingly low frequency in the population. Therapeutic vaccines may not fit this paradigm. These characteristics sharpen the focus on safety. Some vaccine opponents emphasize uncommon or hypothetical side effects and question the benefit. For example, there was some concern that polio vaccines from 50 years ago could have been contaminated by viruses that might be causing or contributing to AIDS or cancer now [735]. Pertussis (wholecell) vaccination seems to be perpetually criticized, and the anti-vaccine movement had demonstrable negative effects worldwide on the control of this serious disease [562]. A more acrid view of the pertussis case is given by Hoyt [740, 741]. The demand for a strong risk–benefit ratio and concern over uncommon but serious side effects has motivated the application of rigorous large clinical trials for vaccine development. These trials examine short-term outcomes and adverse events as carefully as prevention endpoints. The measure of efficacy might be immunological response, or more definitively, reduction in morbidity, mortality, or the incidence of specific infection. Some studies might also show cross immunity manifest by reduction in other infections. There are several incentives to perform vaccine trials in developing countries. Delivery of the agent is often technologically simple. The frequency of the target disease may be considerably higher in a developing country. For this reason or because of cultural values, the acceptance of an experimental study by participants can be great. Companies may prefer such a setting because the liability climate is also usually more favorable than in the United States. In any case, trials in developing countries have raised serious and incompletely resolved ethical questions [1368]. Performing such trials in developing countries always raises concerns about “exploitation” or colonialism [1127]. This term is often used but not justified in my opinion. Exploitation implies that by design, the subject is worse off afterward than before entering some agreement. This is almost never the case in a clinical trial. The resolution of such questions depends on examining the risk–benefit for the study participants, the benefit to the culture in which the study is conducted, the peer-review and ethics review of the study in both cultures, and addressing concerns raised by trials in similar contexts. None of our cherished concepts translate perfectly across cultures.
4.4.3
Ebola 2014 and Beyond
In 2014–2015 the outbreak of ebola in resource-limited west Africa highlighted an urgency for vaccine development, and renewed questions about performing trials in
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PREVENTION
103
developing countries. The spread of this infection with an approximately 50% case fatality rate encouraged governments, pharmaceutical companies, and the World Health Organization (WHO) to accelerate vaccine development. It also brought into sharp focus key population methods for epidemic control such as screening those thought to be at high risk, contact tracing, and quarantine with implications for autonomy as discussed in Chapter 3. Early in the epidemic there were even instances of violence toward health care workers and facilities by individuals in the population who disbelieved the root causes of the disease. Two candidate recombinant vaccines demonstrated preliminary ability to control ebola virus in nonhuman primates. An adenovirus derived from chimpanzees, cAd3-EBO, was developed jointly by GlaxoSmithKline (GSK) and the National Institute of Allergy and Infectious Diseases (NIAID). A second candidate, rVSV-EBO, was developed by the Public Health Agency of Canada and licensed to NewLink Corporation. At the time of initial testing, risk and safety information for both candidate vaccines was based mostly on the use of the recombinant viruses. The WHO convened an emergency meeting in October 2014 to plan for the development of a vaccine. The meeting reached several conclusions including that two vaccines should be developed, a fund would be created to cope with possible litigation over adverse effects, two comparative trials would be conducted, and priority for vaccines would be given to heath care workers. The first of these points is in stark contrast to the public view of the polio field trials discussed above. There was also discussion of a “stepped wedge” design in which each subject first receives the control therapy (presumably no vaccine in this setting) and then crosses over to the test therapy. The time of cross over to the test treatment is randomized. Such designs can randomize either individuals or clusters and have been useful in scaling seemingly effective therapies up to population interventions to assess effectiveness. This design would consume more calendar time than a traditional comparative trial but might be appropriate when the therapy for the test cohort is in limited supply. A preliminary report of dosing for cAd3-EBO indicated immunogenicity is dosedependent and probably requires 2 × 1011 particle-units [910]. Larger clinical trials were ongoing as of early 2015 with a critical need for randomized trials [319]. Late clinical trials appear to support a high degree of protection for the vaccines. In the affected regions of west Africa, health care workers seemed to accept risks inherent in their jobs far in excess of those associated with a well-controlled clinical trial of a new vaccine. One might imagine that there were indirect benefits to these workers for doing so, but it is hard to claim a favorable direct risk–benefit for them. The greater burden on a clinical trial in this case is ironic. There is strong rationale for testing a new vaccine for ebola in health care workers, but many issues of efficacy, risk, and social justice that need to be addressed before wider population testing would be acceptable.
4.4.4
A Perspective on Risk–Benefit
It is often said that (primary) prevention trials are performed in healthy subjects. If this were true literally, it might present serious difficulties for the usual risk–benefit calculus. I will support the view here that health is not the absence of disease, and that either individual or collective risk is a state intermediate between health and disease. This is certainly how we all behave. Low risk (or perceived low risk) induces behavior and judgments close to those taken by healthy individuals. High risk induces behavior similar
Piantadosi
Date: July 27, 2017
104
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
to individuals with disease. Attributing the idea to Hamilton, Pickering et al. [672], Rose [1282] stated . . . the idea of a sharp distinction between health and disease is a medical artifact for which nature, if consulted, provides no support.
Consequently, to view the prevention context as black or white creates a trap. Even relatively safe interventions such as vaccines lose their attractiveness if risk of disease is assumed away. Would the polio field trials have ever been done if the perception of risk was low? Dramatic prevention issues, such as smallpox vaccination in the United States in the wake of terrorist activities, center as much on risk perception (i.e., the unhealth of the healthy population) as they do on the properties of the preventive. The continuum that exists between health and disease has been articulated previously. An early illustration of this trap was the concern over the use of tamoxifen for the prevention of breast cancer in high-risk women. The effectiveness of this therapy is established now, being close to 50% reduction of risk of breast cancer, and it may seem strange that serious objections were raised about investigating the question. However, strong concerns about the tamoxifen prevention trial were raised by some in the breast cancer advocacy community during the design phase of the study (e.g., Ref. [180]). The notion that the target population was healthy was central to the objections. A congressional hearing was held on this question, rationalized by Rep. Donald Payne (D-NJ): It is crucial that the Federal Government conduct research to prevent the epidemic of breast cancer that is frightening women throughout our country. However, exposing healthy women to a potentially fatal drug may not be the best way to prevent breast cancer. It is important to make sure that federally funded research protects the subjects who participate from unnecessary risks and makes sure they are accurately informed of the likely risks and benefits [198].
Superficially this sounds quite reasonable, but the logical inconsistency of an “epidemic of breast cancer that is frightening women throughout our country” and “healthy women” is the error to which I referred above. Labeling the drug as “potentially fatal” was an overstatement. The sentiment in the last sentence of the quote was an argument against a problem that the proposed study did not have. These mistakes regarding the context of the trial were not unique to politicians. See Ref. [551] for a further example. There was also significant opposition to the first RCT of zidovudine for prevention of maternal–infant HIV transmission. Oddly the opposition came from AIDS advocacy groups who perceived the trial as disfavoring the rights of women relative to their offspring. It should not come as a surprise that ideology sometimes opposes science, but finding such a prevention question within the scope of conflict is remarkable. It was soon discovered that prevention worked when the trial was stopped at its first interim analysis [295]. The debate eventually shifted to criticism of the ethics of subsequent HIV prevention trials, under the assumption that efficacy had been established by this first study (e.g., Ref. [962]). More recent examples of botched concern over prevention trials can also be found. The Alzheimer’s Disease Anti-inflammatory Prevention Trial (ADAPT) was a randomized placebo-controlled trial testing the effect of naproxen and celecoxib to reduce the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PREVENTION
105
incidence of AD [991]. Naproxen is an over-the-counter drug thought to be a relatively safe at the appropriate dose. Celecoxib is designed to have fewer side effects than other nonsteroidal anti-inflammatory agents, presently the most widely used class of drugs. ADAPT was publicly criticized and labeled “unethical” for two reasons: unacceptable risk (because it is “known” that the treatments are “incorrect” ) and lack of adequate information in its consent form [111]. The criticisms were derived from an unsophisticated interpretation of scientific evidence combined with an overestimate of the risk of gastrointestinal bleeding from the study interventions. The critics of ADAPT built their case on demonstrably false scientific conclusions [181]. They also chose a nonscientific and inflammatory venue for their criticism, a technique that is often the signature of an ulterior motive. Interestingly, in late 2004, ADAPT enrollment was suspended based on complex safety concerns unrelated to the above criticisms. Rofecoxib, a drug in the same class as celecoxib, was removed from the market earlier in the year because of its association with elevated risk of thromboembolic events [479]. Celecoxib was then being studied for a wide variety of indications, and its safety subsequently came under intense scrutiny. A collaborative NCI trial of celecoxib for prevention of cancer was halted because of increased risk of major cardiovascular events. Based on this and practical concerns, but not due to an internal safety signal, ADAPT enrollment was suspended. At the time of this writing, both the plans for ADAPT and the fate of celecoxib remain uncertain. As a last illustration of risk–benefit issues, consider the role of surgery as a preventive intervention. It is uncommon for surgery to be used in this role for obvious reasons. But prophylactic mastectomy might be appropriate for women with BRCA gene mutation who are at exceptionally high risk of developing breast cancer, for example. Joint replacement is a circumstance in which surgery is arguably used for prevention. The high mortality surrounding hip fracture in the elderly makes this a reasonable perspective. We might apply a similar reasoning to the surgical treatment of aneurysms. There are many such examples for all types of prevention interventions. Health is not so straightforward to define. Risk represents a gradation between the absence and the presence of a disease. The necessary assessment is based on risk versus benefit, and the appropriate behavior follows accordingly.
4.4.5
Methodology and Framework for Prevention Trials
Prevention trials do not have many fundamental methodologic differences from therapeutic trials. Pharmaceutical agents are widely used for prevention, although they are often derived from substances that are considered safer than many drugs. In this sense, prevention agents bear a resemblance to complementary and alternative medicine, except that they are derived directly from mainstream scientific concepts. Additionally, they are invariably supported by extensive preclinical evidence. Examples are vitamins, their analogs, and trace elements. An interesting exception is tamoxifen for prevention of breast cancer, where prevention efficacy was suggested by its performance in treatment trials. A strong safety profile is essential because we expect them to be used in individuals without the targeted disease and administered for prolonged periods of time. It is the nature of the questions being asked, size, and expense of prevention trials that tend to distinguish them from treatment trials. Some prevention agents are literally drugs. These and vaccines are regulated in much the same way as drugs.
Piantadosi
Date: July 27, 2017
106
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
Because of these same characteristics, prevention trials are often amenable to testing multiple treatments at the same time. For example, in a study designed to prevent atherosclerosis, it might be possible to test one or more additional treatments, provided they do not have overlapping side effects or toxicities. One could combine dietary changes, exercise, and medication in a single clinical trial. Treatments could be studied jointly or independently. This reasoning leads to the frequent use of factorial designs (discussed in Chapter 22), where more than one treatment (or factor) administered simultaneously is studied in a single trial. If the treatments do not interact with one another, then independent estimates of individual treatment effects can be obtained efficiently from such a design. If treatments do interact with one another, the same factorial designs can be used (with a larger sample size) to assess the direction and strength of interaction. Many interventions proposed for disease prevention studies are not amenable to placebo control or masking. For example, dietary changes, exercise, and the like do not have corresponding placebo controls. This can lead to the same type of observer bias that arises in other unmasked studies. It may also produce “drop-ins,” where participants on the standard treatment or no-treatment arm adopt the experimental therapy after learning about the trial. Drop-ins are more likely when the intervention is safe and easy, like dietary changes or exercise. They can also occur in drug treatment trials if the risk–benefit appears favorable to those with the condition under study. This happened extensively in the 1980s and 1990s AIDS trials. Another concern that arises frequently in prevention trials is compliance or treatment adherence. Because these studies frequently require participants to take medication or maintain their treatment for several years, compliance may be considerably worse than on short-term treatment trials. How to measure and improve compliance is often an important, if not critical, issue. Fortunately, many prevention strategies have demonstrated their efficacy despite imperfect compliance. The early development of prevention agents is often replaced by evidence from epidemiologic studies. Unlike therapeutic trials, where the intermediate step in development is to ascertain clinical activity and compare it to an external standard, the middle developmental step in prevention studies is often a randomized trial using a surrogate outcome (Chapter 5). Surrogate outcomes make such trials shorter, less expensive, and more efficient. Promising findings in such a study could lead to a large-scale prevention trial employing a definitive outcome. When designing definitive comparative prevention trials, some extra consideration should be given to the type I and II error rates. These rates should be chosen to reflect the consequences of making the respective error. When the treatment is thought to be safe and it is vital to demonstrate it, the type II rate should probably be set quite low, whereas the type I rate might be relaxed compared to the typical therapeutic trial.
4.5
COMPLEMENTARY AND ALTERNATIVE MEDICINE
It is not easy to define alternative medicine, now often called complementary and alternative medicine (CAM). CAM is an inconsistent and moving target. Two modern textbooks on the subject [530, 1595] do not explicitly define CAM but list a range of
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPLEMENTARY AND ALTERNATIVE MEDICINE
107
TABLE 4.3 Common Complementary Practices in Approximate Order of Frequency of Use Practice Natural products Deep breathing Meditation Chiropractic and osteopathic Massage Yoga Diet-based therapies Progressive relaxation Guided imagery Homeopathy
topics thought to be within its scope. Yuan and Bieber [1595] state that CAM consists of those practices “not currently considered an integral part of conventional therapies.” Another definition was offered by Raso: Alternative health care (alt-care, alternative care, alternative healing, alternative healing therapies, alternative health, alternative medicine, alternative therapeutics, alternative therapies, alt-med, CAM, complementary health care, complementary medicine, extended therapeutics, fringe medicine, holistic healing, holistic health, holistic medicine, innovative medicine, mind–body medicine, natural healing, natural health, natural medicine, New-Age medicine, new medicine, planet medicine, unconventional medicine, unconventional therapies, unconventional therapy, unorthodox healing, unorthodox therapies, wholistic medicine): A limitless hodgepodge of health-related methods distinguished from biomedical health care partly, perhaps chiefly, by its almost unambiguous acceptance of “spiritual health” as a medical concern. One of its general principles is that a practitioner is a teacher who can “empower” one. Its purported goal is not to cure, but to effect “healing” : an experience of physical, mental, and spiritual “wholeness” [1248].
This illustrates how difficult it can be to craft a suitable definition for this diverse context (see also Ref. [46]). The National Center for Complementary and Alternative Medicine (NCCAM), the NIH Center that is responsible for investigating these therapies, avoids definitions altogether. NCCAM lists a number of complementary practices as shown in Table 4.3. Alternative practices, which might be used in place of mainstream medicine, are not listed by NCCAM. A problem is that many traditional medical practices also fall into those or similar domains. For example, diverse areas such as natural products and biologicals, diet, and mind–body interactions all have accepted roles in traditional medicine. Many CAM treatments are characterized by only regional use. Others are little more than attitude, dietary, or lifestyle changes and may be widely used. Some proponents make extraordinary claims in light of the natural history of the disease, while others attribute to the treatment improvements that are consistent with typical variations in disease course. It is not sensible to define CAM on the basis of content that is constantly in flux. For the purposes of discussing methodology and context, I define CAM as a treatment whose postulated mechanism of action is poorly defined, or inconsistent with established
Piantadosi
Date: July 27, 2017
108
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
biology. Poorly defined or inconsistent is partly a subjective assessment, but this escapes the pitfall of defining CAM by the nature of the therapy. It also takes advantage of the somewhat easier task of directly characterizing scientific medicine, which is mechanistically founded, and ultimately justified by both biological theory and empirical efficacy. The huge contemporary push in genomic sciences illustrates the emphasis on basic biological mechanisms in medicine. The disprovability of a treatment in the domain of scientific medicine is a key characteristic. CAM therapies are sustained mostly on the basis of belief rather than evidence. Some such therapies can be discredited with strong evidence. This happened with laetrile and hydrazine treatments for cancer, discussed below and in Sections 8.4.3 and 20.8.5, but it might be argued that they were not CAM. I reject the occasional claim that the mechanism of action of a CAM therapy is unknowable or inaccessible to scientific means. Homeopathy is a good example of treatment based on mechanism and theory that are inconsistent with established biology. However, in the case of homeopathy, it seems that no amount of negative evidence can overcome support based entirely on belief. It is worth reading the historic account of homeopathy by the physician Oliver Wendell Holmes for a view that is still contemporary in many ways [730].
4.5.1
Science Is the Study of Natural Phenomena
There are two critical aspects to the discussion of putative therapeutics outside scientific medicine. First is the fact that science is, by definition, the study of natural phenomena, all of which are accessible to it. This idea is as important for what it excludes as much as for what it includes. Science does not incorporate the supernatural even if such things exist. It has no methods of observation or inference for supernatural phenomena. This is why I was careful to define research in terms of the natural world in Section 2.2.1. Hence an a priori acceptance that some therapeutic effect is supernatural, such as intercessory prayer, renders it strictly out of the reach of science. The structure of an experiment in this circumstance is meaningless. Similarly, declaration that some therapeutic effect operates through unknown mechanisms may render it inaccessible to science, depending on the nature of those mechanisms. This is the trap of some alternative concepts. If a rigorous experiment yields no effect, the true believer can always escape to the unknowable or supernatural realm. If a positive effect is detected, we must then tease apart the mechanism, which could be anything from a false positive, to methodologic error, to observer or detection bias, to a new discovery. What happens too often is that investigators on the fringes fail to posit a mechanism for therapeutic effect prior to the conduct of the experiment. But the scientific method requires a mechanism based on natural phenomena even if it is new or revolutionary. Second, we must acknowledge the possibility of false positive errors no matter how strong our designs are, and construct methods to detect them. There are two tools to uncover the inevitable false positive error: consistency with established theory, and replication. Both of these tools represent evidence. Our new result is measured against existing validated evidence in the case of theory, or against new evidence in the case of replication. We cannot accept an empirical result as a true positive if it is dissonant with established fact unless additional evidence is gathered. In some circumstances, inconsistency with theory alone may be sufficient to declare a result to be false.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPLEMENTARY AND ALTERNATIVE MEDICINE
4.5.2
109
Ignorance Is Important
Ignorance is essential to the scientific process, if not at the heart of it. In the excellent book Ignorance: How It Drives Science, Firestein [463] discusses the critical role that ignorance plays in science. Scientists do not principally build a collection of facts, but they do focus on what is not yet known. Ignorance does not represent a void as much as it embodies predictions of where it might be necessary and fruitful to look for new information, formulating a good question in light of existing evidence. Ignorance is not the absence of knowledge, but is the support for the articulation of a question. A worthwhile question is derived from the testable predictions of the current body of knowledge, and will offer the possibility of falsifying some portion of existing theory. Nonscience does not make testable predictions. Some CAM therapies have elaborate justifications inconsistent with established science, such as homeopathy. Others, like therapeutic touch, invoke ill-defined or incomplete mechanisms of action essentially ad captandum vulgus. Some therapies often classified as CAM suggest mechanisms that are not well understood but are very plausible or evident in scientific medicine, such as mind–body interactions. Thus, the dividing line between CAM and scientific medicine is not sharp. Some treatments used in scientific medicine work via unknown mechanisms of action, such as lithium for manic depressive illness, which is far from saying that they will be found to be inconsistent with established biology. When a therapy of any origin is based on well-defined, established biological mechanisms, and is also shown to be effective, it becomes incorporated into scientific medicine. The power of mechanistic theory and integration with existing biological knowledge can be seen in recent investigations into the salutary effects of cinnamon on blood sugar, cholesterol, and triglycerides. Based on the origins and contemporary medicinal use of the spice, it falls within the CAM domain. However, there is rigorous science that reveals the molecule responsible for producing cinnamon’s insulin-like effects and its cellular mechanisms of action [789]. Some CAM proponents endorsed using the spice on the basis of this single study. Since 2001, clinical research findings are divided regarding therapeutic uses for cinnamon [851]. Whether continued research produces derivatives or new compounds remains to be seen, but the pathway for potential incorporation into scientific medicine is visible. Adoption of natural products or derivatives into scientific medicine in this way is a historical norm. Many other substances with conceptually similar origins and therapeutic claims in the CAM context have not been able to cross the divide. Another interesting example is the treatment of verruca vulgaris or common warts in children. A randomized trial in 51 subjects showed duct tape occlusion to be more effective than cryotherapy (85 versus 60% success) over 2 months of treatment [493]. This trial has the additional nice feature of a reported p-value of exactly 0.05. The investigators did not speculate on a mechanism of action for duct tape, though they did point out disadvantages of cryotherapy. This finding could represent a type I error. However, occlusion with tape might stimulate a immunological reaction to the wart similar to freezing. Alternatively there might be active substances in the glue on the duct tape, or the benefit might have come from ancillary methods performed in parallel with the tape. In any case, this unconventional therapeutic did not necessarily disrupt what was known about warts and their resolution. Two later trials of similar size failed to support the activity of duct tap for treatment of warts [342, 1542].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
110
CONTEXTS FOR CLINICAL TRIALS
4.5.3
The Essential Paradox of CAM and Clinical Trials
The problem and paradox regarding the relationship between CAM (by any definition) and clinical trials is the following. To test any therapy rigorously, it must at least temporarily be brought into the domain of orthodox science, with the attendant requirements regarding mechanisms of action, consistency with other accepted findings, and evidentiary implications such as proof of failure. CAM proponents often desire the imprimatur of science but frequently reject the methods necessary to gain it. In some cases, findings not in accord with prior belief are simply discarded (e.g., Section 20.8.5). Others contend that CAM by its very nature cannot be evidence based [1483]. See also commentary on this idea by Bloom [160]. The greatest difficulty for evaluating CAM therapies using clinical trials arises when the therapeutic question is only partially brought into the scientific method. This occurs when the evaluation structurally proceeds by way of a rigorous experiment design, but the hypothetical means of action relies on unfounded or demonstrably false principles. This issue was introduced in the discussion of science and nonscience in Section 2.3.7. Mechanism (and support from other evidence) is absolutely essential to the experimental method because it is the only basis for resolving the difference between type I errors and true findings. A meta-analysis of intercessory prayer [804] confronts the problem directly. Experiments without a foundation in established biology are little more than elaborate speculations. An excellent illustration of this problem is a randomized clinical trial, funded by the Uniformed Services University of the Health Sciences (USUHS), studying the effect of therapeutic touch (TT) on pain and infection in 99 burn victims. In the words of the lead investigator: The idea behind the practice of Therapeutic Touch is that the human energy field is abundant and flows in balanced patterns in health but is depleted and/or unbalanced in illness or injury. The Therapeutic Touch practitioner assesses the patient’s energy field patterns with his/her hands to identify areas of depleted, congested, blocked, or unbalanced energy. Then, the Therapeutic Touch treatment consists of a series of techniques implemented to replenish, clear, modulate, and rebalance the patient’s energy field patterns [1491, 1492].
The practice of TT is loosely based on the theory of unitary human beings [1280] that essentially equates human beings with energy fields. Beyond this, other humans can both manipulate and assess these energy fields. Although a staple of “nursing theory” and usually couched in scientific terms, these energy-based explanations appear supernatural. In this randomized trial, one group of subjects was treated by a nurse trained in TT, while the other was treated by a nurse who was instructed to mimic the treatment. Because the subjects had first-, second-, or third-degree burns on 5–70% of their bodies, the practitioners did not actually touch the study subjects but only moved their hands over them. Care was taken to train the mimics so their hand movements appeared real. This study design was severely criticized by Selby and Scheiber [1356] in advance of the results because of the unconventional (unscientific) model of disease and treatment, the fact that subjects were not actually touched, and the nature of the mimic or sham treatment control group. This study illustrates exactly the problem of omitting strong biological rationale for mechanism of effect, thereby bringing the therapeutic paradigm only partially into the domain of science.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPLEMENTARY AND ALTERNATIVE MEDICINE
111
Ultimately, the investigators reported that TT significantly reduced pain and anxiety compared to sham TT [1492]. They interpreted the finding as support for the proposed mechanism, ignoring obvious ways of testing it directly. For example, energy fields can be measured and no new anatomical energy receptors have been found or proposed in humans. Furthermore, the investigators did not discuss the substantial likelihood that the result arises from methodologic flaws or random error under the assumption that the proposed mechanism is false. A scientific perspective on therapeutic touch is offered by Scheiber and Selby [1338]. Poorly defined mechanisms such as “energy fields” are often invoked as their own explanation. Frequently the names chosen to represent such treatments are value laden (e.g., therapeutic touch), in contrast with traditional medical nomenclature. Therapies based on such ideas may be demonstrably safe (provided they do not displace a needed treatment) but not so easily shown to be effective. Contrast this with radiotherapy, for example, which also might be said to work through an “energy field” but has exquisite mechanistic explanations and demonstrated efficacy and indications. In 1998, therapeutic touch suffered a major setback with the publication of a paper in the Journal of the American Medical Association indicating that practitioners of the art were not able to recognize the requisite patient “energy fields” when masked [1281]. The scientific study demonstrating this was performed by a 9-year-old girl as project for her 4th grade science fair. She became the youngest person to publish a research paper in a peer-reviewed medical journal, and a media star. The research demonstrated that 21 practitioners of therapeutic touch performed no better than chance when asked to detect the energy fields they claimed to be manipulating. The scope of therapeutic touch seems to have declined greatly as a result of this embarrassing exposure. A further illustration of the paradox of testing some CAM treatments in clinical trials can be seen in the Cochrane review results for intercessory prayer. The review comprised 7646 subjects in 10 randomized trials and suggested that the evidence justified more trials [1276]. These findings were questioned on methodologic grounds including the statement that One of the problems is that researchers who investigate interventions that have no credible mechanism need to interpret positive results very carefully [804].
This is exactly the difficulty of trying to use a scientific method to test a question that is a matter of belief rather than a question of evidence. Belief need not be restricted to supernatural phenomena, but can encompass matters that should be evidence based, such as biological efficacy. 4.5.4
Why Trials Have Not Been Used Extensively in CAM
Formal clinical trials have not been widely applied either inside CAM practice or using CAM treatments in traditional settings, although in current times the need for such evaluations is high [1043]. Reasons for this include the following. Many CAM therapies are used in settings where impartial evaluation is inhibited or discouraged by practitioners and patients alike. There may also be strong financial incentives for practitioners to maintain unchallenged belief in the efficacy of such treatments. Also, rigorous methods developed in conventional medicine are sometimes thought incorrectly to be inapplicable to CAM [235a].
Piantadosi
Date: July 27, 2017
112
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
Many medical scientists and patients are willing to apply lower standards of acceptance to some CAM treatments because the absolute risk appears to be low. This seems to obviate the need for rigorous testing. This might be appropriate when there is a reasonable risk–benefit ratio or even when there is only an expectation of such, but it is not sensible in the absence of efficacy. It must be said that nursing and supportive care professionals often provide the gateway for this phenomenon, a good example being therapeutic touch as just discussed. Perhaps the principal reason why CAM treatments have not been extensively and rigorously studied is foundational. Many alternative systems of therapeutic effect do not rely on structural and functional relationships such as those in scientific-based medicine. This eliminates the requirement to propose and study anatomical, physiological, or biochemical mechanisms. The alternative systems in Table 4.5 are examples. Nuland said it well: Scientific medicine is the only tradition in which knowledge of the body’s actual structure and functioning is acknowledged not only as the basis of understanding disease but also as something unattainable without dissection of the dead and study of the living, whether at surgery or by means of various biochemical or imaging techniques. Since other systems do not rely on directing therapy at specific well-identified abnormalities within individual organs, detailed knowledge is superfluous and useless to them. Like humoral medicine, such schemes depend for their efficacy on generalized readjustments of entire conditions of constitutional abnormality [1126].
Whether or not it is theoretically possible to test every CAM question experimentally is an open question. It seems to be true at the empirical level, given only a suitable outcome measure and the superficial structure of an experiment. However, it is not possible to generate true scientific evidence unless the investigator tests outcome and mechanism. There remains the problem of unknown mechanism. Convincing empirical evidence may persuade the scientist to modify, replace, or search for new mechanisms. However, the more that other evidence supports the existing body of theory, the more likely it is that new mechanisms will be consistent with it. Regulation There is no direct government regulation of CAM treatments requiring them to be safe or effective. Many such treatments employ dietary supplements or other similar substances that fall explicitly outside FDA oversight. Even so, the claims made on behalf of such therapies, the attitudes of practitioners and patients, and the potential for CAM therapy to displace treatments of proven benefit and to cause their own adverse effects suggest that some regulation might be appropriate [987, 1441]. Tradition There is not a strong tradition of experimental trials among CAM practitioners, for reasons outlined above. However, the outcomes of CAM therapies are subject to the same selection bias and observer bias as any other treatment, perhaps more so because of the setting in which they are often used. It is possible that many CAM treatments appear more effective than they really are, further complicating our ability to do rigorous trials. There is also not a strong tradition of testing CAM therapies within scientific medicine.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPLEMENTARY AND ALTERNATIVE MEDICINE
113
This situation is beginning to change with the application of NIH money. Because of the essential paradox discussed above, it remains to be seen how successful this will be. Incremental Improvement Most CAM treatments evolve by incremental improvement, although it is not clear that they are titrated to efficacy as much as to safety and cultural acceptability. There does not appear to be any “science of” or formalized developmental process for CAM therapies, thus alleviating the need to provide mechanistic rationale. However, many CAM treatments are well suited to masking, randomization, and placebo control. Economics Many CAM treatments are supported by relatively few practitioners, even a single one. This reduces the need for resources and collaboration, meaning peer review and funding. Economic incentives can be quite favorable for the CAM proponent and correspondingly unfavorable for rigorous voluntary testing. Because many CAM therapies use natural products or existing modalities, the opportunity for patents is limited. This restricts the resources that practitioners and proponents are willing to apply. Psychology Although difficult to generalize, patients can be psychologically very accepting of CAM therapies, provided they appear safe and are accompanied by an appropriate veneer of favorable performance and rationale. Uncritical acceptance seems to have played a role in the demand for shark cartilage as a treatment for cancer [1167]. A simple review of the scientific facts might have altered the course of that fad. Decisions to take a CAM therapy are often based on trust, just as with conventional therapy. I can provide no definitive explanation for why patients appear to accept CAM therapies less critically than traditional ones. Perhaps this is actually a misperception, and patients are broadly very accepting of any treatment provided they trust the practitioner and the risk–benefit appears appropriate to their understanding of their disease. Again, we must be mindful that the apparent safety of some therapies may be seemingly a product of how they are used rather than evidence from rigorous trials. In the absence of rigorous evaluation, there is little to prevent ineffective therapies from remaining in the CAM domain. We pay a double price for this. Patients can become preoccupied with useless treatments, and science misses a much needed opportunity to broaden its view. Even a relatively small amount of rigor injected into the CAM setting would probably provide a large benefit to society. 4.5.5
Some Principles for Rigorous Evaluation
Applying clinical trials to the study of CAM requires several principles. First, we must accept the idea that all natural phenomena are accessible to scientific knowledge. Books could be written about this, but I don’t consider it controversial. (We might even define science as the study of natural phenomena, but that’s not essential.) Some advocates claim that scientific methods must be modified for, or are inapplicable to, the study of CAM. This claim is internally inconsistent. Second, disease processes, administration of a treatment, the treatment itself, recovery from disease, and death are all natural phenomena. CAM proponents often support this
Piantadosi
Date: July 27, 2017
114
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
idea by the frequent claim that its treatments are more natural than traditional healing, apart from the fact that the observables. are natural. Although early in investigation, a CAM therapy may be stereotyped as involving an unexplained phenomenon, this is not a final explanation. Nor is it proof that such observations and phenomena are inaccessible to science. Third, benefit is assessed by repeated structured observation. Strong evidence is gained by reproducibility. This is explicit in CAM by the use of the case history, case series, and occasional study report to influence opinion. Strong evidence (near proof or high reliability) based on repeatability is claimed by CAM advocates when and because they urge the application of treatments to new patients. Traditional science also uses the method of structured observation and reliability, albeit usually quite rigorously. The essential differences between CAM as it has been traditionally supported and rigorous scientific clinical trials are (1) integration with biological knowledge about mechanism and (2) degree of structure applied on a purely empirical basis to generate data. CAM evaluations have been weaker on both counts. Thus, there are several components of structure that can be introduced to evaluate CAM claims more rigorously:
· · ·· · · ··
Establish the biological framework for acquiring new knowledge. Provide background for a reasonable biological mechanism of action. Document promising cases/results. Even a handful of such cases could be evidence of benefit and might motivate more reliable and extensive study. Failure to provide it, when large numbers of patients have been treated, can be taken as evidence of no benefit. Prospectively write a study plan. Adhere to objective, definitive outcomes. Account for all subjects treated and all time at risk following therapy. Accounting is based simply on active follow-up of subjects and active ascertainment of outcomes. The results, when compared to the natural history of the disease, provide informal quantitative evidence as to degree of benefit. This component of structure can be attained easily using a noncomparative (safety and activity) clinical trial design. Employ a comparative study design. This design should be enhanced by masking, randomization, placebo control, or other design maneuvers to strengthen the reliability of its findings. Solicit peer review of plans, process, and results. Obtain independent external verification or validation.
The application of any one of these components to many CAM claims would help considerably with evaluation, and using some or all of them systematically would be a service to patients everywhere. There is nothing in this list that works systematically against CAM, but all the steps work against ineffective treatments, deception, and quackery. In the case of Laetrile, for example (Section 8.4.3), the application of fairly minimal structure to evaluate efficacy claims was enormously helpful. In the case of hydrazine (Section 20.8.5), rigorous structure and strong evidence was not accepted as disproof by proponents.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPLEMENTARY AND ALTERNATIVE MEDICINE
4.5.6
115
Historic Examples
Today in the era of designed therapeutics, we have an increasing tendency to see natural products as complementary, alternative, or otherwise outside the scope of scientific medicine. This is a mistake, and even more of a misperception historically. Three examples, quinine [5], digitalis [83, 1122, 1123], and penicillin [676], illustrate the flexibility and structure of scientific medicine as it applies to incorporating new ideas from natural products, and willingness to suspend formal and full application of biologic mechanistic knowledge in the face of convincing empirical data. Chinchona bark containing quinine was used for malaria by indigenous people from about 1600–1900. In the seventeenth century, it was introduced into Europe, without universal acceptance. The first attempt to find the active element was by Gomez in the early nineteenth century. The alkaloid quinine was isolated in 1820 by Pelletier and Caventou. By 1900 the pure extract was frequently used for fever. After World War I, Plasmodium was discovered, which led to investigations into quinine’s mechanism of action. Synthetic analogs replaced quinine after World War II and as a result of the Vietnam conflict. Presently quinine is used mostly for drug-resistant malaria. The history of quinine illustrates the origin of this therapy as a natural product, with eventual incorporation into scientific medicine and replacement of the natural compound by a designed synthetic. The history of digitalis from natural product to conventional therapy was shortened by virtue of keen observation by the physician William Withering: In the year 1775 my opinion was asked concerning a family receipt for the cure of the dropsy. I was told that it had long been kept a secret by an old woman in Shropshire, who had sometimes made cures after the more regular practitioners had failed. I was informed also, that the effects produced were violent vomiting and purging; for the diuretic effects seemed to have been overlooked. This medicine was composed of 20 or more different herbs: but it was not very difficult for one conversant in these subjects, to perceive, that the active herb could be no other than the foxglove [1572, 1573].
The botanical that would be named Digitalis lanata had been known as a medicinal to Dioscorides and Galen, and was also described by Welsh physicians in the thirteenth century. Leonard Fuchs (1501–1566) named the plant in his book De Historia Stirpium Commentarii Insignes in 1542 and recommended it for dropsy [549, 1035]. Withering was aware of Fuchs’ book when he described the event above. Dr. John Ash, who was a colleague of Withering, used Digitalis to treat the Principal of Brazenose College, Oxford, who was suffering from pulmonary edema. Mr. Saunders, an apothecary of Stourbridge in Worcestershire, also used the plant routinely for dropsy [1189]. The foxglove plant was not studied scientifically until the nineteenth century. In 1841 the Soci´et´e de Pharmacie de Paris awarded E. Homolle and Theodore Oeuvenne a prize for the isolation of digitalin from the plant. Digitoxin was isolated in 1875 by Oscar Schmiedeberg, and digoxin was identified in 1930 by Sydney Smith. Only in the 1970s were the effects of digoxin connected to left ventricular function in heart failure. The long history of Digitalis also illustrates the path from natural remedy to full incorporation into scientific medicine. The evolution of penicillin from a mere mold to a lifesaving therapy was comparatively rapid. Alexander Fleming drew attention to the mold in 1928 while on staff of
Piantadosi
Date: July 27, 2017
116
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
the Inoculation Department at St. Mary’s Hospital, London. There, Fleming noticed the inhibition of growth of S. aureus around a colony of Penicillium notatum on a contaminated culture. Follow-up studies showed a substance that could kill microbes without harming animals. Fleming did not extensively describe his work in the medical literature. It would be 12-years later when Howard Florey from the Oxford University Sir William Dunn School of Pathology published convincing evidence of the therapeutic potential of penicillin. The historical scientific details surrounding the discovery of penicillin and its evolution as a therapeutic over the next 50 years are fascinating [676] but cannot be detailed here. For the present purposes, it is sufficient to point out that stories regarding the utility of molds for treatment of infections had been circulated for hundreds of years without scientific study. In this case, chance favored both a prepared mind and a laboratory setting, allowing scientific medicine to incorporate the natural product almost instantaneously. Looking even lightly at these three examples might cause a critic to insist that the natural products in question were not alternative in the contemporary sense. My point is not whether they were alternative, but that appropriate reductionism can be applied to old or new potential therapeutics from any source to incorporate them into scientific medicine. Furthermore, this is the necessary pathway even if we first employ a therapeutic based on empirical evidence and leave mechanistic understanding until later. Origins are irrelevant provided that strong empirical and mechanistic evidence support therapeutic use.
4.6
SURGERY AND SKILL-DEPENDENT THERAPIES
The frequency and rigor of clinical trials in surgery and skill-dependent therapies is less than in many other medical disciplines [74, 1414]. The problem has been evident from the surgical literature for some time but seems persistent [1009]. Surgical and other skilldependent therapies, such as certain medical devices, can have substantive differences from those that test drugs or biologicals [449, 1017, 1270]. These include the potential for strong bias, the tradition of practitioners, and an expectation of large effect sizes. None of these eliminates the need for rigorous evaluation methods based on clinical trials. Uncontrolled studies, traditionally the most common design, are vulnerable to confounding by selection and observer bias. Reliance on case series in particular for therapeutic comparisons aggravates the potential for bias. For a practical discussion related to this point, see Horton and the studies he discusses [238, 326, 737, 977, 994]. The insular attitude of some surgeons regarding rigorous trials is superficially similar to CAM practitioners, in that it is not unusual for surgeons to distinguish themselves explicitly from “physicians.” However, surgery and CAM are worlds apart with regard to biologically based justifications for their respective treatments. Because of the absence of rigorous evaluation, many widely used surgical procedures have probably been unsafe, expensive, ineffective, or otherwise suboptimal. Grimes [647] provides some examples. Compared to drugs, surgical procedures may have a higher frequency of being ineffective or unsafe, based on the same reasoning. Beecher [131] provides some examples of historical interest where previously accepted surgical treatments were later rejected, such as prefrontal lobotomies for schizophrenia and colectomies for epilepsy.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURGERY AND SKILL-DEPENDENT THERAPIES
117
In the middle developmental stage of therapeutic development, investigators must distinguish between the feasibility of the procedure or technique and clinical benefit. Usually the failure of a surgical procedure is associated with an unfavorable outcome, but this does not mean that success of the procedure will yield a clinical benefit. From a methodologic point of view, we cannot substitute trials demonstrating successful completion of the procedure for those demonstrating improved patient outcome. Technical improvements that seem like a good idea may not easily demonstrate improved patient outcomes in rigorous studies. One example is laproscopic-assisted colectomy [1198, 1531]. Historically, many surgeons have been reluctant to apply randomization to studies of patient outcomes, tending to make comparisons of surgical methods more difficult. Nevertheless, clinical trials of all types have been applied to surgical questions and have produced important information in areas such as vascular, thoracic, and oncologic surgery (e.g., Ref. [902]). For a review of the role of statistical thinking in thoracic surgery studies, see Ref. [1205]. McPeek, Mosteller, and McKneally [1021] discuss general issues regarding randomized trials in surgery, as does McCulloch et al. [1009]. A broad discussion of statistical methods in surgical research is given by Murray [1074, 1075]. When both treatments under study are surgical, a randomized trial might sometimes be easier to motivate (e.g., Ref. [1235]). Surgery and skill-dependent therapies have other substantive differences from drugs, biologicals, and the like. We need to understand these because they can have a strong impact on the development and evaluation of treatments. Unlike drugs, surgical procedures are often developed by single or small groups of investigators at relatively little cost. Surgical procedures have a high degree of determinism associated with them and often work on physical principles, unlike pharmacologic agents. Surgical treatments are only occasionally amenable to evaluation versus placebo. Apart from the “placebo effect” and selection factors favoring those subjects well enough to undergo surgery, a true surgical placebo, such as sham surgery, is nearly always ethically problematic. However, there have been notable trials in which sham surgery was justified. This is discussed in Section 4.6.5. Among surgeons there is not a uniformity of perspective regarding controlled trials. In multidisciplinary collaborations for single therapeutic questions or in disease-oriented groups, surgical collaborators tend to be enthusiastic about studies. This reflects both enlightened attitudes by individual investigators and the fact that surgery is integral to the treatment of some diseases. Cancer is a case in point, where there is a collaborative oncology clinical trials group sponsored by the American College of Surgeons. At the other ends of the spectrum of therapeutics, where surgery is the primary mode of therapy, or it has been little used, attitudes among surgeons toward controlled trials are highly variable and can be quite negative. Examples of successful rigorous surgical trials include diverse studies such as hernia repair [1115], perineal suture after childbirth [616, 971], adjuvant therapy in lung cancer [1333], breast sparing surgery for cancer [466, 1508], coronary artery bypass grafting [1176, 1599], throboembolism prevention [285, 855], epilepsy [1553], lymph node dissection in melanoma and breast cancer [597, 1063], and Parkinson’s disease (Section 4.6.5), to name but a few out of a large number. Coronary artery bypass surgery probably would have had its benefits demonstrated earlier if rigorous trials had been a part of its development. Other examples will be mentioned later.
Piantadosi
Date: July 27, 2017
118
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
The paradigm of how new surgical therapies are developed is different than that for drugs. Developing surgical techniques does not necessarily follow the structured developmental phases for drugs. Techniques tend to evolve slowly and are often amenable to improvement, even while being studied, to the point of becoming individualized to certain surgeons or institutions. In principle, new surgical techniques could be readily and quickly compared with standard methods in formal clinical trials soon after being shown to be feasible. However, despite the relatively relaxed need for developmental testing prior to a comparative trial, new surgical techniques seem to be infrequently compared formally with old techniques. 4.6.1
Why Trials Have Been Used Less Extensively in Surgery
The prevailing mindset regarding developmental studies of surgery appears to be that such treatments are intuitively justified, have large treatment effects, and have favorable risk– benefit when applied in the right patients. Under these assumptions, relatively informal evaluation methods, such as case series or small prospective studies, are all that is needed to generate evidence. There are specific circumstances in which this confidence is justified, but generally these assumptions are overoptimistic. Surgical interventions are not inherently justified or guaranteed to have either a large treatment effect or favorable risk–benefit. Regulation There is no government regulation of surgical treatments requiring them to be safe or effective. Regulation is absent, in part, because such treatments are not advertised and marketed on a commercial scale the way that drugs are. Although there are no corporate entities that market surgical treatments on a large scale, there is compelling reason to have such treatments be safe and effective because a given procedure is likely to be applied extensively by many practitioners. For example, consider the marketing status of vision correction using laser surgery. This technique is currently marketed in a fashion similar to drugs. Tradition There is not a strong tradition of experimental trials in the surgical community, although it is improving in some disease areas such as cancer. Because of teaching and training, surgical culture is more respectful of experience and opinion than study design. The same traditions also limit formal training in research methods. I have indicated elsewhere that medicine historically has always resisted the introduction of rigorous evaluation methods. This tendency is aggravated in surgery, especially when surgical leaders set a poor example by their own investigational methods, and/or actively oppose reasonable but rigorous approaches. These factors can mistakenly reduce a surgeon’s uncertainty, and correspondingly the equipoise concerning a trial that might actually be necessary. Confounding The outcomes of skill-dependent therapies confound three important effects: (1) the selection of subjects (prognosis), (2) the skill and supportive care of the practitioner, and (3) the efficacy of the procedure. Early in development this confounding makes some treatments look promising and may therefore inhibit rigorous independent evaluation
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURGERY AND SKILL-DEPENDENT THERAPIES
119
of the therapy. It is hard to overstate the potential for this confounding to obscure the real treatment effects. Though different in origin, this potential for triple confounding is similar to translational trials. Incremental Improvement Incremental improvement of accepted procedures is the historical norm for development of skill-dependent treatments. Accepted and familiar procedures can appear to carry ethical mandates before they are proven to be safe and effective. Incrementally improved procedures may therefore inherit a similar mandate. There is a myth in surgery that some procedures work when applied by skilled hands but not when used by inexperienced surgeons, or even that some complex procedures cannot be taught. That is nonsense— every operative procedure being applied routinely today was at some time in the past a complex and specialized procedure. What is demonstrably true is that a surgeon with experience reduces the risk to the patient, whether through the appropriate application of selection criteria, enhanced technical skill, or ancillary care. Demonstrations of this effect in the literature are old and convincing [137, 155, 668, 957]. Here again, this fact cannot be extrapolated to mean that experience also increases efficacy. Masking and Placebos Surgery is not very amenable to masking or the use of placebos, diminishing the expected rigor with which some such trials can be done. Many investigators would say that placebo or sham surgery is categorically unethical. This is a reaction to the pure risk and no benefit one would expect on the sham surgery treatment. Although this perspective is reasonable most of the time, it is important to recognize circumstances in which a sham procedure conveys little or no risk, making a placebo trial possible. Also, there are many important therapeutic questions that do not require the use of a placebo control, such as comparison of surgery versus another modality. Selection Surgical treatments are often developed in, and applied to, patients with a good prognosis. Although they often increase the short-term risks to the subject, the subsequent risk may be lower than that in a nonsurgical group, simply by selection. Economics Surgical treatments are often developed at low cost by a single practitioner or a small group. This cost feasibility reduces the need for resources and collaboration. As a consequence there is less funding for (and less pressure to develop) the considerable infrastructure needed to support good clinical trials. The economics of health care favors the adoption and continued use of a surgical procedure that shows promise. A technique or procedure may become widespread early based largely on economic factors, making it very difficult to perform the needed comparative trials. Psychology Patients are psychologically very accepting of surgical therapies because they imply rapid and substantial benefit by correcting defects or removing disease. In reality surgical
Piantadosi
Date: July 27, 2017
120
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
treatments are a very heterogeneous lot—some representing obvious benefits with large treatment effects, others more questionable. An example of a large treatment effect is surgical therapy for trauma, such as control of bleeding or correction of fractures. An example of a potentially small or adverse treatment effect is surgery for Parkinson’s disease, as mentioned above. It is sometimes claimed that a new surgical procedure must be used by a trained practitioner, implying that the standard of evidence in support of safety and efficacy can be relaxed. This second level of application reduces both risk and the need for rigor. However, prescription drugs also anticipate the intervention of a skilled practitioner. The skill of the practitioner may reduce risk, but neither skill nor a favorable attitude can substitute for efficacy.
4.6.2 Reasons Why Some Surgical Therapies Require Less Rigorous Study Designs A clinical trial with an internal control is a method for estimating a treatment effect that is clinically important but the same magnitude or smaller than person-to-person variation. Treatments that yield sufficiently large effects, such as penicillin for pneumococcal pneumonia, can have their benefits convincingly demonstrated by study designs that do not control all sources of error. Some surgical interventions are of this type. Examples include trauma care, correction of fatal congenital defects, or control of bleeding. Any therapy that counteracts rapid demise is likely to be judged effective based on minimalist study designs. Large surgical treatment effects might also be seen in some chronic disease settings. An example of this is joint replacement for degenerative arthritis. Confidence that the clinical and biological setting and preclinical data are consistent with large treatment effects could cause investigators to reduce the rigor of planned trials. Of course, such thinking is not restricted to surgical therapy, and it is often wishful thinking rather than reality. But it is common in surgery, perhaps because only large treatment effects truly offset short-term operative morbidities. The theme appears to be whether a proposed treatment is corrective as opposed to mitigating. We might expect corrective therapies to have large effects, whereas mitigating treatments would have smaller effects. Cures, so to speak, are dramatic and easy to detect. However, the investigator who plans a study only to detect a large corrective effect reliably is likely to miss valuable mitigating effects. Studies should be designed based on the magnitude of effect that is clinically important rather than what one hopes to see. Another potential rationale for a small or informal study is when the treatment is substantially the same as one about which a great deal is known. This notion is used often with regard to medical device regulation, and seems relevant to many operative procedures that are widely established and then modified. Instinct tells us that incremental improvements do not require extensive evaluation, except that we can occasionally be wrong about the improvement of an increment. A sequence of increments constitutes drift, with the same implication. Also, we do not always know as much about the benefits and risks of the standard procedure as we would like, rigorous studies often being lacking there as well.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURGERY AND SKILL-DEPENDENT THERAPIES
4.6.3
121
Sources of Variation
The comparison of two surgical procedures in a single randomized trial may be less prone to bias because the problem of differential stresses on the subjects is removed. However, it may be unlikely that two different surgical procedures for the same condition carry precisely the same risks. In fact this may be one of the primary reasons for comparing them. In this situation we can expect to encounter the same difficulties utilizing rigorous methods of bias control such as masking, placebos, and objective endpoints as for trials with a single surgical treatment arm. Operator skill is another source of variation that may also create difficulties for surgical trials. Some surgeons, particularly those who develop a treatment or technique, may have a higher success rate or a lower morbidity rate than practitioners who are just learning to use the procedure. In most cases, the highest success rates and lowest complication rates appeared to occur in series where the surgeon performs a large number of similar procedures each year. Whatever the source, different practitioners may produce substantially different outcomes from ostensibly the same surgical procedure. This can aggravate the variation or heterogeneity with which clinical trial designs must cope, and may also affect the generalizability of the results. A frequent criticism of clinical trials is that they are beyond the scope of ordinary practice and that the subject cohorts are highly self-selected, rendering the results of questionable utility to routine practice. I do not find this criticism compelling, for reasons outlined in the discussion of treatment– covariate interactions. However, if a treatment, surgical or otherwise, cannot be applied as intended outside the scope of clinical trial, the concerns regarding external validity might be justified.
4.6.4
Difficulties of Inference
The principal difficulty for inference arising from the absence of rigorous evaluation methods for skill-dependent therapies is the triple confounding of the subject’s prognosis (selection), the practitioner’s expectation (observer bias), and the true efficacy of the treatment. This confluence of selection and observer bias confuses the assessment of outcomes and makes reliable comparisons essentially impossible, except within the context of a well-designed and conducted trial. In addition to this confounding of effects, surgeons may be seduced into equating the feasibility of the operative procedure (technique) with clinical efficacy when in fact the two are distinct. The influence of subject selection is often not fully appreciated. Selection is widely used by surgical practitioners who speak of surgical judgment, defined to be not only the proper timing and application of a particular procedure but also the selection of appropriate surgical candidates. Subjects who are not robust enough to survive the surgical procedure will not receive the operation. Consequently, the results of a surgical case series cannot be compared to subjects who did not undergo the procedure, even if the cohorts appear superficially or statistically similar. Somewhat paradoxically, the better surgical judgment is, the stronger the selection effect. Even so, it is surprising how often inappropriate comparisons are made. A classic example of selection effects was seen in a randomized surgical trial of portacaval shunt for the treatment of esophageal varices, using survival as the primary outcome [563, 564]. The investigators followed both randomized groups, as well as a group of 288 subjects not selected for the trial. Although the randomized comparison
Piantadosi
Date: July 27, 2017
122
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
showed no beneficial effect of shunt, both enrolled groups did strikingly better than the unselected cohort. This trial is interesting both for its methodologic implications about selection bias and for its historical value. Although it appeared in a prominent journal 50 years ago, it is surprising how resistant surgical research design has been to the possible effects of selection bias. Comparisons of case series, done so frequently in surgery, raise concerns regarding selection bias. This is similar to a well-known bias in epidemiology, where one version is termed the healthy worker effect. Individuals in the workplace are an unsuitable comparison group (controls) for subjects hospitalized with a particular illness (cases). Despite common exposures workers are healthier in aggregate than hospitalized patients or else would not be able to remain at work. It is plausible that individuals hospitalized at different institutions might also have been subject to substantial selection effects. Comparison of a surgical treatment to nonsurgical therapy in an RCT presents challenges absent from most drug trials. A problem arises if surgery induces a stress on one group, that is a differential stress following randomization. Higher mortality in the surgical group, for example, can make the survivors appear to perform better or live longer than subjects in the comparison group even in the absence of a true treatment benefit. In other words, the weaker subjects may fare worse under surgery, making the survivors appear to have improved. Comparisons of mortality may be valid, but comparisons of survivors may not be. Exactly this situation arose in the National Emphysema Treatment Trial (NETT), a randomized comparison of lung volume reduction surgery versus medical management for patients with emphysema (discussed in Section 4.6.6). Short-term mortality in the surgical group might have biased functional outcomes in favor of surgical survivors. Any group comparison based only on survivors, such as averages of functional test measures, is subject to this effect. To correct for it, a dichotomous outcome can be defined in all subjects: improved versus unimproved, where the unimproved category includes deaths or those too ill to perform the test. Thus, such a trial can provide an unbiased estimate of mortality or certain derived outcomes, but it is not guaranteed to yield an unbiased estimate of the magnitude of functional improvement. Fortunately, NETT was designed to give precise estimates of the mortality risk ratio.
4.6.5
Control of Observer Bias Is Possible
An often cited problem with performing surgical comparative trials is the difficulty in using masking and placebos, the goal being control of observer bias. It is well known that unmasked randomized trials tend to favor the expectations of the investigators. This is even more of a problem when the methods of evaluation are partly subjective, as is often the case in surgical studies. Symptom relief is a common outcome of many trials and is subject to this bias. Surgical placebos necessarily take the form of sham procedures, which are widely claimed to be unethical. The reason is because of the unfavorable risk–benefit ratio that results from a sham procedure. This is distinctly different than the use of a drug placebo, which carries only the risk associated with denial or delay of treatment. A classic example of a sham procedure in a surgical trial came from studying internal mammary artery ligation for treatment of angina. The procedure was used in several uncontrolled studies, observed to be “effective,” and was being widely accepted by surgeons [852].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURGERY AND SKILL-DEPENDENT THERAPIES
123
Two studies were performed in which a control group was treated by a sham procedure (skin incision only) and compared to mammary artery ligation. Cobb et al. [271] reported similar improvements in both groups, and Dimond, Kittle, and Crockett [381] reported no benefit in either group. These trials convinced practitioners that the procedure was ineffective but also that sham operations were inappropriate. However, there are circumstances in which the risk of a sham surgical procedure is near zero, and its use may be essential to help resolve an important therapeutic question. Investigators should probably keep an open mind about sham procedures. An example of this occurred in two small NIH-sponsored randomized trials of fetal tissue implantation for treatment of advanced Parkinson’s disease [516, 517, 1144]. Parkinson’s disease is one where it may be vital to control for a placebo effect [1381]. The treatment required a stereotactic surgical procedure to implant dopamine-producing fetal cells in the substantia nigra of patients with the disease. One anatomical approach to accomplish this is through burr holes in the frontal bone. Small bilateral skin incisions were made in the forehead, burr holes drilled through the bone, and the needle was passed through the dura and advanced until the tip was in the correct location. This procedure is performed under sedation and local anesthesia, meaning the subjects are awake. These trials were done using a sham procedure in the control group. The ideal placebo would have been transplantation of an inactive cell suspension. However, the investigators could not justify the risk attributable to actually passing needles into the brain of placebo subjects. The sham procedure consisted of the steps up to, but not including, the penetration of the dura. There was broad agreement among neurosurgeons that risk was negligible up to that point. Additional time and maneuvers were conducted in the operating room to be certain that the subject could not tell if the procedure was a sham. The study design also called for placebo-treated subjects to return after an evaluation period to undergo the real procedure, at which time the previously drilled holes would be used. Post-surgical evaluations of the disease were performed by neurologists who were unaware of the actual treatment assignment. The results of these trials do not support the initial enthusiasm for fetal tissue implantation. These trial designs were reviewed extensively prior to implementation. The reviews included peer review for funding, sponsor review, local IRB review, and examination by an independent performance and safety monitoring board that also carefully followed the conduct of the studies. Even so, ethical questions were raised [20, 532, 970]. Each reader should evaluate these trials for his or her level of comfort. My sense is that they were appropriately designed (although small), well timed in the window of opportunity, and would likely have yielded biased results without the unusual sham procedures employed. A strict prohibition against sham surgery is probably not sensible [20, 733, 734]. We might expect a sham surgical procedure to be more appropriate under a few fairly rigorous circumstances. These are when the sham is convincing but incomplete enough to produce negligible risk, necessary use of a subjective outcome, and when no other design can truly meet the needs of the investigation. A sham procedure was used by Ruffin et al. [1315] in their trial of gastric freezing for duodenal ulcers and recently by Moseley et al. [1064] in a study of arthroscopic knee surgery. Other procedural (but perhaps not technically surgical) shams have been reported for ulcers [455, 456], hemodialysis [231, 1120, 1345], peritoneal dialysis [1551], and laser photocoagulation [1277]. Investigators must thoroughly examine the clinical contexts and risk of the sham procedure. We cannot pretend that all widely applied surgical procedures are effective or
Piantadosi
Date: July 27, 2017
124
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
carry favorable risk–benefit ratios. This represents a risk of its own that might be offset by the use of sham controls in appropriate circumstances. Investigators must also be alert to surgical questions in which masking and placebos can be used without requiring a sham procedure. Examples of this are the biodegradable polymer studies in patients with brain tumors [182, 1543]. 4.6.6
Illustrations from an Emphysema Surgery Trial
Many of the contextual issues regarding surgery are illustrated by the National Emphysema Treatment Trial (NETT), which was a randomized comparison of lung volume reduction surgery (LVRS) plus medical management versus medical management alone for patients with advanced emphysema [1092]. From its inception in 1997, NETT was surrounded by controversy rooted deeply in the context of surgery. The principal source of difficulty was that many thoracic surgeons believed that the benefits of lung volume reduction had already been established based on uncontrolled studies of postoperative lung function and the resulting interpretation by experts. Some believed a randomized trial was therefore unnecessary and unethical. Other surgeons were supportive of a randomized trial, especially when discussed in private, but group dynamics made it difficult for that view to become dominant. The poor reception of NETT by the academic surgical community was evident again when the trial was completed, as discussed below.
FIGURE 4.1 Overall survival following randomization in NETT. (Data from NETT Research Group [1092].)
Many surgeons failed to recognize the weaknesses of uncontrolled study designs, and therefore the poor standard of evidence upon which LVRS was based, despite the longevity of the procedure. This reflects both undue respect for experience and opinion, and lack of training in methodology. Probably the best attempt to put LVRS on a scientific footing was the work by Argenziano and Ginsburg [80], which curiously came out in 2002 less than a year before the NETT results would appear. The careful articulation of the state of the surgical art therein would soon be obviated by the report of a single rigorous clinical trial.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURGERY AND SKILL-DEPENDENT THERAPIES
125
Because of deficiencies in the literature regarding LVRS, there was equipoise in the multidisciplinary collaboration needed to perform a definitive randomized trial, which is not to say that every investigator held a neutral position. Uncontrolled studies could not correct for the confounding of selection, experience, and treatment effect mentioned above. Follow-up data in case series were short and often incomplete. No long-term studies of survival had been done. Optimal selection criteria for surgery were uncertain. Medicare claims showed a very broad application of LVRS and a high mortality rate. The operative mortality ranged between 4 and 15%. Medicare data indicated a 6-month mortality of 17% and suggested a similarly high 1-year mortality following surgery. Thus, the information necessary for physicians and patients to assess accurately the risk–benefit profile of this major surgery was felt by many to be inadequate.
FIGURE 4.2 Mortality following randomization in NETT non-high-risk subgroups. (Data from NETT Research Group [1092].)
The Center for Medicare and Medicaid Services (CMS) was particularly interested in the LVRS question because of the high cost, wide application, and high mortal-
Piantadosi
Date: July 27, 2017
126
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
ity. Even so, arguments were being made to have the procedure routinely covered by Medicare—an implicit endorsement of an accepted standard therapy. CMS initiated a collaboration with the National Heart Lung and Blood Institute (NHLBI) to test the value of LVRS rigorously and base a coverage decision on the results of the scientific study [228]. This setting was nearly ideal for an RCT. Several factors were less than optimal; however, some strong opinions had already been formed on the basis of weak evidence, the window of opportunity was rapidly closing, and there was a lack of support or even hostility among some members of the medical and lay community for a randomized trial. Two assessments of the LVRS question had been undertaken by the NHLBI and the Center for Health Care Technology Assessment to review what was known and identify research gaps. Both assessments endorsed the advisability of an ethically feasible randomized clinical trial [1537]. CMS developed new policies based on conditional coverage to allow a controlled trial to take place. Design Drafting the NETT protocol to eliminate deficiencies in earlier studies required much attention to ethics, clinical practice, management, design methodology, and trial conduct. The design objectives of NETT were to
· ·· ·· ·
employ definitive clinical outcomes defined for every subject to reduce observer bias, require a lengthy active follow-up, use a large sample size and inclusive cohort, randomize treatments to reduce selection bias, analyze risk factors and definitive outcomes to determine selection criteria, and base the trial on biological characteristics and assessments of the disease to enhance external validity.
Meeting these objectives would yield needed data on the risks and benefits of LVRS and help delineate optimal subject selection. Randomization in each of multiple centers occurred only after subjects had been evaluated and assessed to be good surgical candidates, removing the effects of selection bias. Even so, operative mortality would create differential selection after randomization that could confound the treatment effect. For example, if weaker subjects in the surgical group had a higher short-term mortality than stronger subjects, the average pulmonary function postoperatively in the surgical group will be superior to that in the medical group even when the treatments were equally efficacious. The concern was appropriate, as illustrated by the early findings from NETT, where a small subgroup of subjects were found to be at high risk of mortality following surgery [1093]. Randomization of this subset of subjects was discontinued early. This survivor’s bias was counteracted in two ways: one using design and a second using analysis. First, NETT used overall survival as its primary outcome measure. There was controversy about this because some investigators felt that short-term pulmonary function, exercise, or quality of life was more appropriate. But considerations regarding statistical precision and clinical utility led to a larger lengthier trial with survival as the primary outcome. Second, functional outcomes in the treatment groups were not
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURGERY AND SKILL-DEPENDENT THERAPIES
127
FIGURE 4.3 Distribution of maximum exercise changes at 2 years in non-high-risk NETT patients. (Data from NETT Research Group [1092].)
FIGURE 4.4 Change in maximum exercise measured by cycle ergometry in NETT survivors. (Data from NETT Research Group [1092].)
summarized as mean values defined only in survivors, a regnant flaw in the historical observational studies. Instead, each subject was classified as either improved or unimproved with respect to a given outcome, where the unimproved category included those deceased or unable to complete the evaluation. Thus, a functional outcome was defined for every subject.
Piantadosi
Date: July 27, 2017
128
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
Results NETT was executed smoothly and effectively at 17 collaborating centers. Recruitment was substantially lower than originally anticipated, reflecting the high rate of comorbidities in patients with advanced emphysema and overestimates of the applicability of LVRS. Treatment adherence and completeness of follow-up were outstanding—95 and 99%, respectively. This completeness contrasts with many of the nonexperimental studies of LVRS, illustrating one of the values of designed data production. The overall survival findings from NETT showed no difference (Fig. 4.1). However, based on NETT’s prospective evaluation of selection criteria proposed by investigators, a qualitative treatment–covariate interaction was observed (Fig. 4.2). This survival result identifies a subset whose survival is improved by surgery (upper lobe disease, low exercise capacity; Fig. 4.2, upper left panel), and one in which surgery should be avoided (non upper lobe disease, high exercise capacity; Fig. 4.2, lower right panel). The remaining two subsets have modest but clinically significant improvements in parameters such as exercise, pulmonary function, and quality of life [476]. An example is shown in Figure 4.3, which demonstrates the increased exercise capacity at 2 years in non-high-risk subjects assigned to LVRS. This data display accounts for every subject, including those who could not complete the intended evaluation due to death or illness, who are classified as unimproved. This contrasts with summaries such as Figure 4.4 for maximum work measured by cycle ergometry, which demonstrates improvements subject to the survivor bias. Although typical of pre-NETT reasoning and flawed, Figure 4.4 does demonstrate the remarkable heterogeneity in the outcome and the absence of functional determinism that results from LVRS. The clinical findings from NETT were that (1) a high-risk subgroup could be defined using easily determined disease characteristics and parameters, (2) no overall improvement in survival is attributable to LVRS, (3) selection criteria could be improved for patients undergoing surgery, (4) survival is improved in optimally selected patients, (5) safety is improved by optimally excluding patients, (6) modest but clinically significant improvements were made in exercise capacity, pulmonary function, and quality of life in optimally selected patients, and (7) these findings presented a strong degree of statistical certainty and external validity. Also, NETT yielded a cost-effectiveness analysis that demonstrated costs per quality adjusted life-year were substantially in excess of $50,000 in the near term but were likely to drop below this threshold after 5–10 years of followup [1245]. NETT findings yielded a quantitative basis for decision-making at all levels. CMS implemented a restricted coverage decision based on the trial. Methodological implications of NETT reinforce (1) the feasibility of well-executed trials of complex surgical questions, (2) the need for rigorous design (e.g., control of bias and long-term follow-up), and (3) the deleterious effects of lost data. The trial also provides some insights into the larger context of surgery, especially when examining events before the study was initiated and after it was completed.
Surgical Context after the Trial The single most telling characteristic regarding the surgical view of NETT was a nonevent; specifically the failure of the American Association for Thoracic Surgery (AATS) to accommodate the initial presentation of primary results from the RCT at its 2003 annual meeting. The reasons for this appeared superficially complex—scheduling, journal selection, and society procedures. The actual reasons, in my opinion, were a reflection
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURGERY AND SKILL-DEPENDENT THERAPIES
129
of weakened scientific values with respect to clinical experimentation in the surgical discipline. Sadly, the AATS suggested that the NETT primary results be presented at a continuing education program prior to its meeting rather than at a regular scientific session. This proposal was extremely unusual for an original scientific study and unacceptable to the NETT Steering Committee, who felt that the trial was too important for such a venue. Even if NETT had been afforded a regular AATS slot, the program allowed only an impossibly short 8 minutes for presentation. The first presentation of NETT results took place just days later in a three-hour scientific session at the 2003 American Thoracic Society, a multidisciplinary medical meeting. Despite overcoming the obstacles to applying a rigorous randomized study in this context and the value of the clinical findings, NETT continued to attract criticism (e.g., Berger et al. [142]). Paradoxically, much of this criticism came from within the thoracic surgical community after completion of the trial, despite NETT yielding a conceptual concordance with prior belief and strong evidence supporting an LVRS indication in properly selected patients. As long as a year after completion of NETT, thoracic surgeons were still “debating” the trial [298]. Although lively scientific debate is healthy, complaints about NETT were a potpourri of concerns about study motivation and administration that have no effect on the results, minor methodological points, misunderstandings regarding analytic methods, grievances regarding restrictions on investigators, and even criticism of the way the results were peer reviewed. It also became clear that some surgeons did not fully understand the concept of equipoise. Most of the concerns voiced about NETT a year after its publication had long since received due process. Thoughtful debate took place at many levels typical of a large multidisciplinary investigation: sponsor preparation, Steering Committee discussion about protocol design and implementation, independent ethics panel review, multiple IRB reviews, quarterly DSMB oversight, investigator’s meetings, manuscript preparation, peer review, and public debate. The fact that some of the issues covered continued to resonate almost exclusively within the surgical community illustrates the contextual problem. On all of these points—presentation of NETT results, methodologic criticisms, due process, and interpretation of results—surgeons remained privately divided. This unfortunately placed added strain on surgical investigators who saw the merits of what had been accomplished. The NETT experience overall emphasizes the undue respect that surgeons traditionally have for uncontrolled studies and expert opinion. For example, critical examinations of the type aimed at NETT were never directed at the uncontrolled preliminary studies that supported LVRS. It also demonstrated that large, rigorous, complex clinical trials were outside the experience of many surgeons, even those in academicbased practice. Until teaching, training, and the culture changes, such trials will continue to be feared and misunderstood in this context. With about 5 years of follow-up for survival, the original findings of NETT were sustained [1109]. By 2014, about one hundred original peer-reviewed research papers were derived from the NETT cohort. Aside from that raw productivity, the NETT collaboration and culture stimulated a wave of clinical and translational research into chronic obstructive pulmonary disease. The legacy from the trial has been substantial, but disproportionately in medical disciplines despite the primary surgical question of the trial. This is most likely a consequence of the disparate contextual views of randomized trials.
Piantadosi
Date: July 27, 2017
130
4.7 4.7.1
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
A BRIEF VIEW OF SOME OTHER CONTEXTS Screening Trials
Screening is a particular form of diagnostic testing, applied to a population at risk of developing a disease, with the purpose of diagnosing the condition earlier than natural history would reveal it. Screening is nearly always viewed positively by physicians and the public, perhaps because there is a presupposition that early detection of disease is good. Screening tests are typically repeatedly applied to individuals to diagnose newly incident cases, though always at the initiation of any screening program or trial some prevalent cases will be detected. Screening for cancer in older persons or those at high risk is a standard example. Screening is not without risk to the screened population [179]. This fact alone provides some complexity to the solely positive impression that screening often conveys. Early detection of disease does not guarantee benefit [1578]. Harm can be done to the population by creating stress, false positives, unnecessary workups for disease, and complications from invasive diagnostic procedures. These phenomena are ubiquitous [696], which places strong demands for efficiency and safety on any screening modality. The purpose of a screening trial is to assess the utility of the intervention compared to usual care in preventing deaths or complications from the index disease, and sometimes to compare different methods of diagnosis. Like therapeutic trials, a number of different study designs have been applied to screening questions, but the most reliable is the RCT. A general discussion of this context is given by Prorok [1237]. A premise of screening is that early diagnosis makes the condition more treatable or curable. If there are no good treatments, then screening and early diagnosis will not be helpful. Unlike therapeutic studies, screening trials nearly always carry a positive image, perhaps because there is a presupposition that early detection of disease is optimal. Some generic issues surrounding screening must be understood by investigators at the outset of any study or review of a screening trial. Appreciating these points somewhat abstractly helps alleviate the instinctive bias in favor of early diagnosis that so easily arises in specific disease settings. The first general point has already been made above but repeated here. 1. Screening can work only when effective therapy is available. 2. Screening creates overdiagnosis—diagnosing cases of disease that are false positives or indolent disease. 3. Exposure to radiation or other harmful modes affects risk–benefit. 4. Screening and early diagnosis will appear to lengthen survival compared to naturally arising diagnoses even if it has no beneficial effect because of lead time bias. 5. Screening detects slower evolving (longer and less aggressive) disease courses more readily than short aggressive disease courses, making it look better than it actually is (length-bias sampling). 6. If screening is a good idea, then screening for precursors of disease is better than screening for disease. 7. In a screening RCT, the control or usual care arm will be contaminated by subjects who receive some form of screening.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
A BRIEF VIEW OF SOME OTHER CONTEXTS
131
8. Even under seemingly optimal circumstances, screening may not show a benefit. 9. Disadvantages of screening can include use of resources, harm cause by work-up or treatment of false positives, risks of screening procedures themselves. Most design issues in screening trials are similar to those for preventive and therapeutic agents, including the best choice for an outcome measure. Survival is often seen as the definitive outcome measure for screening for life-threatening diseases such as cancer. But other outcomes have been proposed including the population incidence of advanced disease and stage shifts (higher proportion of cases in earlier stage). Early detection induces a bias in the comparison of survival times that artificially makes screen-detected cases appear to live longer. This lead time bias must be corrected to estimate the true benefit of screening. Another effect that can make screening look good is overdiagnosis. This is somewhat counterintuitive, but a sensitive screen will detect disease that is inconsequential, or at least more indolent than that which presents itself clinically. Some such cases may never have emerged clinically (i.e., the disease would have regressed or disappeared). These extra cases have good prognosis and inflate the number of disease cases detected without adding bad outcomes. This will seem to credit the value of screening which, in fact, did nothing to improve these outcomes. Overdiagnosis is a real phenomenon that may be difficult to quantify. A good example where overdiagnosis seems to be significant is in PSA screening for prostate cancer. The number and interval of screenings is a unique design consideration, as is the interplay among sample size, trial duration, and the (temporally delayed) onset of the screening effect. A large relatively short-duration trial might not provide sufficient time for the benefits of screen-detected cases relative to other cases to be seen. Thus, definitive screening trials may need to be as long as prevention studies, even if the disease under study has a relatively short natural history. Screening methods and studies are similar to prevention trials with regard to risk– benefit and many other methodologic considerations. For example, the role of developmental studies is similar in that early feasibility trials are not formalized and may be replaced by observational evidence. Some methodologic considerations for such trials, including the use of reciprocal controls for screening trials have been given by Byar and Byar and Freedman [211, 214] and Freedman and Green [524]. Because it employs diagnostic tests, procedures, or algorithms, and may use standard clinical evaluations, the framework for evaluating screening is different than that for prevention. Prorok [1237] distinguishes screening trials that address a single question from those that investigate more than one question. The former is typified by the Health Insurance Plan (HIP) trial of breast cancer screening [1371] and the National Lung Screening Trial (NLST) [1106]. There are numerous examples of screening trials that examine multiple questions, also depending on how screening is applied (e.g., continuous or split). For some prototypical designs, see the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial [602, 603] and the Stockholm Breast Cancer Screening Trial [548]. Role of Regulation Screening methods are not explicitly regulated, but their components may be (e.g., radiological procedures and diagnostic tests). Screening is complex and expensive, and usually comes with sponsorship or the recommendation of governmental authorities. Therefore, it
Piantadosi
Date: July 27, 2017
132
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
is a virtually regulated context in that studies are carefully designed, reviewed, conducted, analyzed, and debated before conclusions are made. Observer Bias Observer bias can be controlled in screening studies in much the same way as therapeutic or prevention trials by using masking and objectively defined outcomes. Lead time bias presents some difficulties that cannot be corrected in a straightforward way by design. Treatment Uniformity Easy to define for the purposes of the trial. Uniformity is more difficult to achieve in the way screening is actually implemented on a population basis. Magnitude of Effect Screening is expected to have a relatively large effect both in gaining early diagnosis and reducing mortality and morbidity. This partly explains the positive view that many have regarding it. However, the real benefits of screening can only be gained through treatments whose effects are greatly amplified by early detection. Such treatments are difficult to develop in diseases like cancer. In contrast, for heart disease secondary to high blood pressure, screening and treatment may yield very large benefits. Incremental Improvement The role of incremental improvement can be substantial for screening interventions, particularly when they employ standard methods that are constantly being improved. Examples of this are radiographic procedures, especially computer-based ones, and genetic tests. The ability to improve a screening algorithm may not require extensive testing if convincing technical improvements are made in its components. Risk–Benefit Screening is generally very low risk, making any benefits that accrue favorably balanced from this perspective. However, cost-effectiveness may be unfavorable for some screening methods or when applied to low-risk populations. The true positive rate is a strong function of the background probability of disease, so the same algorithm can have very different properties when applied in different risk settings. It is also useful to reconsider the discussion about the potential negative effects of screening in a population mentioned in Section 4.4.1. Tradition and Training of Practitioners The tradition and training of practitioners greatly supports rigorous evaluation of screening interventions. Examples An interesting example of a recent screening trial is the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) [603]. It was a large, randomized study to determine whether the use of the respective screening tests would reduce death. These four malignancies account for 40% of the cancers diagnosed each year, and 44% of cancer deaths. More than 625,000 Americans are expected to be diagnosed with these cancers in 2015 and more than 250,000 will likely die of these diseases. In this trial, 10 screening centers in the United States enrolled 155,000 participants (78,000 women
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
A BRIEF VIEW OF SOME OTHER CONTEXTS
133
and 76,000 men) aged 55–74 with no previous history of prostate, lung, colorectal, or ovarian cancer. Participants were randomized to either an organized screening program or usual care (which sometimes included screening). Although enrollment began in 1993 and ended in 2001, the duration of the trial was much longer including the design process and continued follow-up and analyses. The screening tests included the following. For prostate cancer, digital rectal examination and prostate-specific antigen blood test were used. A single view chest X-ray was used for lung cancer screening. For colorectal cancer, flexible sigmoidoscopy to 60 cm was used. Ovarian cancer screening employed transvaginal ultrasound and serum CA-125 measurement. Most tests were performed yearly, but sigmoidoscopy was done at study entry and either 3 or 5 years later. The PLCO trial demonstrated that colorectal cancer screening reduced incidence (21%) and death (26%) over an average follow-up of 12 years. Lung cancer mortality was not reduced, even in the subset of participants with a strong smoking history. This negative result seems to be typical of screening chest X-ray. Ovarian cancer screening in the manner indicated increased the number of cancers detected but did not reduce mortality. Complications from invasive pursuit of false positive results created problems for a significant number of women. The results for prostate cancer were followed for 13 years. Screening increased incidence by 12%, but did not reduce mortality. Here again false positive results and treatments contributed some harm to some men. It is noteworthy that men in the control arm were often also screened for prostate cancer as part of usual care. Overall, PLCO demonstrated both benefits and risks associated with screening, that is, that screening is not the universally positive intervention as it is often perceived. A second more recent example is the National Lung Screening trial (NLST) that compared low-dose CT scanning to single-view X-ray to screen for lung cancer, with the hope of reducing mortality [1106]. This randomized trial was performed at 33 centers in the United States and enrolled over 26,700 participants in each treatment group between 2002 and 2004. Follow-up for mortality continued through 2009. Low-dose CT yielded 24% positive results and X-ray yielded 7% positives. However, the false positive rate in each group was approximately 95%. The relative reduction in lung cancer mortality attributable to low-dose CT was 20%. This translated into needing to screen 320 individuals to prevent one death from lung cancer. The cost of screening in the recommended manner is not clear (a cost-effectiveness analysis was not part of the randomized trial) but could approach $100,000 per life year, often taken as the break point for cost-effective interventions. Both the PLCO and the NLST provide insights into key issues of screening that almost always escape public discussion. Screening for nearly any condition enjoys a strong positive public perception without awareness of the potential pitfalls. The fact that most positive screen results are false positives when even an excellent technique is applied to a low-risk population alerts us to the potential for well individuals to be harmed by invasive diagnostic procedures. Screen positives are predominantly true positives only when the population is sufficiently high risk. Screening, in a sense, spreads the potential harm of disease across the population. Everyone screened tolerates a little inconvenience or minimal harm. Those who suffer a false positive diagnosis experience more worry and inconvenience resolving the true diagnosis. Some of those individuals may suffer considerable harm or death due to treatments or diagnostic procedures that would have been avoided without screening.
Piantadosi
Date: July 27, 2017
134
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
And of course those with true positive diagnoses may derive benefit from the early detection. There is no simple way to balance these concerns in general in either a clinical trial of screening or application in the population.
4.7.2
Diagnostic Trials
Diagnostic trials are those that assess the relative effectiveness of two or more diagnostic tests in detecting a disease or condition. Often one test will be a diagnostic “gold standard” and the other will be a new, less expensive, more sensitive, or less invasive alternative. Although not strictly required, the diagnostic procedures are often both applied to the same experimental subjects, leading to a pairing of the responses. In that case it is the discordancies of the diagnostic procedures that carry information about their relative effectiveness. Diagnostic tests are often not developed in as structured a fashion as drugs and other therapeutic modalities. This is the case because the risks of diagnostic procedures are often low. They may be applied in apparently healthy populations, as in screening programs, and may therefore require exceptional sensitivity and specificity to be useful. Developing better diagnostic modalities presupposes that effective treatment for the disease is available. Otherwise, there is little benefit to learning about it as early as possible. Diagnostic trials can also be affected by biases that do not trouble treatment trials. Suppose that a disease is potentially detectable by screening during a window of time between its onset (or sometime thereafter) and the time when it would ordinarily become clinically manifest. The “lead time” is the interval from the screening detection to the usual clinical onset. Because of person-to-person variability, those with a longer window are more likely to be screened in the window. Consequently, they will be overrepresented in the population detected with the disease. This produces length-biased sampling. See Ref. [1522] or Ref. [341] for a discussion of this problem.
4.7.3
Radiation Therapy
The characteristics that distinguish radiation therapy (RT) from other treatments, especially drugs, include the following. RT uses a dose precision that is higher than that for drugs. The RT dose precision is not based on weight but on the amount of the treatment or energy delivered to the target tissue. Practitioners of RT generally manufacture or produce their own medicine at the time of treatment—it is not produced remotely. There might be some exceptions to this rule for radioisotope use. RT controls the therapeutic ratio. The dose delivered to the target tissue and that delivered to the adjacent tissue can be determined with considerable accuracy. The ratio can be titrated using source, dose rate, geometry, and other factors to produce the desired effects and minimize toxicities on surrounding tissue. Side effects are typically local toxicity from damage to normal tissue near the target. Many such side effects are manifest weeks or months after treatment, rather than hours or days as in the case of drugs. This has implications for the kinds of developmental designs that are appropriate for RT. It is also important to note that RT can produce systematic side effects such as suppression of blood elements. One mechanism is via direct effects on bone marrow. But
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SUMMARY
135
there can also be significant systemic effects on blood cell counts through other, as yet unknown, mechanisms [653] that might be important for trial design. For the highest standard of validity and reliability, clinical trials in RT must account for these and other characteristics. See Ref. [1194] for a discussion of some quality control topics. Some of the factors related to dosing and treatment administration include the following:
·· ·· ·· ·· ·· 4.8
Dose and dose rate Source (e.g., photons or electrons) Equipment calibration Dose schedule (fractionation) Distance from source Target volume Dose delivered and tolerance of adjacent tissue Heterogeneity of dose delivered Intervals between fractions or treatments Use of concomitant treatment (e.g., drugs, oxygen, or radiosensitizers)
SUMMARY
The differential use of clinical trials in various medical contexts is a consequence of factors such as the role of regulation, ease with which observer bias can be controlled, treatment uniformity, expected magnitude of treatment effects, relevance of incremental improvement, and the tradition and training of practitioners. In a context such as drug therapy (a useful reference point because of its wide applicability), these factors combine to encourage the routine use of rigorous clinical trials. In contrast, the characteristics of a context such as complementary and alternative medicine generally discourage the use of clinical trials. Recognizing the forces at work and the characteristics of various contexts helps to understand the strengths and weaknesses of clinical trials as a broad tool in medicine and public health. There are legitimate reasons why the stereotypical paradigm for drug development clinical trials does not generalize into some other contexts. Devices and surgical therapies sometimes share a characteristic (or expectation) that the treatment effect will be large, and therefore evident using relatively informal evaluation methods. They also often proceed using incremental improvement of existing therapies. However, it is critical to recognize the many circumstances where this is not the case, which would then require rigorous trials. The early development of preventive agents also often proceeds without the same types of clinical trials as drugs, using laboratory and epidemiologic evidence. The later development of preventive agents invariably uses the same type of rigorous trials as drugs. Use of clinical trials in complementary and alternative medicine represents somewhat of a paradox. There is an obvious empirical method that seems relevant to this context, and some trials in CAM have proceeded accordingly. Trials are not purely empirical, however, and require biological underpinnings of mechanism to be truly scientific. CAM
Piantadosi
Date: July 27, 2017
136
Time: 4:45 pm
CONTEXTS FOR CLINICAL TRIALS
does not routinely supply rationale based on biological mechanism, undermining the utility of trials in this context.
4.9
QUESTIONS FOR DISCUSSION
1. Discuss other specific contexts in which it would be unnecessary or inefficient to perform clinical trials. 2. Give specific examples of clinical trials that do not have all the characteristics of the scientific method. 3. What medical disciplines have stimulated clinical trials methods the most? Why? 4. Discuss reasons why and why not randomized prevention trials, particularly those employing treatments such as diet, trace elements, and vitamins, should be conducted without “developmental” trials. 5. In the therapeutic touch example from Section 4.5.3, note the difference between the grant title and the report title [1491, 1492]. Discuss the implications of this.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
5 MEASUREMENT
5.1
INTRODUCTION
Measurement is at the core of science, and is therefore central to clinical trials. Instrumentalizing perception, or measurement, was highlighted as one of the foundational characteristics of science in Section 2.3.7. Our ability to learn about natural phenomena is in direct proportion to the quality of our measurements. This chapter focuses on what we measure in clinical trials and why, more than on how we measure it. There are other sources for measurement theory [343], particularly developing and validating a new instrument. There are three levels on which clinical trials represent measurement. First are the outcome measurements obtained on individual research participants, or experimental units of the study. These are the primary focus of this chapter. Second, a clinical trial is itself a sophisticated measurement of an important effect, directly in the cohort of interest, and indirectly in the population. Investigators judge the quality of measurement at the cohort and population level in much the same way as they assess the utility and quality of outcomes in the study cohort. At a third level, meta-analyses are summary measurements of treatment effect over a set of clinical trials. Trials generally do not require elaborate measurement designs for the participants, even when the outcomes depend on sophisticated measurement technology like functional imaging or gene sequencing. Firm biological grounding of outcomes simplifies measurement design. The best measurements are biologically definitive with respect to the effects of treatment, which implies that they truly reflect changes in the underlying disease process. In dose-finding for example, gathering time-concentration data may present practical difficulties, but there will be little concern regarding the conceptual
Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
137
Piantadosi
Date: July 27, 2017
138
Time: 4:45 pm
MEASUREMENT
validity of the drug levels as appropriate outcomes and predictors of effects. Surrogate outcomes raise questions of validity regarding the extent to which they capture the true treatment effect even when we have confidence in the construct validity of the surrogate. For example, does progression-free survival inform us properly about treatment effects on overall survival? Comparative trials can offer an opportunity to employ important indirect measures of treatment effect. Patient reported outcomes (PRO) such as quality of life (QoL) is a good example of an important and valid manifestation of therapy. Methodologic weaknesses in PRO tend to operate equally in comparison groups and yield valid measures of relative treatment effects. Care is needed because missing data can undermine the validity of PRO more readily than for many objective measures. As useful as patient reported outcomes are, wearable technology may soon provide an even better window into patient-centric measures. These devices are unobtrusive if not actually fashionable, and can capture quantitative information on activity level, sleep, heart rate, stress, and other behavioral and physiological measures. This technology might substitute for some patient reported outcomes that would otherwise be subject to recall bias and lack of quantification. The ability to observe and quantify hidden phenomena like sleep characteristics creates an opportunity to use important patient-centric measures as outcomes or predictors. A clinical trial may not be the best setting to develop a new measurement instrument. One cannot validate both a new measurement method and a therapeutic effect in the same study. Proven measurements minimize concerns over reliability, validity, responsiveness, and interpretability. On the other hand, the highly structured environment of a trial with active follow-up and outcome ascertainment might be a good opportunity to gather clean and complete data on a new instrument as a secondary goal.
5.1.1
Types of Uncertainty
Measurement replaces qualitative uncertainty with quantitative uncertainty, which is a valuable conversion. Quantitative uncertainty can be controlled in the following ways: (i) it can be made smaller as measurement methods become standardized and instruments improve, (ii) it can be standardized across investigators, institutions, and countries allowing disparate experiences to be integrated into a coherent whole, (iii) control allows us to objectify measures and reduce systematic error, and (iv) quantitative control allows new measures to be validated against standard ones. In short, quantification is the foundation for the key needs of scientific observation: precision, standardization, accuracy, and generalization. The classical view of uncertainty categorizes error as random versus systematic, and defines accuracy and precision, respectively, as ways to describe the aggregate measurement process for each. If the average of an increasing number of measurements approaches the true value, we say that the measure is accurate and the error is random. If the average of an increasing number of measurements does not approach the true value, we say that the measure is biased. In either case, high precision implies lower random variation. We can be accurate but imprecise, or precise but inaccurate. Total error is the sum of these two pieces. However, some refinements are helpful. Uncertainty is a broader concept than error. It can, for example, exist before any measurements are taken, and may be a lingering
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
139
consequence of design flaws. Thus, even precise measurement may not resolve all uncertainty. Random error exists only when perturbations have no preferred direction, which may not be simple to establish. Perturbations can appear systematic under one measurement scheme, but random under another. The physical properties of a measurement device such as resistance could produce a systematic effect, whereas an electronic method of obtaining the same measurement might remove the effect. An improperly designed questionnaire or assessment tool could show a similar defect. In any case, one should use the terms systematic error and random error only when the circumstance is clear. Random error and bias will be discussed in detail below.
Certainty or Precision Has Limits Calibration is the idea of having a standard against which to adjust or correct a measurement. It is an intuitive idea for physical measuring devices and generalizes to control groups in comparative trials. Calibrating a measured effect in the experimental group with a control internal to the experiment is best, for example. Incorrect calibration can produce either random or systematic error. Reproducibility implies that the same readings will be obtained under the same circumstances. Lack of reproducibility is a serious flaw in any biological or clinical measurement. It is a phenomenon of partly subjective tasks, notoriously, for example, when radiologists are tested with the same images for the same readers [101]. Confidence in the results of our clinical trial depends substantially on the nature of the measurement underlying our experiment. The outcome measure must be appropriate to the question, and the actual observations on experimental subjects must be obtained in a way that validates it. The term “measurement(s)” is slightly ambiguous because it can refer either to an outcome measurement in the abstract, or to the observations that have been recorded. Often singular “measure” or “measurement” refers to the outcome measure, and plural “measurements” refers to the observations, but this rule is not universal since we can have multiple outcome measures in a trial. I will try to make the important distinctions clear in this chapter. When taking measurements, judgment and inexactness arise so that even accurate measurement is also a statement about imperfect human observation. Moreover, we must transmit our measurements to others, which immediately raises questions about reliability. The clinical trial itself is a kind of composite u¨ ber-measurement. In this mixed subjective and objective process, how much confidence can someone have in our results? Such questions should probably be considered when selecting the primary outcomes for any clinical trial. Experiment, measure, and reliability are inseparable concepts. To be useful and interpretable, a measured value must include a statement of reliability. But experimental observations are subject to multiple types of uncertainty, so that a single reliability indicator may not be adequate to convince our peers that the measurement is as correct as possible. For example, suppose we calculate a mean from some measurements taken at face value, and characterize reliability using the standard error of the mean (SEM). A small SEM may imply scientifically useful precision but does not guarantee the absence of bias in the measurements. We must look to the conduct of the study, outcome, methods, and even intentions and mindset of the observer to be confident that other sources
Piantadosi
Date: July 27, 2017
140
Time: 4:45 pm
MEASUREMENT
of uncertainty such as bias have been eliminated from the measurements. Reliability and precision can be equally elusive. Modern statistical thinking is partly an outgrowth of measurement sophistication, especially regarding a body of observations. Because we need to characterize multiple aspects of reliability like those just mentioned, statistical thinking incorporates reliability design for data production as discussed in Section 2.2.3. In other words, statistical thinking recognizes that reliability is a consequence of the active process of removing uncertainty by design as well as the passive process of characterizing variability post hoc.
5.2
OBJECTIVES
Clinical trials are one tool in a larger process whose goal is to reduce suffering from disease. Reducing the impact of disease includes prevention, early diagnosis to take advantage of existing treatments, more effective therapy, and better symptom control. Any of these broad objectives might employ a clinical trial, which might then be said to carry the implicit and qualitative objective of reducing disease burden. The impact of a given trial in reducing disease burden must account for information external to the study. For example, suppose that a new treatment is found to be superior to conventional therapy in a comparative trial. The management of disease might not improve unless the trial results are generalizable, the new treatment is feasible to apply widely, and is cost effective. Similarly, drug regulation is a high-impact process that depends on actual findings as well as subjective interpretation. Trials also carry internal research questions or goals that are distinct from the larger objectives. An internal objective for a comparative trial might be to determine which therapy has superior efficacy and/or fewer side effects. Achieving such goals does not depend on which therapy is better, only on obtaining a valid result. Internal objectives include evaluating safety or pharmacokinetics, or establishing equivalence. The objectives of a safety and activity trial can be met whether or not the treatment is judged worthy of continued development. At a third level, the objectives of a trial are linked to specific outcomes. Outcomes are quantitative measurements implied or required by the objectives. An outcome is determined for each study subject, whereas the objectives are met by appropriately analyzing the aggregate of outcomes. The word “ endpoint” is often used synonymously with outcome in this sense. I prefer the latter term because the occurrence of a particular outcome may not imply the “end” of follow-up or anything else for a study participant. The most appropriate outcome measure depends on the specific objective and the way it is quantified and defined. For example, suppose that the objective of a trial is to determine if a new surgical procedure “reduces peri-operative morbidity” relative to some standard. The assessment of operative morbidity can be partly subjective, which may be a substantial issue if more than one surgeon participates in the study. At least three aspects of morbidity need definition. First is a window of time during which morbid events could be attributed to the operative procedure. Second is a list of events, diagnoses, or complications to be tracked. Third is specification of procedures or tests required to establish each event definitively. Using these criteria, “peri-operative morbid events” can be well-defined and a definitive outcome can be established for each participant. Without the necessary definitions, individual outcomes may be ambiguous.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
OBJECTIVES
141
A clinical objective may have more than one way of being quantified or may be described by more than one outcome. For example, “improved survival” or “survival difference” might mean prolonged median survival, higher 5-year survival, or a lower death rate in the first year. Each definition might require different methods of assessment, and need not yield the same sample size, overall design, or analysis plan for the trial. In the peri-operative morbidity example, the outcome could be defined in terms of any of several events or a composite. Knowing which outcome and method of quantification to use in particular clinical circumstances is an essential skill for the statistician. Trials invariably have a single primary outcome, and often many secondary ones. This is largely a bookkeeping device because usually more than one outcome is actually essential. We can only actively control the type II error for one objective. Secondary objectives have type II error properties determined passively by the sample size for the primary objective. Multiple objectives or outcomes require multiple analyses, some of which may be based on subsets of the study cohort. This increases the possibility of error as discussed in Section 20.8.2.
5.2.1
Estimation Is the Most Common Objective
Most clinical objectives translate into a need to estimate an important quantity. In dosefinding trials, the primary purpose is usually to estimate an optimal biological dose, sometimes the maximum tolerated dose (MTD). More generally, we might view these trials as estimating the dose–toxicity relationship in a clinically relevant region (although the conventional designs employed are not very good at this). The objective of middle development studies is usually to estimate activity and toxicity probabilities using a fixed dose of drug or a specific treatment. Randomized comparative trials estimate treatment differences or risk ratios, and large scale trials estimate rates of complications or side effects. The importance of estimation goes beyond characterizing objectives. Estimation also carries strong implications with respect to how trials should be analyzed, described, and interpreted. A key is emphasis of clinically relevant effect estimates rather than hypothesis tests.
5.2.2
Selection Can Also Be an Objective
Some trials are intended primarily to select a treatment that satisfies important criteria, as opposed to simply estimating the magnitude of an effect. For example, the treatment with the highest response rate could be selected from among several alternatives in a multiarmed randomized middle development trial. The absolute rates could be unimportant for such an objective. The size of each treatment group in such selection designs might be too small to permit typical pairwise comparisons with high power. The goal of such a design is not to test hypotheses about pairwise differences but rather to rank the estimated response rates. Issues related to selection as an objective are discussed in Sections 6.3.1, 13.3.3, and 13.7.3. Frequentist sequential trial designs (see Chapter 18) use significance tests to select the treatment that yields the most favorable outcome. When sequential designs terminate early, the overall type I error is controlled, but the treatment effect or difference may be overestimated. These designs select the best treatment but do not provide an unbiased
Piantadosi
Date: July 27, 2017
142
Time: 4:45 pm
MEASUREMENT
estimate of differences. There are many situations where this is useful. For example, when choosing from among several developing treatments, it may be important to know which one performs best without certainty about the magnitude of the difference.
5.2.3
Objectives Require Various Scales of Measurement
Clinical trial objectives use measurements that fall into one of four numeric categories. The first is classification, or determining into which of several categories an outcome fits. In these cases, the numerical character of the outcome is nominal, namely, it is used for convenience only. There is no meaning ascribed to differences or other mathematical operations between the numbers that label outcomes. In fact, the outcomes need not be described numerically. An example of such a measurement is classifying a test result as normal, abnormal, or indeterminate. The results could be labeled 1, 2, or 3, but the numbers carry no intrinsic meaning. A second type of objective requires ordering of outcomes that are measured on a degree, severity, or ordinal scale. An example of an ordinal measurement is severity of side effects or toxicity. In this case, we know that category 2 is worse (or better) than category 1, but not how much worse (or better) it is. Similarly, category 3 is more extreme, but the difference between 3 and 2 is not necessarily the same as the difference between 2 and 1. Thus, the rankings are important but the differences between values on the ordinal scale has no meaning. Still other objectives are to estimate differences and lead to interval scales of measurement. On interval scales, ordering and differences between values is meaningful. However, there is an arbitrary zero point, so ratios or products of measures have no meaning. An example of an interval scale is pain severity, which can be measured with an arbitrary zero point. A difference in pain on a particular scale is meaningful, but the ratio of two scores is not. Finally, the objectives of some studies are to estimate ratios. These lead to ratio scales on which sums, differences, ratios, and products are all meaningful. Elapsed time from a clinical landmark, such as diagnosis (time at risk), is an example of a ratio scale. Both differences and ratios of time from diagnosis are meaningful. Examples of various scales of measurements for hypothetical data concerning age, sex, and treatment toxicity are shown in Table 5.1. Toxicity grade is an ordinal scale, age rank is an interval scale, age is a ratio scale, and sex code is nominal. The outcomes that are appropriate for each type of objective are discussed in the next section. The scale that probably comes closest to being ideal for measured values is expressing sizes or differences relative to natural variation. This can be as simple as dividing measurements by the population standard deviation or its estimate, yielding a “standardized effect.” Representing the effect in standard deviation units is helpful because it automatically accounts for the magnitude of natural variation, something that we were going to do anyway when assessing the importance of an effect size on its usual scale. Natural variation is a very important biological phenomenon and accounting for its size in estimates of treatment effect is useful. For example, a standardized effect equal to one is easily understood, and will almost always be “clinically significant.” To shift an effect by an amount equal to or larger than the natural person-to-person variation seems destined to be important. Lesser standardized effects may also be important, especially to those with the disease, but a shift of one
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
143
TABLE 5.1 Examples of Different Data Types Frequently Used in Clinical Trials ID Number
Age
Sex
Toxicity Grade
Age Rank
Sex Code
1 2 3 4 5 6 7 8 9
41 53 47 51 60 49 57 55 59
M F F M F M M M F
3 2 4 1 1 2 1 3 1
1 5 2 4 9 3 7 6 8
0 1 1 0 1 0 0 0 1
⋮
⋮
⋮
⋮
⋮
⋮
Ratio
Category
Ordinal
Interval
Nominal
Scale
standard deviation is difficult to dismiss. It hardly matters what the disease, treatment, or outcome is—such an effect is almost surely of interest clinically.
5.3
MEASUREMENT DESIGN
I will refer to the specification of measurements in a trial as the measurement design. It demands the same attention as all the other elements of trial design. In definitive trials, one might argue that only four types of measurement are actually worthwhile: increasing survival; improving quality of life; decreasing side effects; and decreasing costs. These outcomes seem to be the most relevant for proving worthwhile and valued therapies in any context. These measures are, in my view, patient centered, which means that they all reflect phenomena of direct relvance and benefit to the participant in the trial as well as to future patients with the disease. My short list of measures may be acceptably restrictive for large definitive trials. However, many other clinical and biological outcomes are needed developmentally. Some of these are not patient centered but are vital for making developmental decisions. Examples are biological markers, blood or tissue levels, surrogates, imaging assessments, and slowing or stabilizing disease progression among others. Very early in development investigators often put forward “feasibility” as the primary outcome of a trial. Much of the time, this is a ruse. Feasibility is an appropriate primary outcome provided (i) there are legitimate questions around the technology, (ii) it is well defined by events in the trial, and (iii) the study decision rules are clear. What often happens in practice is that feasibiltiy is put forward vaguely and to deflect criticism, and sometimes to guarantee a success. Often investigators have no intention to terminate development based on feasibility so no decision rules are articulated. 5.3.1
Mixed Outcomes and Predictors
Measurements taken during a clinical trial can potentially serve one of several purposes. The first purpose is to capture the effect of treatment, that is as an outcome. A second use
Piantadosi
Date: July 27, 2017
144
Time: 4:45 pm
MEASUREMENT
is to predict subsequent events or outcomes. There is not necessarily a problem with this duality, unless the roles are confused by design or analysis. Measurements can represent independent domains in which to evaluate treatment effects, such as assessments of side effects (risk) versus efficacy (benefits). They can also serve as markers of disease, triggers for changing therapy, and surrogate outcomes. The investigator must be thoughtful and explicit about how a given measurement will be used. Both the study design and analysis depend on this clarity. Measurements that are fixed at the time of study entry can only be used as predictors. Examples are subject attributes such as sex and other demographic characteristics. Any measurement that can change during follow-up could, in principle, serve any role mentioned above. An exception is survival, which is a pure outcome. The most appropriate use of a given measurement depends on biology, the setting of the study, the nature of the treatment, and the purposes of the trial. For example, a subject’s functional abilities could be the most relevant outcome for a palliative therapy, but may also predict very strongly later function or other outcomes. Thus, measures of function could have both roles within the same study. 5.3.2
Criteria for Evaluating Outcomes
There are a handful of general considerations for evaluating or selecting an outcome for a clinical trial (Table 5.2). Most of these characteristics are intuitive and require no additional explanation. The degree to which a proposed outcome satisfies ideal characteristics could be the basis for choosing it, although other considerations such as tradition often play a major role. The outcome employed should be methodologically well established so that the investigators can expect the results of the trial to be widely understood and accepted. Outcomes (or endpoints as they are often called) can be parsed endlessly into various categories and attributes. Every disease has its own set of choices with varying merits and problems. However, my view is that we can save ourselves from the detail and potential confusion surrounding outcomes with a simple dichotomy as follows. The most powerful outcomes are those that reliably reflect on modification of the underlying disease, whatever it is. In comparison, there are all the other less useful outcomes. Disease modification means that the biological course of the condition has been improved in a fundamental way. This is distinct from mere symptom improvement, which while it may be useful to a person with the disease, does not automatically reflect an
TABLE 5.2 Considerations for Evaluating and Selecting Outcomes Characteristic
Meaning
Relevant Quantifiable Valid Objective Reliable Sensitive Specific Precise
Clinically important/useful Measured or scored on an appropriate scale Measures the intended construct or effect Interpreted the same by all observers Same effect yields consistent measurements Responds to small changes in the effect Unaffected by extraneous influences Has small variability
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
145
underlying improvement. Disease modification is also distinct from other outcomes that track manifestations of the condition, such as biomarkers, that are unproven in reflecting true benefit. One outcome that reflects disease modification is survival, in conditions that are known to shorten it. Improved survival always reflects a beneficial modification to the underlying disease process. In development of cancer therapeutics, another outcome that can have the same connotation is shrinkage or tumor response. It is well known that short-term tumor response dose not correlate with prolonged survival nor does it always yield a tangible benefit. But therapies that show a response offer a particular promise that deserves more developmental effort, because there has been an alteration in the biology of the disease. In CHD and stroke we know that lowering blood pressure is disease modifying. And in HIV infection we know that reduction of viral load also reflects a fundamental modification of the disease. Many diseases cannot be observed in the direct way that tumors, atherosclerosis, and infectious agents can. Those and other diseases have pathological lesions that can be biopsied, removed, measured, and sampled. The inability to observe the underlying disease process directly presents great challenges to therapeutic development, because it complicates proof of disease modification. This has been the case in degenerative neurological diseases for example. A trial design that may be helpful in those circumstances is dicussed in Section 14.4.2. The role of the outcome(s) employed in a clinical trial is analogous to that of diagnostic criteria for a disease. Proper assessment and classification depend on each, as does clinical relevance. For translational, early, and middle developmental trials, evidence of biological marker activity is usually sufficient for clinical relevance. Biological marker activity is often evidenced by appropriate laboratory measurements on serum or tissue, or imaging studies, but the trial itself may contain marker validation as a secondary objective. For definitive comparative trials, clinical relevance can only be established by suitably defined measures of efficacy. If a relevant clinical efficacy parameter is not used in such trials, the role of the treatment in practice will not be well defined regardless of the outcome.
5.3.3
Prefer Hard or Objective Outcomes
“Hard” outcomes are clinical landmarks that rank high with respect to the above-stated characteristics—they are well defined biologically (and in the study protocol), are definitive with respect to the disease process, and require no subjectivity. Thus, they are not prone to observer bias. Examples include death, disease relapse or progression, and many laboratory measurements. “Soft” outcomes are those that do not rank as high with regard to the ideal characteristics. This distinction is not the same as that between definitive and “surrogate” outcomes, discussed below. An example of why subjectivity is undesirable in clinical trial outcomes is the socalled Hawthorne effect, named after experiments at the Hawthorne Plant of the General Electric Company in the 1920s and 1930s [731a]. These studies tested the effects of working conditions on productivity and demonstrated that even adverse changes could improve productivity. However, it turned out that the research subjects were affected by the knowledge that they were being tested, illustrating that study participants can respond in unexpected ways to support the research hypothesis if they are aware of it. While such
Piantadosi
Date: July 27, 2017
146
Time: 4:45 pm
MEASUREMENT
effects may be more likely when subjective outcomes are used, they can also influence hard outcomes that depend on changes in behavior. Some useful and reliable outcome measures fall between the extremes of “hard” and “soft”. An example is pathologic classification, which is usually based on expert, experienced, and objective judgment. Such outcomes are likely to be useful in clinical trials and prognostic factor studies because they are valid and reliable. The important underlying issue with outcome measures is how error prone they are. Even a good outcome, such as vital status, can be made unreliable if investigators use poor methods of ascertainment. 5.3.4
Outcomes Can Be Quantitative or Qualitative
Defined outcomes and prospective methods of assessing them are among the characteristics that distinguish rigorous experiment designs from other types of studies. Thus, the strength of evidence from a trial depends greatly on these aspects of design. The most important beneficial characteristics of the outcome used in a study are that it must correspond to the scientific objective of the trial and the method of outcome assessment must be accurate and free of bias. These characteristics are important, not only for subjective outcomes like functional status or symptom severity but also for more objective measures such as survival and recurrence times. These outcomes usually become evident long after the treatment is begun, providing a chance that incomplete follow-up, ascertainment bias, or informative censoring can affect the results. There are several classes of outcomes that are used in many types of trials. These include continuously varying measurements, dichotomous outcomes, event times, counts, ordered categories, unordered categories, and repeated measures (Table 5.3). Measurements with established reliability and validity might be called measures. In the next section, each of these will be described with specific examples. 5.3.5
Measures Are Useful and Efficient Outcomes
Measured values that can theoretically vary continuously over some range are common and useful types of assessments in clinical trials. Examples include many laboratory values, blood or tissue levels, functional disability scores, pain scales, and physical dimensions. In a study population, these measurements have a distribution, often characterized by a mean or other location parameter, and variance or other dispersion parameter. Consequently, these outcomes will be most useful when the primary effect of a treatment is to raise or lower the average measure in a population. TABLE 5.3 Type of Outcomes and Corresponding Summaries and Comparisons Type
Summary
Compare
Measures Counts Dichotomy Ordered categories Unordered categories Event time
Mean and variance Total; average; rate Proportion Mean; proportion Mean; proportion Hazard
Differences Differences; rate ratio Odds ratio Odds ratio; proportional odds Odds ratio Hazard ratio
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
147
Typical statistical tests that can detect differences such as these include the t-test or a nonparametric analogue and analyses of variance (for more than two groups). To control the effect of confounders or prognostic factors on these outcomes, linear regression models might be used, as is often done in analyses of (co)variance. The efficiency of measured values comes from the fact that every observation contributes to the overall summaries (mean and standard deviation, for example), and that comparisons of mean values are more precise for a given sample size than just about any other statistic. Measured outcomes are not always as clean clinically as we might like or hope. Consider the problems associated with measuring a difficult factor like mild cognitive impairment (MCI). This construct is important in Alzheimer’s, Parkinson’s, and Huntington’s diseases and other degenerative neurological conditions. Ideally, we would have valid disease-specific tools to measure MCI, although presently they do not all exist. Aside from disease specificity, we would want our assessment to track MCI longitudinally and to be sensitive to treatment interventions. Investigators have a dilemma whether to use a general but nonspecific assessment of MCI, develop and validate disease-specific instruments, or employ an alternate strategy. Consider as a sort of worst-case alternative that we have a test to detect MCI but it has no validity or sensitivity for longitudinal tracking. It would be valid to randomize subjects to treatment versus control, begin treatment, and test to detect MCI after followup. Such a design could attribute differences in incidence of MCI reliably to treatment. In other words, we could develop MCI therapies this way, except that the study design(s) would have to be much larger than ones based on a valid quantitative longitudinal measure.
5.3.6
Some Outcomes Are Summarized as Counts
Count data also arise frequently in clinical trials. Count data are most common when the unit of observation is a geographic area or an interval of time. Geographic area is only infrequently a source of data in treatment trials, although we might encounter it in disease prevention studies. Counting events during time intervals is quite common in trials in chronic diseases, but the outcomes counted (e.g., survival) usually number 0 or 1. These special types of counts are discussed below. Some assessments naturally yield counts that can take on values higher than 0 or 1. For example, we might be interested in the cancer prevention effects of an agent on the production of colonic polyps. Between each of several examination periods, investigators might count the number of polyps seen during an endoscopic exam of the colon, and then remove them. During the study period, each subject may yield several correlated counts. In the study, population counts might be summarized as averages, or as an average intensity (or density) per unit of observation time.
5.3.7
Ordered Categories Are Commonly Used for Severity or Toxicity
Assessments of disease severity or the toxicity of treatments are most naturally summarized as ordered categories. For example, the functional severity of illness might be described as mild, moderate, or severe. The severity of a disease can sometimes be described as an anatomic extent, as in staging systems used widely in cancer. In the case of toxicities from cytotoxic anticancer drugs, five grades are generally acknowledged.
Piantadosi
Date: July 27, 2017
148
Time: 4:45 pm
MEASUREMENT
Classifying individuals into a specific grade, however, depends on the organ system affected. Cardiac toxicity, for example, ranges from normal rate and rhythm to atrial arrhythmias to ventricular tachycardia. Neurologic toxicity ranges from normal function, to somnolence, to coma. There is no overall scale of toxicity independent of organ system. When used as a primary outcome or effect of treatment, ordered categories can capture much of the information in, but may be easier and more convenient to apply, than quantitative scores. For example, ordered categories might be useful in assessing outcomes for degree of impairment due to chronic arthritis, functional ability and muscular strength in multiple sclerosis, or functional ability in chronic heart disease. When measures are categorized for simplicity or convenience, investigators should retain the continuous measurements, whenever possible, to facilitate different analyses or definitions in the future. Summarizing outcomes from ordered categories requires some special methods of analysis. One would not automatically look for linear trends across the categories because the ordering itself may not be linear. For example, the difference between a grade IV and grade V side effect may not imply the same relative change as between grade I and grade II. Consequently, care must be exercised in the choice of analytic methods.
5.3.8
Unordered Categories Are Sometimes Used
Outcomes described as unordered categories are uncommon in clinical trials but are occasionally necessary. For example, following bone marrow transplantation for acute leukemia, some patients might develop acute graft versus host disease, others might develop chronic graft versus host disease, and others might remain free of either outcome. These outcomes are not graded versions of one another. In most circumstances unordered categorical outcomes can be reclassified into a series of dichotomies to simplify analyses.
5.3.9
Dichotomies Are Simple Summaries
Some assessments have only two possible values, for example, present or absent. Examples include some imprecise measurements such as shrinkage of a tumor, which might be described as responding or not, and outcomes like infection, which is either present or not. Inaccuracy or difficulty in accurate grading can make a measured value or ordinal assessment into a dichotomous one. In the study population these outcomes can often be summarized as a proportion of “successes” or “failures”. Comparing proportions leads to tests such as the chi-square or exact conditional tests. Another useful population summary for proportions is the odds, log-odds, or odds ratio for the outcome. The effect of one or more prognostic factors (or confounders) on this outcome can be modeled using logistic regression. When summarizing proportions, one should be certain of the correct denominator to use. Problems can arise in two areas. First, there is a tendency to exclude subjects from the denominator for clinical reasons, such as failure to complete an assigned course of treatment. The dangers of this are discussed in Chapter 19. Second, the units of measurement of the denominator must be appropriate. For a proportion, the units of the denominator must be persons, as compared with person-years for a hazard. For example, if we compare the proportions of subjects with myocardial
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
149
TABLE 5.4 Cross-Tabulation for Dichotomous Outcomes Group A B Total
Outcome + − 𝑎 𝑐 𝑏 𝑑 𝑎+𝑏 𝑐+𝑑
Total 𝑎+𝑐 𝑏+𝑑 𝑇
infarction in two cohorts, we might calculate 𝑟1 ∕𝑛1 and 𝑟2 ∕𝑛2 , where 𝑟1 and 𝑟2 are the number of subjects with the event. However, if the follow-up time in the two cohorts is very different, this summary could be misleading. It might be better to use the total follow-up time in each cohort as the denominator. This treats the outcome as a risk rate rather than a proportion. 5.3.10
Measures of Risk
Proportions Assessing risk quantitatively is a very natural and common measurement objective in clinical trials as it is in medicine and epidemiology broadly. A ubiquitous starting place for many risk measurements can be derived from cross-tabulated data summaries as shown in Table 5.4. There we might have a clinical trial with treatments A and B and the outcome totals shown. The chance of a positive outcome on treatment A is 𝑝A = 𝑎∕(𝑎 + 𝑐) and on treatment B is 𝑝B = 𝑏∕(𝑏 + 𝑑). The treatment difference can be summarized as a relative risk, estimated by ̂= 𝑅
𝑝A 𝑎(𝑏 + 𝑑) . = 𝑝B 𝑏(𝑎 + 𝑐)
(5.1)
̂ ≈ 𝑎∕𝑏, indicating If this study were a randomized trial with nearly equal group sizes, 𝑅 that such a design reduces the estimate of relative treatment effect to a simple ratio ̂ and other summaries of relative risk are dimensionless requiring no fancy analysis. 𝑅 numbers that summarize the treatment effect. Another common and useful summary of the results is based on the odds of success, which for group A is 𝑝A 𝑎 𝑎 𝑎+𝑐 = , = 1 − 𝑝A 𝑎+𝑐 𝑐 𝑐 so the odds ratio estimate is ̂ = 𝑎𝑑 , OR 𝑏𝑐 which is often used as a summary of relative risk. For positive outcomes that are infrequent, 𝑎 ≪ 𝑐 and 𝑏 ≪ 𝑑, so equation (5.1) becomes ̂ ̂ = 𝑎(𝑏 + 𝑑) ≈ 𝑎𝑑 = OR. 𝑅 𝑏(𝑎 + 𝑐) 𝑏𝑐 This justifies the use of the OR in many epidemiologic studies as a measure of relative risk. However, the OR has its own pedigree as a summary of differential risk.
Piantadosi
Date: July 27, 2017
150
Time: 4:45 pm
MEASUREMENT
TABLE 5.5 Events, Exposure Time, and Hazards Group A B
Outcome + − 𝑎 𝑐 𝑏
Exposure Time 𝑇A
𝑑
Hazard Rate
𝑇𝐵
𝑎 𝑇A 𝑏 𝑇𝐵
Event Times Another important summary of risk relates to event times, hazards, and hazard ratios. The key feature of these measures is the need to account for longitudinal follow-up at the subject level and analogously total exposure in the trial cohort. An example is shown in Table 5.5, for which the similarity to Table 5.4 should be noted. For an event time analysis, the relevant denominators are not the total number of subjects in the treatment group, but the total person-time or exposure times, 𝑇A and 𝑇𝐵 . Except for accounting for follow-up time and contribution to total time at risk, we will not make direct use of nonevents, and have little need for totals across the treatment groups. The estimated hazard per person-time of exposure on group A is 𝑎 𝜆̂A = , 𝑇A and likewise for group 𝐵. The hazard ratio is then estimated by ̂= Δ
𝜆̂A 𝑎∕𝑇A = . 𝑏∕𝑇𝐵 𝜆̂𝐵
Again if the exposure times were equal in the two treatment groups, ̂ ≈ 𝑎, Δ 𝑏 indicating again the power of design to simplify analyses. As a general rule, however, we always scale by the respective exposure times. Number Needed to Treat Another measure of risk difference is the number needed to treat (NNT), defined by the reciprocal of the absolute risk reduction (ARR). The ARR is the difference in absolute outcome probabilities, 𝑝A and 𝑝B , on two treatments, so NNT =
1 1 = , ARR 𝑝A − 𝑝B
where we assume the 𝐴 outcome is desirable and 𝑝A > 𝑝B . I will not discuss situations in which the ARR and NNT can be negative, implying harm. NNT is not used directly as an outcome in clinical trials but can be helpful in interpreting the clinical significance of relative risks. Low values of NNT are good, indicating that fewer individuals need to receive treatment to produce one good outcome. If every treated person in group A has a good outcome but none in group B do, NNT = 1, which is the most optimal value. Large values of NNT indicate a less useful treatment. For example, a difference in outcome
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
151
probabilities of 0.01 yields NNT = 100, which we would interpret as having to treat 100 individuals to incrementally benefit one. NNT can reflect the potential ineffectiveness of a treatment even when it is found to be statistically superior or associated with strong relative risk improvements. For example, suppose a new vaccine safely reduces the case fatality rate from an infectious disease from five to two per 100,000 people vaccinated. The relative risk of death is 5∕2 = 2.5 times higher on the old treatment. However, NNT = 33, 000, indicating that the new vaccine might not be a worthwhile population improvement, and its use might depend strongly on cost. In the hypothetical case of an Ebola vaccine (Section 4.4.3) that reduces the case fatality rate from 50 to near 0%, NNT would be about two, indicating a very strong utility. NNT has important shortcomings as an outcome measure. For example, as presented above, it is said not to have the dimension of “persons” despite its name and interpretation. The probability components, 𝑝A and 𝑝B , are dimensionless numbers so 1∕ARR must be as well. To clarify this point, we might ask: assuming a given ARR, how many people need to be treated to save one life? Including the dimension of people (plural), the equation to solve is ARR × 𝑁 people = 1 people. This equation has the correct units on both sides. Solving for 𝑁 yields 𝑁 people = 1∕ARR people. Although the literal reciprocal of ARR is dimensionless, this equation demonstrates that NNT is the number of people with magnitude 1∕ARR. Its implied dimensions are correct. Returning to other issues, using standard approximation methods, the expected value of NNT is ] [ { } var{ARR} 1 1 𝐸{NNT} = 𝐸 ≈ . 1+ ARR 𝐸{ARR} 𝐸{ARR}2 The ordinary use and interpretation of NNT therefore assumes the coefficient of variation for ARR is small compared to 1 (second term), which seems unlikely to be generally true. It might be true if ARR is determined from a large population. The simplistic estimate is therefore biased [482]. For the same reason, it is complicated to construct serviceable confidence intervals for NNT [22, 139]. Also, there is no threshold below which we would say that treatments are equivalent. NNT does not utilize an additive scale, and so cannot be used directly in meta-analyses [754, 755, 921]. Risk versus Safety Risk and safety are distinct concepts. The two assessments do tend to converge in large experiences. However, the distinction is important in the small cohorts of developmental trials. Risk is measurement based. For example, we might count the frequency of clinically unacceptable events attributable to a therapy, such as side effects, toxicities, deaths, hospitalizations, and complications. These measurements allow us to quantify risk probabilities, risk event rates, and their precision. Developmental trials are fully capable of yielding this information, keeping in mind that the precision of such measures relates
Piantadosi
Date: July 27, 2017
152
Time: 4:45 pm
MEASUREMENT
inversely to the observed number of events, or the study sample size, both of which will be limited. An unfortunate illustration of the manifestation of high risk in a developmental trial was the clinical trial of TGN1412 in 2006. The drug was a CD28 superagonist antibody being tested by TeGenero in a dose finding trial. Six volunteers were given the drug intraveneously at a dose 500 times smaller than that found safe in animal studies. All six experienced a life threatening cytokine release syndrome and multiorgan failure for which extensive intensive care was required. Some had permanent injuries. This experience has been extensively discussed [750] and resulted in important changes in the early development of biologic agents. For example, the starting dose is now determined based on the minimum anticipated biological effect level (MABEL) as opposed to the previosuly used no observed adverse effect level (NOAEL). Slower infusion rates, closer patient monitoring, sequential enrollment of subjects, and use of highly skilled treatment centers with ICU facilities is required. Finally, this experience taught the need for rigorous evaluation of preclinical studies by independent experts. Safety is not a measurement. Safety is a judgment based on the informative absence of risk events. To be informative or convincing, risk events should be infrequent, but based on circumstances that could have reliably detected them. This requires employing large cohorts or person-time of exposure. Experiences expected to yield nothing in the presence of clinically unacceptable risk are uninformative with respect to safety. This often happens in the small cohorts of developmental trials. We can quantitate these ideas using the Poisson probability distribution, which is the standard model for rare events [797]. The probability of observing 𝑘 events when they occur with rate or intensity 𝜆 is Pr[𝑘] =
(𝜆𝑛)𝑘 𝑒−𝜆𝑛 , 𝑘!
where 𝑛 is the cohort size. For example, a 95% probability of observing some events equates to a 5% chance of observing exactly 0 events, 0.05 =
(𝜆𝑛)0 𝑒−𝜆𝑛 = 𝑒−𝜆𝑛 , 0!
or 𝜆𝑛 = − log(0.05) = 3.0,
(5.2)
where 𝜆 and 𝑛 refer to specific values that satisfy the last equation. For a cohort of size 𝑛, 𝜆 = 3∕𝑛 is the threshold below which we lose reliable resolution of the event rate. Rates below this value have a greater than 5% chance of yielding zero events. If rates below this threshold are clinically meaningful, the study cannot provide convincing evidence of safety. If we are interested in something other than the case of exactly zero events, confidence bounds like those discussed in Sections 16.4.3 or 16.8.1 need to be used. Behavior of the 5% safety threshold is shown in Figure 5.1. Any combination of event rate and sample size below the diagonal line stands a reasonable chance of yielding zero events from the cohort. In a study of 20 subjects, 𝜆 = 3∕20 = 0.15; event frequencies below 0.15 are likely to yield no events. For example, if the true event rate 𝜆 = 0.1, there is a 13.5% chance of seeing no events in the cohort.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
153
FIGURE 5.1 Highest rate consistent with ≥ 5% chance of observing zero events at the specified sample size based on equation (5.2).
This explains how moderately frequent events can remain undetected in trial cohorts. The absence of clinically significant events cannot be taken as evidence of safety when the study cohort does not yield sufficient person-time of exposure. High risk can be detected in small trials based on only a few events, but establishing low risk or safety requires extensive experience. This also explains why safety signals sometime appear only after a new drug or device is approved and marketed. Developmental experience can be inadequate to reveal rare but important risks, whereas a wider post-market exposure could be sufficient. The well-known experience with the nonsteroidal anti-inflammatory drug rofecoxib [168] and its withdrawal from the commercial market is an example of this phenomenon. 5.3.11
Primary and Others
The most important component of measurement design is designation of a primary outcome that captures the biological effect central to the primary question. A trial has designed statistical properties only with regard to the primary outcome. Precision or power for other outcomes will be passively determined by the resulting sample size. Calculating or describing the passive power for secondaries is useful but not equivalent to active design. There is an important subtlety however. Measurements do not all have the same statistical efficiency. It could be worthwhile to employ an important low efficiency measure as primary and a more efficient measure as secondary, if the priority is otherwise not critical. The study would then yield adequate precision for both. For example, values measured on a continuum and summarized with means and standard deviations tend to have the highest efficiency for comparisons. The sample size needed to detect differences is smallest for such measurements. In contrast, censored event times tend to have low efficiency. If an outcome of each kind were important in a trial, using the event time as
Piantadosi
Date: July 27, 2017
154
Time: 4:45 pm
MEASUREMENT
primary would essentially guarantee that the precision for the measured value was high. The converse may not be true. Differing precisions in critical outcomes arose when designing the lung volume reduction surgery trial discussed in Section 4.6.6. Functional outcomes were measures with a high degree of precision, whereas survival was a censored less efficient outcome. The primary outcome was chosen to be survival—the resulting sample size guaranteed a super-nominal power for the functional outcomes. If the converse were chosen, the trial would have been underpowered for the mortality endpoint. Aside from precision, designation of a primary outcome helps to reduce the proliferation of hypothesis tests and control overall type I errors. It is a rookie mistake to allow tests to proliferate, which can guarantee a “significant” finding merely by chance. Discipline with regard to primary and secondary outcomes helps to control this multiplicity problem. In addition, a sensible plan to analyzing secondary outcomes, of which there may be several of interest, will control type I errors. 5.3.12
Composites
Composite outcomes are those defined on the basis of more than one individual event. For example, in some trials of cardiovascular disease, a composite event time of major adverse cardiac events (MACE) has been used. MACE usually includes efficacy and safety outcomes but does not have a standard definition across trials [850]. For example, MACE might be defined in some studies by myocardial infarction, unstable angina, cardiovascular death, revascularization, fatal/nonfatal cerebrovascular accident, peripheral arteriopathy, or aortic event. In other trials it might be a subset of these, include relevant hospitalizations, or all-cause mortality. Such composites serve several purposes including capturing all clinically relevant events for the population under study, increasing the number of study outcomes so that fewer subjects will be needed, and alleviates concern over competing risks when a single endpoint is used. The lack of standardization is a problem however. With a true composite like MACE, it is appropriate at the end of a trial to look at specific components to see what did and did not contribute to differences. The precision to do this is less than for the overall composite. Also, each component typically contributes equally to the overall outcome. However, some components are more important clinically than others. In MACE, a nonfatal myocardial infarction will have the same weight as sudden cardiac death on the outcome. This may not be clinically appropriate. The classic composite outcome is overall survival—an event is triggered by death from any cause. This is a primary outcome in many trials in serious chronic diseases like cancer and cardiovascular disease. Clinical investigators sometimes assert that diseasespecific mortality would be a more appropriate outcome. There are several problems that reduce the utility of more specific mortality summary. One is that it yields fewer events, requiring larger study sizes. A second problem occurs when there are competing, nonindependent causes of death, which seems to be the rule rather than an exception. In a study focused on cancer, death attributed to cardiovascular causes may actually contain information about cancer risk. But if events are singularly classified, we have essentially censored a cancer death as if the censoring mechanism was independent of the underlying risk of cancer death. This yields a biased underestimate of the risk we are most interested in. Finally, weighting all causes of death equally seems appropriate. For these reasons, overall survival is the preferred mortality outcome.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
155
Some trials have used composite outcomes of dissimilar measures and employed a global statistical test of differences. This method might allow a range of outcomes from survival, to measured values, to PRO to be treated as though they represented a single outcome [651, 890, 945, 961, 1147]. This strategy helps power and precision and yields a single assessment of a treatment effect. However, it does not provide clarity on the principal drivers of the effect or clinical interpretation of a global outcome construct. Only clinically important components should contribute to a composite outcome. All components must be likely to respond to the treatment under study, and should be elements in the same spectrum of cause and effect. For example, progression or death in cancer or degenerative neurological disease is a sensible composite. Some consideration should be given to assigning clinical weights or utilities to the components to balance their impact. For example, hospitalization for chest pain and fatal myocardial infarction might be weighted according to their differential clinical importance. Sensitivity of results to such weights needs to be assessed. When reporting a composite outcome, each element should also be summarized individually. 5.3.13
Event Times and Censoring
Measurements of the time interval from treatment, diagnosis, or other baseline landmarks to important clinical events such as death (event times) are common and useful outcomes in chronic disease clinical trials. Survival time is an often cited example of a “definitive” outcome because it is usually determined with minimal error. Many other intervals might be of clinical importance, such as time to hospital discharge or time spent on a ventilator. The distinguishing complication of event time measurements, like many longitudinal assessments, is the possibility of censoring. This means that some subjects being followed on the trial may not experience the event of interest by the end of the observation period. The nature of accruals and event times in a clinical trial with staggered subject entry is shown in Figure 5.2. In this hypothetical study, subjects are accrued until calendar time 𝑇 . After 𝑇 there is a period of additional follow-up lasting to 𝑇 + 𝜏. Some subjects are observed to have events (denoted by x’s) during the study period (e.g., #3–#6). Others
FIGURE 5.2 Accrual, follow-up, and censoring on an event time study. 𝑇 represents the end of accrual, and 𝑇 + 𝜏 the end of follow-up. Observation #2 shows administrative censoring. Observation #1 is lost to follow-up.
Piantadosi
Date: July 27, 2017
156
Time: 4:45 pm
MEASUREMENT
are lost to follow-up during the study, as denoted by the circles (e.g., #1), and may have events after the study period. Still others remain event free at the end of the study, also denoted by circles (e.g., #2). Thus, subjects #1 and #2 are censored. Censoring of event times most often occurs when an individual is followed for a period of time but is not observed to have the event of interest. Thus, we know only that the event time was greater than some amount but do not know its exact value. This is often called right censoring. Censoring can also occur if we observe the presence of a state or condition but do not know when it began. For example, suppose that we are estimating the distribution of times from seropositivity to clinical AIDS in patients at high risk of HIV. Some members of the cohort will already be seropositive at the start of the observation period. The time to AIDS is censored for these observations because we do not know the precise point of seroconversion. This is often called left censoring. Event time data can also be interval censored, meaning that individuals can come in and out of observation. Most event time data are right censored only, so the term “censoring” most commonly means “right censoring”. Additionally, three types of censoring based on the nature of follow-up have been described. The usual type of censoring in clinical trials is random (or type III) because the staggered entry and losses to follow-up produce unequal times at risk and censoring. Some laboratory experiments place all animals on study at the same time and end after a fixed observation period. This produces type I censoring where all censored times are the same. Alternatively, if the investigator waits until a fixed proportion of the animals have had events, type II censoring is produced. Type I and II censoring are usually not seen in clinical trials. We do not discard censored or incomplete observations but employ statistical methods to use all the available information about the failure rate contained in the follow-up time (time at risk). For example, if we followed a cohort of subjects for 50 person-years of time and observed no deaths, we have learned something useful about the death rate, even though no deaths have been observed. Using the information in censored observation times requires some special statistical procedures. Event time or “survival” distributions (e.g., life tables) might be used to summarize the data. Clinicians often use medians or proportions at a fixed time to summarize these outcomes. One of the most useful summaries is the hazard rate, which can be thought of as a proportion adjusted for follow-up time. The effect of prognostic factors or confounders on hazard rates can often be modeled using survival regression models. Right censoring can occur administratively, as when the study observation period ends, or throughout a trial if study participants are lost to follow-up. The statistical methods for dealing with these are the same. The common methods that account for right censoring assume that the censoring mechanism is independent of the outcome. This means that there is no information about the outcome in the fact that the event time is censored. If this is not true, a situation called informative censoring, then the usual methods of summarizing the data may yield biased estimates of risk. For example, suppose individuals are more likely to be censored just before they experience an event. Treating the censored observations as though they are independent of the event will underestimate the event rate.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
157
TABLE 5.6 Example of Censored Event Time Data ID Number
Exposure Time
Censoring Indicator
1 2 3 4 5 6 7 8
141 153 124 251 160 419 257 355
0 1 0 0 1 0 1 0
⋮
⋮
⋮
Event Time Data Require Two Numerical Values To capture the information in event times, whether censored or not, it is necessary to record two data elements for each individual. The first is the time at risk (follow-up or exposure time). It is a measure such as number of days, weeks, months, or years of time. The second item needed is an indicator variable that designates whether the event time represents the interval to an event or to a censoring point. The censoring indicator is usually given the value 1 for an event and 0 if the observation is censored. An example is shown in Table 5.6. If both left and right censoring are present, two indicator variables will be required to describe the risk interval fully. The distribution of event times can be described in several ways. Often, it is summarized as the cumulative probability of remaining event free over time. Because this method is used so commonly in survival applications, the resulting curves are often called “survival curves” even when the outcome is not death. The most commonly employed method for estimating survival curves is the product-limit method [821]. In many other situations, a more natural descriptive summary of the data is the overall hazard or event rate. Methods for describing such data are discussed in more detail in Chapter 20. Censoring and Loss to Follow-Up Are Not the Same It is easy to confuse the concepts of censoring and lost to follow-up, but important to understand the distinction. Not all censored observations are lost to follow-up. The difference is readily apparent for the commonest type of censoring in a survival study, administrative censoring at closeout. Subjects alive (event free) when the trial closes have censored event times even though they may be very actively followed until the end. They may be returning for regular follow-up, have all required exams, and be readily contacted. Thus, they are in no sense lost to follow-up, but censored nonetheless. Censoring created by calendar termination of a study is not likely to be related to the event status of the subjects. Because subjects typically enter a trial over an extended accrual period, this type of censoring may occur for a sizable fraction of the study cohort. Lost to follow-up means that the event status of a subject cannot be determined, even with an active follow-up effort. This usually occurs much less frequently than administrative censoring. Subjects lost to follow-up during a trial are frequently analyzed as being censored at the time of last contact. This makes the implicit assumption that being lost to follow-up does not carry information about the event; for example, subjects
Piantadosi
Date: July 27, 2017
158
Time: 4:45 pm
MEASUREMENT
are not lost to follow-up as a consequence of impending death. If there is an association between the risk of being lost to follow-up and the risk of the event, then it is not appropriate to analyze the data under the usual assumption of independent censoring. More generally, events that compete with one another (e.g., different causes of mortality in the same cohort) cannot be analyzed as though they were independent. An egregious confusion of administrative censoring with loss to follow-up happened in a meta-analysis of interferon treatment for multiple sclerosis (MS) [459]. One study included in the analysis [783] had substantial administrative censoring, because it was terminated before its planned conclusion when results of other similar trials became known. However, the meta-analysis investigators imputed worst-case values for the administratively censored subjects as part of a sensitivity analysis intended for lost to follow-up observations. Because the meta-analysis of interferon included relatively few studies, this method of handling censored data in a large trial may have clouded the results. Although attention was drawn to this error [1314], the meta-analysts stated: With respect to lost to follow-up, an authoritative definition is that these are patients who “become unavailable for examinations at some stage during the study” for any reason, including “clinical decisions . . . to stop the assigned interventions” [808]. Thus the patients were lost to follow-up [458].
The exact quote is from the discussion of attrition bias in Ref. [808], which reads: Loss to follow-up refers to patients becoming unavailable for examinations at some stage during the study period because they refuse to participate further (also called drop outs), cannot be contacted, or clinical decisions are made to stop the assigned interventions.
This informal definition is ambiguous in light of my discussion here, and is probably irrelevant for the MS meta-analysis. A clinical decision to stop or change treatment during the trial is invariably based on how the subject is doing. Therefore, it likely represents informative censoring and can contribute to the bias that J¨uni et al. [808] discussed. Such subjects might be censored when treatment changes but are not necessarily lost to follow-up. To the contrary, their treatment might change because of findings at a follow-up visit. Administrative censoring is different because it occurs at the end of the study, affects all subjects equally, and is generally uninformative (i.e., will not produce a bias). Statistical methods for censored data were developed explicitly to avoid discarding incomplete observations or imputing arbitrary values. As indicated above, these observations are also not necessarily lost to follow-up. Thus, the methods applied in the MS meta-analysis were inappropriate. One can imagine how problematic it would be to confuse the definitions of censoring and lost to follow-up in a prevention trial (or other setting with a high frequency of administrative censoring). In a low-risk population, most subjects will be event free and administratively censored at the close of the study. Analytic methods that treat the eventfree subjects in a prevention trial as lost to follow-up with imputed worst-case values rather than as censored will produce erroneous results.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
159
Survival and Disease Progression Survival time is often taken to be a prototypical definitive outcome for clinical trials in serious chronic diseases. There is usually little ambiguity about a subject’s vital status, and it is therefore very reliable. Prolonging survival is often a worthy goal. Furthermore, treatments that affect survival are very likely to have a fundamental biological action. Survival is a pure outcome – there is no sense in which it predicts anything. As a point of methodology, it is important therefore to understand survival as a clinical trial outcome. There are a few potential limitations to using survival as an outcome. It may not be the most clinically relevant parameter for some settings. Symptoms or other patient-reported outcomes might be more appropriate measures of therapeutic efficacy, especially for diseases that are not life-threatening. Although reliable, survival may be somewhat inaccessible because of the long follow-up that is required. Because of censoring, survival is often estimated with lower precision than other outcomes, leading to larger trials. All these factors should be considered when selecting survival as the primary outcome for a trial. On balance, survival is seen as a relevant and definitive outcome for studies in many chronic diseases. This is especially true in cancer, where it is something of a gold standard for evaluating drugs [1187]. The regulatory process surrounding anticancer drugs has reinforced the idea that survival outcomes are paramount. However, if one looks carefully at the basis of approval of new anticancer drugs in recent years, it is evident that other outcomes are equally important [794]. Disease progression is a concept similar to survival and is often termed progressionfree survival (PFS). PFS is a composite that includes death or advancing disease, and is typically used only when study participants are not rendered free of their disease by a baseline therapy. If ascertained reliably, it can be an early and useful signal of therapeutic activity. Problems with using this outcome are that it is often vaguely defined, inconsistently evaluated and interpreted, or based on infrequent or imprecise clinical evaluations. To be useful, PFS ascertainment should be based on active follow-ups required in the protocol, have relatively short intervals so time bias does not occur, and use reliable diagnostic evaluations. Even when rigorously done, the resolution afforded by the schedule of follow-up evaluations may not allow us to detect small treatment effects on progression reliably. Time-to-progression (TTP) is a related idea that censors death events unassociated with progression. TTP attempts to estimate the pure risk of disease progression, but can only do so if progression and death are independent. This is seldom the case. Disease-free survival (DFS) and time-to-recurrence are related outcomes applicable when trial participants are made free of disease by an initial therapy. DFS is then the interval from baseline to either death or reappearance of disease. Time-to-recurrence like TTP censors intervening death events.
Composite Outcomes Instead of Censoring Investigators would prefer to count only events that relate specifically to the condition under study. There is often a firm clinical rationale for censoring events unrelated to the target disease. For example, in a large trial with a new treatment trying to prolong survival after cancer diagnosis, some subjects will invariably die of noncancer causes such as cardiovascular disease. It seems appropriate to censor these noncancer events rather than count them as deaths. However, this creates some serous unintended consequences. When two or more failure processes affect a population (competing risks), we cannot expect to obtain unbiased estimates of risk for a single cause by censoring all other events.
Piantadosi
Date: July 27, 2017
160
Time: 4:45 pm
MEASUREMENT
If the failure processes are not independent of one another, as is usually the case, events attributable to one cause contain some information about events of other types. This concern is often valid biologically. The reporting mechanisms for events can obscure the connections. For example, a myocardial infarction listed as the primary cause of death may be secondary to an advanced state of some other underlying disease, such as cancer. Therefore, the censored cardiovascular events are not independent of the cancer events, creating a potential for bias if independence is assumed. As a further example, death from injury, such as a motor vehicle accident, may be partly a consequence of the patient’s psychological state due to the underlying disease. Rather than assuming independence, the study could count deaths from any cause as a composite outcome. This all-cause or overall mortality is not subject to many biases and has a straightforward interpretation. Consequently, it is the preferred mortality outcome for many clinical trials. Waiting for Good Events Complicates Censoring In most event time studies the interval of interest is measured from a hopeful clinical landmark, such as treatment, to a bad event, such as disease progression or death. Our perspective on censoring is that events such as disease progression or death could be seen with additional follow-up. In a few studies, however, investigators measure the waiting time to a good event. Short event times are better than long ones and censoring could be a more difficult problem than it is in survival studies. For example, suppose that patients undergoing bone marrow transplantation for hematologic malignancies are observed to see how long it takes the new bone marrow to “recover” or become functional. The restoration of bone marrow function is a good outcome, but some patients may never recover fully or may die. Death from complications of not having a functioning bone marrow is more likely early rather than late in follow-up. In this circumstance death censors time to recovery, but the censoring paradigm is different than that discussed above. A short event time is good if it terminates with recovery but bad if it terminates with death (censoring). Long event times are not good but may be unlikely to end in death (censoring). This example illustrates the importance of understanding the relationship between the censoring process and the event process.
5.3.14
Longitudinal Measures
It is sometimes necessary in clinical trials to summarize outcomes measured repeatedly over time, particularly in chronic diseases. Examples are when comparing control of blood pressure, anti-inflammatory therapy for arthritis, or treatments for degenerative diseases. A trial could be designed to use before- and after-treatment measures to assess differences. Because of the variability in individual measurements, a better strategy might be to monitor outcomes at intervals longitudinally. This also allows detection of fairly subtle time trends that could be important in understanding the disease or effects of treatment. One difficulty with such a scheme is using all of the information collected. Ideally, we would employ statistical techniques that simultaneously use all the longitudinal information collected, are robust to the inevitable missing data, and are flexible enough to permit valid inferences concerning a variety of questions. Repeated measures analyses, longitudinal linear models, and other methods of analyzing longitudinal data can be used.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MEASUREMENT DESIGN
161
However, complexity, higher cost of data collection, and administrative needs for trials with repeated assessments can make these designs harder to implement. Time to event outcomes, like survival, are longitudinal measures of a similar nature. But rather than recording measured values, the entire experience is summarized with a single event time and event indicator. Although censoring is the most unique aspect of event time outcomes, the basic design requiring active follow-up is the same as for other longitudinal outcomes. A survival time is typically singular and can often be captured correctly after the actual event. In contrast, many longitudinal measures need to be obtained in specific windows dictated by either the clinical setting or the statistical design. This adds to logistical difficulties and cost. 5.3.15
Central Review
Central review is a kind of augmented measurement whose purpose is to validate assessments made at local institutions in a multicenter trial. Central review can be applied to laboratory tests, images, pathologic findings, and clinical assessments such as toxicities. Using a single uniform method for key measures in a trial can remove an extraneous source of variation and might enhance validity. A centrally managed resource in a trial should be justifiable only when the technology or expertise is scarce. If local technology is inadequate and error prone, there may be no alternative but to use a single valid expertise. Centralization then provides efficiency. But this methodology is implemented today even when technology or experts exist at every location in a multicenter trial. Except for managing scarce technology, central review is unnecessary in randomized multicenter trials. It is not needed to validate eligibility, diagnosis, or outcomes. Nor is it required to remove an extraneous source of variability. Randomization and the principle of “equal slop” assures us that errors will not differentially affect treatment groups. Even if we imagine that errors are of different magnitude from center to center, the typical design of blocked strata removes this effect. Central review superimposed on other methods that obviate it creates inefficiency. There might be some minor justification for masked central review, such as for clinical outcomes, when trial sites are unmasked. But investigator’s and reviewer’s somewhat misplaced concerns regarding validity are the main motivation for central review. 5.3.16
Patient Reported Outcomes
Patient reported outcomes (PRO) are obvious sources of information on which to base therapeutic evaluations. Practitioners are accustomed to this because symptoms can be reliable indicators of disease and its mitigation. PRO are widely used in clinical trials where they can yield varying impressions of their reliability. Pain is an example of an important PRO that has been extensively used in trials. The essential issues surrounding the use of PRO are the same as for any outcome as listed in Table 5.2. Not all PRO perform well with regard to the needed characteristics, especially validity and objectivity, leading many investigators to disfavor them. But a given patient reported outcome can be studied and standardized the same as any laboratory measurement. The nature of communicating with sick and frightened patients may cause these outcomes to seem less rigorous than those reported by observers or machines, but PRO may often be more relevant.
Piantadosi
Date: July 27, 2017
162
Time: 4:45 pm
MEASUREMENT
Quality of life assessments are a special category of PRO that are broadly used in clinical trials as secondary outcomes, especially in chronic diseases like cancer. These outcomes attempt to capture psychosocial features of the patient’s condition, symptoms of the disease that may be distressing, and/or functional (physical) status. See Ref. [1468] for a review, and Refs. [259, 260] for a discussion of statistical issues. Fairclough also discusses some design problems [438]. It is easy to imagine situations where the underlying disease is equally well controlled by either of two treatments, as measured by objective clinical criteria, but the quality of life might be superior with one treatment. This could be a consequence of the severity and/or nature of the side effects, duration of treatment, or long-term consequences of the therapy. Thus, there is also a strong rationale for using quality of life assessments as primary outcomes in the appropriate setting. Quality of life assessments are often made by summarizing items from a questionnaire (instrument) using a numerical score. Individual responses or assessments on the quality of life instrument might be summed, for example, for an overall score. In other circumstances, quality of life assessments can be used as “utility coefficients” to weight or adjust other outcomes. This has been suggested in cancer investigations to create “quality-adjusted” survival times. Usually, these quality-adjusted analyses require subjective judgments on the part of investigators in assigning utilities to various outcomes or conditions of the subject. For example, one could discount survival after chemotherapy for cancer by subtracting time with severe side effects from the overall survival time. This corresponds to a utility of zero for time spent in that condition. Because of the subjectivity required, quality-adjusted measurements often are not considered as reliable or rigorous as more objective measurements by many investigators. A perpetual issue with quality of life is construct validity: Does the measurement instrument actually measure “quality of life”, or can it even be defined? Overall quality of life is usually explicitly defined as a sum of subscale measurements and there is no guarantee that a given treatment will have an impact on it. Furthermore, how to measure or define quality of life appears to depend on the disease, and to a certain extent, on the treatment. Finally, longitudinally measured quality of life scores are subject to informative missingness because sick or dead subjects do not yield measurements. Although there are approaches to fixing these problems, it has proved challenging to use quality of life as a primary outcome in clinical trials. There is an additional problem with quality of life and other PRO in longitudinal studies of progressive chronic diseases. As a disease advances, patients may become sicker and less willing to report subjective outcomes that require a task such as an extensive questionnaire. Thus, the missing data in studies that use PRO may be informative because of progressive disease or survival. This creates an informative censoring bias that can make the average quality of life in a study cohort appear to improve as the subjects feel worse. This effect can limit the usefulness of such outcomes.
5.4
SURROGATE OUTCOMES
A surrogate outcome is one that is measured in place of the biologically definitive or clinically most meaningful outcome. Typically, a definitive outcome measures clinical benefit, whereas a surrogate outcome tracks the progress or extent of the disease. A good surrogate outcome needs to be convincingly associated with a definitive clinical outcome so that it can be used as a reliable replacement. Investigators choose a surrogate when the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURROGATE OUTCOMES
163
definitive outcome is inaccessible due to cost, time, or difficulty of measurement [708]. The difficulty in employing surrogate outcomes is more a question of their validity or strength of association with definitive outcomes than trouble designing, executing, or analyzing trials that use them. Surrogate outcomes are sometimes called surrogate endpoints, surrogate markers, intermediate, or replacement endpoints. The term “surrogate” may be the best overall descriptor and I will use it here. Some authors have distinguished auxiliary from surrogate outcomes [487, 492]. Auxiliary outcomes are those used to strengthen the analysis of definitive outcome data when the latter are weak because of a lack of events. Such outcomes can be used statistically to recover some of the information that is missing because of unobserved events. Auxiliary outcomes may be measurements such as biomarkers or other manifestations of the disease process. Also, intermediate outcomes can be distinguished from surrogates, particularly in the context of cancer prevention [526, 527, 1336]. Prentice offered a rigorous definition of a surrogate outcome as a response variable for which a test of the null hypothesis of no relationship to the treatment groups under comparison is also a valid test of the corresponding null hypothesis based on the true endpoint [1232].
An outcome meeting this definition will be a good surrogate for the definitive outcome. A measurement that is merely correlated with outcome will not be a useful surrogate, unless it also reflects the effects of treatment on the definitive outcome. Surrogates may exist for efficacy, but there can be no convincing surrogates for safety. It is sometimes useful to distinguish types or levels of surrogates. One type of surrogate is a predictor of disease severity, and it may be useful independently of treatment. In other contexts, these might be seen as prognostic factors. They may be partially outcomes and partially predictors. Examples are measures of extent or severity of disease, such as staging in cancer. All prognostic factors are not surrogates in this sense because many do not change with the disease (e.g., sex). A second type of surrogate is a disease marker that responds to therapy with a frequency and magnitude that relates to the efficacy of the treatment. Blood pressure, cholesterol level, and PSA are examples of this. These surrogates are likely to be useful in middle developmental trials to demonstrate the activity of the treatment and motivate comparative trials with definitive outcomes. Glycosylated hemoglobin in diabetes is also an example of a surrogate that marks disease control and may also be prognostic. The third type of surrogate is the ideal one—a marker that captures the clinical benefit of the treatment and is informative regarding definitive outcomes. Such a marker would be useful for establishing comparative efficacy. These kinds of surrogates are the ones we wish we had routinely. They are difficult or impossible to find and require strong validation. Intuitively appealing surrogates like tumor burden in cancer do not necessarily accurately reflect the effect of treatment on definitive outcomes like survival. In contrast, measured viral load in HIV appears to reflect well impact on survival and very useful treatments have been developed using it. Many clinically or biologically relevant factors are proposed as surrogate outcomes. The potential list is enormous especially nowadays with the central role that both biomarkers and imaging play in therapeutic development. Unfortunately, relevance does not guarantee the validity of a proposed surrogate. Nor does accessibility or safety. Some
Piantadosi
Date: July 27, 2017
164
Time: 4:45 pm
MEASUREMENT
TABLE 5.7 Examples of Surrogate Endpoints Frequently Used in Clinical Trials Disease HIV infection Cancer Colon cancer Prostate cancer Cardiovascular disease Glaucoma
Definitive Endpoint
Surrogate Endpoint
AIDS (or death) Mortality Disease progression Disease progression Hemorrhagic stroke Myocardial infarction Vision loss
Viral load Tumor shrinkage CEA level PSA level Blood pressure Cholesterol level Intraocular pressure
measures can have great overall utility for other purposes without being valid surrogates. The essential point is that a surrogate must be validated empirically and no amount of biological theory can substitute for that. 5.4.1
Surrogate Outcomes Are Disease Specific
Surrogate outcomes are disease specific because they depend on the mechanism of action of the treatment under investigation. A universally valid surrogate for a disease probably cannot be found. Some examples of surrogate-definitive outcome pairs are listed in Table 5.7. Trialists are interested in surrogate outcomes like these because of their potential to shorten, simplify, and economize clinical studies. The potential gain is greatest in chronic diseases, where both the opportunity to observe surrogates and the benefit of doing so are high. However, surrogate outcomes are nearly always accompanied by questions about their validity. Trials with surrogate outcomes usually require independent verification, but success at this does not establish the validity of the surrogate. Some important characteristics of surrogate outcomes can be inferred from Table 5.7. First, a good surrogate can be measured relatively simply and without invasive procedures. Second, a surrogate that is strongly associated with a definitive outcome will likely be part of, or close to, the causal pathway for the true outcome. In other words, the surrogate should be justified on biologically mechanistic grounds. Cholesterol level is an example of this because it fits into the model of disease progression: high cholesterol ⇒ atherosclerosis ⇒ myocardial infarction ⇒ death. This is in contrast to a surrogate like prostatic specific antigen (PSA), which is a reliable marker of tumor burden but is not in the chain of causation. We might say that cholesterol is a direct, and PSA is an indirect, surrogate. However, because of temporal effects, PSA rather than cholesterol may be more strongly associated with a definitive disease state. Third, we would expect a good surrogate outcome to yield the same inference as the definitive outcome. This implies a strong statistical association, even though the definitive outcome may occur less frequently than the surrogate. Several authors have pointed out that this statistical association is not a sufficient criterion for a surrogate to be useful [167]. Fourth, we would like the surrogate to have a short latency with respect to the natural history of the disease. Finally, a good surrogate should be responsive to the effects of treatment.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURROGATE OUTCOMES
165
Cancer In testing treatments for cancer prevention, surrogate outcomes are frequently proposed because of the long latency period for the disease. Also we are most interested in applying preventive agents or measures to a population of patients without disease. Even populations at high risk may have only a small fraction of people developing cancer each year. These factors inhibit our ability to observe definitive events such as new cases of cancer or deaths attributed to cancer. This situation provides strong motivation for using surrogate outcomes like biomarkers, provided they are valid [166, 830]. In studies of cancer therapy, investigators also find many reasons to be interested in surrogate outcomes [417]. When the interval between treatment and a definitive outcome is long, there is an opportunity for intercurrent events to confuse the assessment of outcomes. An example of this type of event is when the subject receives additional active treatment during the follow-up period. Deaths due to causes other than the disease under investigation can also confuse outcome assessment. Tumor size reduction (tumor response) is used as definitive outcome and proposed as a surrogate outcome in cancer clinical trials. In middle development, measurable tumor response is often taken as evidence that the therapy is active against the disease. The degree of such activity determines whether or not the treatment is recommended for continued testing in comparative trials. In comparative trials, tumor response is sometimes proposed as a surrogate for improved survival or longer disease free survival. However, the association between response and definitive event times is weak for most cancers. Consequently, tumor shrinkage (response) should not generally be used as the primary outcome variable in comparative trials. The occasional exception might be for conditions in which reduction in tumor size provides a clinically important improvement in quality of life or reduced risk of complications. There are situations where evidence from studies employing surrogate outcomes is considered strong enough to establish convincing safety and efficacy, and consequently regulatory approval of some treatments for cancer. The clinical context in which this occurs is at least as important as the nature of the study and outcome measure. Recent examples include trastuzumab for treatment of metastatic breast cancer in 1998 using a surrogate marker in 222 subjects, leuprolide for palliation of advanced prostate cancer in 2003 using a surrogate marker in 140 subjects, and temozolomide for treatment of recurrent glioblastoma in 1999 using response rates in 162 subjects. There are many other examples, but these illustrate the balance of circumstance, treatment options, safety, and evidence of benefit inherent in such decisions. “Cure” or “remission” are stronger types of tumor response that have been used as surrogate outcomes in some cancer trials. For example, permanent or long-term tumor shrinkage below the level of clinical detectability might be labeled a remission or cure as is commonly the case in studies of childhood hematologic malignancies. There are ample data to support the strong association between this type of outcome and survival in these diseases. The connection is further supported by evidence that the failure rate diminishes to near zero or that the survival curve has a plateau in subjects achieving remission, providing long-term survival for them. In other cancers a disease-free interval of, say, 5 years after disease disappearance is often labeled as a “cure,” but the failure (recurrence) rate may not be near zero. Thus, tumor response is not a uniformly good surrogate outcome for all types of cancer.
Piantadosi
Date: July 27, 2017
166
Time: 4:45 pm
MEASUREMENT
Some cancer biomarkers have been considered reliable enough to serve as surrogate outcomes. Two well-known ones are prostatic specific antigen (PSA) and carcinoembryonic antigen (CEA), produced by some gastrointestinal malignancies. These markers are particularly useful for following disease status after treatment, when an elevation is strongly associated with recurrence. The usefulness of these and other biomarkers as surrogate outcomes remains to be established. Cardiovascular Diseases Studies of cardiovascular diseases present opportunities to use potentially valid surrogate outcomes for definitive outcomes like mortality [1574]. This is possible because we understand the mechanisms leading to many cardiovascular events fairly well and can measure entities in the causal path. For example, elevated blood pressure, serum cholesterol, left ventricular ejection fraction, and coronary artery patency are in the chain of events contributing to myocardial infarction and can be measured quantitatively. Sometimes these surrogates, or risk factors, are used as the primary outcomes in a trial, while in other cases they are secondary. Because of the strong mechanistic connection between some cardiovascular surrogates and definitive outcomes, they may be used more effectively in trials in this setting than surrogates for cancer outcomes. However, interventions that modulate risk factors do not necessarily change definitive outcomes. If treatment modifies the risk factor through mechanisms unrelated to action on the definitive outcome, we can be misled by a study using the factor as a surrogate outcome. HIV Infection In patients with HIV infection, CD4 positive lymphocyte count is a widely discussed candidate for a surrogate outcome for clinical AIDS and death. Unfortunately, the available data suggest that CD4 count is not reliable enough to serve as a valid surrogate outcome. See Ref. [492] for a review of this point. AIDS, by the Centers for Disease Control clinical criteria, could be considered a surrogate for death because of the severely compromised immune system that it implies. Other clinically valid measures of immune function such as P-24 antigen levels and plasma HIV viral load have been successfully used as surrogate outcomes. Like middle development cancer trials that use tumor response as an outcome despite its poor utility as a definitive outcome, developmental trials in AIDS may be able to use measures of immune function to evaluate the potential benefit of new treatments. Eye Diseases In trials studying diseases of the eye, Hillis and Seigel [720] discuss some possible surrogate outcomes. One example is retinal vein occlusion, which can lead to loss of vision. Hypertensive vascular changes are a precursor to vein occlusion and can be observed noninvasively. However, in an eye affected by vein occlusion, the blood vessel changes may not be observable because of tissue damage. Observations in the opposite eye may be useful as a surrogate for the affected eye. Thus, the opposite eye is a surrogate for observing the state of the retinal vessels, which is a possible surrogate for vein occlusion. In this situation the eye least affected by hypertensive vascular changes is likely to be used as a surrogate, leading to a biased underestimate of the relationship between the surrogate and the definitive outcome.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURROGATE OUTCOMES
167
A second example in eye diseases discussed by Hillis and Seigel [720] is the use of intraocular pressure as a surrogate for long-term visual function in patients with glaucoma. The validity of this surrogate depends on certainty that the elevated pressure is a cause of optic nerve damage in glaucoma patients, that the pressure can be determined reliably with the type of measurements commonly used, and that lower intraocular pressure will result in better vision in the long term. Many recent trials of glaucoma therapy have followed this reasoning, correct or not, and used intraocular pressure as a surrogate outcome.
5.4.2
Surrogate Outcomes Can Make Trials More Efficient
Most clinical trials require an extended period of accrual and observation for each subject after treatment. This is especially true of comparative studies with event time as a primary outcome. Disease prevention studies, where event rates are low and event times are long because the study population is relatively healthy, are even more lengthy than most treatment trials. It can be impractical or very expensive to conduct studies that take such a long time to complete. Good surrogate outcomes can shorten such clinical trials, which explains why they are of particular interest in prevention trials. However, to be useful, a surrogate outcome needs to become manifest relatively early in the course of follow-up. A simple example will illustrate the potential gain in efficiency using surrogate outcomes. Suppose that we wish to test the benefit of a new antihypertensive agent against standard therapy in a randomized trial. Survival is a definitive outcome and blood pressure is a surrogate. If it were practical and ethical to follow subjects long enough to observe overall mortality, such a trial would need to be large. For example, using calculations discussed in detail in Chapter 16, the difference between 95 and 90% overall mortality at 5 years requires 1162 subjects to detect as statistically significant with 90% power and a two-sided 0.05 𝛼-level test. By contrast, if we use diastolic blood pressure as the outcome, we could detect a reduction of as little as 1∕2 of a standard deviation using 170 subjects with a trial duration of a few weeks or months. Larger reductions in blood pressure could be detected reliably using fewer subjects. This hypothetical example is not meant to equate this benefit in mortality with this degree of blood pressure reduction. The benefit from such a small reduction is probably smaller. It does illustrate the potential difference in the scope of trials using surrogate outcomes. In some instances trials using surrogate outcomes may provide a clearer picture of the effects of treatment than those employing the “definitive” outcome. For example, suppose that we are studying the effects of a treatment on the prevention of coronary occlusion in subjects with high risk for this event. Clinicians react to these significant and morbid events when they occur by attempting to restore blood flow to the heart muscle using drugs, surgery, or other interventions and often make modifications in other aspects of the patient’s treatment. Thus coronary occlusion is an important clinical milestone and can be used as a basis for establishing the efficacy of treatments. A trial that used a “definitive” outcome such as death can present a somewhat confusing picture of treatment efficacy. Some patients will live a long time after the first coronary occlusion, allowing noncardiac complications to intervene. Also the patient may change lifestyle or therapy after a coronary occlusion and there may be comorbidities that influence the
Piantadosi
Date: July 27, 2017
168
Time: 4:45 pm
MEASUREMENT
course of treatment. It may be difficult to describe or account for the effects of these changes on the natural history of the disease. A similar situation is not uncommon in cancer, where effective second-line treatments (if they exist) will be chosen to intervene if patients start to do poorly after primary therapy. A trial with survival as a definitive outcome may not yield the correct treatment effect in such a case because the outcome will be confounded by the secondary therapy. A better outcome to evaluate the primary treatment might be time to treatment failure. Overall survival would essentially be an unsatisfactory surrogate for this. In other cases the ethical acceptability of a trial can be enhanced by using surrogate outcomes. For example, suppose that the definitive outcome becomes apparent only after a long period of follow-up. A rigorous trial design comparing two therapies would likely require control of ancillary treatments administered during the follow-up period. However, restrictions on such treatments may be ethically problematic. By contrast, a valid surrogate outcome that can be measured early in the post-treatment period allows the comparison of interest to proceed and permits physicians to respond with fewer constraints to changes in the patients’ clinical status.
5.4.3
Surrogate Outcomes Have Significant Limitations
Although surrogate outcomes are used frequently for developmental trials and are occasionally helpful for comparative studies, they often have serious limitations. Sources of difficulty include determining the validity of the surrogate, coping with missing data, having the eligibility criteria depend on the surrogate measurement, and the fact that trials using these outcomes may be too small to reliably inform us about uncommon but important events. Of these, the validity of the surrogate outcome is of most concern. For a review of surrogate outcomes and their limitations in various diseases, see Ref. [490]. Many surrogates are proposed because they appear to represent the biological state of disease, and observational data suggest that they are convincingly associated with the definitive outcome. In this setting it seems counterintuitive that a treatment effect on the surrogate could not yield clinical benefit. The problem is that treatment effects on the definitive outcomes may not be predicted accurately by treatment effects on the surrogate. This can occur for two reasons. First is the imperfect association between the surrogate and the true outcome, which may not reflect the effects of treatment. Second is the possibility that treatment affects the true outcome through a mechanism that does not involve the surrogate. Some explanations are suggested by Figure 5.3. In Ia, the treatment affects the surrogate, but there is a second pathway by which the definitive outcome is produced by the disease. In IIb, the surrogate does indeed lie on a causal pathway for the outcome, but there is a second pathway that could invalidate the treatment effect at b. In IIc, the treatment will affect the true outcome, but the surrogate will not capture the effect. IIId appears to be an ideal situation, but difficulties can still arise. The action on the surrogate may produce unanticipated effects on the true outcome. For example, the direction of the effect could be reversed. The time interval between the effect on the surrogate and the final outcome may be so long as to dampen the utility of the surrogate measurement. In
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SURROGATE OUTCOMES
169
FIGURE 5.3 Possible relationships between surrogate outcomes, definitive outcomes, and treatment effects. Pathway I shows the disease independently producing the surrogate outcome and the true outcome. Pathway II shows the surrogate outcome as a partial intermediate, and pathway III shows a surrogate outcome as a true intermediate. Pathway IV shows a surrogate produced as the consequence of the definitive outcome.
IV, the surrogate might be the product of the definitive outcome rather than vice versa. As a result, the surrogate might or might not be valid. Some Particular Problems A problem in the development of cancer treatments arises when trying to evaluate cytostatic (rather than cytotoxic) drugs. A traditional outcome for middle development trials has been response or tumor shrinkage, which although a surrogate for benefit, is objective and widely understood. Some investigators have proposed new “clinical benefit” response criteria that could be useful for cytostatic drugs [1289]. (See Refs. [576, 1509] for discussions.) However, new outcomes will require validation with respect to disease status, treatment effects, and existing outcomes before being used to evaluate, and possibly discard or accept, new treatments. In cardiovascular diseases there are sobering accounts of the use of surrogate outcomes in clinical trials. One example is the history of developing treatments for cardia arrhythmias, which began with promising development trials [223, 224]. This lead to the Cardiac Arrhythmia Suppression Trial (CAST), which was a randomized, placebocontrolled, double-masked treatment trial with three drug arms [50, 225, 1317]. The drugs employed, encainide and flecainide, appeared to reduce arrhythmias, and therefore were promising treatments for reducing sudden death and total mortality. After a planned interim analysis, CAST was stopped early in two arms because of a convincing increase in sudden deaths on the treatments. Control of arrhythmia, although seemingly justified biologically, is not a valid surrogate outcome for mortality. A similar experience was reported later with the anti-arrhythmic moricizine [226, 270]. A randomized placebo-controlled trial of milrinone, a phosphodiesterase inhibitor used as a positive inotropic agent in chronic heart failure, in 1088 subjects also showed benefit on surrogate outcomes (measures of hemodynamic action) with increased long-
Piantadosi
Date: July 27, 2017
170
Time: 4:45 pm
MEASUREMENT
term morbidity and mortality [1169]. A similar failure of surrogate outcomes was seen in a randomized trial of the vasodilator flosequinan, compared with captopril in 209 subjects with moderate to severe chronic heart failure [313]. Flosequinan had similar long-term efficacy and mortality compared with captopril, but a higher incidence of adverse events. Another example that does not speak well for the use of surrogate outcomes is the randomized study of fluoride treatment on fractures in 202 women with osteoporosis [1272]. Although bone mass was increased with fluoride therapy, the number of nonvertebral fractures was higher. Measures such as bone mass and bone mineral density are not valid surrogates for the definitive clinical outcomes. A surrogate outcome in a specific disease may be useful for some purposes but not others. For example, surrogate outcomes in drug development may be appropriate for verifying the action of a drug under new manufacturing procedures or for a new formulation of an existing drug. Their use for a new drug in the same class as an existing drug may be questionable and their use for testing a new class of drugs may be inappropriate. For illnesses in which patients have a short life expectancy (e.g., advanced cancer and AIDS), it may be worthwhile to use treatments that improve surrogate outcomes, at least until efficacy can be verified definitively in earlier stages of disease.
5.5
SUMMARY
Each trial setting is unique with respect to the efficiency and practicality of various outcome measurements. Investigators may also need to consider the cost efficiency of various outcomes and choose a feasible one or allocate resources appropriately. A discussion from this perspective is given by Terrin [1467]. The primary statistical objective of a clinical trial is usually to estimate some clinically important quantity, for example, a treatment effect or difference. There may be numerous secondary objectives employing different outcomes, but the properties of the trial can usually only be controlled for one (primary) objective. Some trials are not designed to provide unbiased estimates of treatment effects, but instead select the “best” treatment from among several being tested. Different scales of measurement may be required depending on the outcome being used. Scales of measurement include nominal or categorical, ordered, interval (for estimating differences), and ratio. Aside from scales of measurements, outcomes can be classified as measures, categorical, counts, and event times. Event times are widely used outcomes for clinical trials, especially in chronic diseases like cancer, cardiovascular disease, and AIDS. Special statistical methods are required to cope with censored event times, which are frequently present in clinical trial data. Trial methodologists continue to examine and debate the merits of surrogate outcomes, essentially on a study-by-study basis. Plausible surrogate outcomes have been proposed or used in cancer (e.g., tumor size), AIDS (e.g., viral load), cardiovascular disease (e.g., blood pressure), and other disease trials. These types of outcomes can potentially shorten and increase the efficiency of trials. However, they may be imprecisely associated with definitive outcomes such as survival and can, therefore, yield misleading results.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
5.6
171
QUESTIONS FOR DISCUSSION
1. Rank the different scales of measurement in order of their efficiency in using the available information. Discuss the pros and cons of each. 2. Repeated measurements on the same study subjects increase precision. Because repeated measurements are correlated with one another, adding new study subjects may increase precision more. Discuss the merits and weaknesses of each approach. 3. Surrogate or intermediate outcomes are frequently used in prevention trials. Discuss reasons why they may be more appropriate in studies of disease prevention than in studies of treatment. 4. Read and comment on the study by the Chronic Granulomatous Disease Cooperative Study Group [764].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
6 RANDOM ERROR AND BIAS
6.1
INTRODUCTION
Error has two components, a purely random one and a systematic one called bias. Understanding the differences between randomness and bias, and the sources of each, is the first step in being able to control them using experiment design. The terms random error, random variability, or just variability are often used to describe the play of chance, particularly when the effects of explanatory or predictive factors have already been taken into account. An operational definition of randomness might be unexplainable fluctuations, these are fluctuations that remain beyond our ability to attribute them to specific causes. Bias means systematic error–deviations that are not a consequence of chance alone. Bias can arise from numerous sources, such as selection effects, uncontrolled prognostic factors, procedural flaws, and statistical methods, and also from perceptual errors, attitudes, and beliefs. These last three sources are often collectively called observer bias. The exact consequences of bias may be difficult to know in advance, but it is usually simple to understand factors that can contribute to it. Strong sources of bias can often be anticipated well enough to be controlled. In clinical trials we tend to focus on bias as much as random variation as a source of error because many biases such as selection effects are strong relative to the size of treatment effects. Errors in experiments can be discussed in either a purely statistical context or in a larger clinical one. Although mathematical formalism is required to discuss the statistical context, random error and bias are conceptually the same in a clinical context. Most statistical discussions of random error and bias assume that a primary focus of the investigation is hypothesis testing. This framework is convenient and will be used here. However, this is not intended to be a general endorsement of hypothesis tests as the Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
172
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
173
preferred method for making inferences. Estimates of treatment effect and confidence intervals can be similarly affected by random error and bias. 6.1.1
The Effects of Random and Systematic Errors Are Distinct
Pure random error has no preferred direction. Statistically, we expect its net or average effect to be zero. Clinically, we can always make its relative effect inconsequentially small by averaging over a large number of observations or a long enough period. This does not mean that chance will not affect a particular observation, but that randomness averages out to have a small relative effect in the long run. Replication or increasing the number of observations or repeating the experiment is the only tool that can reduce the magnitude of random error. Because of sampling variability, subject-to-subject differences, measurement error, or other sources of noise, we can never completely eliminate random error. However, in most experiments it can be controlled and reduced to acceptably low levels by careful attention to the experiment design, particularly sample size. Bias, unlike random fluctuation, is a component of error that has a net direction and magnitude. In a purely statistical context, bias can sometimes be quantified and may be quite small. In a clinical context, bias has four deadly characteristics: it can arise in numerous, diverse, but common ways; its direction or magnitude cannot be predicted; it can easily be as large as the treatment effect of interest; and it cannot be corrected by replication. As a result, bias can dominate the guesses of practitioners making unstructured treatment evaluations, or invalidate poorly planned or conducted treatment comparisons. The factors that produce bias may not be amenable to quantification but can usually be removed or reduced by good design and conduct of the experiment. The difference and importance of distinguishing between random variation and bias is analogous to sound or signal reproduction. In this analogy the treatment effect is the sound level or signal strength, variation is like background noise, and bias is distortion. Background noise (variation) is important when the sound or signal level (treatment effect) is low. Reducing variation is like increasing the signal-to-noise ratio. In contrast, even when the noise is low, the sound produced may be distorted unless the system is designed and performing properly. Even when the variation is low, the treatment effect may be biased unless the trial is designed and conducted properly. Having a strong signal is not sufficient if there is too much distortion. Example Suppose that we have two different estimators (or methods of estimation) of a treatment effect, Δ, one unbiased and the other biased (Fig. 6.1). Because of sampling variation, ̂ repeated use of either estimate would produce a distribution of values. One estimator, Δ, has random error but no bias, and the distribution of values we would obtain is centered ̃ has both random error and bias around the true value, Δ = 0. The second estimator, Δ, ̃ yields answers and its distribution is not centered around the true value. Many times Δ ̂ but on average, Δ ̃ does not give the true value. similar to Δ, In practice, the investigator does not see a full distribution of values for any estimate because the experiment is performed only once. Thus, the actual estimate obtained for Δ ̂ and Δ ̃ in this example are subject is a single sample from a distribution. Because both Δ to randomness, either can yield a value that is far enough away from the true value
Piantadosi
Date: July 27, 2017
174
Time: 4:45 pm
RANDOM ERROR AND BIAS
FIGURE 6.1 Hypothetical sampling distributions of biased and unbiased estimators. 𝜃 represents the true effect.
̃ can be and lead us to conclude Δ ≠ 0 (e.g., after a hypothesis test). Despite the bias, Δ ̂ closer on average to the true treatment effect than Δ. Sometimes the overall performance (bias + random error) of a biased estimator can be better than an unbiased one. 6.1.2
Hypothesis Tests versus Significance Tests
Hypothesis testing has had a prominent role in developing modern ideas about the design and analysis of clinical trials and still provides a useful perspective on errors of inference. Following ideas put forth by Neyman and Pearson [917, 918, 1117], hypothesis testing is an approach for choosing between two competing statistical hypotheses. Let the competing hypotheses be labeled 𝐻0 and 𝐻𝑎 , and denote a summary of the data (or statistic) by 𝑇 . The hypothesis testing approach requires that we define a critical region in advance of taking data values, and then choose 𝐻0 or 𝐻𝑎 , depending on whether or not 𝑇 falls within the critical region. In practice, investigators seldom employ hypothesis testing in exactly this way. A different procedure, called significance testing [317] is used more commonly. A nonmathematical comparison of the two procedures is given by Salsburg [1328]. Assume that the probability distribution of the test statistic, 𝑇 , is known or can be approximated when 𝐻0 is true. The more extreme the actual value of 𝑇 , relative to this distribution, the less likely the 𝐻0 to be true. The significance level is 𝑝 = Pr{𝑇 ∗ ≥ 𝑇 ∣ 𝐻0 }, where 𝑇 is the value of the statistic based on the observed data. Thus, this “test” yields a significance level (or p-value) instead of a decision. The p-value is intended to help the investigator assess the strength of evidence for or against 𝐻0 . In reality, the p-value does not have this nice interpretation. The deficiencies of this approach will be discussed in Chapter 20.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
175
TABLE 6.1 Random Errors from Hypothesis Tests and Other Types of Dichotomized Inferences Truth and Consequences Result of Test
𝐻0 True
𝐻0 False
Reject 𝐻0 Don’t reject 𝐻0
Type I error No error
No error Type II error
The basic structures of hypothesis and significance tests are the same as are their properties with respect to random error. This is discussed in the next section. 6.1.3
Hypothesis Tests Are Subject to Two Types of Random Error
The two types of random error that can result from a formal hypothesis test are shown in Table 6.1. The type I error is a false positive result and occurs if there is no treatment effect or difference but the investigator wrongly concludes that there is. The type II error is a false negative and occurs when investigators fail to detect a treatment effect or difference that is actually present. The power of the test is the chance of declaring a treatment effect or difference of a given size to be statistically significantly different from the null hypothesis value when the alternative hypothesis is true, that is, the probability of not making a type II error. These ideas are illustrated in Figure 6.2, which shows the distributions (assumed to be ̂ under both the null hypothesis (Δ0 = 0) and an normal) of a treatment effect estimator, Δ, ̂ arise from sampling alternative hypothesis (Δ𝑎 ≠ 0). The probability distributions for Δ variability. The short vertical lines represent the critical value chosen to calibrate the type I error to reject the null hypothesis. For example, the critical value could be ±1.96 standard deviations from the null hypothesis mean, which would yield a two-sided type I error of 5%. The following discussion will focus on only the upper critical value, although both have to be considered for a two-sided test. ̂ will be a value from the distribution centered at Δ0 . If the null hypothesis is true, Δ ̂ exceeds the critical value, the experimenter would find it improbable that it came If Δ from the distribution centered at Δ0 and would reject the null hypothesis. This would ̂ does not exceed the critical constitute a type I error if the null hypothesis were true. If Δ value, the experimenter would not reject the null hypothesis. ̂ would be a value from the distribution centered If the alternative hypothesis is true, Δ ̂ is greater than at Δ𝑎 . As before, the experimenter will reject the null hypothesis if Δ ̂ the critical value. However, if Δ is less than the critical value, the experimenter will not reject the null hypothesis, resulting in a type II error. When the alternative hypothesis is true, one would like to reject the null a large fraction of the time, that is, have a test with high power. The properties of the hypothesis test are determined by the alternative hypothesis and the type I error level. If the alternative hypothesis is taken to be far away from the null relative to sampling variability, the test will have a high power. However, alternatives far away from the null may be clinically unrealistic or uninteresting. Also, the consequences of type I and type II errors are quite different. Aside from the fact that
Piantadosi
Date: July 27, 2017
176
Time: 4:45 pm
RANDOM ERROR AND BIAS
FIGURE 6.2 Sampling distributions of a treatment effect estimate under null (dashed line) and alternative (solid line) hypotheses. The critical value for testing the null hypothesis is denoted by c.
they are specified under different assumptions about the true state of nature, they require different maneuvers to control them. These are discussed in the following sections. 6.1.4
Type I Errors Are Relatively Easy to Control
Usually there is only one factor that governs the chance of making a type I error, that is, the critical value of the test. The experimentalist is free to set the critical value for the type I error at any desired level, controlling it even during the analysis of an experiment. The type I error rate does not depend on the size of the experiment. For example, in Figure 6.2, the type I error can be reduced by moving the critical value (short vertical line) to the right. There are some circumstances in which the investigator must consider more than just the critical value to control the type I error. This error can become inflated when multiple tests are performed. This is likely to happen in three situations. The first is when investigators examine accumulating data and repeatedly perform statistical tests, as is done in sequential or group sequential monitoring of clinical trials. A second situation in which the type I error can become inflated occurs when many outcomes or treatment groups are examined using multiple hypothesis tests. A third circumstance arises when multiple subsets, interactions, or other exploratory analyses are performed. Although the error of each test can be controlled in the manner outlined above, the overall or experimentwide probability of an error will increase. Some corrections during analysis are possible, but the overall error rate should be carefully considered during the design of the trial. This point will arise again in the discussion of sequential methods in Chapter 18. Inflation of the type I error is also discussed in Section 20.8.2. 6.1.5
The Properties of Confidence Intervals Are Similar to Hypothesis Tests
Summarizing the observed data using effect estimates and confidence intervals yields more information than the results of a hypothesis test. Confidence intervals are centered on the estimated effect and convey useful information regarding its precision. This descriptive quality renders them more useful than hypothesis tests or p-values, which are inadequate as data summaries. This is also discussed in Chapter 25. However, hypothesis tests and confidence intervals share some common properties. For example, each can
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
177
̃ Δℎ and Δ denote FIGURE 6.3 Confidence interval centered on an estimated treatment effect, Δ. a hypothetical and true treatment effect, respectively. U and L are the confidence bounds around the estimate. (The true parameter value remains unknown.) Part A is similar to a type I error; Part B is similar to a type II error.
be reconstructed from the other. A confidence interval is essentially a collection of hypotheses that cannot be rejected at a specified 𝛼-level. ̃ is the estimated treatment effect from our trial (Fig. 6.3). We construct Suppose that Δ ̃ but it is a confidence interval for Δ, the true treatment a confidence interval around Δ, effect. Suppose investigators think that the true treatment effect is Δℎ . A situation like ̃ excludes Δℎ (Fig. 6.3A). Even a type I error occurs if the confidence interval around Δ though the investigators were correct in that Δℎ ≈ Δ, the study sample seems to have been an atypical one, and we would conclude that the data were inconsistent with Δℎ , despite Δℎ ≈ Δ. A mistake like the type II error can occur when the confidence interval includes both Δ and Δℎ (Fig. 6.3B). The estimated treatment effect is seen to be consistent with Δℎ , although in reality it is substantially different from Δ. We would expect a narrower confidence interval based on a larger sample to correct this. 6.1.6
Using a One- or Two-Sided Hypothesis Test Is Not The Right Question
In comparative experiments, investigators frequently ask if a one-sided hypothesis test is appropriate. When there is biological knowledge that the treatment difference can have only one direction, a one-sided test may be appropriate. Firm knowledge concerning the direction of a treatment effect is not easy to acquire. As one possibility we might imagine testing the effects of a nontoxic adjuvant, such as A versus A+B, where B can only augment the effect of A. Given sufficient knowledge about B, a one-sided test may be sensible. This is not the same as “only being interested in” differences in one direction, which happens frequently. Not all hypothesis tests are inferentially symmetric. If a treatment is being compared with a placebo, investigators will not be interested in demonstrating conclusively that the new treatment is worse than the placebo. Similarly, if a new treatment is compared to standard therapy, learning that the new treatment is not superior
Piantadosi
Date: July 27, 2017
178
Time: 4:45 pm
RANDOM ERROR AND BIAS
to the standard is sufficient. Some questions have only one interesting, useful, and appropriate side. However, the directionality of the biological question in no way settles the appropriate probability level to use in a significance or hypothesis test. Should investigators accept a lower standard of significance just because there is a known or preferred direction for the treatment difference compared to when it is not known? I think not, but this is precisely what happens if a one-sided critical value is chosen for 𝛼 = 0.05 when otherwise a twosided 𝛼 = 0.05 test would be used. Using the same type I error probability for one- and two-sided questions amounts to using a less rigorous critical value for the one-sided case. A more consistent procedure would be to employ a one-sided 𝛼 = 0.025 type I error, which would yield the same standard of significance as the conventional two-sided test. This discussion is not meant to define evidence in terms of the p-value. Thus, the right question is the standard of evidence reflected by the critical value, and not the direction of the hypothesis test. It is not appropriate to employ a one-sided test as a falsely clever way to reduce the sample size. It is always necessary to adjust the type I error rate to suit the question at hand.
6.1.7
P-Values Quantify the Type I Error
P-values are probability statements made under the null hypothesis regarding replication of the experiment. Suppose that the treatment effect is a fixed constant of nature, and the null hypothesis is true. Because of random sampling variability, the estimated effect or difference that we observe in our experiment will be different from zero. If the null hypothesis is correct, how likely are we to obtain the observed result, or one more extreme? If the observed result is unlikely, we take this as evidence that our sample must have come from a distribution other than the null. If we reject the null hypothesis when it is correct, it is a type I error. The significance level, or p-value, is the probability of obtaining the observed result or one further away from the null when the null hypothesis is in fact true. If the observed result or one more extreme is relatively likely, we would not reject the null hypothesis. If we could repeat our experiment many times, the estimates of clinical effect obtained would average out close to the true value, assuming no bias. Then the mean of the probability distribution of the estimates would be evident and there would be little trouble in deciding if an effect was present or not. However, we usually perform only a single study, meaning that our judgment is subject to error. We assess instead the estimate obtained under an assumption of no effect. In this way of thinking, the probability distribution or uncertainty refers to the estimate obtained and not to the true treatment effect. Thus, the p-value is not a statement about the true treatment effect but about estimates that might be obtained if the null hypothesis were true and we could repeat the identical trial many times.
6.1.8
Type II Errors Depend on the Clinical Difference of Interest
There are three factors that influence the chance of making a type II error. These are the critical value for the rejection of the null hypothesis, the width or variance of the distribution variance of the estimator under the alternative hypothesis, and the distance between the centers of the null and alternative distributions, which is the alternative
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
179
FIGURE 6.4 Power as a function of hazard ratio and number of events for the logrank test. All sample sizes can have “high power” if the effect size is assumed to be sufficiently large.
treatment effect or difference (Fig. 6.2). Investigators have control over the rejection region as discussed above and the width of the distribution, which is a direct consequence of sample size. The magnitude of the alternative hypothesis should not truly be under investigator control: it should be a consequence of the minimally important difference (MID) that is specified based on clinical considerations. We can calculate power for different hypothetical MIDs. Larger assumed MIDs decrease the type II error from a given sample size. Thus, all trials have a high power to detect MIDs that are hypothetically large enough. Unfortunately, large treatment differences are usually not plausible clinically, so high power to detect them is not useful. In contrast, the utility of a large study is that it can detect realistic but important small differences with high power. These ideas can be made more quantitative by considering the accompanying power curve (Fig. 6.4) for a hypothetical clinical trial comparing survival in two treatment groups. The power of the study is plotted against the assumed treatment difference measured as a hazard ratio, roughly the ratio of median event times. A study with 100 events (solid line) has only 53% power to detect a hazard ratio of 1.5, but 90% power to detect a ratio of 2 or higher. Smaller studies have lower power for all effect sizes, but they eventually reach 90% power to detect hazard ratios greater than 2. In the presence of censoring there is not a perfect correspondence between events and sample size, but this is unimportant for the purposes here. A more detailed discussion of power appears in Chapter 16. It is useful to examine the consequences of type I and type II error rates beyond the context of a single trial. If these error rates are habitually relaxed as sometimes happens
Piantadosi
Date: July 27, 2017
180
Time: 4:45 pm
RANDOM ERROR AND BIAS
TABLE 6.2 Chance of a False Positive Study Result as a Function of Power and Prior Probability of Success* 𝑝 0.01
0.02
0.03
0.04
0.05
𝛽
𝑃 [𝑇 − |𝑆 + ]
𝑝
𝛽
𝑃 [𝑇 − |𝑆 + ]
𝑝
𝛽
𝑃 [𝑇 − |𝑆 + ]
0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5
0.73 0.76 0.83 0.58 0.60 0.71 0.47 0.50 0.62 0.40 0.43 0.55 0.35 0.37 0.49
0.10
0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5
0.20 0.22 0.31 0.10 0.11 0.17 0.06 0.07 0.10 0.04 0.04 0.07 0.03 0.03 0.05
0.60
0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5 0.1 0.2 0.5
0.02 0.02 0.03 0.01 0.01 0.02 0.01 0.01 0.01 𝑇 .
.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INCLUSIVENESS, REPRESENTATION, AND INTERACTIONS
267
Note that 𝐷(0) = 0, and lim 𝐷(𝑡) = 𝑎0 𝑇 ,
𝑡→∞
where 𝑎0 𝑇 is simply the total accrual. An example of 𝑛(𝑡) and 𝐷(𝑡) is shown in 𝑘 Figure 9.3, where 𝑇 = 10, 𝑎0 = 25 with a Weibull survival function 𝑆(𝑡) = 𝑒−𝜇𝑡 , with 𝜇 = 0.1 and 𝑘 = 1.5. If we make the additional simplifying assumption of exponential survival, equation (9.2) would yield 𝑛(𝑡) = 𝑎0 𝜆
𝜏
∫0
𝑒−𝜆⋅(𝑡−𝑢) 𝑑𝑢.
Then, { 𝑛(𝑡) =
𝑎0 (1 − 𝑒−𝜆𝑡 ) if 𝑡 ≤ 𝑇 , −𝜆(𝑡−𝑇 ) −𝜆𝑡 − 𝑒 ) if 𝑡 > 𝑇 . 𝑎0 (𝑒
From equation (9.3), the cumulative number of events is { 𝑎0 (𝜆𝑡 + 𝑒−𝜆⋅𝑡 − 1), if 𝑡 ≤ 𝑇 , 𝜆 𝐷(𝑡) = 𝑎0 (𝜆𝑇 + 𝑒−𝜆⋅𝑡 − 𝑒−𝜆⋅(𝑡−𝑇 ) ), if 𝑡 > 𝑇 . 𝜆
(9.3)
(9.4)
Example 9.5. Suppose that a clinical trial requires 180 events to achieve its planned power. If accrual proceeds at a constant rate of 80 subjects per year for 4 years and the event rate is 0.13 per person-year of follow-up, how many events will have taken place after 4 years? We assume exponential survival. Substituting into equation 9.4 yields 𝐷(4) =
80 × (0.13 × 4 + 𝑒−0.13⋅4 − 1) = 70 𝑒𝑣𝑒𝑛𝑡𝑠. 0.13
The study will be only 40% finished, with respect to events, after 4 years. However, additional events will accumulate more quickly after 4 years due to the size of the study cohort. The number of subjects remaining on-study at 4 years is 320 − 70 = 250 and the length of time required to observe 180 events is just under 7 years.
9.4 9.4.1
INCLUSIVENESS, REPRESENTATION, AND INTERACTIONS Inclusiveness Is a Worthy Goal
Having a broadly inclusive study cohort is a good idea from at least two perspectives. It allows an opportunity to investigate biological heterogeneity if it truly is present. This is a more valid point with respect to genotypic rather than phenotypic differences in study subjects. Nevertheless, it is applied mostly in the latter case. Second, a broadly inclusive study cohort helps justly distribute the risks and burdens of research in society. The removal of barriers to participation in research is an
Piantadosi
Date: July 27, 2017
268
Time: 4:45 pm
THE TRIAL COHORT
important sociopolitical goal that scientists must assist. Even so, participation in research is voluntary and reflects the personal and cultural attitudes and beliefs of subjects. Medical investigators have only limited opportunity to influence those factors. I did not say that an inclusive study cohort distributes the benefits of research justly. As indicated elsewhere in this chapter, knowledge from clinical trials generalizes on the basis of biology rather than on the superficial characteristics of the cohort (empiricism). We are all more similar than different with regard to biology. Clinical trials clearly have the potential to provide benefit to individual participants, but only indirectly to others. The societal benefits of this research depend almost entirely on the structure and delivery of health care, which is determined by political, social, and economic factors.
9.4.2
Barriers Can Hinder Trial Participation
Among adult cancer patients in the United States, less than 3% participate in clinical trials [622]. A recent study of disparities in cooperative oncology groups shows the same low participation rate and poor acceptance by minorities, women, and the elderly [1076]. Similar levels of participation in the research studies would probably be found for many diseases. Exceptions might include AIDS and some uncommon pediatric conditions. Although the reasons for lack of participation are complex, they fall into three general categories: physician, patient, and administrative. The reasons that physicians give for failing to enter eligible subjects onto clinical trials include the perception that trials interfere with the physician–patient relationship and difficulties with informed consent [1465]. In some cases participation in a trial may threaten the supposed expertise and authority of the physician. This is likely to be less of a problem in the future as patients become more active in choosing from among treatment options. The increasing use of second opinions instituted by patients and insurers and patients’ assertive participation in multiple AIDS treatments are examples. Informed consent procedures can be cumbersome, intimidating, and time–consuming. Often patients who give consent do not retain the detailed knowledge about the treatments or the study that investigators would like them to. This suggests difficulties with the consent process that may discourage patients and their families from completing it. Many consent forms are not written at an appropriate reading level. The process tends to overstate risks and understate benefits from participating in research studies as a way of minimizing the liability of investigators and their institutions. An accurate portrayal of these details would require excessively long and technical consent documents. Some investigators are implicitly or explicitly formulating the idea that patients have a right to participate in research studies [416]. This new right is seen to arise as an extension of the principle of autonomy. I believe this notion is incorrect because it presupposes benefits from participation that may not exist. This is true on the individual participant level and perhaps more so on a group or societal level. This point will be expanded below. Furthermore this right cannot be exercised universally, even if it exists, because the number of clinical trials is too small. Many patients are mistrustful of the medical establishment even if they trust their individual physicians. Recent public reactions to proposed changes in health care, such as managed care, indicate this. Stronger reactions are often evident among racial and ethnic minority groups regarding participation in clinical trials. Of the small amount of research that has been done in this area, results related to cancer suggest three points:
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INCLUSIVENESS, REPRESENTATION, AND INTERACTIONS
269
(1) patients are not well informed about trials, (2) they believe that trials will be of personal benefit, and (3) patients are mistrustful [1274]. Such factors partially explain the low-minority participation on many research studies [793]. 9.4.3
Efficacy versus Effectiveness Trials
There are two views of the role of clinical trials in evaluating medical therapies today. The views overlap considerably with regard to the methodology that is required but can generate differences of opinion about the appropriate study sample to enroll, coping with data imperfections (Chapter 19), and the best settings for applying clinical trials. The first is that trials are primarily developmental tools used to make inferences about biological questions. These types of studies tend to be smaller than their counterparts (discussed below) and employ study cohorts that are relatively homogeneous. Studies with these characteristics are sometimes called “efficacy” trials, a term and distinction incorrectly attributed to Cochrane [274]. With this perspective, investigators tend to emphasize the internal validity of the study and generalize to other settings based primarily on biological knowledge. A second perspective on trials is that they are evaluation tools used to test the worth of interventions that should be applied on a large scale. They are motivated by the fact that a biologically sound therapy may not be effective when applied outside the controlled setting of a developmental trial. These trials are large, applied in heterogeneous populations, and use simple methods of assessment and data capture. In Chapter 18, I referred to these studies as large-scale (LS) trials. They have also been called effectiveness trials, large simple, and public health trials. Investigators conducting these trials tend to emphasize the external validity of the studies and expect that their findings will have immediate impact on medical practice. A concise summary of the difference between these types of studies is shown in Table 9.1. These characterizations are approximations and simplifications. It is easy to find studies that fit neither category well or have elements of both. As we have already seen, some treatments become widely accepted and validated without using any structured methods of evaluation. Other treatments are developed using both types of studies at different times. Viewed in this way, we can see that the proper emphasis of clinical trials is not an “either-or” question about effectiveness or efficacy. Both pathways and types of studies have useful roles in developing and disseminating medical therapies.
TABLE 9.1 General Characteristics and Orientation of Efficacy and Effectiveness Trials Characteristic
Efficacy trial
Effectiveness trial
Purpose Number of participants Cost Orientation Cohort Data Focus of inference Eligibility
Test a biological question Less than 1000 Moderate Treatment Homogeneous Complex and detailed Internal validity Strict
Assess effectiveness Tens of thousands Large Prevention Heterogeneous Simple External validity Relaxed
Piantadosi
Date: July 27, 2017
270
Time: 4:45 pm
THE TRIAL COHORT
The importance of these distinctions for the study cohort primarily relates to the breadth of the eligibility criteria. For any particular trial, these criteria should be broad enough to permit inclusion of enough subjects to answer the research question quickly. However, the criteria should not permit more heterogeneity than is clinically useful. For example, the bioavailability of a compound may be in question and could be established using a small number of study subjects who meet fairly narrow eligibility criteria. Questions about the bioavailability in different individuals are important; they might be answered by combining the study result with other biological knowledge. It may not be efficient to try to determine differences in bioavailability empirically. 9.4.4
Representation: Politics Blunders into Science
In recent decades the clinical trials community has focused a great deal of attention on the composition of study cohorts, particularly with regard to their gender and minority makeup. A perception developed early in the discussion that women and minorities were under represented in clinical studies of all types and that this was a problem for scientific inference regarding treatment effects [1274, 1340, 1353]. One of the strongest contributing factors to this perception might have been that women of childbearing potential were historically considered ineligible for many drug trials. Ignoring Benjamin Franklin’s admonition that “the greatest folly is wisdom spun too fine,” many politicians thought it necessary to have scientists correct this perceived injustice. The legislative solution was contained in the NIH Revitalization Act of 1993: The Director of NIH shall ensure that the trial is designed and carried out in a manner sufficient to provide for a valid analysis of whether the variables being studied in the trial affect women or members of minority groups, as the case may be, differently than other subjects in the trial [1097, 1501].
Some exceptions to this requirement are permitted, although the law explicitly excludes cost as a reason for noncompliance. There are scientific reasons why one would occasionally need to know if sex or ethnicity modulate the efficacy of the treatment. More specifically, do the biological components of sex or ethnicity interact with a given treatment in a way that affects therapeutic decisions? However, to require always testing for them while ignoring the consequences of known effect modifiers such as major organ function, and to increase the cost of performing trials to do so, requires something other than a scientific perspective. As for the underrepresentation belief, in the case of women, reliable and convincing data to support or refute such a claim were simply not available at the time [997]. Data from published studies suggest that female:male participation of individuals in clinical trials was about 2:3. Published female-only trials tend to be larger than male-only studies, and there were probably as many female-only trials as male-only trials before the law was passed [596]. Recent data from NIH sponsored trials in cancer show a firm female preponderance [1329]. It would not be a surprise to see the same in other diseases. Possibly there never was a women’s participation problem, and according to the political calculus of 1993, we might now be neglecting men’s health. The sex question now has a life of its own, there being a body of scientific evidence that males and females are different [585]. However, it remains difficult to tell politically correct differences from politically incorrect ones. Aside from reproductive biology
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INCLUSIVENESS, REPRESENTATION, AND INTERACTIONS
271
and its immediate hormonal consequences, there are other male–female differences in metabolism as well as risk and consequences of chronic disease. What is not clear is how many such differences are important therapeutically (we know that many are not) and if we should try to uncover them all. For ethnic subsets relatively few data are available, but it is a universal experience that minorities join clinical trials less often than, say, caucasians. This is true even in diseases such as cancer where some minorities have higher incidence rates and higher death rates. Many factors contribute to this, including history and cultural perceptions, underserved primary medical needs, and comorbidities that render potential participants ineligible for some research studies. Comorbidities were an important factor in low participation in a survey even at a minority institution [7]. Other barriers to participation in cancer studies have been described as more attributable to religion, education, and income than race [11]. This last paper explicitly makes the mistake of stating generally that there is questionable applicability of research findings to ethnic groups when they are underrepresented. Aside from the biological weakness of such a claim, it is inconsistent with the conclusion of the investigators who find that race is largely a surrogate for religion, education, and income.
Convenience versus Active Samples Revisited In issues of efficacy and efficiency, it is important to understand the difference between convenience samples and cohorts that actively control one or more aggregate characteristics, such as sex or ethnic composition. Convenience samples accept the composition of the cohort passively. The nature of the cohort is dictated by local factors such as referral patterns, chance, selection effects, and the like. Only infrequently would such a sample resemble the population with the disease under investigation, and even then it most likely would not be a true random sample. However, even a convenience sample uses important filters such as eligibility and ineligibility criteria. Convenience samples have the advantage of simplicity and presumably the lowest cost. A sample in which subject characteristics are controlled (e.g., demographics or risk factors) has potential advantages over a sample based largely on convenience. Active sampling offers the possibility of a truly random study cohort representative of the population with the disease. This could be a critical design feature if the study must provide an accurate estimate of an absolute rate—not a very frequent need in clinical trials. A common circumstance where an active sample is required is when predicting political elections. Active samples require a great deal of planning and expense compared to convenience samples. A simple example will illustrate the potential inefficiency of an active sample. Suppose that a disease under study has an asymmetric sex distribution, 25% female versus 75% male, and we need to control the study cohort so it has the same sex composition. Suppose further that the convenience sample readily available to investigators has the inverse composition, 75% female versus 25% male. For simplicity, assume that all subjects are actually eligible for the study. To obtain the target of three times as many males as females, we have to discard almost 90% of the female potential participants—well over half of the convenience sample. Thus the extra calendar time and resources required to actively control the sex composition would have to be well justified in terms of scientific gain. This dichotomy is not perfect, but as a rule, clinical trials do well with samples that are closer to the passive end of the spectrum than the active end. Reasons why include the
Piantadosi
Date: July 27, 2017
272
Time: 4:45 pm
THE TRIAL COHORT
following: many biologically important characteristics are already controlled by routine eligibility criteria; accurate inferences about the disease population may not be needed; relative differences rather than absolute treatment effects might be the primary focus and these can be estimated from simpler designs; other factors typically controlled in active samples do not often have biological importance. It is this last reason that is most relevant in the present context. NIH and FDA Guidelines The response of scientists at NIH to the requirements of the law is evident in the guidelines published to implement them [529, 1095]. The guidelines essentially restrict the applicability of the law to comparative trials and require an assessment of existing evidence from animal studies, clinical observations, natural history, epidemiology, and other sources to determine the likelihood of “significant differences of clinical or public health importance in intervention effect.” When such differences are expected, the design of the trial must permit answering the primary question in each subset. When differences in treatment effects are not expected, are known not to exist, or are unimportant for making therapeutic decisions, the composition of the study group is unimportant. In cases where the evidence neither supports nor refutes the possibility of different treatment effects, representation is required, although “the trial will not be required to provide high statistical power for each subgroup.” In view of the wording of the law, the NIH guidelines are helpful because they intend to restore a scientific perspective on the design of phase III trials. When no conclusive information about treatment–covariate interactions is available, a valid analysis is implicitly defined as being unbiased because trials with low statistical power are permitted. Presumably very large interactions would be evident from such designs. However, if valid analysis means unbiased as opposed to high statistical power, then some clinically important interactions could remain undetected. This seems to be contrary to the intent of the law. Thus there does not appear to be a reasonable scientific perspective that can be superimposed on the law and the repair is inescapably empirical. The FDA guidelines concerning women of childbearing potential [495] also attempt to impose scientific thinking on an empirical law. They deal with the issues in a more mechanistic fashion, emphasizing the role of pharmacokinetic analyses in detecting possible gender differences. Also the exclusion of women of childbearing potential from early trials has been eliminated. Additionally the FDA states … representatives of both genders should be included in clinical trials in numbers adequate to allow detection of clinically significant gender-related differences in drug response [495].
Although questions of “validity” and “power” are not addressed, this statement seems to be at odds with the NIH guidelines. Nowhere in the guidelines does it say explicitly that FDA will refuse drug approval on the basis of failure to study subjects adequately representative of those with the disease. One can easily imagine some circumstances in which this would be reasonable and other situations in which it would not be. An interesting example arose when the FDA approved the use of tamoxifen for treatment of male breast cancer, virtually exclusively on the basis of biological similarity of the disease to female breast cancer, rather than on the results of well-performed clinical trials in men. Furthermore, during advisory committee review of the issue, there was little, if any, discussion of male–female differences in response to tamoxifen. Outcomes such
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INCLUSIVENESS, REPRESENTATION, AND INTERACTIONS
273
as this are consistent with a biology-based approach to the issues but are at odds with the current law and written policy. Treatment–Covariate Interactions A primary reason for requiring a cohort with a particular composition (e.g., age, sex, or ethnic background) on a clinical trial is to be able to study interactions between the treatment and the actively controlled factor (covariate). For example, if males and females are likely to have clinically significantly different treatment effects, the trial should probably be designed to permit studying the difference. To accomplish this efficiently, equal numbers of males and females should be studied (rather than proportions representative of the disease population). Another important point regarding treatment–covariate interactions is the distinction between modifying the magnitude of the treatment effect as compared to changing its direction (Fig. 9.4). The former is termed a quantitative interaction, whereas the latter is termed qualitative. In principle, we would be interested in sizable interactions of either type. However, qualitative interactions are the ones of real therapeutic concern because they mean that one subset should receive treatment A, whereas the other should receive treatment B. Such circumstances are exceptionally rare. Quantitative interactions are less consequential because they inform us that both subsets should receive the same treatment, although one subset will benefit more than the other. The cost of testing treatment–covariate interactions can be high, even for balanced covariates. As will be seen in Chapter 22 in some detail, the test of interaction in a balanced 2 × 2 factorial trial has variance 4𝜎, whereas the main effects have variance 𝜎. This indicates the inefficiency of studying interactions. The variance of the difference of two means, analogous to a main effect, is 2𝜎 2 ∕𝑛. The variance of the difference of differences, analogous to a treatment–covariate interaction, is 4𝜎 2 ∕𝑛. This also indicates the inefficiency of testing treatment–covariate interaction. Covariate imbalances will decrease the precision. Ordinarily we would not design trials to detect such effects with high precision, unless it was very important to learn about them. Final Comments The U.S. government’s fetish with racial and sex composition of study cohorts has important practical consequences and must be examined from a scientific perspective, despite the fact that it originated from political concerns. The only way to make sense out of the inconsistent thinking that plagues this issue both within and outside the government is to separate political considerations from scientific ones. I cannot say generally if the law and its consequent NIH policy make good politics or good science policy. My guess would be not. But it is reasonably certain that they do not generally make good science. Even so, there are many apologists who view the requirements as justified on scientific or ethical grounds. The rationale is that such differences are common and important. I believe that neither is the case—people are demonstrably more similar than they are different. In fact the legislation and the guidelines support my claim. If differences are common and important, then both the funding levels and valid analysis rules are terrible injustices. Gender differences in treatment effects are very uncommon. If real, they are mostly inconsequential clinically. In 2004, plans to complete development of a new drug for heart failure targeted to African–Americans were announced with great fanfare. The drug is a combination of
Piantadosi
Date: July 27, 2017
274
Time: 4:45 pm
THE TRIAL COHORT
FIGURE 9.4
Different types of treatment–covariate interaction.
hydralizine and isosorbide dinitrite, neither of which is new. This combination was said to be the first “ethnic” drug and has been rationalized on the basis of epidemiologic and other differences in heart disease and its treatment in blacks and whites. Superficially, this circumstance appears to support the idea of consequential biological differences attributable to race. However, the driving forces behind it are sociopolitical, not scientific. The relevant history behind this drug combination and its proposed indication is deconstructed by Kahn [811]. Race is not a biological construct but a sociodemographic one. The study of human genetic heterogeneity does not support the general attribution of biological differences to race. Genetic differences are generally larger, more varied, and more consequential within racial categories than between them [106–109]. It would be a surprise if factors derived on this superficial basis turned out to have important biological consequences. Schwartz [1351] stated the problem especially well: . . . Such research mistakenly assumes an inherent biological difference between blackskinned and white-skinned people. It falls into error by attributing a complex physiological or clinical phenomenon to arbitrary aspects of external appearance. It is implausible that the few genes that account for such outward characteristics could be meaningfully linked to multigenic diseases such as diabetes mellitus or to the intricacies of the therapeutic effect of a drug.
Thus therapeutic differences attributable to race rather than to a more fundamental biological construct are more likely to be errors. We can often extrapolate inferences from animal species (or even in vitro experiments) to humans (e.g., carcinogenicity and dose testing). Why would we not think we can learn about all humans from studies in other humans?
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
275
On the ethics side, there is an interesting paradox that surrounds such concerns. The notion is that the principle of justice requires us to address the concerns of sex and minority subsets, the premise being that subgroups need to be studied to derive the benefits of biomedical research. When health concerns have been identified, usually by means other than clinical trials, this is an important concern. For clinical trials broadly and abstractly, I believe it is exactly backward. Because we have learned much about the treatment of human disease, we have learned about the treatment of disease in particular subsets. If individuals from one of those subsets have not participated in clinical trials, they have derived benefit without incurring risk, very likely disproportionately. Consider the diagnosis and treatment of prostate cancer as an example. African– American men have been underrepresented in most trials, but we have learned a great deal about screening and treatment that applies directly to them. Disparities in outcome may be a consequence of failure to apply knowledge or health care access, but within the domain of clinical trials the justice argument does not apply. Of course, additional studies could target specific questions. It is reasonable to ask what has been learned from trials under the requirements for representation now that the NIH policy has been in place for 20 years. It is not possible to say conclusively that little has been learned across all of medicine. But no qualitative treatment–covariate interactions based on race or sex appear to have entered clinical practice. This is predictable from the biological arguments above.
9.5
SUMMARY
A clinical trial cohort is defined by the eligibility and exclusion criteria. The criteria are chosen with a target population in mind, but often the study cohort will not closely resemble the intended population. This selection effect is a consequence of interpretations by the investigators and chance and limits the comparability of different single-arm studies, even though they have the same selection criteria. In randomized comparative trials, the selection effects are the same in all treatment groups, validating the comparisons. Realistic quantitative assessments of accrual are necessary when planning a clinical trial. Simple mathematical models can help predict the expected duration of a study and the number of events that will be observed over time. In the last few years there has been a great deal of interest in the gender and ethnic composition of trial participants. This interest arises, in part, from concerns about the generalizability of trial results. Large heterogeneous study populations offer some advantages for generalizing results empirically, but may not be optimal for proving biological principles. Removing barriers to trial participation and employing representative cohorts is essential for the health of medical studies. However, required representation in clinical trials could be a hindrance to acquiring new knowledge if it consumes too many resources.
9.6
QUESTIONS FOR DISCUSSION
1. Suppose that the population consists of only two types of individuals. Half of all subjects have a response probability of 0.25 and the other half have a response
Piantadosi
Date: July 27, 2017
276
Time: 4:45 pm
THE TRIAL COHORT
probability of 0.40. If both types of subjects are equally likely to be accrued on a trial, what response probability can we expect to observe? How will this change if the eligibility criteria select 2:1 in favor of those with a higher response probability? 2. Suppose that the accrual rate onto a trial is constant and the event times (survival) are exponentially distributed with hazard 𝜆. After the accrual period, there is a fixedlength follow-up period. Derive expressions for the number of events and cumulative number of events as functions of time and these parameters. Illustrate how follow-up time (or total study duration) can be substituted for new accruals. 3. Response data from a clinical trial will be analyzed using a linear model. For the 𝑖th subject, 𝑌𝑖 = 𝛽0 + 𝛽1 𝑇𝑖 + 𝛽2 𝑋𝑖 + 𝛾𝑇𝑖 𝑋𝑖 , where 𝑇 and 𝑋 are binary indicator variables for treatment group and covariate, respectively, 𝑌 is the response, and the other parameters are to be estimated from the data. 𝛾 models the interaction between treatment and covariate. Using ordinary least squares (or maximum likelihood), what is the relative magnitude of 𝑣𝑎𝑟{𝛽1 }, 𝑣𝑎𝑟{𝛽2 }, and 𝑣𝑎𝑟{𝛾}? Discuss. 4. One way to generalize the results from clinical trials is empirical. This perspective suggests that, if females are not in the study population, the results do not pertain to them. Another way to generalize is biological. This perspective suggests that learning about males informs us about females insofar as biological similarities will allow. Discuss the merits and deficiencies of each point of view. 5. Read the study report by Exner et al. [436a]. Comment on the research design, methods, and conclusions.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
10 DEVELOPMENT PARADIGMS
10.1 INTRODUCTION A development paradigm is a conceptual model for generating strong efficacy and safety evidence using an ordered series of experiments. The structured evidence generated in this way informs reliable decisions regarding a new treatment. Development embodies two strategies simultaneously—one for the overall goal of accepting a new therapy, and a second for stopping as early as possible if signals are unfavorable. Individual trials are the tactics used for each clinical question or stage of development. The development paradigm exists because there is a scientific, regulatory, and ethics consensus as to which research designs are appropriate to answer recurrent questions. The development pathway can be fairly clear for new drugs, biologicals, and devices for example, but perhaps less evident for unregulated therapies. Studies that cannot provide definitive causal evidence about a new therapy will not have a formal place in the paradigm, but may be useful predevelopment. Overall properties of a development paradigm are the result of (i) ordering and structure of therapeutic goals, (ii) individual trial designs used to address typical developmental questions, and (iii) decision rules that act on trial results. I will use the term “pipeline” to refer to some realization of a paradigm and its properties. Each therapy probably deserves individualized development, but the disease, recurring scientific questions, regulation, decision rules, and patterned thinking tend to sustain stereotypical paradigms. Some tasks in development can be accomplished in parallel. For example, we can estimate both an optimal dose and some bounds on risk threshold from a single dosefinding trial or several performed at roughly the same time. However, there is always a necessary sequencing of some tasks—things that must be known before the next step can take place. For example, demonstrating that a drug engages its target or biological Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
277
Piantadosi
Date: July 27, 2017
278
Time: 4:45 pm
DEVELOPMENT PARADIGMS
site of action will precede evaluation of clinical outcomes. When drugs have significant side effects, it is usually necessary to establish a safe dose before testing efficacy. Also, before spending resources on a large comparative trial, it is usually wise to require a positive signal from a smaller preliminary trial. In academic settings especially, there is a tendency to view trials individually without much regard for an overall process that necessitates them. That results in part from a singular focus on the chosen scientific question. This is changing. In the pharmaceutical and biotech settings, overall scientific and financial needs have always required a more integrated, longitudinal, and strategic view of development [297]. There the development pipeline must meet scientific needs while simultaneously being optimized for efficiency and regulatory requirements. Commerical entities may have additional challenges if they sponsor therapeutic development in several areas, in which case there are multiple pipelines that may share or compete for resources. I will not discuss those issues. Strategy for development reflects essential therapeutic questions driven by the needs of those affected by the disease. A flexible development strategy could help optimize individual trials, but there tends to be an overly rigid view of the pipeline. The current era of precision therapeutics will likely change this. Not all drugs can or should undergo the same developmental pathway. For example, in oncology we seem to be at the threshold of discarding organ- or morphology-based primary classification of cancers in favor of gene-based categorization that reflects therapeutic targets. But cancer drug regulation is not yet ready to adopt this paradigm because not all drugs are targeted, and it would have strong implications for trial design. 10.1.1
Stages of Development
The classical drug development paradigm recognizes three stages or phases. This may echo phase I, II, and III, but my view is that the converse is true, and the trilogy reflects a necessary and efficient balance between competing strategies for proceed-
FIGURE 10.1
Phases of therapeutic development
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
279
ing with development versus terminating it. The three stages are sometimes labeled discovery, learning, and application or confirmatory (Figure 10.1) [1372]. I prefer to describe the paradigm simply as translation, early, middle, and late development, so as not to imply particular study designs. Translation employs both nonclinical and clinical experiments, and properly is part of discovery. Translational clinical trials are described in Chapter 11. Early development of drugs and biologicals tends to focus on dosing. Devices and surgical treatments have early developmental steps that are more variable. More universally, early development focuses on risk. Affirmation of safety, though popular to claim as an objective, is not within reach of the small sample sizes employed in early development. Dose-finding as a prototypical early development question is discussed in Chapter 12. Middle development is a diverse space that tends to focus on activity, risk, and safety (Chapter 13). These questions are nearly universal, but the ideal study designs are highly variable. Discarding underperforming therapies is mostly a middle development question. In some cases, middle development can be skipped entirely, despite the important questions typically asked. Doing this has strong effects on the overall performance of the pipeline, and may represent considerable economic and intellectual risk. Late development is usually formally comparative, especially employing randomized trials (Chapter 14). Evidence from properly designed late development trials always supersedes findings from middle development, which is one reason why it is theoretically possible to skip middle development. Comparative trials are larger, longer, and more expensive than earlier developmental studies and most pipelines would seek to minimize how many such studies need to be done. Some questions do not fit neatly into the descriptive categories above. Biomarker trials could take place in any stage of the developmental paradigm, though many interesting ones are comparative. Questions that relate dose of drug and efficacy (as opposed to dose and safety) often arise in middle development, but structurally are expensive comparative trials. Assessment of safety is a formal objective of every clinical trial regardless of developmental stage. But a true perspective on safety requires large cohorts. A mapping of main study objectives by developmental phase or stage is shown in Table 10.1. This mapping is typical in the traditional scheme but should not be viewed as required. The following discussion will emphasize freedom to recast paradigms in response to scientific needs.
TABLE 10.1 Objectives and Elements of Trials According to Developmental Stage Discovery Translation Targeting Signaling Delivery
Learning Early Dosing Optimal dosing Pharmacokinetics Pharmacodynamics Side effects Risk
Application Middle Feasibility Risk and safety Activity Futility (Relative efficacy)
Late Relative efficacy Safety Effectiveness (Dosing)
Piantadosi
Date: July 27, 2017
280
10.1.2
Time: 4:45 pm
DEVELOPMENT PARADIGMS
Trial Design versus Development Design
Clinical trials are under scrutiny by well meaning individuals who question every aspect of their design and conduct. Criticisms come from the lay public, ethicists, advocates, basic scientists, practitioners, and even from trialists themselves. A common claim with new therapeutics is that “we need new clinical trial designs to address these questions.” Most of the concerns do not and can not challenge the scientific foundations of clinical trials. It is easy to forget that the designs and pipeline used classically were highly evolved for the purposes to which they were put. The paradigm should change only when the character of the therapeutics entering the pipeline changes the questions being asked. The problems appear to me to be less about individual trial design and more about the way we put studies together in developmental pathways. For example, a crowded therapeutic landscape demands development strategies that do not displace effective but imperfect treatments. Tough diseases with few treatment options demand a rapid strategy. Both cases can have very similar requirements for individual questions, but the respective pipelines could have key differences. In oncology, much has been said regarding targeted therapies for which new development strategies may be needed [153]. It is essential to formulate a perspective regarding biological determinism for both individual trial designs and development pathways. Biological determinism means the extent to which characteristics of the study subjects influence or determine the effects of treatment. Before targeted therapies and the concept of precision medicine, nearly all treatments were assumed to work equally well or equally poorly in all people. We now know that some therapies can only work in certain subsets of the disease because they were designed that way. A given treatment may take advantage of a variant of the disease with favorable characteristics—certain pathways or dependencies in cancers, for example. This kind of determinism has a strong effect on how we design trials. Determinism implies a different question in the subpopulation of interest where efficacy signals may be large, compared to the overall population where efficacy signals may be diluted to near zero. Belief abounds in biological determinism very broadly. For example, characteristics such as sex, race, ethnicity are often mentioned as biological determinants of treatment effect, or at least key factors to be designed around (see Section 9.4). Here we must be careful to separate issues of social justice, for which there is reason to pay attention to these characteristics, from biology, for which there is little if any evidence of determinism. It is also known that some such factors are surrogates for health-related behaviors that can be relevant. Closely reated to the question of biological determinism is our belief regarding random variability. I am repeatedly surprised at the near disbelief in randomness among some investigators working in genomic sciences. If we can understand enough about the working of the genome, will random variation in therapeutic responses disappear making large clinical experiments unnecessary? The answer is of course “no,” but is not provable, and it may be a philosophical rather than a scientific question. In the laboratory with genetically identical rodents kept in a common environment we still see significant outcome variation. The need for large experiments to control variation will not soon disappear.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PIPELINE PRINCIPLES AND PROBLEMS
10.1.3
281
Companion Diagnostics in Cancer
Genomic-based targeted therapies suggest great promise in cancer therapy when applied to a subset of individuals who have the relevant tumor or host characteristic. For example, a cancer that seems to depend on a specific growth pathway or factor amplified by mutation might respond well to a drug targeted to block that pathway. This pairing of genetically dysregulated pathways and drugs specifically designed to target them is a main element of “precision” or “personalized” treatment in cancer. An essential question remains how one determines that a pathway is disrupted. A diagnostic test tailored to measure the relevant gene expression or dysregulated pathway would be useful, and might need to be developed along with a therapeutic. This companion diagnostic would validate an appropriate subset of individuals in whom the new therapy can work. Outside the subset, one would not expect the drug to work as effectively, if at all. Validation of this diagnostic–therapeutic couple can be advantageous for both scientific and economic reasons, and we might imagine them being paired in a development paradigm. In fact, it might be hard to separate the diagnostic from the therapy. Assuming that the relevant subset is a small fraction of the population, which often seems to be the case in cancer therapy, the diagnostic is as key as the therapy which would be nearly useless without proper application. This would motivate linking them from a regulatory perspective, for example. Also, when the cost of the therapy is high, a good diagnostic test prevents an expensive therapy from being misapplied on individuals who cannot benefit. A problem is that the initial companion diagnostic may not be as accurate as tests developed later, perhaps based on superior technology. A more sensitive or specific test developed secondarily can yield more effectiveness by improving population targeting without increasing the biological efficacy of the therapy. We must be alert to this possibility and not restrict therapies with first-generation diagnostics as a matter of principle. A flexible development paradigm does not necessarily require specific companion diagnostics for targeted treatments.
10.2
PIPELINE PRINCIPLES AND PROBLEMS
The overall drug development success rate is about 11% [862]. In cardiovascular disease, the rate is as high as 20% but only about 5% for cancer drugs. The average amount of time consumed in development of a new drug is over 10 years and the average cost exceeds $1 billion. Although costs are extraordinary, we have to assume that since many pharmaceutical, biotechnology, and device companies survive, therapeutics development remains profitable. Late stage failures appear to be the largest barrier to development productivity [1186]. The development pipeline is therefore a very consequential entity. Even small improvements in it could strongly affect cost and development time. Genomically derived targeted agents fuel expectations of large treatment effects and greater safety. A suitably crafted pipeline should yield a relatively high success rate and lower cost of development. Although it is somewhat heretical to say, I suspect that this ideal is not at all guaranteed. The current technology to detect genetic abnormalities, particularly in chaotic diseases like cancer, and design and synthesize drugs to affect those targets, is far ahead of our knowledge of how to understand the implications. The
Piantadosi
Date: July 27, 2017
282
Time: 4:45 pm
DEVELOPMENT PARADIGMS
short term result may be potent but ineffective drugs. Some early indications may already be evident—single agent failures despite insightful design. We must be vigilant about pipeline properties so they do not contribute to a decreased success rate when so much promise is possible.
10.2.1
The Paradigm Is Not Linear
In whatever way we conceptualize it, a development paradigm is not necessarily linear. Interim findings accompanied by external pressures will change directions and priorities. Translational questions circulate between clinic and lab, for example. Relevant clinical questions depend on current evidence that does not evolve linearly. At any stage of development, more than one type of trial may be required to gather the findings needed to advance. An obvious example of nonlinearity is when a drug demonstrates usefulness for a purpose other than which it was designed. Sildenafil, developed initially for hypertension but now famous for erectile dysfunction, is an example. Unexpected toxicities or drug interactions similarly alter the direction of development. The business and clinical models also influence the developmental paradigm. The historical business model for pharmaceuticals and biotech industries has been a “vial on the shelf” that can be manufactured, stored, and applied to a sizeable population of patients. However, precision medicine suggests that some therapies may need to be manufactured on site, individualized or customized, and administered only to the original patient. These would not lend themselves to the standard business model and may require a different developmental paradigm. An active cancer therapy that has historically been produced in exactly this way is bone marrow transplantation, although it did not evolve according to our paradigm. Certain new immunotherapies for cancer may follow a similar path. Although the paradigms for individualized treatments might be different than that discussed throughout this chapter, the experiment design principles for any particular trial remain valid.
10.2.2
Staging Allows Efficiency
Staging is a response to the competing needs of development, specifically the requirement to move ahead quickly and efficiently, balanced by early decision points to minimize investments in a failed agent. Interim analyses in a single clinical trial are based on exactly the same principle. More stages are better to make an early exit, but fewer stages are best for reducing overhead costs and calendar time. Stages occur naturally between developmental trials, at which points the results of the latest study may indicate failure. However, there can be staged decision points within a trial as typically happens in middle and late development. Two stages in a complex process probably capture most of the efficiency of a larger number of steps. As each step increases in complexity and cost, it would naturally tend to be staged itself. Two key developmental requirements in cancer drugs prior to a comparative trial are dosing and clinical activity. These outcomes are connected by safety issues, but different enough that two steps are implied, as discussed in Chapters 12 and 13. The third and most visible decision point occurs after comparative trials. Three literal stages seems inevitable, although there may have been several predeveopmental checkpoints as well. The dose or activity of some therapies might be established by
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PIPELINE PRINCIPLES AND PROBLEMS
283
historical practice, with potential to shorten the pipeline to two stages, or even directly to a comparative trial. Two stages are not necessarily shorter in calendar time or less expensive to conduct than three. Two methodologic factors can shorten development as a matter of principle. One is a sufficient number of early points at which development can halt, saving later resources. This pipeline feature would demonstrate its value over many development cases—any specific therapy can only be halted once. A second possible efficiency factor is an early efficacy signal that reduces the cost of comparative trials. Such a factor would likely be circumstance specific, but we constantly seek new valid efficacy outcomes that could be applied in every case. Validation of an early efficacy signal is challenging, for example, see the discussion of surrogate outcomes in Section 5.4.
10.2.3
The Pipeline Impacts Study Design
Pipeline structure facilitates, and is a consequence of, reductionism, which focuses scientific questions. Focus encourages individual studies to be optimized for the questions they address. Hence, optimal study design is a function of both the scientific question and the overall developmental paradigm. But the scientific questions are derived from specific demands of the therapy, whereas the developmental paradigm embodies general principles. This difference can create discord, and assures that trial methodologists will always have a job. A series of optimal study designs does not necessarily yield an optimal pipeline. An illustration of this dilemma is when considering whether or not to conduct an underpowered comparative trial in the window for middle development. When viewed as a middle development trial, such a trial may be seen as having optimal statistical features. Can this design improve the pipeline? Or are the required pipeline properties dictating resource conservation in middle development? We can’t answer these questions unless we reason quantitatively about the paradigm and its principles, as well as exceptions that might be appropriate for a particular drug or therapy. Development of some treatments might require that questions of efficacy be addressed early, or that we immediately jump into comparative trials. For example, we may have considerable experience to suggest that our putative treatment is safe, and the question being addressed is quite serious. The use of vitamins or trace elements for reducing risk of degenerative neurological disease might be this sort of question. Perhaps no good therapy exists, and the cost of a possible new one is low. The pipeline to develop a lowcost proven-safe treatment will induce quite different study designs than those used when treatments carry substantial risk, or when their success or failure is less consequential.
10.2.4
Specificity and Pressures Shape the Pipeline
The discussion to this point should demonstrate that the developmental model, as well as individual trial designs, depends on the kind of questions asked and is therefore likely to be disease- or context-specific. The old cancer cytotoxic development structure is not automatically appropriate for therapeutic development in other diseases. In fact it is not even well suited to newer therapeutics in the same field. Efficient development demands specificity in pipeline properties. Suppose for one disease there are reasonable but risky therapeutic alternatives, and the prior chance that
Piantadosi
Date: July 27, 2017
284
Time: 4:45 pm
DEVELOPMENT PARADIGMS
new treatments are true positives is low. The pipeline for this circumstance should be skeptical, meaning that early exit points, small type I errors, and cautious steps are appropriate. Suppose for a second disease there are no active treatments and a promising and demonstrably safe possibility emerges. An optimistic pipeline that moves quickly to comparative trials with more emphasis on the type II error and less on the type I error would likely serve development better. This might characterize the context of MSA mentioned briefly in Section 8.2.4. We might also consider economic resources needed to evaluate each treatment and see that the resulting pipelines do not need to resemble each other. Overburdened pipelines require different study designs than well resourced ones. As indicated above, a simple example can be seen in the role for middle development, which might reasonably be skipped in a well resourced setting with few available therapeutic alternatives. In a pressured pipeline with some existing safe and effective therapies, we would not skip middle development, but instead make it a tight filter to decrease false positives. This more skeptical pipeline is typical of cancer therapeutics and might remain so even in the new age of targeted treatments. Many other practical issues might influence our development paradigm design. These include the best time to address optimal dosing, encountering therapies serially (perhaps at intervals) versus multiple ones simultaneously, and the consequences of type I and II errors. Ideally, we might examine the potential resource consumption for different development strategies and choose an efficient one.
10.2.5
Problems with Trials
It is fashionable to emphasize deficiencies in our clinical trial apparatus (to the extent that it can be characterized as such), and to propose various rescues. This was done earnestly early in the AIDS epidemic when there were few and inadequate therapeutic alternatives for a fatal disease. Many people had the impression that rigorous trials were part of a problem of conservativeness, and various deconstructions of developmental trials for AIDS therapy were undertaken. One example was a “parallel track” that allowed investigational drugs to be used at least partly outside the formal clinical trial setting. This was a drastic alteration of the development pipeline that diminished or removed controls. While the pipeline for anti-HIV drugs such as it existed at the time may well have been unsuitable for the task, informal alterations were not necessarily better. The best progress in developing antiretroviral therapy was made only after these poor ideas were replaced by good study designs again. More recently in the United States, the large and highly productive cooperative cancer clinical trials apparatus sponsored by NCI has undergone reforms and consolidation brought about by cogent criticisms and obvious inefficiencies. Much of this was triggered by an Institute of Medicine report in 2010 [426]. Unfortunately, these changes evolved in an era of contracting research spending and reduced volunteer participation in cancer clinical trials, so the utility of reforms may not be easily visible. The changes have been administrative and managerial, that is to say not focused on individual study design. More importantly, the pipelines themselves were not studied, characterized, or redesigned to satisfy any overall optimality criteria. It is not clear what to expect in the future as a result.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PIPELINE PRINCIPLES AND PROBLEMS
285
Although it is difficult to generalize, pipeline properties historically in cancer have been predicated on a somewhat optimistic assumptions—despite a high proportion of new therapeutic entities being eliminated by inadequate performance in early and middle development, comparative trials also had a high failure rate. This suggests that there may have been too much leniency in the early pipeline and much blame for this has been assigned to middle development, or “phase II” , in particular. Cooperative oncology group reforms mentioned above do not appear to have redesigned the pipeline to, say, a more skeptical one, which might have been appropriate if decent therapies are now widespread and we do not want to risk replacing them with false positives. On the other hand, one might argue for a more optimistic pipeline now, because the new generation of treatments seem to be rationally targeted to specific abnormalities, less toxic, more effective in appropriate disease subsets, and easier to administer. What should our pipeline look like for the new era of therapeutics, and should it in fact be unique to each agent? Our culture refers its values with regard to clinical trials at two levels. One is at the level of research ethics and values broadly. The second level is the application of those principles to individual studies. Between these levels is the development pipeline, which never seems to be discussed, engaged, or designed as a real entity. Its properties are left to the passive consequences of other forces. The modern forces affecting clinical trials can be severe, and include the following: 1. A public that sends mixed messages about the value of clinical trials. 2. Regulation based on snapshots of crises, technology, and concerns from bygone years. 3. Unrealistic demands on individual trials for flexibility, breadth, and quantity of evidence. 4. Creeping regulatory policies without national consensus. 5. Hypertrophic respect for privacy and individual autonomy that discourages participation. 6. Political correctness intruding on science. 7. Low trial participation rates. 8. Conservative protectionism by IRBs and other overseers. 9. High costs and resistance by insurers to cover standard care in the trial context. 10. Inadequate funding levels for research by NIH and similar sponsors. 11. New scientific pressures driven by discoveries in genomic science. 12. Outdated training for clinical investigators. 13. Marketing-driven rather than science-driven pharmaceutical priorities. 14. Pressures to replace therapeutic trials by less reliable research designs. It is highly unpredictable what properties will emerge passively from development paradigms that evolve is this sort of setting. None of the changes in the regulation or oversight of clinical trials in the last 35 years actually makes it easier to perform these studies. What is remarkable and foolish is that the common pathway for many concerns about these studies finds its way back to clinical trial design, which is supposed to fix or compensate for problems via innovation, efficiency, flexibility, adaptation, and other clever modifications. Who ac-
Piantadosi
Date: July 27, 2017
286
Time: 4:45 pm
DEVELOPMENT PARADIGMS
tually believes that individual study design can compensate for structural problems like those above restricting our ability to do clinical trials? It would be much more sensible to assemble research designs into a pipeline with desirable properties.
10.2.6
Problems in the Pipeline
Specific development paradigms are not without problems and criticisms. A common focus for concerns is the FDA regulatory process, which is one of the few places that decision rules based on evidence surface for public view and discussion. The FDA has two choices—drugs can be approved either too fast or too slow. In reality the regulatory process in the United States has tended to compensate appropriately for important factors like availability of therapeutic alternatives and seriousness of the disease, measured against benefits and risks of the new treatment being evaluated. However, no regulatory process can conform perfectly to all the nuances of therapeutic development. All observers will bring their own priorities and biases to such decisions. A universal problem is that developmental pipelines are not easily visible. Individual study protocols are registered publicly but important aspects of decision making in them may not be open like the final regulatory steps. Examples are monitoring charters and decision processes for other design adaptations. I am not arguing here that such deliberations should be open, simply that their confidentiality prevents a complete characterization of the contribution of a study design to the developmental paradigm. The second element of the paradigm—the use of trial evidence to make decisions—is similarly diffusive or invisible. Also, such decisions may not always be purely rational. The result of all this is that it is not simple to design or characterize a developmental pipeline. What actually happens is that the operational characteristics of the pipeline emerge from the common clinical trial designs employed and the way studies are implemented and interpreted. The reverse should be the case, but tools to help us proceed in the correct direction are lacking. In the next section, I discuss one alternate approach.
10.3
A SIMPLE QUANTITATIVE PIPELINE
The simplest pipeline consists of a few serial steps. The steps depend on one another scientifically but are statistically independent. A stream of therapeutic ideas inputs into the first step and results from each step feed into the subsequent one. Each step could represent more than a single clinical trial. As will be seen below, only the type I and II errors for each step are important for this discussion—specific design elements such as adaptive features or number of treatment groups are irrelevant. Some quantitative properties, or operating characteristics (OC), of this simple scheme can be determined analytically.
10.3.1
Pipeline Operating Characteristics Can Be Derived
Assume that the frequency of true positive agents entering the pipeline is 𝑓 ≪ 1, meaning that only a fraction of them are actually worthwhile. Each developmental step, whether it represents a single trial or multiple trials, has an aggregate type I and type II error, 𝛼𝑖
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
A SIMPLE QUANTITATIVE PIPELINE
287
and 𝛽𝑖 , respectively. Pipeline step number 𝑖 allows true positives to pass with probability 𝜃𝑖 = 1 − 𝛽𝑖 , and allows true nulls to pass with probability 𝛼𝑖 . As will be done for middle developmental trials (Section 13.4), Bayes rule [123, 124] can be used to calculate the frequency of true positives from any pipeline step. Let 𝐴 represent the event that the treatment under study is a true advance, and 𝐴̃ the event that ̃ = 1 − Pr[𝐴]. Then, the treatment is not a true advance, Pr[𝐴] Pr[𝐴|𝑆𝑖+ ] =
Pr[𝑆𝑖+ |𝐴] Pr[𝐴] ̃ − Pr[𝐴]) Pr[𝑆𝑖+ |𝐴] Pr[𝐴] + Pr[𝑆𝑖+ |𝐴](1
,
where 𝑆𝑖+ denotes the event of a positive result for step 𝑖 of the pipeline. Using the error rates above, Pr[𝐴|𝑆𝑖+ ] =
𝜃𝑖 Pr[𝐴] . 𝜃𝑖 Pr[𝐴] + 𝛼𝑖 (1 − Pr[𝐴])
(10.1)
We would not get to step 𝑖 unless there are positive results from step 𝑖 − 1. So Pr[𝐴] represents the “prior” probability of success for step 𝑖, but is the “posterior” probability of success, or output, from step 𝑖 − 1. Substituting serially for Pr[𝐴] in equation (10.1) yields Pr[𝐴|𝑆3+ ] =
1 1+
𝛼 𝛼 𝛼 ( 𝜃3 𝜃2 𝜃1 1−𝑓 ) 3 2 1 𝑓
,
(10.2)
for a pipeline with three serial steps. Thus, the odds of a true positive result at the end of development is 𝑚 𝑚 𝑝𝑚 𝑓 ∏ 𝜃𝑖 𝑓 ∏ 1 − 𝛽𝑖 = = , 1 − 𝑝𝑚 1 − 𝑓 𝑖=1 𝛼𝑖 1 − 𝑓 𝑖=1 𝛼𝑖
(10.3)
where the product is taken over all 𝑚 steps. This result has a familiar structure—the posterior odds of a true positive is the prior odds of a true positive multiplied times a Bayes factor. The Bayes factor is formed from the type I and II error rates, and is essentially the odds ratio effect of the pipeline. The pipeline amplifies the prior odds according to the cumulative error properties of the steps. Equation (10.3) is an OC for the pipeline, and could have been written directly based on the following reasoning. The pipeline frequency of true positive therapies 𝑓 , is diminished with each trial according to the power. So over a series of 𝑚 steps, the true positive frequency becomes 𝑝𝑚 = 𝑓 (1 − 𝛽1 )(1 − 𝛽2 ) … .
(10.4)
The pipeline frequency of null treatments, 1 − 𝑓 , is sustained with each trial only by the type I error rate, so over a series of trials it becomes 1 − 𝑝𝑚 = (1 − 𝑓 )𝛼1 𝛼2 …
(10.5)
The ratio of equations (10.4 and 10.5) is the odds of a true positive, and yields equation (10.3).
Piantadosi
Date: July 27, 2017
288
Time: 4:45 pm
DEVELOPMENT PARADIGMS
The truth of equation (10.3) does not rely on any specific types of designs within the pipeline. The individual trials can be frequentist or Bayesian, adaptive or fixed, randomized or not, and so on. As filters or selection devices, the trials have implicit or explicit error properties that contribute to the ultimate result. Thus, no matter what designs are employed, the pipeline is a Bayesian construct that might be viewed as a learning algorithm. While this is interesting and probably useful, it does not carry implications for the design of individual clinical trials, which as demonstrated can be anything appropriate to the developmental question.
10.3.2
Implications May Be Counterintuitive
If the type I errors are very small individually or collectively, then the true positive yield approaches 100% as we might expect—eliminating all type I errors at any point in the pipeline gives only true positives at the end. Type II errors have more limited influence. Even when all type II errors are zero, the multiplier for the prior odds is Π𝛼𝑖−1 . The differential consequences of type I and II error control is an important point for establishing appropriate null hypotheses as discussed in Section 13.5. The OC embodied in equation (10.3) illustrates several important facts about serial pipeline elements. First, type I error rates multiply, indicating the strengthening of the pipeline with respect to this error as serial steps are added. This is in accord with intuition. This does not mean that the overall pipeline type I error rate goes down, only that the odds of a true positive increases at the end. Second, power probabilities also multiply, which represents a degradation of the pipeline with regard to additional serial elements. This is probably not intuitive. The power probabilities are less than one, their joint product decreases, and the multiplier for the prior odds decreases. This diminishes the true positive odds when all other performance criteria are constant. Of course, high powers are good in individual steps, but additional steps risk reducing overall power. Third, the prior probability of a true positive finding 𝑓 , or the corresponding odds is the most influential single factor regarding the performance of the pipeline. This is intuitive as it is with typical applications of Bayes’ rule. When prior odds of success is low, the true positive rate from development is reduced strongly.
10.3.3
Optimization Yields Insights
Based on equation (10.3), we can always increase the final odds of a true positive by adding any step to a pipeline with 1 − 𝛽 > 𝛼, which is not a very demanding criterion. Additional steps increase errors and consume sample size and other resources. Most simply we could make 𝛼 and 𝛽 very small in a single step. As mentioned above, the pipeline will yield 100% true positives if any step has 𝛼 = 0. The type I error is always under the control of the experimenter (even after the data are in hand) as a consequence of the critical value used for hypothesis tests. But eliminating false positives always costs some true positives, which is why type I errors are not routinely driven to zero. Hence, the odds of a true positive result at the end of a developmental pipeline is not the sole performance criterion. Our pipeline should control type I and II errors, sample size (or other appropriate measure of cost), in addition to the true positive rate.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
A SIMPLE QUANTITATIVE PIPELINE
For multiple steps, the overall chance of a type I error is ∏ 𝛼∗ = 1 − (1 − 𝛼𝑖 ),
289
(10.6)
𝑖
and similarly for a type II error, so the overall power is ∏ (1 − 𝛽𝑖 ). 1 − 𝛽∗ =
(10.7)
𝑖
The overall sample size is also a consequence of 𝛼 and 𝛽, and would satisfy a relationship like ∑ ∑( )2 𝑁𝑖 ∝ (10.8) 𝑍𝛼𝑖 + 𝑍𝛽𝑖 , 𝑁∗ = 𝑖
𝑖
if we imagine a series of simple trials with standard sample size determinations and a common treatment effect. The 𝑍-scores are taken to be positive by definition. The common treatment effect does not need to appear in equation 10.8. Although simplistic, these ideas provide a conceptual path to optimization. For example, a pipeline might be optimized by simultaneously maximizing the odds for a true positive while minimizing sample size over a given number of steps. One way to do this is to numerically maximize a ratio of the odds multiplier from equation (10.3) divided by the total sample size from equation (10.8). To place the odds multiplier, which is a product, on approximately the same numerical scale as the sample size term, the logarithm of the product term can be used. The resulting ratio, denoted by 𝑅, is purely ad hoc, and is 𝑚 ∑
𝑅=
𝑖=1 𝑚 ∑ 𝑖=1
( ( ) ( )) 𝑤𝑖 log 1 − 𝛽𝑖 − 𝜆 log 𝛼𝑖 )2 ,
( 𝑤𝑖 𝑍𝛼𝑖 + 𝑍𝛽𝑖 +
1 𝑍𝛼𝑖
+
(10.9)
1 𝑍𝛽𝑖
where the sums are taken over all 𝑚 pipeline steps. Equation (10.9) can be viewed as a design equation for a pipeline. Practical Nuances The denominator in equation (10.9) differs from equation (10.8) for the following reason. 𝑍-scores near zero for type I and II error rates would inflate 𝑅 arbitrarily, corrupting maximization. To discourage 𝑍-scores that yield a zero denominator, reciprocal penalty terms are included in the dominator. When type I and II error rates are small, 𝑍-scores are large and the penalty terms are small. The 𝜆 factor in the numerator is a discretionary parameter that is similarly intended to prevent any 𝛼𝑖 from being pushed relentlessly to zero, which would also inflate 𝑅. In amplifying the alpha term (we would choose 𝜆 > 0), very small values are unnecessary and maximization is stabilized. Pipeline steps are distinguished from one another through the weights, 𝑤𝑖 , also discretionary parameters. Otherwise, we could expect only a single value for all 𝛼𝑖 and a single value for all 𝛽𝑖 at the maximum. In other words, multiple steps are equivalent to a single step in our developmental pipeline with respect to the final odds of a true positive result, unless we distinguish steps on the basis of cost or resource utilization. Thus, 𝑤𝑖
Piantadosi
Date: July 27, 2017
290
Time: 4:45 pm
DEVELOPMENT PARADIGMS
are utilities that the investigator must supply and upon which the optimization will partly depend. In this hypothetical optimization, there is one utility parameter for each pipeline stage and the factor 𝜆, giving are 𝑘 + 1 subjective parameters for a 𝑘-step pipeline. Why not simply view the 𝛼’s and 𝛽’s as discretionary, which is the conventional behavior, and have 2𝑘 parameters for a 𝑘-step process? That might seem more familiar and less complicated. An answer is that the weights 𝑤𝑖 can likely be connected to real considerations such as cost or other resources. Then they would be objective and provide a way to quantify additional relevant effects on development, enhancing optimization. Moreover, the sketch here is not the sole path to optimizing this complex process. Other design equations or methods might be employed using fewer or more objective parameters. It is fruitless to try to remove expert judgment from the management of a complex process, but worthwhile to focus it in natural and easily understood places. One alternative is suggested after the following example. Example Optimization Some numerically maximized values of 𝑅 with 𝜆 = 1 are shown in Table 10.2. The weights assigned in each example reflect greater influence for late stages, but are otherwise arbitrary. The error control determined by the optimization is strict compared to conventional choices. The algorithm for Table 10.2 used a stochastic method, simulated annealing, so the resulting values could be local optima only. The intent of these examples is not to motivate particular choices for 𝛼 and 𝛽 in development stages, but to indicate the accessibility of quantitative pipeline properties to analysis and optimization. Another possibility for optimization is to recognize that there are 2𝑘 parameters (𝛼𝑖 ’s and 𝛽𝑖 ’s) for a pipeline of 𝑘 stages. Therefore, supplying 2𝑘 constraints in the form of equations that must be satisfied could yield an exact solution for all the 𝛼𝑖 ’s and 𝛽𝑖 ’s. Constraints or equations to satisfy might include specification of the overall type I error intended (equation 10.6), overall power (equation 10.7), and total sample size (equation 10.8). Additionally, we might specify the Bayes factor or pipeline odds ratio
TABLE 10.2 Optimized Error Rates for Hypothetical 2-, 3-, and 4-Step Pipelines ) ( ( )2 1−𝛽 Σ 𝑍𝛼𝑖 + 𝑍𝛽𝑖 Pipeline Step 𝛼 1−𝛽 𝑤𝑖 log Π 𝛼 𝑖 𝑖
1
2
3
1 2 All
0.003 0.002 0.005
0.95 0.89 0.84
1 5
1 2 3 All
0.03 0.001 0.05 0.08
0.87 0.79 0.77 0.53
1 2 10
1 2 3 4 All
0.02 0.06 0.01 0.0002 0.10
0.85 0.87 0.94 0.78 0.55
1 2 5 10
11.9
36.0
12.3
28.9
18.6
49.0
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
A SIMPLE QUANTITATIVE PIPELINE
291
from Equation 10.3. Those four constraints could uniquely specify a pipeline of two stages. It may not be likely that we would know pipeline parameters rather than type I and II error probabilities for the component trials. The point is that we could actively design the pipeline.
10.3.4
Overall Implications for the Pipeline
This simple derivation shows some of the value in thinking quantitatively about the overall developmental paradigm. One implication is that we might wish to control the odds of true positives that emerge from our pipeline. Restricting our design focus to individual trials within the pipeline is not guaranteed to render the properties that are needed for a particular therapeutic or disciplinary area. Second, developmental steps are relevant to resource utilization but not directly to errors from the pipeline. It is sensible to break the pipeline into steps to make reliable decisions at appropriate points without consuming too many resources, or for ethics reasons. It is not essential or wise to add steps or increase the resource utilization of individual steps, especially without assessing the impact on the overall properties of the pipeline. Following the same reasoning, one has to look skeptically at the use of randomization in middle development (Section 13.7). To make these trials small, the error controls are relaxed. Low power increases false positives [205]. The resulting (1 − 𝛽)∕𝛼 multiplier is weak and may not be worth the resource expenditure. In principle, the larger sample size could have a beneficial effect on pipeline overall type II errors but this is not assured using typical designs. An overall view of the pipeline suggests that individual study design and analysis is not as powerful as a few well-crafted steps. We have to examine carefully the frequent calls for “new study designs” in the light of this larger view. New designs may not be helpful unless they are responsive to new questions. Whether the questions are new or standard, adjusting the properties of the pipeline is the most powerful tool for development that we have. All these points speak to the usefulness of modeling error properties, and optimizing the pipeline. Another implication relates to the appropriate choice of the null hypothesis and the concept of futility. This is discussed in Section 13.5 where I point out that the most weighty biological question is generally taken for the null hypothesis. This principle can be violated in the futility design. Equation (10.9) emphasizes that role reversal of 𝛼 and 𝛽 in the futility design has a different impact on pipeline properties. The pipeline always improves with respect to 𝛼 and degrades with respect to 𝛽 when steps are added. Reversing error roles may not yield the intended properties in all circumstances. Whatever view we take of these concerns, it seems reasonable to incorporate this thinking in rational design or analysis of any developmental paradigm. Currently in the oncology community, there is wide discussion of the large cooperative group mechanism for therapy evaluation and development, now called the NCI Clinical Trials Network (NCTN), resulting from criticisms about cost, resources, and administration. There are also parallel calls for clinical trials redesign, adaptive design, learning systems, and other modern concepts from many stakeholders anticipating new questions for targeted therapeutics. And there have been substantial alterations to the management and preferred clinical trial designs from sponsors like the National Cancer Institute. But as yet, there has not been discussion of any desirable developmental paradigm or actively crafting
Piantadosi
Date: July 27, 2017
292
Time: 4:45 pm
DEVELOPMENT PARADIGMS
its inferential properties for therapeutic classes. We seem destined to allow the overall behavior of our pipeline to be a passive consequence after adjusting other aspects of the process.
10.4
LATE FAILURES
The theoretical considerations in this chapter can reflect on a serious unsolved problem of development, namely, failures at the end of the pipeline. By that I mean finding near the end of a development process that the treatment of interest is worse or no better than standard. This can of course happen by chance, but it seems to occur much more frequently than that, especially given the good biology behind many current therapeutic ideas, promising preclinical studies, and the results of early development trials. Late null findings represent enormous lost time and resources and we should do all that is reasonable to prevent this waste. For the purposes of this discussion, the pipeline could be literally the classic phases of clinical development. Or we might imagine the pipeline as preclinical only, with the final step being the whole of clinical development. In both cases there is a stream of potential therapeutics filtered by an evaluative process. At the final step of evaluation, we find more null results than expected. Assuming reasonable quantitative properties for the evaluations, it must be that the input stream contains a lower than expected true positive rate. Clinical investigators have been looking intently at middle development for an explanation for this phenomenon. Somewhat relaxed error properties, small sample sizes that classic externally controlled single cohort trials permit, and potential selection biases suggest that middle development may not have always reliably filtered out ineffective therapies. The truth becomes evident years and many millions of dollars later. One proposed fix-up for this problem in oncology has been to double the size and cost of middle development trials by incorporating a randomized control group. It is not obvious that this is the correct articulation of the problem or a solution for late failures. Section 13.3.2 I indicated that the strongest effect on the frequency of true positives from a pipeline is the prior probability for them, 𝑓 , which we often assume is low. If it is low, most of the apparent positive results during the early steps of the pipeline will be false positives. This means that the input into late development is not as enriched with true positives as we would like, setting up late failures. Another consequence of a low true positive prior probability, which is as yet unverified, is that a substantial fraction of the positive findings from the pipeline will be false positives. But our biological, laboratory, and preclinical evidence in this modern era seems strong. Shouldn’t the prior probability, 𝑓 , be high, not low? Any proof for this is at the end of the pipeline, not the beginning. There does seem to be a problem with reproducibility of findings from the laboratory [963, 1020], even with drug targets [1236]. This may point to a weakness in the quality of preclinical evidence and an overly optimistic view of the true positives input into our development pipelines. We might do better to ask why assumptions about 𝑓 can be incorrect, rather than ask the pipeline to be unrealistically reliable. Generally, it is not difficult to see how we might be taking ineffective therapeutic ideas out of the laboratory and into development. Some factors likely contributing to the initiation and propagation of unpromising therapeutic ideas are described in the following sections, including biases, statistical principles, and the mindset of clinical investigation. Any single factor may not be likely
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
LATE FAILURES
293
to sustain something questionable. But collectively these factors make it easier for truly unpromising ideas to gain developmental traction. 10.4.1
Generic Mistakes in Evaluating Evidence
Clinical investigator are prone to the same flaws in thinking that less technically knowledgeable people are. Adapting Kida’s discussion of basic errors in how we think [845] to the problems of a development pipeline, we can see how the following defects allow misdirection and error in a complex process: 1. Case examples versus statistically validated observations: The fundamental unit of clinical reasoning is the case history. Favorable anecdotes are especially supportive in the murky early steps of therapeutic assessment. A similar phenomenon may be at work in basic science studies, where a few positive findings may be more akin to an anecdote than strong numerical evidence. 2. Instinct to confirm held ideas rather than question them: The classic dilemma of confirmation bias. 3. Failure to appreciate the strong role of chance: Random variability can only be controlled through replication. But replication is precisely what is lacking early in development, leaving chance to play a strong role. 4. Mismeasure and misperception: New measurements lacking adequate validation may be central to early development. Lack of experience with new assessments can propagate errors. 5. Inadequate institutionalized memory: Publication bias is a good illustration of this problem. 6. Simplistic thinking about complex processes: Simple models are essential tools of scientific reductionism. But they may encourage an unrealistic simplistic view of disease pathways, validity of biomarkers, or the adequacy of targeted treatments. These and other predilections for error will be discussed in more detail below. 10.4.2
“Safety” Begets Efficacy Testing
The way investigators talk about ideas reflects the way they think about them. It is common to imply that safety is verified when no adverse events are observed in small early development cohorts. The serious defect with this inference was discussed in Section 5.3.10. If a small cohort experience supports safety, a major barrier to efficacy testing is removed, and momentum for efficacy is created. This process can be hard to stop for reasons listed below. The steps that sustain momentum are (i) a small cohort shows no adverse events, (ii) safety is inferred, (iii) efficacy testing can and should proceed, (iv) efficacy testing supports investment and belief in efficacy, (v) the process can be stopped only by strong evidence, which does not arrive until late development. The typical dose-finding trial is unlikely to halt therapeutic development for lack of efficacy. It might do so for a worrisome safety signal, but more likely the dose employed will be titrated downward to alleviate safety concerns, and therapeutic testing will move forward. The selectivity of the pipeline for efficacy would not be much improved—a weak prior probability of success will then extend directly into middle development. We
Piantadosi
Date: July 27, 2017
294
Time: 4:45 pm
DEVELOPMENT PARADIGMS
will see in Section 13.4 that a middle development trial does not have to control error rates very strongly to raise the posterior probability of a true positive result, provided that the prior probability is on the order of 0.1 or higher. While we can’t be certain as to that prior probability, suppose 30% of late development trials are null (odds of failure is 3∕7 = 0.43), and conservatively that all positive findings are true positives. Also assume that the pipeline prior to that point was moderately good at filtering false positives, specifically with two development stages each with 16-fold amplification of a prior odds. This would represent a 256-fold multiplier overall, and might be provided by two stages, each with 𝛽 = 0.2 and 𝛼 = 0.05. Back-calculating, this would imply that the prior odds of success was only about 1 in 600. If the first two developmental stages were weaker, say each with 𝛽 = 0.3 and 𝛼 = 0.15, the implied prior odds of success would have been roughly 1 in 50. So it seems that a high fraction of failures in late development coupled with reasonable pipeline properties up to that point suggests, contrary to intuition, that the prior odds of success is indeed very low.
10.4.3
Pressure to Advance Ideas Is Unprecedented
There is probably no other time in history where the pressure to advance therapeutic ideas has been so great. This seems to be true in both academic and commercial entities. Cultural values in many societies support this because of belief and acceptance of therapeutics, and through government mandates and funding. Corporations also see great value in therapeutics because they yield profits, resources, reputation, and contributions to human health. Individual incentives range from the desire to improve humanity, to fame, fortune, career advancement, and admiration. Have we oversupplied therapeutic ideas? This could be happening if we have underestimated the complexity or resilience of biological systems, or overestimated the capabilities of therapeutics. A consequence would be many true nulls and ultimate failures and false positives, not unlike the heuristic argument above. The expansion of biomedical research in the last 25–50 years has created a critical mass of ideas derived appropriately from the growing body of scientific knowledge. Our therapeutic ideas therefore seem more promising than in the past. But we have no way to assess the scale of this enterprise we have built compared to the frequency of true therapeutic solutions. We could be over-driving the system, metaphorically speaking. In these times of restricted peer reviewed and institutional research funding, many investigators look to foundation, philanthropic, biotech, pharmaceutical, device, and venture capital entities to support their research. In many cases, these sources pay well and contribute to the pressure to move quickly and less critically through the research process. The value of a successful therapeutic can depend strongly on calendar time, as can the ability to attract risk-accepting support. The economics of clinical research infrastructure support also favors sustained testing more than a halt in early development.
10.4.4
Scientists Believe Weird Things
It is easy for humans to believe weird things, and scientists are no different. How beliefs are formed and reinforced is the subject of excellent books by Shermer [1379, 1380]. Many ideas of science began as heresy, were strange, fantastic, unconventional, contrarian, or remain so in their maturity. Openness and creativity in science may actually
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
LATE FAILURES
295
accentuate this innate trait, and some ideas in science are contrary to intuition. Scientists themselves are often outr´e. The modern mind can look at certain old beliefs and easily reclassify them as myths, superstitions, mistakes, and the like. It’s not so easy to dislodge other ideas from popular thinking, even with negative evidence. Examples include bigfoot, alien visits, the loch ness monster, demonic possession, astrology, claivoyance, various urban legends, and so on. Belief can be stronger than factual evidence to the contrary. In the United States, evolution is sometimes denied or ridiculed as a “theory” by the ignorant, despite it being proved by comparative anatomy, embryology, paleontology, and genetics, not to mention its defense by some of the best minds in the history of science. A few ideas are made to seem plausible or are defended by some scientists despite essentially a total lack of objective evidence. Examples of these are alien abduction with the work of John E. Mack [969], the morphic resonance of Rupert Sheldrake [1374], FJ Tipler’s omega point, and therapeutic touch discussed in Section 4.5.3. Some case studies of this phenomenon are discussed by Park [1175]. Many scientists cultivate supernatural beliefs alongside their scientific ones, such as deity, angels, and the like and see no inherent conflict, or actually profess added value in doing so. Maybe it is good that the human mind is so flexible. Value judgments aside, we must accept that in the mind of the scientist, incoherent ideas coexist with coherent ones. Which of someone’s “scientific” ideas are actually scientifically incoherent? Such questions have answers if we accept that the scientific method can eventually arrive at truths of nature. In the short term, the real issue is how well a particular idea is defended. Smart people believe strange ideas as readily as anyone else does. Smart people are better at defending their weird ideas than others, and such ideas can be made to persist. It has to be that some of the constructs of modern therapeutic science are well defended but bogus. Such ideas will not seem strange to the larger community if they are convincingly defended. Weird beliefs do not have to exist on a grand scale to stimulate therapeutic ideas. There is no format for widespread agreement on the potential of a given therapeutic approach. To the contrary, it is natural for individuals or corporations to protect promising ideas until disclosure is unavoidable. Relatively few individuals may be required to sustain this type of support.
10.4.5
Confirmation Bias
Confirmation bias is the tendency to look for, notice, or remember things that confirm one’s beliefs and to ignore or undervalue contrary findings. Confirmatory or supportive information always appears to be given excessive weight, perhaps because doing so requires less cognitive energy than coping with contradictory evidence [1380]. Even designed experiments can be constructed in ways that tend to confirm rather than refute beliefs. A common manifestation of this is when using one-sided hypothesis tests “because the treatment effect can only go in one direction.” While this is sometimes actually the case, a one-sided test constitutes a weaker standard than a two-sided test at the same overall significance level. There is probably no simple way to measure confirmation bias in science. This bias is difficult to counteract because it operates at an interpretive or synthetic level rather than
Piantadosi
Date: July 27, 2017
296
Time: 4:45 pm
DEVELOPMENT PARADIGMS
at a perceptual level. So while it may be difficult to point to explicit examples, it must exist even our present age of relative sophistication. Conspiracy theories illustrate the power of confirmation bias. These are too numerous to count in modern politics and folklore. There are functional equivalents of conspiracies in science, considering the presentation and support of certain ideas, the potential to misuse or selectively use evidence, use of references to reinforce support and validity, and refusal to consider contrary ideas. A famous historical circumstance with these characteristics was the heliocentric model of the solar system and the damage done to Galileo when he challenged it. Ideas do not survive within science if they have traits of conspiracy. More recent examples might be phrenology, eugenics, ESP, and homeopathy. A conspiracy-like idea cannot enter science from external origins, as in the case of creation science and intelligent design. 10.4.6
Many Biological Endpoints Are Neither Predictive nor Prognostic
One useful kind of measurement accurately reflects the current status of disease, such as a diagnostic test. A more useful measurement also contains information regarding prognosis or the future state of disease. Among many others, functional classifications are examples of measurements that often have prognostic ability for disease progression or survival. A measurement that reliably indicates the utility of a treatment is said to be predictive. A gene- or pathway-based marker might predict whether or not a treatment targeted to that pathway can possibly yield the intended action. In some cases such a marker is literally the target, as in the human estrogen receptor in breast cancer. Validation of measurements for each of these roles may require extensive studies. We can measure many more biological phenomena than are useful in the vital ways just outlined. Although focused, promising, unique, or seemingly relevant measurements underlie many new therapeutic approaches, validity for classification, prediction, or prognosis is never guaranteed. Measurements without established validity, even though accurate, will almost certainly contribute to eventual developmental failures. 10.4.7
Disbelief Is Easier to Suspend Than Belief
Most of us, even scientists, enjoy suspending disbelief. The idea was described in 1817 by Samuel Taylor Coleridge in relation to the interpretation of poetry by [283] in relation to the interpretation of poetry. It represents a form of relaxation, entertainment, escape, or diversion that is required in nearly all the arts, and occasionally in science. As such it is often a pleasure, even if the literal interpretation of the topic is an uncomfortable one like a horror movie. When it is aligned with other tendencies, suspended disbelief can support widespread and enduring cultural icons like superheroes. Given the myths and fables in all cultures, it may be that suspended disbelief is a human need. The asymmetry between disbelief and belief is remarkable. Bertrand Russell said Man is a credulous animal and must believe something: in the absence of good grounds for belief, he will be satisfied with bad ones [1318].
Suspended disbelief can offer an interesting alternative short or long term, and may be impersonal. In contrast, suspending belief seems difficult for us, and may not be accompanied by an attractive replacement. We may be required to suspend a very personal
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
LATE FAILURES
297
belief. My view is that suspended belief is required by science when we pose questions, evaluate data, and interpret findings. The skill to do so is not taught didactically, takes years to master, and is easily lost in circumstance. Development can offer clues that should cause us to suspend our belief in the current hot therapeutic idea. One such clue might be the complete absence of efficacy in safety or dosing studies. We expect strong efficacy effects to be visible even in small cohorts. Investigators often expect side effects to be visible in small studies—hence their absence indicates “safety”—so why not efficacy? However, we are generally interested even in small beneficial effects (as well as small risks) so the absence of efficacy in early development cannot be taken as definitive for a loser. It can however serve to weaken our belief. Another suspension of belief should occur with the first failure. It’s not uncommon to blame early unexpected failure on methodology—we measured the wrong outcome, accrued the wrong population, made the trial too small, didn’t account for a critical variable, and so on ad infinitum. Investigators can be so convinced with the irrelevance of early failure that they mentally delete the experience. Perhaps early failure predicts late failure when we are unable to suspend belief. In general we should maintain more confidence in good methods of evaluation than in the idea behind the therapy.
10.4.8
Publication Bias
Publicaton bias is the tendency for positive or statistically significant results to be accepted into the scientific literature, whereas null results are less likely to appear. This effect is well known to clinical researchers, is discussed in Section 25.2.4, and needs little more emphasis here. Most experiments are designed so that a positive finding supports the promise, utility, or potential efficacy of the intervention. Hence, a biased sample of positive findings, for example, from preclinical studies, helps to support ongoing investigation and can contribute to inactive treatments being sustained in the developmental pipeline. Aside from publication bias affecting the content of the literature, there is also bias in actual publications. Scientists are affected by the interpretations put forward by study authors, even when they are overly ambitious with respect to the data. In other words, scientists are affected by the spin of the publication [174].
10.4.9
Intellectual Conflicts of Interest
Conflict of interest (COI) is almost always discussed only in terms of money. In fact, scientists broadly speaking seem to have more vested interest in their intellectual positions on key issues than in financial matters. In most cases, there is no money at stake but there are real reputational, credit, academic, and other important nonfinancial issues that drive self interest. These intellectual preferences show up in citations, review of journal articles, review of grant applications, idea advocacy, editorial debate, and many other places where money is not directly an issue. My personal belief is that intellectual conflict of interest influences scientific behavior more than financial COI. We would be vigilant in removing as much financial COI as feasible from influencing therapeutic development. This ideal is fundamentally flawed because the dominant therapeutics development apparatus in the world is commercial, and therefore inherently
Piantadosi
Date: July 27, 2017
298
Time: 4:45 pm
DEVELOPMENT PARADIGMS
conflicted by money. Financial incentive is vital however to offset the monetary risks. Money can and does support commercial interest in some therapeutic settings and concepts. The ability of money to create a false positive result from which profits might be gained is probably low, especially considering the strong regulatory overlay for these activities. In contrast, there does not seem to be much alertness to intellectual conflicts of interest in supporting the initiation or sustenance of therapeutic strategies. There is certainly no meritocracy of ideas in this respect because there is no valid forum for assessing relative merit. We cannot point to scientific meetings, journal publications, citations, patents, or any single venue that compensates for potential intellectual conflicts of interest on the part of decision makers. It seems unavoidable that inputs to our development pipelines must accordingly be inflated with potential false positives by intellectual conflicts of interest.
10.4.10
Many Preclinical Models Are Invalid
One of the most valuable concepts in the biological sciences, as in science broadly, is the notion of a model. The full elucidation of this idea would require a book, but a short version is that interim answers to therapeutic questions are often guided by compact, simple, wieldy, imperfect versions of reality called models. Models are useful when they accelerate asking and answering questions and return reasonably valid answers. In short, models are agents of efficiency in scientific investigation. They are ubiquitous and often based on tissue lines, animals, explants, and increasingly, on engineered versions of cells and animals that are required to address some specific phenomena. By necessity models omit some elements of reality, just as they always include some irrelevant components. A cancer tissue cell line lacks an organ and a host. A model of cardiac muscle injury may lack chronicity. A model of diabetes may lack intrinsic and as yet unknown underlying causes. These are simplistic examples, but they illustrate a key principle that can invalidate a model. A model may be valid for some purposes or therapies but will not be for others. All preclinical development depends on models. Signals from a full spectrum of biological questions are queried first in them, including targeting, biodistribution, absorption, metabolism, toxicity, and efficacy. In cancer, for example, it is not unusual for a promising therapeutic to cure a substantial fraction of model subjects, albeit under ideal circumstances. Clinical development specifically depends on models to a lesser degree, unless we artificially define healthy subjects for some pharmacologic trials as models, or similarly for self-selected trial participants. There can be no model for safety, but there can be for risk. Similarly, there is no universal definitive model for efficacy. As careful as scientists are with the crafting and use of model systems, many of them are invalid with respect to human disease. I say this not because I have insight into specific deficiencies, but because it is obviously the case. Tests in a truly valid model would obviate some or all human testing. This has never happened. Some of our invalid models support the attempted development of ineffective therapies. Similarly, some halt preclinical development of truly effect therapies (false negatives) but that is not the focus of this discussion. The failure of models generically is partially a result of multiplicity. Aside from models being imperfect, studies that employ them allow type I errors, and a series of
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
LATE FAILURES
299
such studies may inflate false positive errors. Also there may be many model systems that can generate evidence regarding a therapeutic. Positive results anywhere are likely to be taken as encouraging. These characteristics are unlikely to increase the background true positive rate entering the pipeline. Here I must comment on an irony in the way we use models in therapeutic development. We use our least perfect preclinical models to support initiation and maintenance of development. Despite some potential deficiencies just outlined, this is appropriate. Later in development however, we hesitate to allow therapeutic evidence to cross sociodemographic boundaries within fellow humans. Males don’t inform us about females; ethnic group A does not inform us about ethnic group B. In other words, the utility of models seems to decline sharply just when they begin to be based on other humans.
10.4.11
Variation Despite Genomic Determinism
Genomic determinism is the idea that most biologic variability disappears when we have a full accounting of gene expression and epigenetic factors. Unfortunately this is not true [1536]. We know that some genetic mutations are singularly responsible for diseases, and that some functional targets in gene regulated pathways can be manipulated by small molecules yielding large effects. These are indications that some deterministic genome-based effects are larger than random variation. It does not however indicate that randomness disappears with the fuller understanding of the genome. Many genomic determinants are multifactorial, yielding at least partially random outcomes. Another mechanism for the occurrence of randomness stems from chaos—the essentially random downstream output of a complex process superimposed on minute changes in initial conditions. Gene–gene interactions, post-translation modifications, epigenetic factors, and environment guarantee small differences in initial conditions in otherwise identical cells or organisms. Random variation in measured outcomes is the consequence. But it is not only initial conditions that lead to such consequences. Cumulative small factors over time, most too small to measure or account, can have the same effect. Consider an experiment where genetically identical laboratory animals are entered into a lifetime survival study, perhaps incorporating a risk factor or exposure. They are all housed in the same environment and have a common diet. Why don’t they all die or have events on the same day? Imagine a second experiment where a large population of identical subjects is given an exposure, and over time a small fraction develop a disease. If we could literally repeat the experiment with the same individuals under the same conditions, we would expect the same proportion to develop disease. But would the very same individuals develop disease with each repetition of the experiment? Although the proportion affected in each realization of this experiment would remain constant apart from sampling variability, we could not expect to observe the very same affected individuals every time. Miniscule changes in initial conditions and longitudinally would produce random outcomes. One might hypothetically imagine an even finer degree of determinism to explain the outcomes. But the resolution is not just practically limited, it is a consequence of fundamental random processes at the physical level. Hence natural variation is superimposed on even genomic determinants.
Piantadosi
Date: July 27, 2017
300
Time: 4:45 pm
DEVELOPMENT PARADIGMS
10.4.12
Weak Evidence Is Likely to Mislead
Some preclinical steps and much of early development is based on weak evidence. By this I mean two things. First, whatever measure is taken as primary for guiding decisions, the relative evidence to support the estimated effect compared to reasonable alternatives will be low. Second, because the only tool for controlling variability is replication, the small sample sizes of preclinical and early developmental studies have a relatively high chance of yielding effect estimates far from the truth. In other words, weak evidence is much more likely to point in the wrong direction than is strong evidence [1294, 1298]. Investigators are not accustomed to thinking in terms of relative evidence (Section 7.5), but this principle may be familiar when expressed in terms of precision. Suppose a biological effect of interest is estimated without bias by the mean of a series of observations that are samples from a normal distribution, although the principle does not depend on an exact distributional form. Our effect estimate is the mean of a series of √ observations, and has standard error proportional to 1∕ 𝑛. The relative precision of two √ different estimates is therefore 𝑛1 ∕𝑛2 , which is also the relative width of confidence intervals. This square-root relationship indicates that a developmental study will produce a confidence interval twice as wide as a definitive study with four times as many observations. Suppose we take this four-fold ratio as the operational threshold between weak and strong evidence. The wider confidence interval for weak evidence reflects how it is consistent with more incorrect estimates, and can therefore be misleading. Because a confidence interval can be made as small as desired by using large samples, stronger and stronger evidence will unfailingly indicate the correct biological effect. I trust the reader will forgive my technically incorrect interpretation of confidence intervals to illustrate this point: weak evidence is more likely to mislead, but strong evidence will indicate the truth.
10.5
SUMMARY
A development paradigm must explicitly or implicitly account for two competing needs: to provide evidence necessary to accept new therapies as being safe and effective, and to stop development at the earliest feasible time for treatments that prove unsafe or ineffective. The usual ordering of therapeutic goals and general consensus on appropriate clinical questions and trial designs to address them tends to produce a typical development paradigm or pipeline in a given context. It is not clear to what extent the phases in typical development pipelines are constructed by prior design versus consequences of staging that naturally permits flexibility, especially for early termination of unsuccessful therapies. Whatever the driving forces, staging is the principle tool for flexibility and adaptability in the development pipeline, just as it is in individual clinical trial designs. Ideally, the overall properties of the pipeline would actively drive appropriate study design. In practice, the pipeline properties are often dictated passively by individual trial designs each of which may be optimized without regard to the overall development goals. Quantitatively, a simple pipeline follows Bayes’ rule where the posterior odds of a true positive result is determined by the prior odds of a true positive multiplied by a factor depending entirely on the type I and II error levels for each stage. The posterior odds is
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
301
strengthened by additional steps with strong type I error control, but weakened by type II errors of additional steps. Such properties, along with subjective utilities for sample size and type I error bounds, can be used to optimize the error rates for each pipeline step. This approach appears not to have been tried as a prior design strategy for development in any discipline. Looking overall at a developmental paradigm suggests some insights as to why late failures occur. The strongest influence on the odds of late failure is the prior odds of a true positive. The quantitative error properties of middle development trials is not a strong factor. Although we have some reasons to believe that modern basic and translational science offers a stream of putative therapeutics with a relatively high prior probability of success, there are also quantitative, perceptual and inferential reasons why this may not actually be true. A moderately high rate of late failures suggests that the prior probability of success is indeed quite low, essentially regardless of middle development properties.
10.6
QUESTIONS FOR DISCUSSION
1. In equation (10.9), what effect on 𝛼 and 𝛽 would you expect from different values for the utilities? 2. Describe how pipelines might differ if they focused on preventive vaccines, targeted drugs, or risky drugs.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
11 TRANSLATIONAL CLINICAL TRIALS
11.1 INTRODUCTION Only a small fraction of therapeutic ideas originating in the laboratory mature to developmental clinical trials. The transition from laboratory to clinic is guided by small targeted studies rather than large clinical trials. Translational trials are this bridge between the laboratory and clinical development. They are, therefore, among the most common types of clinical trials performed, but are difficult and delicate [926] to perform. What and how we learn from translational trials is the focus of this chapter. Eight principles or guideposts for understanding and using them will be discussed. It is not necessary to define or discuss translational research broadly, with its variable and inconsistent definition, to understand the relevant trial type. In contrast, bridges between laboratory and clinic seem to share well-defined settings, traits, and principles as outlined further. The term phase 0 has been occasionally but inconsistently used for some predevelopmental questions [567, 849, 872, 873, 896, 1072, 1313]. That term is not descriptive and can obscure the purpose of the trial. It also implies a developmental step that does not exist, and does not unify concepts across multiple disease areas. Similarly, many pre-developmental studies are labeled as pilot, for which there has never been a definition. Translational trials are none of these, and their methodology is still being formulated in the literature. A consistent framework is needed to describe them. Ideas presented here regarding methodology are based on my early discussion of the topic [1202], and recent attempts to teach the essence of translational trials. Despite variations in definitions and methods, this topic attracts great interest among researchers and sponsors because nearly every new therapy depends on a successful transition from laboratory to clinic. Such trials are relevant in fields where therapeutic development is highly active, such as oncology. However, translation is equally important Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
302
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
303
in settings where new therapeutic ideas arise less frequently; good methods may be even more valuable there. Translational studies can sometimes be performed as add-ons to later developmental trials, provided that the subjects and the questions are compatible with it. Even when piggy backed in that way, translational questions are pre-developmental. Sometimes, phase I is taken to be the dividing line between basic research and clinical development. Phase I is occasionally equated to the earliest human trial of a new therapy. But dose-finding is a developmental question, and is often not the key therapeutic question needed at the interface between laboratory and clinic. Failure to recognize the geography and role of translational trials has probably impeded both their formalization and that of dose-finding designs. Dose-finding and related experiment designs are discussed in Chapter 12. I have indicated elsewhere in this book that clinical trials can be viewed either as empirical devices that test or select treatments, or as biological experiments that reveal truths of nature. In actuality, they are both simultaneously. How best to view a trial depends on the context of the question and the study design. Many randomized trials or large-scale studies fit better in the empirical end of the spectrum, while translational trials are typical of the biological end. The translational model explicitly incorporates elements from both the laboratory and clinical paradigms for the disease. Another major impediment to defining translational trials may be skepticism among methodologists that such small studies, as these typically are, can provide useful information. In a sense, the class of translational trials is reverse engineered from the small sample sizes that clinical investigators instinctively know will yield needed information about the worth of a new therapeutic idea. This is at the heart of the matter, and clinical investigators never seem to be in doubt as to the informational utility of such trials. Consider the relatively large amount and important nature of information that dose-finding/ranging trials provide with minimal sample size. It is similarity of purpose and structure that codifies a design class. Translational trials can be typed well enough across disciplines to justify their definition as a class. In this chapter, I also discuss certain biomarker studies that are closely related to translational trials. The biomarker concept is invoked often as a useful tool for trial design, prognosis, and therapeutic prediction. In reality, the number of reliable biomarkers for such purposes is small. However, the biological signal that is central to the understanding and implementation of a translational trial is a biomarker. So it is important to have a perspective on biomarkers in parallel with translational trials.
11.1.1
Therapeutic Intent or Not?
An immediate problem with translational clinical trials is the presence or absence of therapeutic intent [719]. This has occasionally been an issue in dose-finding trials, though I think inappropriately so as discussed in Section 12.2.1. For translational trials, the issue is more substantial. Some investigators explicitly state that such trials lack therapeutic intent. If true, this makes these trials problematic, because most of them would be expected to carry more than minimal risk for participants. New agents and invasive diagnostic procedures seem routine. Most Institutional Review Boards might take a dim view of the resulting risk-benefit profile in the absence of therapeutic intent. The literature regarding translational trials is not very helpful on this point. There is no universal way
Piantadosi
Date: July 27, 2017
304
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
around these issues from the perspective of trial design, but some principles can be kept in mind. First, it is possible to design some translational trials with therapeutic intent. Driving biological signals or markers in a good direction seems to be a therapeutic idea even if we are not yet certain about dose, durability of effect, definitiveness, side effects, and other important aspects of a therapy. Second, investigators should choose minimal risk assessments like blood tests, non-invasive imaging, or biological fluid analyses as evaluation methods whenever possible. Third, these trials should be coupled with standard therapy to take advantage of more aggressive treatment without adding significant risk. For example, a subject who requires a biopsy or surgery as standard of care might also contribute to answering an appropriately constructed translational question.
11.1.2
Mechanistic Trials
Early developmental studies test treatment mechanisms (TM). To a biomedical engineer, mechanism might have a conventional interpretation in terms of device function. To the clinical pharmacologist, treatment mechanism might mean bioavailability of the drug; to a surgeon, it might mean procedure or technique; to a gene therapist, it might mean cell transfection or function of an engineered gene; and so on. Even diagnostic and screening trials have treatment mechanisms in this sense of the word. Mechanistic success is not equivalent with clinical outcome that must be assessed in later and larger clinical trials. Some early drug studies focus on more than the mechanism of drug availability. They frequently explore a dose-finding (DF) goal based on appropriate clinical outcomes. For some drugs or preventive agents, we might be interested in the minimum effective dose. In cytotoxic drugs for oncology, the focus is often the highest tolerated dose, under the belief that it will provide the best therapeutic effect. TM and dose-finding studies are closely connected to biological models of pharmacokinetics or dose-response. Other types of treatment may be found on more complex models of normal function or disease. The outcome of such studies may be best described in terms of such a model. In later developmental trials, the model of the treatment and its interaction with the disease process or clinical outcome is less important.
Can Randomization be Used in Early Developmental Trials? Early developmental trials seldom have internal controls. Selection bias tends not to be a major concern for trials focused on treatment mechanism, pharmacologic endpoints, and biological signals. Therefore randomization seems to have little role in such studies. Some new agents, such as genetically engineered vaccines, can be expensive to manufacture and administer but have large treatment effects. Dose-finding study objectives for these agents are to determine the largest number of cells (dose) that can be administered and to test for side effects of the immunization. Side effects could arise from transfected genes or the cells themselves. A randomized design could separate these effects. Such considerations led one group of investigators to perform a randomized dose escalation trial in the gene therapy of renal cell carcinoma [1401]. An advantage of this design was that the therapy had the potential for large beneficial effects that might have been demonstrated
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
305
convincingly from the trial. However, as a general rule, the strong connectedness of translational trials to biological models reduces the need for randomized designs.
11.1.3
Marker Threshold Designs Are Strongly Biased
A frequently used translational design is based on a biomarker threshold. It is sometimes labeled a “phase 0” design [1313], although that term is not consistently applied. A baseline assessment of a target or biomarker is made with imaging or biopsy. Following therapy and a suitable interval, the target is assessed again as it was at baseline. Differences in the marker reflect the action of the therapy. This design depends on the ability to sample tissue with an actual biopsy, functional imaging, microdialysis, or other method of directly observing a target. Even if we assume the validity of the marker, a variety of other complications can render this idealization infeasible. For example, we may not be able to obtain multiple biopsies, or have appropriate cellular functional imaging. But when assessments can be performed, the design is efficient at evaluating the action of the therapy. The main efficiency stems from the before and after therapy evaluations allowing within subject effects to be estimated. This saves at least half the sample size relative to a treatment versus control design. Using as few as 20 subjects, one can reliably detect a shift in marker or target level of one standard deviation. It is almost always biologically meaningful to cause a shift in a valid marker equal to the magnitude of person to person variability. Unfortunately, this design is very fragile with respect to baseline assessments. When they are not available, we have two design options with very different operating properties. One is an independent group comparison, like a randomized trial. That will require at least twice as many subjects as the pre–post design, but will yield an unbiased estimate of the treatment effect. Alternatively, we could test a single cohort for a post-treatment shift in marker levels above a defined threshold. (Defining that threshold might have required a control group in a separate study.) This is a marker threshold design, and might use a one-sample binomial test to compare an expected high proportion of successes to some minimal background frequency. The marker threshold design seems to be able to detect treatment effects reliably, and appears more efficient than an independent groups trial. But in fact, it is strongly biased to discard active therapies. This is unintuitive. The bias in the marker threshold design can be seen in the following way. Prior to treatment, the background distribution of values for a specific marker can be described by the probability distribution function, PDF, f(𝑥). This is the probability of observing a background marker level of any value 𝑥. The effect of treatment is represented by a second probability function 𝐺(𝑧) that describes the chance of increasing marker level by an amount 𝑧 or more. Post-treatment values above a threshold 𝑐 are declared to be successes and otherwise are failures. We could make the threshold a marker reduction rather than increase without altering this development or conclusions. The main interest in the single cohort trial is gauging treatment success by the frequency of post-treatment marker values exceeding the threshold. For a baseline value 𝑥 to end up above 𝑐, the treatment must increase it by the amount 𝑐 − 𝑥 or greater. Accord-
Piantadosi
Date: July 27, 2017
306
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
ing to definitions above, the probability of observing this event for baseline value 𝑥 is 𝑓 (𝑥)𝐺(𝑐 − 𝑥), where we assume that the treatment effect is independent of the marker level. The overall frequency from the entire cohort is the convolution ∞
𝑃 (𝑐) =
∫−∞
𝑓 (𝑥)𝐺(𝑐 − 𝑥) 𝑑𝑥,
(11.1)
where we have integrated overall possible baseline marker values. It makes sense to view this probability as a function of the threshold 𝑐 because more extreme criteria will have a lower chance of being satisfied. This convolution represents the sum of two random variables, one for the baseline marker value and a second one for the treatment effect. As such, we might have written equation (11.1) directly [457]. We need specific distributions for equation (11.1) to be useful quantitatively. I will not attempt to produce an analytic form for 𝑃 (𝑐) with overly simple assumptions, but instead make some realistic choices for 𝑓 (𝑥) and 𝐺(𝑥) to illustrate quantitative results. First, assume the baseline 𝑓 (𝑥) to be a standard normal probability distribution function (PDF). All marker values will then be expressed as 𝑧-scores relative to this PDF. Assume optimistically that the effect of treatment is positive in all subjects and can be modeled by an exponential distribution, { 1 , if 𝑧 < 0 𝐺(𝑧) = , − log(2)𝑧 𝑒 , otherwise shown in Figure 11.1. The median treatment effect is assumed to be one standard deviation (𝐺−1 (0.5) = 1). Every baseline value is increased following therapy, some quite substantially. Now, using these distributions and equation (11.1), values for 𝑃 (𝑐) can be calculated numerically. The results are surprising in two ways. First, the probabilities of observing marker values above given thresholds are relatively small. For example, if we choose 𝑐 = 2
FIGURE 11.1
Hypothetical exponential marker response distribution, 𝐺(𝑥).
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
307
FIGURE 11.2 Probability of post-treatment marker exceeding threshold C (solid line), baseline probability of value C (dotted line), and difference (dashed line).
(a threshold of two standard deviations), we expect only 31% of subjects to have post-treatment marker values above 2 despite the fact that every subject had an increase. The probability of exceeding any threshold drops off quickly for higher and higher criteria. These results are shown in Figure 11.2 (solid line). This illustrates the terrible inefficiency of this method compared to a pre–post design that uses the information from all marker-level increases. Second, keep in mind that some baseline values may already be above any specific threshold. As 𝑐 is moved toward lower values, the probability that baseline marker values exceed it increases. Under the present assumptions, this is a cumulative normal probability shown in Figure 11.2 (dotted line). Subtracting the baseline chance of exceeding 𝑐 from the post-treatment chance yields the increase in successes resulting from the trial design and the assumed properties of the therapy. Such a curve is also shown in Figure 11.2 (dashed line), where it can be seen that the design performs with a maximal increase in successes, though still poorly, when the threshold is not extreme. The best threshold to use is about the same value as the median treatment improvement. While we can’t know that value, it makes no sense to choose an extreme threshold when we expect a modest increase in marker value attributable to treatment since that simply degrades the performance of the design. As a final lesson, imagine what sort of treatment effect 𝐺(𝑧) it takes to yield a high frequency of marker measurements above 𝑐. In the present framework, 𝐺(𝑧) would have to be extreme—a median effect of several standard deviations with a large fraction of the cohort receiving increases several times to yield a high proportion of post-treatment measurements above 𝑐 = 2. However, these unrealistic treatment effects are implicitly assumed when a study design proposes to detect a large success rate compared to some minimal background frequency. Without huge effects, the design will be indicating failure of the therapy. In other words, the design is strongly biased against modest sized treatment effects. If we have good treatments and only care to pursue home run
Piantadosi
Date: July 27, 2017
308
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
FIGURE 11.3 Idealized biological paradigm illustrating the role of an irrefutable signal on inferences regarding a potential therapy. As long as an irrefutable signal is affected by the therapy, invalid signals will be irrelevant.
effects, these properties are fine. On the other hand, if we lack good treatments and need confirmation of a positive biological effect, these designs will be unserviceable.
11.2
INFERENTIAL PARADIGMS
There are four idealized paradigms of inference that are relevant to translational trials, if not to all clinical trials. I will discuss three here, leaving the fourth for elaboration later when defining translational trials explicitly. The paradigms are biological, clinical, surrogate, and translational. 11.2.1
Biologic Paradigm
The mainstay of preclinical research is the biologic paradigm (Fig. 11.3). It centers around a validated model that contains essential aspects of the actual disease. A problematic word here is essential because it may be difficult to say what must be captured by the model to allow valid inferences about preclinical development of a therapy. In any case, the model system must be sufficiently real, and great effort is spent to be sure this is true. The biologic paradigm is applied especially as an incremental tool in determining the most appropriate next set of experiments, rather than to make definitive conclusions. Firm conclusions derive from a sequence of such experiments. Examples of biologic model systems include bacterial systems for carcinogenicity testing [33, 34], rodent tumor models, explants of cells lines, explants of actual tumors in immunocompromised (nude) mice, and so on. An infectious disease model system could be as simple and obvious as injecting the pathogen into otherwise healthy animals. A simple surgical model might be based on intentional disruption of normal physiology by a surgical intervention performed to yield a pathologic condition. When good model systems do not exist, as has been the case for many years in Alzheimer’s disease and emphysema, the progress of therapeutic development is often slowed. Validation of a model system consists of conclusively demonstrating characteristics and behavior relevant to human disease. In the biologic paradigm, therapeutic response
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INFERENTIAL PARADIGMS
309
is the prized characteristic on which to base validation. An animal model judged to be valid with respect to one therapy would require a separate validation for a new therapy. For translational research, an essential feature of the model system is an irrefutable measurable signal that relates to the progress of the disease and on which the therapy acts (Fig. 11.3). A simple example of an irrefutable signal is survival of the host, assuming the disease is chronic and severe. Other signals might be frequency or timing of clinical events, tissue or organ changes evident on pathologic examination, enzyme levels, receptor binding, or gene expression. We require that our putative treatment pushes the irrefutable signal in a direction that indicates benefit. If the signal does not change sufficiently, or changes in an unfavorable direction, then we will abandon the treatment. Secondary safety signals may also be captured, but they are not essential in understanding the nature of the biological paradigm. One cannot say, generally, how large a change in a biological signal is needed to sustain an interest in a given therapy, because we can’t know at such an early stage how changes in a signal will influence clinical outcomes. The implications of outcomes from such a system are clear but not definitive. A positive influence on the signal will advance our therapeutic development. Anything else will modify or terminate it. There is ample opportunity for the therapy to fail in later experiments due to actions unrelated to the main signal. Failure could be the result of our biological model being deficient in a key way, such as toxicities that become evident later, or inability to deliver the therapy in real-world circumstances or at sufficiently high concentrations. The biologic paradigm tends to be clean with respect to some key features, including (1) homogeneous experimental subjects, (2) low variability, (3) absence of delivery/adherence as a major variable in therapy, (4) well-controlled (laboratory) conditions of treatment administration, (5) unbiased and complete outcome ascertainment, and (6) well controlled sources of extraneous variation. As these ideal circumstances disappear in human cohorts, we may be unable to replicate model-based findings. Even subtle defects in the biologic paradigm carry serious consequences. BIA102474 was a fatty acid amide hydrolase inhibitor developed for treatment of neurological disorders such as anxiety, Parkinson’s disease, and multiple sclerosis. In 2015, it was studied in a masked, randomized, placebo-controlled, multiple dose study in healthy volunteers conducted at the University Hospital in Rennes, France. 90 of 128 participants received drug at doses ranging from 0.25 to 100mg. After about 6 months, five out of six subjects at the highest dose were hospitalized with necrotic and haemorrhagic brain lesions. One person died. Toxicities were likely off-target binding, enhanced by doses 40 times higher than that needed for hydrolase inhibition [399]. Pharmacokinetic analyses were not available in real time—there was gradual accumulation of drug in the brain due to non linear PK behavior and saturation of the elimination mechanisms. In retrospect, some animal deaths in preclinical studies were not reported. In previous trials with similar drugs, no serious toxicities were seen, leading to a false sense of security. These defects in understanding how the drug worked were aggravated by problems with the design of the trial, and event reporting [685, 686].
Piantadosi
Date: July 27, 2017
310
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
FIGURE 11.4 Idealized clinical paradigm illustrating the role of definitive outcomes and secondary outcomes on inferences regarding a new therapy.
11.2.2
Clinical Paradigm
The idealized clinical inference paradigm rests on a well-defined disease and a valid definitive outcome measure at the individual subject level. The definition of disease in humans may include genetic, pathophysiologic, and behavioral requirements. Second, a trial will also have eligibility restrictions that enhance the safety or reliability of a treatment. The therapy is expected to affect the definitive outcome favorably (Fig. 11.4), and will likely produce effects on safety or other secondary outcomes that will also influence decisions regarding clinical utility. The clinical paradigm is most relevant to middle and late developmental trials where inference is empirically grounded. The implications of this paradigm are clear and definitive in clinical terms. However, we may still be left with concerns about the magnitude of therapeutic benefit, cost, side or secondary effects of treatment, appropriate application, and effectiveness on a population scale. We aspire to the definitiveness of the clinical paradigm in developmental clinical trials, but perhaps less so in translational trials. Implications from the clinical paradigm do not depend on biological mechanisms, although we would very much prefer to understand them. The pure clinical paradigm is empirical, and we could find interesting avenues for investigation on that basis alone. For example, many natural products that have come into wide use are only later mechanistically validated. It also explains why many clinical scientists are willing to approach unconventional or “alternative” therapies with no more scientific underpinnings than the empirical structure of a clinical trial. This point is an additional illustration of the dual role of the clinical trial as experiment. Aside from overwhelming empirical evidence, the best way to filter false positive results is through a mechanistic understanding. An essential point about the clinical paradigm is that biological mechanisms are important and useful but not required, which is very much unlike the biological paradigm. Defects in the clinical paradigm can have severe consequences for both development and subject safety. One extreme example was the fialuridine clinical trial in 1995 that
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INFERENTIAL PARADIGMS
311
FIGURE 11.5 Idealized surrogate paradigm illustrating the role of a surrogate (or intermediate) outcome in making inferences regarding the action of a therapy ultimately on a definitive outcome.
was a phase II study in 15 subjects of two doses of the nucleoside analog for treatment of chronic hepatitis B. After week 13, participants began to experience hepatic failure and lactic acidosis. Although the drug was stopped at the first such report, seven participants developed progressive hepatic failure and five eventually died. Two participants survived by emergency liver transplants [1014]. In retrospect, chronic administration was seen to cause drug to accumulate in the liver, leading to mitochondrial damage and lactic acid deposition. There had been no chronic pre-clinical testing of the drug. Similarly, the chronic use schedule had not been tested in dose finding trials. In those early studies, three subjects had died with hepatic failure but pathology follow-up on the circumstances was not extensive. This tragedy resulted in corresponding changes in pre-clinical testing. 11.2.3
Surrogate Paradigm
The third idealized inferential therapeutic framework or paradigm is based on surrogate outcomes. It may be useful to revisit the potential difficulties of surrogate outcomes as discussed in Chapter 5. My use of the term here will be slightly less formal but in no way invalidates the earlier discussion. The surrogate paradigm incorporates a well-defined disease along with a surrogate outcome measure, which we can assume is validated. Therapy favorably affects the surrogate outcome and this affords some certainty that the definitive outcome will also have a beneficial change (Fig. 11.5). The implications of the result on the surrogate outcome are clear but not absolutely definitive. Unclear implications in this regard would result from application of a new type of therapy to a previously validated surrogate paradigm that unfortunately is exactly what we might encounter in therapeutic research. As in the clinical paradigm, biological mechanisms are important and useful but not required to interpret results from surrogates. The chain from surrogate to definitive outcome is strengthened by knowledge of mechanisms, but this paradigm in also inher-
Piantadosi
Date: July 27, 2017
312
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
ently empirical. Shortly, I will synthesize these three inferential paradigms in a fourth translational system, but will first discuss the important role of evidence and theory.
11.3
EVIDENCE AND THEORY
Suppose that a treatment A is compared to a placebo in a masked randomized clinical trial. The result shows A is superior to placebo with a standardized effect estimate of 0.25 (i.e., 14 standard deviation) and 𝑝-value equal to 0.06. Our interpretation of this result would be different under the following two scenarios. Scenario 1: A is a targeted new drug. A body of biological evidence supports the rationale for the design and testing of A, including preclinical evidence and early developmental trials that seem to validate a mechanism for its efficacy. Under this scenario, the comparative trial result would likely be taken as reasonable evidence of the efficacy of A. One could debate whether or not practice should generally be altered because of this result, but at least the therapeutic development of A would likely be pursued. Many investigators would characterize the result as “statistically significant.” Scenario 2: A is a homeopathic remedy. The homeopathic preparation of A involves dilution of a parent solution to the point that no active compound can possibly remain in solution according to the principles of physical chemistry. Efficacy is claimed on the basis of the solution’s “memory” of what was previously dissolved. Here, the evidence in favor of the therapy would likely be judged weak, and might well be attributed to a type I error. The support for the presumed mechanism of efficacy is wholly inconsistent with established pharmacology and physical chemistry.
Thus, the power of the very same observation is enhanced or weakened according to the biology behind it. This illustrates the First principle of translational research: A good study design in the context of a strong theory maximally leverages the available observations (efficiency principle).
The leverage afforded by an excellent biological theory is strong enough to resolve some ambiguity on the “𝑝 less than or equal to 0.05” boundary. I am using the term “theory” in the sense of an established body of knowledge, as opposed to a speculation. Now consider a different experiment. Suppose I flip a coin four times and show all heads. One may ask “Is the coin fair?” To help answer the question, we might calculate 14 2
≈ 0.06. Most of us would not yet reject the expectation of fairness in the coin based on such evidence. It is worth pointing out the crude analogy between this example and the traditional presentation of 𝑝-values in “Table 1” in many randomized clinical trial presentations. In this latter case, we seem to be asking if the allocations are consistent with chance, judged by an empirical 𝑝-value, when we already know that the assignments were random. To continue the experiment, suppose I disclose that I actually had two coins. One is indeed fair, whereas the second coin has two heads. One of the coins was chosen indiscriminately from my pocket to produce the results. With this additional information, your inference regarding fairness based on a sequence of four heads is likely to change. This illustrates the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
TRANSLATIONAL TRIALS DEFINED
313
Second principle of translational research: The same evidence can produce different conclusions depending on the working model (theory) on which they are based (modelevidence dependency).
For biological examples where this may be at play, see Neufeld et al. or Pang et al. [1114, 1174]. 11.3.1
Biological Models Are a Key to Translational Trials
A model is simply a representation of its object. It is composed of real and imagined parts, assembled to represent its object in a useful way. Conceptually, models are quite general and include diverse forms such as humans temporarily repurposing themselves, literal or abstract physical forms, animals with human disease attributes, mathematical equations, and computer algorithms. In quantitative science, there are two regular uses for models. One use is interpolatory, in which the model summarizes and simplifies that which has already been observed. This model is often the most concise way to summarize a large number of observations. An example is the statistical model used to estimate treatment or covariate effects (e.g., see Chapter 20). Under ideal conditions, statistical models capture all the available information, and hence represent superb data reduction devices. A second role is extrapolatory, where the model is used to predict or mimic behavior that has not been observed. The model may be tailored or fit to observations in one range, but really intended to inform us about an unobserved scope. Statistical examples include the “low-dose extrapolation” models for carcinogenicity testing. Test animals are exposed to high levels of potential carcinogens. The outcome data are represented by mathematical models that inform us about risks at low-dose exposures. These models save us from having to do huge, long-term studies at low doses of toxins. The continual reassessment method for dose-finding employs its model for both interpolation and extrapolation way to gain efficiency (discussed also in Chapter 12). Biological models used for therapeutic testing are extrapolatory. They inform us to a certain extent how the treatment would fare in humans. The biological models that we think are valid are given a great deal of power; we discard therapies based on how they perform in those models. Low-dose extrapolation models serve this purpose for environmental exposures. But of course, we never accept therapies unless they pass real-world (human) testing. In the second principle of translational research, evidence is model dependent. The inseparable nature of evidence and models is true throughout science. It is particularly an issue in translational research, because it is somewhat minimized in more familiar and empirical developmental trials. Developmental therapeutics have taught us some atypical lessons, but translational research is more like the rest of science.
11.4 11.4.1
TRANSLATIONAL TRIALS DEFINED Translational Paradigm
The idealized inferential paradigm for translational research contains elements of the biological, clinical, and surrogate paradigms (Fig. 11.6). Like the biological paradigm, a translational study requires an irrefutable signal on which to base its inference. This
Piantadosi
Date: July 27, 2017
314
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
FIGURE 11.6 Idealized translational inference paradigm illustrating elements from the biological, clinical, and surrogate paradigms. The irrefutable signal is measured in subjects with the actual disease but arises from the biological paradigm. It plays a role similar to a surrogate in that the trial may not permit direct observation of the definitive outcome.
signal is selected based on evidence learned via an appropriate biological model. Like the clinical paradigm, a translational study is performed in subjects with a well-defined disease or condition. The surrogate paradigm indicates how a convenient, timely, and measured outcome may relate to a definitive clinical outcome. The translational study borrows this concept, anticipating that the measured signal will, under the working model of disease and therapy, yield a definitive clinical benefit in future studies designed specifically to show it. Translational inference is a unique hybrid from laboratory and clinical paradigms. Metaphorically the bridge is built from both shores. The treatment applied in a translational clinical trial is changeable and may be modified based on the results. Developmental clinical trials use either definitive outcomes such as survival, or surrogate outcomes such as toxicity and safety. During these developmental trials the treatment is not changeable except for dose or schedule. The outcome measure employed in a translational trial is a target or biological marker that itself may require additional validation as part of the study. This outcome is not a surrogate outcome as defined in Chapter 5 because it is not used to make inferences about clinical benefit. It might anticipate later questions of clinical benefit, but it has more urgent purposes. The activity of the treatment on the target defines the next experimental steps to be taken. The outcome has to be measurable with reasonable precision following treatment, and the absence of a positive change can be taken as reliable evidence of inactivity of the treatment. A carefully selected biological outcome is the linchpin of the experiment because it provides definitive mechanistic evidence of effect—an irrefutable signal—within the working paradigm of disease and treatment. The signal has to reveal promising changes in direction and magnitude for proof of principle and to support further clinical development. Good biological signals might be mediated through, or the direct result of, a change in levels of a protein or gene expression or the activity of some enzyme. Such a finding
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
TRANSLATIONAL TRIALS DEFINED
315
by itself would not prove clinical benefit but might lay a foundation for continued development. As an example of a possible biological outcome, consider a new drug for the secondary prevention of cancer. The translational goal might require that our new drug reduces the presence of specific biomarkers in the tissue (biopsy specimens) or increase other relevant biomarkers. The absence of these effects would necessitate discarding or modifying the drug, or reconsidering the appropriate target. Weakly positive biomarker changes might suggest additional preclinical improvements. Neither outcome could reliably establish clinical efficacy. 11.4.2
Character and Definition
The basic characteristics of a translational trial are the following:
· ·· · · ·· · · ·
The trial is predicated on promising preclinical evidence that creates a need to evaluate the new treatment in human subjects. The treatment and/or its algorithm can be changed. The treatment and method of target evaluation are fully specified in a written protocol. The evaluation relies on one or more biological targets that provide definitive evidence of mechanistic effect within the working paradigm of disease and treatment. The outcome is measurable with small uncertainty relative to the effect size. Target validation is sometimes also an objective, in which case imprecision in the outcome measurement may also represent a failure. Large effects on the target are sought. There is a clear definition of “lack of effect” or failure to demonstrate effects on the target. The protocol specifies the next experimental step to be taken for any possible outcome of the trial. Thus, the trial will be informative with respect to a subsequent experiment. The study is sized to provide reliable information to guide additional experiments, but not necessarily to yield strong statistical evidence. The study is typically undertaken by a small group of investigators with limited resources.
These considerations lead to the following formal definition of a translational trial: A clinical trial where the primary outcome: (1) is a biological measurement (target) derived from a well-established paradigm of disease, and (2) represents an irrefutable signal regarding the intended therapeutic effect. The design and purposes of the trial are to guide further experiments in the laboratory or clinic, inform treatment modifications, and validate the target, but not necessarily to provide reliable evidence regarding clinical outcomes [1202].
Translational trials imply a circularity between the clinic and the laboratory, with continued experimentation as the primary immediate objective. We are likely to use more than one biological outcome in such studies, and the trial may also provide evidence about
Piantadosi
Date: July 27, 2017
316
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
the utility of each. Many therapeutic ideas will prove useless or not feasible during this cycle. For others, the lab–clinic iteration will beget the familiar linear development of a new therapy, perhaps after numerous false starts.
11.4.3
Small or “Pilot” Does Not Mean Translational
My attempt to formalize translational trial design is not a general justification for small clinical trials. Although translational studies often use small sample sizes, small trials are not necessarily translational. The important distinctions are the setting, purpose, nature of the outcome, and how the studies are designed, conducted, and interpreted. For example, small comparative trials using clinical outcomes are seldom justified, almost always being the product of resource limitations. Some limits on small trials are discussed in Section 2.4.6. Translational clinical trials need no special justification. Investigators are convinced that they learn important information from such studies and the long run evidence supports the truth of this view. A common term that appears in this context is “pilot study,” a label that should be avoided. The reason to shun it is that there is no definition for “pilot” in the literature that codifies it as a class of design. This term is often used when investigators refuse to think creatively, quantitatively, and descriptively about the requirements of their study design. Though it seems sensible not to use terms without good definitions, the word “pilot” appears paradoxically in places like “phase I pilot” or “phase II pilot” . I see it often in regulatory places and grant reviews. The main purpose of the term is to deflect criticism rather than to inform others about the purpose of a clinical trial. It should not be used in the context of translational trials. Any well written and clearly purposed clinical trial protocol will be able to avoid the use of “pilot” in the title. Eliminating the term from our lexicon for clinical trials will help young investigators sharpen the objectives of their studies.
11.4.4
Hypothetical Example
Suppose we have a local gene therapy for brain cancer using a well-studied (safe) adenoviral vector. The therapy is placed by injection after a surgical resection. Production of the vector is straightforward; injecting the viral particles is a solved problem; the correct dose of viral particles is known from preclinical studies. To design an appropriate translational clinical trial, the following questions are relevant. (1) What is the appropriate study goal, outcome, and design for the first administration of this therapy to humans? (2) What evidence is needed to inform the next step? (3) Is there a dose question? (4) What does it mean to establish safety? (5) What can we measure that might indicate the therapy is promising? There does not seem to be a dose question, at least not a strong one that should drive the design. Unexpectedly perhaps, it might be possible to remove some of the uncertainty regarding therapeutic benefit before being certain about safety. This is the reverse of conventional wisdom about developmental trials, where evidence regarding safety seems a more easily attainable goal than evidence for efficacy. For example, to rule out safety events with a threshold of 10% requires 30 subjects. This is based on an exact binomial 95% upper confidence bound on 0/30 safety events (Section 16.4.3). But we might reliably establish the presence or absence of gene product in the tumor
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INFORMATION FROM TRANSLATIONAL TRIALS
317
with half as many subjects. An off-the-shelf developmental design (e.g., dose-finding) is not appropriate for addressing the relevant biological questions. A successful view of efficacy, would set the stage for a subsequent and slightly larger safety trial. This example makes clear the Third principle of translational research: In a well-designed translational study, either a positive or null result will be informative provided the primary goal is to acquire enough information to guide the next experiment, rather than to estimate clinical outcomes reliably (fail-safe principle).
Precision of estimation is not a primary design requirement as it is with later clinical trials. This represents a departure from key aspects of the statistical paradigm for developmental clinical trials, but not from the use of statistical reasoning. 11.4.5
Nesting Translational Studies
Generally, there are no major statistical design issues in nesting a translational study within a developmental clinical trial. The translational trial is almost always smaller than the parent trial, and the outcomes required tend to be simple and measured early. Also the analysis is simple. There are other significant issues however, such as
·· ·· ··
assuring that translational objectives can be met from the cohort being studied, assay/target validation, destructive sampling of specimens, selection bias, interfering with clinical outcomes, and subject safety/convenience.
The brain tumor example trial already discussed might be nested in an existing “phase II” trial.
11.5
INFORMATION FROM TRANSLATIONAL TRIALS
Introducing a new treatment into humans is usually characterized by high uncertainty about both beneficial effects and risks. In this circumstance, even a fairly small experiment can substantially increase information for the purposes of guiding subsequent studies. The reduction of uncertainty (gain in information) can be quantified. The combination of biological knowledge with information from formal observation is a powerful tool for designing early experiments. This is the Fourth principle of translational research: When properly designed, a translational trial acquires measurable information, particularly in the context of high uncertainty (such as concerning the treatment effect for an emerging therapeutic idea).
Entropy can quantify information gained during the trial and help design the study properties. My view is that investigators use translational designs and inferences in a
Piantadosi
Date: July 27, 2017
318
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
FIGURE 11.7 Relationship between the probability of an event and surprise, if that event occurs.
way consistent with acquiring modest, measurable, but critical pieces of information. To develop this idea, first consider what it means to have surprising information. 11.5.1
Surprise Can Be Defined Mathematically
It’s helpful at this point to offer a precise quantitative definition of surprise as it pertains to the probability of observing certain events. This definition will be useful later. Intuitively, if we actually observe an event that is thought to have high probability, we cannot say we are surprised. In contrast, if we actually observe an event that we judged to be unlikely, we would be surprised. Hence, there is a connection between the probability of an event and the degree of surprise resulting from observing it. A certain event must yield zero surprise, whereas an “impossible” event would bring infinite surprise. The simplest mathematical function of probability that has the necessary properties is 𝑠 = − log(𝑝), where the minus sign is taken merely to give surprise, 𝑠, a positive value since 0 ≤ 𝑝 ≤ 1. We will take this as a formal definition of surprise (Fig. 11.7), and it will provide a link to the concept of information. Information is what we seek in any clinical trial. The idea of information is often used informally, but in translational trials it is helpful to describe formally and quantitatively the information gained because it captures the utility of the experiment. 11.5.2
Parameter Uncertainty Versus Outcome Uncertainty
The usual statistical perspective associates uncertainty with unknown parameters related to the problem at hand. Examples of parameters that are of interest in clinical trials
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INFORMATION FROM TRANSLATIONAL TRIALS
319
TABLE 11.1 Hypothetical results of three translational trials. A positive outcome is denoted by +, indifferent by +/-, and unsatisfactory by -. a, b, and c are the numbers of subjects in each outcome category. All three trials contain the same amount of information Trial 1 2 3
Good
Outcome Frequency Neutral
Bad
a b c
b c a
c a b
include relative risks of failure or death, probability of benefit, and mean plasma level of a drug. The appropriate size of many experiments may be driven by the precision with which such parameters must be determined. A second level of uncertainty associated with therapy is a consequence of different possible outcomes. Information regarding the parameter(s) may not greatly reduce this overall outcome uncertainty. For example, we could be very sure that the probability, 𝑝, of one of two outcomes following treatment 𝐴 is 0.5. Despite knowing the parameter fairly precisely there is still much uncertainty associated with the use of treatment 𝐴, because either outcome is equally likely. In contrast, if we were sure that treatment 𝐴 produces one outcome with probability 0.95 the overall uncertainty would be lower. Therefore, even if we know probability parameters with high precision, outcome (overall) uncertainty remains. Furthermore, outcome uncertainty depends on the value of the parameter, but not as much on the precision of the parameter. Entropy (defined later) measures overall uncertainty. Specifically, it formalizes and quantifies the idea of information. This is a useful concept for translational trials because they guide early development more from the perspective of overall uncertainty than from precision of estimation. There is no proof of this assertion—only the general behavior of translational investigators. Later developmental trials are appropriately focused on precision of estimation, as discussed elsewhere in this book. Suppose a, b, and c represent the numbers of subjects with good (+), bad (−), or neutral (±) outcomes in a translational trial. The results of three hypothetical translational trials in Table 11.1 yield equal amounts of information and have the same total sample size. However, the therapeutic implications of the results could be quite different in each case. (Table 11.1 omits three additional theoretically possible study outcomes.) Thus, information is not result-specific but is a consequence of the discrete states and probabilities that can be attained. 11.5.3
Expected Surprise and Entropy
It is now an appropriate place to formally define information. In Section 11.5.1, I defined surprise and justified it heuristically. I will assume that the result of our experiment is multinomial, that is a set of discrete outcomes each with some probability of occurring. Information, denoted by 𝐻, is simply the statistical expectation (expected value) of surprise with respect to the possible outcomes of our experiment, 𝐻 =−
𝑚 ∑ 𝑖=1
𝑝𝑖 log 𝑝𝑖 ,
(11.2)
Piantadosi
Date: July 27, 2017
320
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
TABLE 11.2 Different sized translational trials that carry the same information. Each cell represents the number of subjects in that outcome category Trial 1 2 3
Good
Outcome Frequency Neutral
Bad
a 5a 10a
b 5b 10b
c 5c 10c
where there are 𝑚 possible outcome states, each with probability 𝑝𝑖 , and Σ𝑝𝑖 = 1. Thus, information is essentially a measure of how much surprise we can expect from the results of our experiment. A few details need to be cleaned up. Equation 11.2 is Shannon’s entropy [1365] that measures the information in a random variable that takes one of 𝑚 states. The entropy/information concept is identical in all of science. It can also be placed on an axiomatic footing, not required here. The outcome probabilities, 𝑝𝑖 , are in principle, known. However, I will treat the probabilities as approximate, empirical, or arising from the results of the translational trial at hand. We cannot know them in advance or there would be no reason to perform the trial. Empirically determined information will be denoted by 𝐻 ∗ or Δ𝐻 ∗ as appropriate. Gain in information is measured by change in entropy. If we characterize the pre-experiment uncertainty by 𝑝𝑖 ’s and the post-experiment state by 𝑞𝑖 ’s, the acquired information is Δ𝐻 =
𝑚 ∑ 𝑖=1
𝑞𝑖 log 𝑞𝑖 −
𝑚 ∑ 𝑖=1
𝑝𝑖 log 𝑝𝑖 .
(11.3)
Many results yield the same amount of information, as seen in Table 11.2, because the ordering of the probabilities in equations 11.2 or 11.3 is immaterial. Large sample sizes do not produce proportionate gains in information, because the proportions in the outcome categories tend to stabilize after modest sample sizes. More data increase the precision in the estimated probabilities but may not change those probabilities substantially. In other words, the observed probability of each outcome stabilizes fairly quickly with respect to sample size. Thus, the apparent information is not strongly a function of sample size. The properties of information seem to be reflected by investigator behavior with respect to translational trials. Many clinical outcomes are equally informative about the treatment effect, though not equally promising. Large studies seem to be viewed as unnecessary to accomplish the goals. Definitive clinical inferences are neither made nor needed, but the information gathered can be used to guide subsequent experiments. I suspect that investigators are instinctively reacting to reduced uncertainty, or gain in information instead of statistical precision, when they design and assess translational trials. As an illustration, consider the three trial outcomes in Table 11.2. The therapeutic implications for each of the three trials are the same because the probability of the favorable, unfavorable, and indifferent outcomes are identical across cases. Also, the information is the same in all three cases. However, which overall study size is most suitable and efficient for reaching a reasonable conclusion remains unanswered.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INFORMATION FROM TRANSLATIONAL TRIALS
321
Another circumstance is illustrated in Table 11.3, where each of the outcome probability ratios for the trials are 3:1:1. But the therapeutic implications for each case are quite different, because the precision and trial sizes are so different. Apart from sample size, the apparent gain in information from a study relative to maximal uncertainty depends on the degree of non-uniformity in the outcome probabilities, as illustrated by the hypothetical trials in Table 11.4. There for example, the first trial seems to represent a strong reduction in uncertainty. 11.5.4
Information/Entropy Calculated From Small Samples Is Biased
Equation 11.3 was applied as though the 𝑞𝑖 ’s from the experiment were the true frequencies. However, small studies may tend to underrepresent some outcomes, so equation 11.3 can overestimate the gain in information from the study. The expected value of 𝐻 ∗ , using the observed data from 𝑛 subjects, is } {𝑚 ∑ 𝑟𝑖 𝑟 log 𝑖 𝐸{𝐻 ∗ } = −𝐸 𝑛 𝑛 𝑖=1 =−
𝑚 ∑ 1( 𝑖=1
𝑛
) 𝐸{𝑟𝑖 log 𝑟𝑖 } − 𝐸{𝑟𝑖 } log 𝑛 .
The responses are multinomial and the number in a given category is binomial. Using the definition of expectation for the binomial distribution, we have ) ( 𝑛 ( ) 𝑚 ∑ 1∑ 𝑛 𝑘 ∗ 𝑛−𝑘 𝐸{𝐻 } = − 𝑘 log 𝑘 (11.4) 𝑝 (1 − 𝑝𝑖 ) − 𝑝𝑖 log 𝑛 , 𝑛 𝑘=0 𝑘 𝑖 𝑖=1 where the 𝑝𝑖 are the true classification probabilities. In other words, 𝐸{𝐻 ∗ } ≠ 𝐻 = −Σ𝑝𝑖 log 𝑝𝑖 from equation 11.4. 𝐸{𝐻 ∗ } is a function of sample size, the number of categories, and the true 𝑝𝑖 . The bias in 𝐻 ∗ or Δ𝐻 ∗ is the difference between equations 11.4 and 11.2, ) ( 𝑛 ( ) 𝑚 𝑚 ∑ ∑ 𝑛 𝑘 1∑ ∗ 𝑛−𝑘 𝑝 (1 − 𝑝𝑖 ) − 𝑝𝑖 log 𝑛 . (11.5) 𝑏{𝐻 } = 𝑝𝑖 log2 𝑝𝑖 − 𝑘 log 𝑘 𝑘 𝑖 𝑛 𝑘=0 𝑖=1 𝑖=1
TABLE 11.3 Different translational trials with disparate outcomes. All trials carry the same amount of information. Each cell represents the number of subjects in that outcome category Trial 1 2 3 4
Good
Outcome Frequency Neutral
Bad
3 1 12 24
1 1 4 8
1 3 4 8
Piantadosi
Date: July 27, 2017
322
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
TABLE 11.4 Translational trials that carry different amounts of information. Each cell represents the proportion of subjects in that outcome category Trial
Good
Outcome Frequency Neutral
Bad
1 2 3 4
0 0.33 0.3 0.4
0.1 0.33 0.5 0.4
0.9 0.33 0.2 0.2
𝐻 ∗ underestimates 𝐻, that is, 𝑏{𝐻 ∗ } > 0. Because the function 𝑝 log 𝑝 is convex, Jensen’s inequality states 𝐸{
𝑟𝑖 𝑟 𝑟 𝑟 log 𝑖 } ≥ 𝐸{ 𝑖 } log 𝐸{ 𝑖 } = 𝑝𝑖 log 𝑝𝑖 , 𝑛 𝑛 𝑛 𝑛
where the strict equality is a consequence of the fact that 𝑟𝑖 ∕𝑛 is unbiased for 𝑝𝑖 . Hence, 𝑚 ∑ 𝑖=1
𝑚
𝐸{
∑ 𝑟 𝑟𝑖 log 𝑖 } − 𝑝𝑖 log 𝑝𝑖 ≥ 0, 𝑛 𝑛 𝑖=1
𝐸𝐻 ∗
or 𝐻 − ≥ 0. The bias calculated from equation 11.5 is illustrated in Figure 11.8. The empirical entropy overestimates the information gained from a study. Some bias is present even for large sample sizes but is reduced substantially for sample sizes above 15. 11.5.5
Variance of Information/Entropy
The variance of 𝐻 ∗ can be calculated from var{𝐻 ∗ } = 𝐸{𝐻 ∗2 } − 𝐸{𝐻 ∗ }2 . The first term yields 𝐸{𝐻 } = ∗2
𝑚 𝑚 ∑ ∑ 𝑖=1 𝑗=1
𝐸
{(
𝑟𝑖 𝑟) log 𝑖 𝑛 𝑛
(
𝑟𝑗 𝑛
log
𝑟𝑗 𝑛
)} .
The outcomes 𝑟𝑖 and 𝑟𝑗 are not independent. The expectation inside the sum can be calculated using 𝐸{𝑋𝑌 } = 𝐸𝑋 {𝑋𝐸𝑌 {𝑌 |𝑋}}. For 𝑚 ≥ 3, {( ( )} 𝑟𝑗 𝑟𝑖 𝑟𝑖 ) 𝑟 𝑗 𝐸 log log = 𝑛 𝑛 𝑛 𝑛 ( ) ⎧∑ )𝑛−𝑘−𝑟 𝑛 𝑛−𝑘 (𝑛−𝑘) ( 𝑝𝑗 )𝑟 ( (𝑛) 𝑘 ∑ 𝑝𝑗 𝑘 𝑟 𝑘 𝑟 ⎪ 1 − log log 𝑝 (1 − 𝑝𝑖 )𝑛−𝑘 , for 𝑖 ≠ 𝑗, × 𝑘 𝑖 𝑛 𝑛 𝑛 𝑟 1−𝑝𝑖 1−𝑝𝑖 ⎪ 𝑘=0 𝑛 𝑟=0 ⎨ 𝑛 ( )2 ( ) ⎪∑ 𝑘 𝑘 𝑛 𝑘 log for 𝑖 = 𝑗. 𝑝 (1 − 𝑝𝑖 )𝑛−𝑘 . ⎪ 𝑘 𝑖 𝑛 𝑛 ⎩ 𝑘=0
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INFORMATION FROM TRANSLATIONAL TRIALS
323
FIGURE 11.8 Bias versus sample size for empirically calculated entropy. The true probabilities used to generate each curve are 𝐴 = {0.20, 0.20, 0.20, 0.20, 0.20}; 𝐵 = {0.46, 0.26, 0.15, 0.08, 0.05}, 𝐶 = {0.50, 0.30, 0.10, 0.05, 0.05}, and 𝐷 = {0.01, 0.15, 0.80, 0.03, 0.01}.
Therefore, var{𝐻 ∗ } =
𝑚 𝑚 𝑛 ∑ ∑∑ 𝑘 𝑖=1 𝑗=1 𝑘=0 𝑗≠𝑖
𝑘 log 𝑛 𝑛
( 𝑛−𝑘 )( ( )𝑟 ( )𝑛−𝑘−𝑟 ) ∑𝑟 𝑝𝑗 𝑝𝑗 𝑟 𝑛−𝑘 log 1− 𝑟 𝑛 𝑛 1 − 𝑝𝑖 1 − 𝑝𝑖 𝑟=0
( ) 𝑚 𝑛 ( ) ( ) ∑ ∑ 𝑘 𝑘 2 𝑛 𝑘 𝑛 𝑘 log × 𝑝𝑖 (1 − 𝑝𝑖 )𝑛−𝑘 + 𝑝𝑖 (1 − 𝑝𝑖 )𝑛−𝑘 𝑛 𝑛 𝑘 𝑘 𝑖=1 𝑘=0 [𝑚 ( 𝑛 )]2 ( ) ∑ 1∑ 𝑛 𝑘 − 𝑘 log 𝑘 𝑝𝑖 (1 − 𝑝𝑖 )𝑛−𝑘 − 𝑝𝑖 log 𝑛 . 𝑘 𝑛 𝑖=1 𝑘=0
(11.6)
For 𝑚 = 2 (binomial), the conditional expectation simplifies, so that 𝐸
{(
𝑟𝑖 𝑟) log 𝑖 𝑛 𝑛
(
𝑟𝑗 𝑛
log
𝑟𝑗 𝑛
)} =
⎧∑ )( ) 𝑛 ( 𝑘 𝑘 𝑛−𝑘 𝑛−𝑘 (𝑛) 𝑘 ⎪ log log 𝑝 (1 − 𝑝𝑖 )𝑛−𝑘 , for 𝑖 ≠ 𝑗, 𝑛 𝑛 𝑛 𝑘 𝑖 ⎪ 𝑘=0 𝑛 ⎨∑ )2 ( ) 𝑛 ( 𝑘 𝑘 𝑛 𝑘 ⎪ log for 𝑖 = 𝑗, 𝑝 (1 − 𝑝𝑖 )𝑛−𝑘 . 𝑘 𝑖 𝑛 ⎪ 𝑘=0 𝑛 ⎩
Piantadosi
Date: July 27, 2017
324
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
FIGURE 11.9 Reduction in bias and variance of entropy as a function of sample sizes greater than 3 for a binary outcome. Lines represent outcome probabilities of {0.1, 0.9}, {0.2, 0.8}, {0.3, 0.7}, {0.4, 0.6}, and {0.5, 0.5} from top to bottom.
and equation 11.6 becomes var 2 {𝐻 ∗ } =
11.5.6
)( ) 𝑛−𝑘 𝑛−𝑘 𝑛 𝑘 log 𝑝𝑖 (1 − 𝑝𝑖 )𝑛−𝑘 𝑛 𝑛 𝑛 𝑘 𝑖=1 𝑘=0 𝑚 𝑛 ( ) ( ) ∑ ∑ 𝑘 𝑘 2 𝑛 𝑘 log + 𝑝𝑖 (1 − 𝑝𝑖 )𝑛−𝑘 𝑛 𝑛 𝑘 𝑖=1 𝑘=0 [𝑚 ( 𝑛 )]2 ( ) ∑ 1∑ 𝑛 𝑘 𝑛−𝑘 − 𝑘 log 𝑘 𝑝 (1 − 𝑝𝑖 ) − 𝑝𝑖 log 𝑛 . 𝑘 𝑖 𝑛 𝑘=0 𝑖=1 𝑚 𝑛 ( ∑ ∑ 𝑘
log
𝑘 𝑛
)(
(11.7)
Sample Size for Translational Trials
Now that we can calculate 𝐻 ∗ , var{𝐻 ∗ }, Bias{𝐻 ∗ }, a method can be proposed to select an appropriate sample size for a translational trial. As a first case, assume that the outcome is binary with response probability 𝑝. For a range of outcome probabilities we can calculate var{𝐻 ∗ } and Bias{𝐻 ∗ }, and observe how they decline with increasing sample size. Taking a small sample size (𝑛 = 3) as a baseline, the relative sizes of var{𝐻 ∗ } and Bias{𝐻 ∗ } are shown in Figure 11.9 for a range of outcome probabilities. Provided 𝑝 is not extreme, 𝐻 ∗ can be estimated with low MSE with sample sizes less than 20. When 𝑝 deviates strongly from 12 , ar{𝐻 ∗ } and Bias{𝐻 ∗ } decline more slowly. For multinomial outcomes, complexity increases rapidly if we attempt to specify a range of possibilities for the response probabilities {𝑝1 , 𝑝2 , … , 𝑝𝑛 }. This is shown in Figure 11.10. It is more feasible to consider specifying a minimal feature of the response distribution, particularly the mean, and choosing the 𝑝𝑖 to have conditionally
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INFORMATION FROM TRANSLATIONAL TRIALS
325
FIGURE 11.10 Relationship of bias, variance, and sample size for empirically calculated entropy. The true probabilities used to generate each curve are 𝐴 = {0.2, 0.2, 0.2, 0.2, 0.2}, 𝐵 = {0.46, 0.26, 0.15, 0.08, 0.05}, 𝐶 = {0.01, 0.15, 0.80, 0.03, 0.01}, and 𝐷 = {0.50, 0.30, 0.10, 0.05, 0.05}.
the maximum entropy (greatest uncertainty). If we specify only the mean 𝜇, maximum entropy is given by a Gibbs distribution. For example, suppose there are 𝑛 possible outcomes scored as 𝑥1 , 𝑥2 , … , 𝑥𝑛 . The mean outcome is then 𝜇=
𝑛 ∑ 𝑖=1
𝑝𝑖 𝑥𝑖 ,
and the 𝑝𝑖 satisfy 𝑒𝜉𝑥𝑖 𝑝𝑖 = ∑ 𝜉𝑥 . 𝑒 𝑖
(11.8)
From the implied 𝑛 + 1 equations, we can determine 𝜉 and all 𝑝𝑖 that will yield maximum entropy. For example, suppose an outcome has five levels that we score as {−2, −1, 0, 1, 2}. Appropriate clinical labels might be attached to each score. If the mean outcome is 0, equation 11.8 yields 𝑝 = {0.2, 0.2, 0.2, 0.2, 0.2} as the distribution of maximum uncertainty. In contrast, if the mean outcome is 1, equation 11.8 yields 𝑝 = {0.048, 0.084, 0.148, 0.261, 0.459} as the distribution of maximum uncertainty conditional on 𝜇 = 1. Bias versus variance of 𝐻 ∗ is plotted in Figures 11.9 and 11.11 for binary and multinomial responses of length 5, respectively. For binary outcomes in Figure 11.9, the response probability matters somewhat, but the gain in information is well approximated for very modest sample sizes. With a five-level multinomial response as in Figure 11.11, the mean of the Gibbs distribution is not strongly influential, but larger sample sizes are required to reduce the bias and variability in the estimate of the information acquired. Because the examples have distributions with maximum entropy conditional on the chosen mean,
Piantadosi
Date: July 27, 2017
326
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
FIGURE 11.11 Reduction in bias and variance of entropy as a function of sample sizes for a multinomial outcome with 5 categories scored {−2, −1, 0, 1, 2}. For each set of calculations, a mean is specified and the remainder of the distribution is maximum entropy (Gibbs). Lines represent means of 0, 0.4, 0.8, 1.2, and 1.6.
the results are likely to be conservative with regard to sample size. When thinking about the appropriate size for a translational study, the range of possible outcomes is more important than which outcome might dominate. The general behavior is similar across many cases. Modest increases in sample size reduce the bias and variance substantially. Large increases in sample size reduce the bias and variance to negligible levels, but most of the benefit seems to be achieved by sample sizes around 15 to 20. It makes sense to balance bias and gain in information. We cannot generally afford to eliminate bias completely, and we will be satisfied to get a useful reduction in uncertainty. Any hypothetical truth of nature should be studied quantitatively, but as a general rule relatively small sample sizes provide adequate information for guiding the next experiments. Example 11.1. Say, we plan a translational trial with a dichotomous outcome (success or failure) and assume maximal uncertainty with regard to the outcome (50% chance of success). This circumstance corresponds to Figure 11.9. Substantial reductions in the variance and bias of the empirical entropy can be attained with a sample size of 10 to 15 subjects. Sample size in that range will provide a large fraction of the available information. Example 11.2. Suppose that a translational trial is planned with an outcome classified into one of five ordinal categories, and the probabilities of the outcomes are expected to be {0.46,0.26,0.150,.08,0.08}. Then Figure 11.11 applies. A large fraction of the available information will be gained with 15 subjects.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INFORMATION FROM TRANSLATIONAL TRIALS
11.5.7
327
Validity
My discussion of translational trials is non traditional with respect to the development paradigm, and also deviates from the usual statistical emphasis on precision of parameter estimation. As far as I know, there is not yet sufficient experience with the suggested design concepts to be completely comfortable where they are serviceable. But they do seem to be appropriately reverse engineered from the instinctive behavior of translational investigators. It is important to be aware of the deficiencies of these types of designs. Limitations include lack of proven clinical validity for the outcome and the poor statistical properties of success or failure estimates. Questions about validity lead to the Fifth principle of translational research: You cannot validate a signal and draw a therapeutic conclusion from the same study (principle of reductionism).
(One cannot generate a hypothesis and test that hypothesis in the same study.) Although some reflections on the validity of the signal are always an element of a translational trial, it’s important to recognize the similarities between the signal and surrogate outcomes, as discussed in Chapter 5. In particular, validating the irrefutable signal in the context of one therapy does not automatically guarantee that it is valid for other treatments. Thus we must articulate the Sixth principle of translational research: The validity of a signal (or biomarker) is conditional on a specific therapy. Every new treatment, in theory, requires validation of a given biomarker (principle of sobriety).
Not so obvious is the fact that the translational trial paradigm partially confounds three things: (1) the correctness of the disease paradigm, (2) the selection of a relevant biological outcome, and (3) the action of the therapy. Errors in any of these components can masquerade as either a positive or negative treatment effect. However, the correctness of the disease paradigm and the selection of a relevant outcome can be strongly supported using evidence from earlier studies. Seventh principle of translational research: A translational trial confounds three things: the correctness of the disease/treatment model, the validity of the target/biomarker, and the effect of the therapy (principle of confounding).
The study itself cannot separate this confounding. Finally with regard to the quality of evidence produced under the best of circumstances from a translational trial, the small sample size alone will guarantee relatively weak evidence. Investigators must not ask more of such evidence than it can deliver. Subsequent experiments will always take into account the potential for a mistaken premise arising from an earlier study. This view provides the Eighth principle of translational research: Translational studies yield only weak empirical evidence. Weak evidence is more likely to mislead us than strong evidence, a risk made worse by poor study methodology (weak evidence principle).
Piantadosi
Date: July 27, 2017
328
11.6
Time: 4:45 pm
TRANSLATIONAL CLINICAL TRIALS
SUMMARY
Translational studies represent the interface between therapeutic ideas emerging from the laboratory and clinical development. A translational clinical trial is a study where the primary outcome (1) is a biological measurement based on a well-established paradigm of disease and (2) represents an irrefutable signal regarding the intended therapeutic effect. The design and purpose of the trial are to guide further experiments in the laboratory or clinic, inform treatment modifications, and validate the target, but not necessarily to provide reliable evidence regarding clinical outcomes. Typically, these studies are small compared to developmental trials and are often performed by resource limited investigators. Translational trials are usually done in a setting where there is substantial uncertainty regarding the biological effects of the treatment. Hence, a relatively small trial can provide a critical gain in information to guide subsequent studies. The gain in information can be formalized using the standard concept of entropy. Small studies overestimate information gained that is a function of the sample size and outcome probabilities. The sample size of a translational trial can be determined partly based on relative reductions in the bias and variance of estimated entropy. Translational trials have significant limitations. They rely very heavily on biological knowledge rather than the empirical structure of an experiment. At best, they provide weak evidence regarding therapeutic effect. Finally, errors in the biological model or validity of the target measurement can invalidate an apparent effect of treatment.
11.7
QUESTIONS FOR DISCUSSION
1. The mean squared error (MSE) is variance + bias2 . Plot this function of the empirical entropy versus sample size for a few cases of your choice. What does it suggest regarding sample size? 2. Suppose that reducing the variance of 𝐻 ∗ is more important than reducing the bias (and vice versa). Suggest an appropriate modification for a sample size approach.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
12 EARLY DEVELOPMENT AND DOSE-FINDING
12.1 INTRODUCTION True therapeutic development focuses on clinical outcomes, and begins only after translational trials demonstrate promising biological signals. Early development is usually driven by positive and negative actions by the therapy on organ systems. For example, early medical device trials focus on the clinical consequences of their action rather than on innate function. Drugs and biologicals might be viewed as nano-devices, and mechanisms such as absorption, biodistribution, elimination, and side effects need to be quantified. Surgical therapies are also mechanistic, but it is risk, risk mitigation, side effects, and efficacy rather than technique that must be studied in early development. Early development is not necessarily the first introduction of an agent into humans, but are the first trials to employ clinical outcomes. The wide range of important questions and paucity of standard designs for early development creates a pitfall. Rather than a broad array of flexible early developmental designs, investigators often encase their question in some sort of “phase I” clinical trial, perhaps for no other reason than that it seems like such trials should come first. Uncritically accepting a design stereotype like this can put the necessary developmental goals at risk. Perspective on appropriate designs can be limited by several factors. Drugs tend to dominate thinking about early development, and drug regulation classically encourages a few standard designs. Although phase I oncology studies are only a subset of early developmental trials, they have strongly influenced, if not damaged, originality in areas
Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
329
Piantadosi
Date: July 27, 2017
330
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
outside cancer. The dose versus risk questions for cytotoxic agents are narrow, artificial, and simpler than both general dose optimization questions and the wider needs of early development. Many clinical investigators have difficulty breaking this restricted mindset. In Section 12.6.4, I describe a design that can be used for very general dose-finding problems, and can free us from the phase I rut. That algorithm requires only a quantitative model connecting dose and a measured outcome. Because early oncology drug trials are stereotypical, relatively highly evolved from decades of experience with cytotoxic agents, and reflect important design principles, I will discuss them in detail [905]. Targeted and precision treatments do not automatically fit the paradigm of cytotoxics development, but old ideas can often be modified or extended to address the needs of newer therapeutics. Overreliance on phase I thinking will not disappear soon. My intent here is to organize early development sufficiently to encourage clear thinking, creativity, and avoiding inferior methods for the necessary purposes. The discussion pertains mostly to drugs and biologicals because there is sufficient complexity there to illustrate the principles. Important lessons regarding focused objectives, risk versus safety, efficiency, and the limitations of minimalist trials are universal. An important class of early developmental trials are those designed to determine an optimal biological dose (OBD) defined in terms of key clinical and biological outcomes. The OBD, described further as a concept in Section 12.3.3, might be characterized in terms of dose and clinical risk, but could also be based on pharmacokinetics, target binding, or efficacy. Aside from finding an OBD, mechanistic trials could similarly select from different administration schedules, estimate pharmacokinetic parameters that relate dose to risk, and measure the frequency of side effects. The control variables for drugs and biologicals in development are dose, schedule, and combinations. Schedule and combinations are important general therapeutic issues, but dose questions dominate the designs. Early drug development focuses on the relationship between dose and important clinical/biological outcomes, including side effects (toxicity), clinical efficacy, blood or tissue levels of drug and metabolites and elimination (pharmacokinetics), and actions of the drug on targets within the body (pharmacodynamics). If there is a descriptive theme for early drug development, it is investigating differential biological effects as a function of dose. A ubiquitous goal is to use differential effects to select a good dose for later trials. Studies to accomplish this are dose-ranging trials. Often it is more appropriate to turn the question around—search for a dose that satisfies predefined criteria. Accomplishing that requires a dose-finding design. The difference between dose-ranging and dose-finding may seem subtle, but has strong consequences for study design.
12.2 12.2.1
BASIC CONCEPTS Therapeutic Intent
Early development contains some unique ethics issues. One difficulty is that a purpose of some early trials is to cause and study side effects. Such stereotypes suggest that dosefinding trials lack therapeutic intent. When participants have the target disease, personal
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
BASIC CONCEPTS
331
benefit and therapeutic intent are tangible even when there is uncertainty about ultimate clinical outcomes. Well-designed therapeutic research yields favorable risk–benefit for the participants. In oncology dose-finding trials, the subjects are always patients with the disease. Outside of oncology, the participants are likely to be “healthy volunteers” who may be paid nominally for participation, and might not have an expectation of personal benefit. The risk must be lowered in such settings. Cancer patients who participate in dosefinding or phase I trials are often out of therapeutic options or terminally ill, which carries implications for obtaining valid informed consent. These concerns are balanced by the fact that patients participate in these trials with new agents only when there are no other proven therapeutic options. In these higher risk circumstances, informed consent documents are key, and participants usually consider their decisions carefully with family members. Perhaps altruism is an important motivation, but there are also participants who benefit tangibly. Publications from NIH indicates that, over recent decades, oncology dose-finding trials yielded a reasonable degree of benefit [736]. Response rates up to 17% were seen in studies that also include an agent known to be effective. Classic single new agent trials yield response rates under 5%. The safety of these trials is similar to that seen historically, with serious toxicities occurring in about 14% of subjects. Deaths due to toxic events occur with a frequency of 0.5%. Despite their focus on toxicity safety, these trials remain an appropriate option for patients with many types of cancer. Early developmental trials employing targeted therapies generally have therapeutic intent. This is especially true for the new class of agents specifically targeted for mechanisms in the disease pathway. Although it is not possible to make such statements without exception, the stereotype regarding lack of therapeutic intent mostly belonged to an earlier era in clinical investigation.
12.2.2
Feasibility
Establishing feasibility of administering a new treatment is a common early developmental goal. Infeasibility can result from problems with subject eligibility, recruitment, dosing, biodistribution, side effects, logistics, manufacturing, subject retention, and many other trial stoppers. None of us will try to develop a treatment that is infeasible by these or other measures. Troublesome aspects of feasibility goals are (1) providing a precise definition for infeasibility in the clinical setting, (2) establishing explicit treatment failure criteria for individual participants and for the trial cohort as a whole, and (3) working around the fact that feasibility may be uninformative about efficacy. For example, suppose that an individualized cell-based therapeutic vaccine has to be manufactured and administered. It would be infeasible for a given trial participant if a sufficient number of immune cells could not be obtained or cultured. That would be the basis for a dichotomous success or failure outcome for each study subject. In the cohort, we might judge the treatment to be infeasible if more than 20% of subjects fail to have an adequate vaccine prepared, for example. Investigators have to be prepared to terminate or suspend development on the basis of appropriately chosen cohort criteria. Feasibility criteria can be qualitative but still well defined.
Piantadosi
Date: July 27, 2017
332
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
A treatment can appear feasible, but generate little evidence regarding efficacy. Similarly a clinically appropriate dose could be identified without producing an efficacy signal. As a general rule however, dose and efficacy seem to be more tightly coupled than feasibility and efficacy. Even though a therapy must be feasible to be effective, feasibility may not be the strongest early developmental question. Often feasibility criteria are set up to be straw dogs. Can we recruit enough subjects with the required characteristics? Can we administer this agent in this new population (even though we can administer it to other types of patients)? Can we measure some substitute outcome? Such questions appear disingenuous when their answers are, or should be, already known. The setting in which a trial takes place, experience with similar patient populations, or small modifications of standard procedures are often very telling in this regard. Fake feasibility questions damage clinical protocols and grant applications alike. A feasibility question should have the same importance and legitimate uncertainty as any other scientific objective. Investigators should also provide a rationale for why a clinical trial rather than a less complex study design is required to answer a given feasibility question.
12.2.3
Dose versus Efficacy
A theme of this Chapter is the relative accessibility of questions addressing dose versus risk for modest sized trials. Sometimes, however, dose must be established on the basis of efficacy or its surrogate—we might prefer the most efficacious of several doses. Detecting differential efficacy is a familiar problem in clinical trials, since it is essentially the same as comparing different treatments. There are however a few differences between designs that compare two or more treatments and those that select the most efficacious dose that might keep dose-efficacy trials from being too large. When we compare treatment A to treatment B, we are typically looking for a clinically significant difference to be statistically significant. Selecting the most efficacious dose is simpler if differences are not required to be statistically significant. Selecting a winner is statistically easier than proving that differences exceed a specified magnitude. Selection designs are discussed in Sections 13.6 and 13.7.3. Considerable savings in sample size compared to superiority questions are explained there. A further advantage in efficiency can be gained in dose-efficacy studies when the treatments are incremental doses across which we would expect a trend. Detecting a trend and its maximum can be accomplished more easily than detecting a prespecified difference. When assessing a trend, dose plays a role like time in a longitudinal design. The usual longitudinal design follows measurements across time, but the design and analysis issues are similar if we follow outcomes across doses. One adaptive dose-finding design was proposed to address dose-efficacy questions while preserving safety [148]. This design carried two dose stages to assess efficacy. Operating characteristics determined by the authors demonstrated that this strategy requires many more participants than the typical dose-finding trial to address the additional goals. In an actual application, the trial ended with the first stage when the low and medium doses failed to show sufficient superiority over placebo. The highest dose from a possible second stage was never tested. Among other lessons, this experience illustrates the complexities of dose-efficacy questions, many of which would also be present in fixed design (nonadaptive) trials.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ESSENTIAL CONCEPTS FOR DOSE VERSUS RISK
12.3 12.3.1
333
ESSENTIAL CONCEPTS FOR DOSE VERSUS RISK What Does the Terminology Mean?
A foundation for dose-finding is the relationship between dose and response, particularly a risk response. The risk indicator is often toxicity, usually measured on a probability scale. But responses can also be measured effects on a target, or efficacy outcomes. Dose may refer literally to a drug, say, on a milligram basis, but is also a general concept that can refer to amount, number, intensity, or exposure, depending on the therapy. We never need to assume that dose is measured on a linear scale. Investigating the relationship between dose and risk is variously called phase I, doseescalation, dose-ranging, or dose-finding. These terms are not equivalent and I distinguish between them in Table 12.1. The term phase I has become so diffuse as to be nearly useless. I will only use the term narrowly to refer to dose-finding or dose-ranging studies for cytotoxic agents. The phase I concept is often over generalized to say that such trials are the first studies in which a new drug is administered to human subjects. A similar mistake labels as phase I any early developmental design, such as a translational trial. In oncology, phase I studies investigate toxicity and organ systems, establish an optimal biological dose, estimate pharmacokinetics, and assess tolerability and feasibility of the treatment. Secondarily they assess evidence for efficacy, investigate the relation between pharmacokinetics and pharmacodynamics of the drug, and targeting [1525]. Not all these goals can be met completely in any phase I trial, in part because the number of subjects treated is usually small. However, well-conducted phase I studies can achieve substantial progress toward each of these goals, and simultaneously provide therapeutic alternatives for some patients [30]. Phase I trials are not synonymous with dose–response studies, but they have many characteristics in common. For a discussion of dose–response designs, see Cheung [255], Ruberg [1302, 1303], or Wong and Lachenbruch [1577]. Some statistical and ethical issues are discussed by Ratain et al. [1249]. Dose-ranging refers to designs that specify in advance the design points or doses to be employed in the investigation. In dose-ranging simple decision rules are used for
TABLE 12.1 Characteristics of Early Developmental Trials Term
Meaning
Treatment mechanism
Early developmental trial that investigates mechanism of treatment effect. Pharmacokinetics is an example of mechanism in dose–safety trials. Imprecise term for dose-ranging designs in oncology, especially for cytotoxic drugs. Such studies are a subset of dose-ranging designs. Design or component of a design that specifies methods for increases in dose for subsequent subjects. Design that tests some or all of a prespecified set of doses, i.e., uses fixed design points. Design that titrates dose to a prespecified optimum based on biological or clinical considerations, such as frequency of side effects or degree of efficacy.
Phase I
Dose-escalation Dose-ranging Dose-finding
Piantadosi
Date: July 27, 2017
334
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
moving from level to level based on outcomes in small subject cohorts. Some criteria based on safety (risk events) are also given for ending the dose jumps and selecting one dose for subsequent studies. Even when quantitative decision rules are employed for escalating, de-escalating, or terminating, dose-ranging is not a true optimization because of its typically poor statistical characteristics discussed below. But these primitive designs are the most common type employed for addressing early developmental dose questions for both drugs and biologicals. There is no universal dose-ranging design, but some frequently used designs are discussed below (see also Store [1453]). I will use the term dose-finding for designs that attempt to locate an optimum biological dose (OBD) based on defined criteria. Dose-finding designs iteratively employ doses that are themselves determined as intermediate outcomes of the experiment, and they are often chosen from a continuum rather than from a small prespecified set. By this definition, only a few dose-finding designs exist. Dose-ranging operationally tries a small set of doses that are specified in advance, whereas dose-finding employs an algorithm that queries a large or infinite number of doses. Thus only the latter can truly optimize or titrate dose to a specific outcome criterion. Occasionally it is claimed that doses must be discrete because medication is packaged or prepared that way by manufacturers. Drug packaging is a balance between recommended dosing and profitability. Aside from that, liquid (intravenous) formulations lend themselves perfectly to continuous dosing. Pills can be powdered and re-encapsulated by nearly any research pharmacy. Frequent changes in dosing or formulation would be an issue in multicenter trials with a single drug supplier, even if the shifts were among standard manufactured formulations. For dose-finding trials in institutions with research pharmacists, the issue is not a major problem and should not drive study design.
12.3.2
Distinguish Dose–Risk From Dose–Efficacy
Acute adverse events relate to exposure which is a consequence of absorption, distribution, metabolism, and elimination of the drug. In turn, these depend on dose. Nearly every drug can produce unwanted acute effects if administered at high enough doses. The chain of causality from dose to pharmacokinetics to acute adverse events can render a drug unusable, and can often be seen early in the drug exposure or post-exposure period using appropriately designed short-term studies. Consequently the relationship between dose and risk should be addressed early in development. The dose–risk relationship may be more important than the dose–efficacy one, meaning that efficacy is irrelevant if the dose (or treatment) is unsafe. The connection between dose and efficacy tends to be unreliable early in development whether using short- or long-term outcomes. Also, efficacy may be slow to show itself because its principal manifestation might be the absence or slowing of untoward events, especially in studies of chronic disease. For these reasons we generally try to resolve questions of risk prior to those of efficacy. Aside from the sequencing of the questions, it is important to appreciate how the distinction produces different trial designs. For reasons explained in Section 13.6, there is no dose–efficacy methodology strongly analogous to the dose–safety designs discussed here. Dose–efficacy questions are innately comparative in the same way that treatment comparisons are, and consequently are expensive. Agents with minimal toxicity can be exceptions to this principle of development ordering. Prior knowledge of safety will likely lead immediately to dose–efficacy testing.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ESSENTIAL CONCEPTS FOR DOSE VERSUS RISK
335
Examples are vaccines or other putative preventive agents such as vitamins or trace elements which are likely to be safe even when given in slight excess. Then dose– efficacy may be a more imperative question than dose–risk. Even if we recognize such a re-ordering of developmental objectives, study designs will be strongly different. For preventives, safety questions can persist late in development because of the need to asses the frequency of uncommon events. This leads to large long trials. Nearly always we will have low tolerance for side effects when testing or using a preventive in an at-risk but disease-free cohort. 12.3.3
Dose Optimality Is a Design Definition
Before a clinical investigator can design dose-finding trials, he or she must have some notion of what characterizes an optimum dose. For example, the best dose might be the one with the highest therapeutic index (maximal separation between benefit and risk). In other circumstances, the best dose might be one that maximizes benefit, provided that the risk is below some prespecified threshold. I refer to the broad concept of an ideal dose as optimality, specifically the optimal biological dose (OBD) or just optimal dose. Apart from the exact milligrams or concentration to employ, the OBD needs to be characterized conceptually as part of the study design. The OBD also depends on the clinical circumstances and purposes of treatment. OBD Examples The optimal dose of a new analgesic might be the lowest dose that completely relieves mild to moderate pain in 90% of recipients. This is a type of minimum effective dose (MED). In contrast, the optimal dose of an new antibiotic to treat serious infections and discourage resistance might be the highest dose that causes major side effects in less than 5% of subjects. This is a type of maximum nontoxic dose (MND). For cytotoxic drugs to shrink tumors, it has been thought historically that more is better, leading investigators to push doses as high as can be tolerated. We might choose the dose that yields serious (even life-threatening) but reversible toxicity in no more than 30% of subjects. This is a maximum tolerated dose (MTD). The optimal dose for development of a molecular targeted agent might be the dose that suppresses 99% of the target activity in at least 90% of subjects. Until definitive clinical data are obtained, this might be viewed as a most likely to succeed dose (MLS). For drug combinations or adjuvants, the optimal joint dose might require a substantially more complex definition and correspondingly more intricate design, depending for example on interactions or synergy. For specific agents, there is no guarantee that any of these optima actually exist or have acceptable safety profiles. Whatever the clinical circumstances, the nature of the OBD must be defined explicitly or dose-finding will be ambiguous at best. 12.3.4
Unavoidable Subjectivity
Any design used for early developmental questions requires a number of subjective judgments. In classic dose-ranging studies, subjectivity enters in (1) the design points chosen to study, including the starting dose, (2) the between subject or cohortescalation/deescalation rules, (3) the assessment and attribution of side effects, critical to connecting dose and safety, (4) decisions to employ within-subject dose increases
Piantadosi
Date: July 27, 2017
336
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
(although the data from higher doses in the same subject is often disregarded in escalation algorithms), (5) reacting to unexpected side effects, and (6) the size of the cohort treated at each dose. Subjectivity is absolutely unavoidable early in development, but it is often denied or overlooked when investigators design trials based on tradition or dogma. Some dose-finding designs, such as the continual reassessment method (CRM), encapsulate subjectivity explicitly. This means that some of the formal components of the design that require subjective assessment are made explicit and self-contained. A standard implementation of the CRM is a Bayesian method and uses a prior probability distribution for unknown dose–toxicity model parameters. Because of the overt role of subjectivity in the CRM, the method is sometimes criticized, when in fact the same characteristics are present in other designs where the subjectivity is more diffuse and obscure. The CRM is discussed in detail below.
12.3.5
Sample Size Is an Outcome of Dose-Finding Studies
Usually it is not possible to specify the exact size of a dose-finding clinical trial in advance. Except in bioassay or other animal experiments, the size of the dose-finding trial depends on the outcomes observed. If the starting dose has been chosen conservatively, the trial will require more subjects than if the early doses are near the MTD. Because the early doses are usually chosen conservatively, most dose-ranging trials are less efficient (larger) than they could be if more were known about the MTD. Regardless of the specific design used for a dose-ranging trial, sample size, power, and the type I error rate are not the usual currency of concern regarding the properties of the study. Instead, such studies are dominated by clinical concerns relating to pharmacokinetic parameters, properties of the drug, and choosing a dose to use in later developmental trials. In the last few years, investigators have been paying more attention to the quantitative properties of dose-ranging designs, particularly how well they inform us about the true dose–toxicity or dose–response function for the agent. For dose-finding (DF) trials, sample size is usually not a major statistical design concern. Convincing evidence regarding the relationship between dose and risk can often be obtained after studying only a few subjects. The best dose-finding designs are response adaptive, that is they do not employ a fixed sample size. The number of study subjects actually entered into this type of trial, although small, is not a fixed design parameter but an outcome of the study. This point will be made more clear in Section 12.5.2. However, small sample size is not a rationale for conducting the entire dose-finding study informally.
12.3.6
Idealized Dose-Finding Design
The ideal design for a DF study in humans might resemble a bioassay experiment. The investigator would employ a range of doses, 𝐷1 … 𝐷𝑘 , bracketing the anticipated optimal one. Trial participants would be randomly assigned to one of the dose levels, with each level tested in 𝑛 subjects. At the completion of the experiment, the probabilities of response (or toxicity) would be calculated as 𝑝1 = 𝑟1 ∕𝑛1 , … , 𝑝𝑘 = 𝑟𝑘 ∕𝑛𝑘 , where 𝑟𝑖 is the number of responses at the 𝑖th dose. Then the dose–response curve could be modeled with an appropriate mathematical form and fit to the observed probabilities. Model fitting facilitates interpolation because optima may not reside at one of the design points used.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ESSENTIAL CONCEPTS FOR DOSE VERSUS RISK
337
Estimating the MTD would not depend strongly on the specific dose–response model used. To implement this design, one would have only to choose an appropriate sample size, mathematical form for the dose–response model, and range of doses. Designs such as this are used extensively in quantitative bioassay [625]. Because experimental subjects would be randomly assigned to doses, there would be no confounding of dose with time, as would occur if doses were tested in increasing or decreasing order. One could even use permuted dose blocks of size 𝑘 (or a multiple of 𝑘) containing one of each dose level. One of the advantages of the ideal design is the ability to estimate reliably and without bias the dose of drug associated with a specific probability of response. This is different from estimating the probability of response associated with a specific dose employed in the design. For example, suppose that investigators are interested in using a dose of drug that yields a 50% response. One would estimate this from the estimated dose–response function after data fitting in the following way. Assume that the relationship between dose, 𝑑, and probability of response is 𝑓 (𝑑; 𝜃), where 𝜃 is a vector of parameters to be estimated from data. The target dose is 𝑑50 , ̂ where 𝜃̂ is an estimate of 𝜃 obtained by curve fitting, maximum such that 0.5 = 𝑓 (𝑑0 ; 𝜃), likelihood, Bayesian methods, or some other appropriate method. In other words, ̂ 𝑑50 = 𝑓 −1 (0.5; 𝜃), where 𝑓 −1 (𝑝; 𝜃) is the inverse function of 𝑓 (𝑑; 𝜃). For suitable choices of the model 𝑓 (𝑑; 𝜃), the inverse will be simple, and the ideal design will assure a good estimate of 𝜃. For example, for a two-parameter logistic dose–response model, the probability of response, 𝑝, is a function of dose 𝑝 = 𝑓 (𝑑; 𝜃) =
1 , 1 + 𝑒−𝛽0 (𝑑−𝛽1 )
(12.1)
where 𝜃 = {𝛽0 , 𝛽1 }. In this parameterization, 𝛽1 is the dose associated with a response probability of 50%. A concise way to write this model is ′
𝑙𝑜𝑔𝑖𝑡(𝑝) = −𝛽0 (𝑑 − 𝛽1 ),
(12.2)
similar to ordinary logistic regression. The inverse probability function is the dose function, or 𝑑 = 𝑓 −1 (𝑝; 𝜃) = 𝛽1 +
𝑙𝑜𝑔𝑖𝑡(𝑝) . −𝛽0
(12.3)
This model will be important in dose-finding designs discussed below. Another widely used mathematical model for the relationship between dose and response is the probit model 𝑓 (𝑑; 𝜃) =
𝑑
∫−∞
1 −(𝑑−𝜇)∕𝜎 𝑑𝑥, √ 𝑒 2𝜋
(12.4)
where 𝜃 = {𝜇, 𝜎}. This model is complicated by the integral form for 𝑓 . Nevertheless, the inverse function can be found using approximations or numerical methods. ′
Piantadosi
Date: July 27, 2017
338
12.4
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
DOSE-RANGING
The ideal bioassay type design is not feasible in human studies because ethics considerations require us to treat subjects at lower doses before administering higher doses. Also we require designs that minimize the number of subjects treated at both low, ineffective doses and at excessively high, toxic doses. Dose-ranging designs strike a compromise among the various competing needs of ethics, safety, simplicity, and dose optimization. The components of a dose-ranging design are (1) specification of the study population, (2) selection of a starting dose, (3) specification of either dose increments or exact doses (design points) to test, (4) cohort sizes at each dose, (5) definition of dose-limiting effects, (6) decision rules for escalation, deescalation, or visiting a new dose, and (7) criteria for recommending one of the doses tested. Again, the choices for these design features are subjective—good experiment design does not mandate specific values for any of these features. There are certain values that are often used, and this results in them sometimes being mistaken as necessary or required values. One example is a cohort size equal to three subjects, which is common but by no means mandatory. Optimal methods emerge only when additional goals or efficiencies are brought to bear on these simple designs. The starting dose, range, and design points are selected based on preclinical data and experience with similar drugs or the same drug in different patient populations. One method for choosing a starting dose is based on one-tenth of the dose that causes 10% mortality in rodents on a milligram per kilogram (mg/kg) basis—the so-called 𝐿𝐷10 . Properly used, dose-ranging is an appropriate, flexible, and convenient strategy to acquire knowledge about a new therapy. It can be defended most effectively from a purely clinical or operational perspective. Dose-ranging designs usually cannot resolve small differences between doses. Hence, they are most appropriate when the dose question is not sharp, meaning that doses within a broad range are nearly equivalent with regard to the outcomes being assessed. Investigators would then be comfortable recommending a single dose based mostly on clinical judgments from an operationally simple design. If it is necessary to select a true dose optimum reliably, model-based adaptive designs discussed below will be required. It should not come as a surprise that the wide latitude in dose-ranging design options does not yield broadly optimal behavior. The bane of dose-ranging is when this design is taken to be optimizing, which actually happens nearly all the time. Special claims are often made for the dose that emerges from a dose-ranging trial. Examples are that a given dose is safe, or the maximum tolerated dose, the reliability of which cannot be supported by the informal nature of dose-ranging designs. Reasons for this caution are given below, and better designs for optimality will also be suggested.
12.4.1
Some Historical Designs
A classic design based on this method was proposed by Dixon and Mood [382]. Their exact method is not used routinely, however. In most dose-escalation designs, investigators do not fit the probit (cumulative normal) dose-response model as originally proposed, and subjects are often treated in groups of 3 rather than individually before deciding to change the dose. This grouped design was suggested by Wetherill [1544] and a widely used modification by Storer [1453]. Bayesian methods have been suggested [569] as have those
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-RANGING
339
that use graded responses [617]. Dose increments of this type may not be suitable for studies of vaccines and biologicals, where logarithmic dose increments may be needed. Another important concept is illustrated by the stochastic approximation method [1273]. As with other approaches fixed dose levels are chosen, and groups of subjects can be used. If we denote the 𝑖th dose level by 𝐷𝑖 and the 𝑖th response by 𝑌𝑖 , we then choose 𝐷𝑖+1 = 𝐷𝑖 − 𝑎𝑖 [𝑌𝑖 − 𝜃], where 𝜃 is the target response and 𝑎𝑖 is a sequence of positive numbers converging to 0. For example, we might choose the 𝑎𝑖 such that 𝑎𝑖 = 𝑐∕𝑖, where 𝑐 is the inverse of the slope of the dose–toxicity curve. It is clear that when the response 𝑌𝑖 is close to 𝜃, the dose level does not change (this is why the 𝑎𝑖 are chosen to be decreasing in absolute value). The recommended dose is the last design point. This design has not been widely used in clinical trials. But it illustrates some important principles. It has prespecified doses but might be viewed as dose-finding because of its titration to the target 𝜃. The dose-response model is local in the sense that only the immediately previous dose is used at any step, and is linear. Thus the model has only to be locally approximately correct and the algorithm will perform appropriately. These ideas will be important later in discussing true dose-finding designs. 12.4.2
Typical Dose-Ranging Design
A common technique for specifying the range of doses to be tested is the Fibonacci scheme. Fibonacci (Leonardo of Pisa) was a thirteenth-century Italian number theorist, for whom the sequence of numbers, 1, 1, 2, 3, 5, 8, 13, 21, … was named, where each term √ is the sum of the preceding two terms. The ratio of successive terms approaches ( 5 − 1)∕2 = 0.61803 …– the so-called golden ratio, which has some optimal properties with regard to mathematical searches [1558]. An example of how this scheme might be used to construct dose levels in a dose-ranging trial is shown in Table 12.2. The idealized doses from the sequence are usually modified to increase more slowly, hence the name modified Fibonacci scheme. In cytotoxic drug development, a dose-limiting toxicity (DLT) is a serious or lifethreatening side effect. Death is also dose limiting. However, most DLTs are reversible TABLE 12.2 Fibonacci Dose-Escalation Scheme (with Modification) for Dose-Finding Trials Step 1 2 3 4 5 6 7 8
Ideal Dose
Actual Dose
Percent Increment
𝐷 2×𝐷 3×𝐷 5×𝐷 8×𝐷 13 × 𝐷 21 × 𝐷 34 × 𝐷
𝐷 2×𝐷 3.3 × 𝐷 5×𝐷 7×𝐷 9×𝐷 12 × 𝐷 16 × 𝐷
– 100 67 50 40 29 33 33
D represents the starting dose employed and is selected based on preclinical information.
Piantadosi
Date: July 27, 2017
340
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
TABLE 12.3 Simulated Phase I Study for Example 12.1 Using Modified Fibonacci Dose-Escalation Scheme Step
Dose Used
Patients Treated
Number of Responses
1 2 3 4 5 6 7
100 200 330 500 700 900 700
3 3 3 3 3 3 3
0 0 0 0 0 1 1
⋮
⋮
⋮
⋮
The MTD is operationally defined as the 700 unit dose.
side effects. The exact definition of a DLT depends on the clinical setting and the nature of the therapy. For nonlife-threatening diseases and safe therapies, tolerance for side effects might be quite low. Headache, gastrointestinal side effects, and skin rash, for example, might be substantial dose limitations for many drugs to treat self limited conditions. Rather than treating a fixed number of subjects at every dose, as in bioassay, doseranging studies are adaptive. This means that the decision to use a next higher or lower dose depends on the results observed at the current dose, according to the prespecified rules. If unacceptable side effects or toxicity are seen, higher doses will not be used. Conversely, if no DLTs or side effects are seen, the next higher dose will be tested. Example 12.1. An example of a dose-ranging trial using Fibonacci escalations, cohorts of size 3, and simple decision rules is shown in Table 12.3. Very frequently this basic design is implemented with cohort sizes of 3 subjects, with some outcome conditions calling for 3 additional subjects at the same dose. This explains the name 3+3 design. There are many variations and flexibilities suggested to tailor the method to specific clinical settings. Differences between Fibonacci, modified Fibonacci, 3+3, up-and-down, accelerated titration, and similar designs are relatively small and unimportant for this discussion. Although these designs are operationally simple, their ease of use is overshadowed by poor performance with respect to the intended purposes. In particular, these methods are often used to identify an OBD such as the maximum tolerated dose (MTD). As a rule, doses are not escalated beyond one that causes 2 DLTs out of 3 subjects, and the MTD is taken to be the next lower dose that yields 1 DLT out of 3 subjects. The inferential mistake is that we have then identified the dose associated with a 33% chance of DLT, that often having clinical importance. In short, these designs do not determine any OBD reliably for reasons that will be made clear in the following discussion. 12.4.3
Operating Characteristics Can Be Calculated
The principal weakness of the Fibonacci dose-escalation, 3+3, and related designs can be seen in their operating characteristic (OC), which is the expected performance of the design under specified conditions. A good design should have a high probability of
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-RANGING
341
terminating at or near the optimal dose, such as the true MTD, regardless of the design points chosen for study. The cumulative probability of stopping before the true MTD should be low, as should the probability of escalating beyond the MTD. A poor design will have a high chance of stopping at a dose other than the true MTD (bias), at a highly variable stopping point, or both. We can already anticipate that dose-ranging designs will have some difficulties with bias. One reason is because of their sequential nature, and the chance of stopping early by chance. Second, when investigators choose the doses to test, it is highly unlikely that they will include the unknown true MTD as a design point. The study is required to stop on one of the design points, guaranteeing that it will not stop at the true MTD. Aside from heuristic arguments, the OC can be calculated for designs and decision rules like those put forward in Section 12.4.2. The OC for more complex designs can always be determined by simulation. From the decision rules we can calculate the conditional probability of passing from the current dose to the next higher dose, and the unconditional probability of passing any given dose. Such calculations will yield the chance of stopping at or before a given dose, the operating characteristic (OC). Denote the binomial probability of observing 𝑘 responses (toxicities) out of 𝑛 subjects as 𝑏(𝑘; 𝑛, 𝑝), where the probability of response is 𝑝. Then define 𝐵(𝑎, 𝑐; 𝑛, 𝑝) =
𝑐 ∑
𝑏(𝑖; 𝑛, 𝑝)
𝑖=𝑎
to be the cumulative binomial mass function between 𝑎 and 𝑐 inclusive and 0 ≤ 𝑎 ≤ 𝑐 ≤ 𝑛. 𝐵 represents the probability of having between 𝑎 and 𝑐 responses out of 𝑛 subjects. Suppose that the true probability of response at the 𝑖th dose is 𝑝𝑖 . The 𝑖th cohort size is 𝑛𝑖 , and the observed number of responses (toxicities) is 𝑟𝑖 . The decision rule to escalate is if 𝑟𝑖 ≤ 𝑢𝑖 and to stop if 𝑟𝑖 ≥ 𝑑𝑖 . If 𝑢𝑖 < 𝑟𝑖 < 𝑑𝑖 , then 𝑚𝑖 additional subjects will be placed on the same dose. After 𝑚𝑖 subjects are added to the cohort, we escalate if the number of responses in the new cohort is less than or equal to 𝑠𝑖 and otherwise stop. To keep the formulas relatively simple, the design does not incorporate deescalations. Then the conditional probability of escalating past the 𝑖th dose, given that we have just arrived there, is 𝑑𝑖 −1
𝑃𝑖 = 𝐵(0, 𝑢𝑖 ; 𝑛𝑖 , 𝑝𝑖 ) +
∑
𝑗=𝑢𝑖 +1
𝑏(𝑗; 𝑛𝑖 , 𝑝𝑖 )𝐵(0, 𝑠𝑖 ; 𝑚𝑖 , 𝑝𝑖 ).
In other words, the 𝑖th dose can be passed when it is first visited (first term), or after the cohort is expanded (second term). The chance of deescalating from a given dose to the previous one is 𝑑𝑖 −1
𝑅𝑖 = 𝐵(𝑑𝑖 , 𝑛𝑖 ; 𝑛𝑖 , 𝑝𝑖 ) +
∑
𝑗=𝑢𝑖 +1
𝑏(𝑗; 𝑛𝑖 , 𝑝𝑖 )𝐵(𝑑𝑖 , 𝑚𝑖 ; 𝑚𝑖 , 𝑝𝑖 ).
To pass the 𝑖th dose, we pass it and don’t revisit it from the next dose: 𝐺𝑖 = 𝑃𝑖 (1 − 𝑅𝑖+1 ).
Piantadosi
Date: July 27, 2017
342
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
The unconditional probability of passing the 𝑖th dose, 𝑄𝑖 , depends on what happens at it and lower doses, 𝑄𝑖 =
𝑖 ∏ 𝑘=1
𝑃𝑖 ,
(12.5)
and the OC is 1 − 𝑄𝑖 . The OC does not depend on any doses, only on the underlying probability of toxicity at the doses selected to study (design points). A good OC would demonstrate a high chance of terminating at the optimal dose, independent of the underlying probabilities of the design points. Dose-ranging designs generally have a poor OC. A consequence of the dependency on the exact response probabilities is that different trials for the same drug and question may produce very different estimates. Numerical Examples For many dose-ranging designs, all cohorts are the same size and use the same decision rules. For example, cohorts are often taken to be size 3. The dose is escalated if 0 toxicities are observed and stopped if 2 or more are seen. If 1 toxicity is seen, 3 additional subjects are placed on the same dose. Then, if 1 out of 3 in the additional cohort have toxicity, the trial is stopped. Fewer than 1 out of 3 increases the dose to the next level, and more than 1 out of 3 stops. The operating characteristics of this design can be calculated from the general formulas above and equation (12.5). Using the stated cohort sizes and numerical rules, the conditional probability of escalating past the 𝑖th dose, given that the 𝑖th dose has been reached, is 𝑃𝑖 = 𝑏(0; 3, 𝑝𝑖 ) + 𝑏(1; 3, 𝑝𝑖 )𝑏(0; 3, 𝑝𝑖 ), where 𝑝𝑖 is the true probability of toxicity at the 𝑖th dose. The 𝑖th dose is passed when there are no toxicities in the first 3 subjects, or when 1 toxicity occurs but there are no additional ones in the expanded cohort. The unconditional probability of passing the 𝑖th dose must account for the probability of passing all earlier doses, meaning it is the product 𝑄𝑖 =
𝑖 ∏ ] [ 𝑏(0; 3, 𝑝𝑘 ) + 𝑏(1; 3, 𝑝𝑘 )𝑏(0; 3, 𝑝𝑘 ) . 𝑘=1
The OC is 1 − 𝑄𝑖 . Returning to the simple numerical case given above, the OC can be illustrated under several scenarios. To do so, it is necessary to hypothesize true values for 𝑝1 , 𝑝2 , … , 𝑝𝑘 . Again, the values of the doses are irrelevant—only the response probabilities at the selected doses matter. True probabilities of response for two cases are shown in Figure 12.1, in the order they will be employed in the escalation. For set A, the doses are conservative, that is, they yield a chance of toxicity that increases significantly for later doses. For set B, the probabilities increase steadily. Both of these are plausible based on the way doses are arranged in advance of these experiments. Assume that the optimal dose is defined as that which yields a 30% chance of toxicity. Then, for the conservative set of probabilities/doses (curve A), the trial should terminate
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-RANGING
FIGURE 12.1
343
Hypothetical dose-response probabilities.
between the 7th and 8th dose with a high probability. For the second set (curve B), the trial should terminate near the 5th dose with a high probability. The calculated OCs for these cases are shown in Figure 12.2. For set A, the expected stopping point is the third dose, and the design has an 85% chance of stopping before the 5th dose (curve A, Fig. 12.2). When such a set of toxicity probabilities is encountered, this design will most likely yield a biased estimate of the MTD. For set B, the expected stopping point is between the 6th and 7th dose, and the design has a nearly 90% chance of stopping before the 8th dose (curve B, Fig. 12.2). Aside from the bias, in both cases the designs are not likely to produce repeatable values for the MTD. This performance is typical for dose-ranging designs when they are pressed into dose-finding service. 12.4.4
Modifications, Strengths, and Weaknesses
Over the years a large number of modifications have been proposed to the basic design sketched above. These modifications are a testament to the flexibility and subjective nature of the design, and to the fact that dose-ranging studies have been primarily the domain of clinical researchers rather than statistical methodologists. Although limitations are evident, these designs offer simplicity as the main advantage. The decision points and definitions are defined in advance and do not require elaborate statistical considerations. This is an important advantage in the clinic. Sample sizes, often under 20 subjects, allow the study to be completed quickly. Finally the design is familiar to many drug developers, allowing them to focus on clinical endpoints. The designs are easy to execute, and special methods of analysis are not needed. Furthermore, when modifications are necessary, either during a particular trial or when designing a new study, the changes can often be made by investigators based on their
Piantadosi
Date: July 27, 2017
344
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
FIGURE 12.2 Operating characteristics for designs A and B from Figure 12.1.
intuition. Statistical input is avoided even when needed, and many studies are designed and conducted by example. When clinical protocols using such designs are peer reviewed, it is simple to see the doses to be employed and the plans of the investigators. The principal weaknesses of these designs are the variability and bias in the estimate of the optimal dose. A design that converges quickly to the wrong answer is not useful. Because of their poor performance these designs should not be used for dose-finding. An additional disadvantage is the illusion of objectivity. Investigators are always required to make subjective judgments about the trial, as discussed earlier. Simple designs used by rote may falsely appear objective. In recent years various modifications and improvements on the basic Fibonacci doseranging design have been proposed. These are generally designed to increase efficiency, particularly use fewer subjects at low doses where the chance of clinical benefit is lower. A popular example is the accelerated titration method, essentially first proposed as a staged design by Storer [1453, 1454] and refined by Simon [1400], that essentially escalates based on treatment of a single subject at low doses until some toxicity is seen. Such designs are appropriate in many circumstances, especially in oncology drug development, but retain the fundamental shortcomings of the basic design.
12.5
DOSE-FINDING IS MODEL BASED
Dose-finding studies are conducted with sequentially rising doses in successive cohorts and terminate when certain predefined clinical outcomes are observed. Because of their sequential nature combined with the effects of chance, these designs tend to yield a biased underestimate of the targeted dose. Model guided dose-finding is not prone to this bias.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-FINDING IS MODEL BASED
345
The principal model-based dose-finding methods are the continual reassessment method (CRM) and estimation with overdose control (EWOC). The dose–response models built into these methods confer superior operating characteristics compared to primitive designs. Those designs are the methods of choice for dose-escalations of cytotoxic drugs in oncology, for reasons provided below. However, neither EWOC nor the CRM has been as widely employed by clinical investigators as they deserve to be.
12.5.1
Mathematical Models Facilitate Inferences
Previously I indicated that dose-ranging designs are not good at optimization, which is the domain of dose-finding methods. Reliable and efficient identification of a dose that satisfies certain criteria is not a simple task. In fact, it is presently possible only for a subset of dose optimization questions. However, it is remarkable that designs superficially similar to those already discussed can accomplish the goal. The design modifications that permit reliable dose-finding relate to the method described in Section 12.4.1 where a dose–response model was used as the device to guide the titration. To make this idea concrete, a mathematical model for the dose–response relationship can supplement reasonable biological assumptions and data gathered in real time to identify a dose with the intended features. The model embodies other knowledge that is omitted from dose-ranging methods and provides the needed methodologic edge for optimization. A model that reasonably represents the true state of nature is a powerful descriptive and analytic tool. Provided it is accurate and flexible enough, conclusions will depend minimally on its exact mathematical details. Not only does it represent information external to the experiment, a model allows more efficient use of internal information. Early dose-ranging trials sometimes employed probit or logit modeling to smooth or interpolate the final data. Probably because of computational difficulties before the widespread availability of computers, routine model-based interpolation was dropped. The resulting simplicity facilitated wide use of these designs by clinical researchers. An unfortunate consequence for oncology studies was that the estimated MTD then had to be the last design point (dose) visited according to the escalation/deescalation rules rather than a smoothed and interpolated estimate facilitated by the overall model. This contributes to inferior operating characteristics. The strengths and efficiencies of the continual reassessment method (CRM) follow from its use of a mathematical model to guide the choice of a new dose. The method is inseparable from modeling the dose–toxicity function. The model also provides a formal way for outcomes at every dose to help guide decisions made at any other dose. Doseranging cannot do this. Mathematical models that support the design and execution of these trials are distinct from models for analyzing pharmacokinetic data (Chapter 20).
12.5.2
Continual Reassessment Method
The continual reassessment method (CRM) is a dose-finding design that employs an explicit mathematical model describing the relationship between dose and toxicity [1157, 1158, 1160]. A broader view of dose-escalation methods is discussed by Whitehead et al. [1550]. A recent overview is given by Cheung [255]. In the CRM algorithm, the mathematical model is used to fit, interpolate, and smooth acquired data, and to
Piantadosi
Date: July 27, 2017
346
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
extrapolate beyond the observed data to the optimal dose. At each point in the escalation, the next cohort of subjects is administered the best estimate of the optimal dose based on all existing data and the model fit and extrapolation. A sigmoidal-shaped dose–response model requires a minimum of two constants or parameters—location and slope is one way to think about the role of the model parameters. As originally described, the CRM is a Bayesian method and therefore functions via probability distributions for the model parameters. These probability distributions represent uncertainty regarding the exact parameter values, which diminishes as information is acquired from treated subjects. Data from each cohort updates the estimates of parameters, the model fit, and the best extrapolated guess for the MTD. The next cohort is treated at what then appears to be the MTD based on all the available data. The process of treating, assessing response, model-fitting, and dose estimation is repeated until it converges (no additional dose changes) or until a preset number of subjects have been treated. The information that investigators need to initiate the CRM is essentially the same that they require to choose a starting dose and design points for a dose-ranging trial. One difference between the CRM and dose-ranging methods is that the CRM does not necessarily specify a limited set of doses at the beginning of the experiment. Once started, it can find the MTD from a continuum of doses by updating the best guess with each cycle. However, the same algorithm can be used to visit and select from among a set of prespecified doses, but this reduces dose-finding efficiency. The CRM should be understood primarily as a flexible model-based algorithm, whereas dose-ranging is a set-based procedure. The CRM algorithm is formal, quantitative, and best implemented by computer. The set-based procedure of dose-ranging is simple, operationally defined, and usually does not require a computer. The CRM algorithm is unbiased for finding the MTD. Simulations show that the CRM is more efficient (fewer subjects needed to reach the actual MTD) compared to dose-ranging designs. Software by the author to execute the CRM can be found at https://risccweb.csmc.edu/biostats/ and on the book website.
Example 12.2. An example of how the CRM works is shown in Figure 12.3 and Table 12.4. Small technical details are omitted here to focus on the example. Figure 12.3 shows the dose–toxicity models at each step of the CRM iterations with the data in Table 12.4. Assume that we intend to select the dose of a new drug that yields a target probability of 0.3 of toxicity (horizontal dotted lines in Figure 12.3). The postulated dose–response model at the initiation of the trial (which in this case has the same mathematical form as curve A in Fig. 12.4) is specified by the investigators, and explicitly provides a value of 35 for the starting guess for the optimum dose (Fig. 12.3A). Investigators must be satisfied with this model and starting dose, as they are free to adjust the initial parameters of the model to locate it in accord with their intuition and knowledge. Suppose that 3 subjects are treated at the dose of 35 with no dose-limiting toxicity (DLT). This data point and model fit are shown in Figure 12.3B. The model now indicates that the optimal dose is 68. Investigators might be unwilling to administer a dose increment more than 1.5 times the previous dose. Suppose that they use a dose of 50 instead and treat 3 new subjects with no DLTs. This step is shown in Figure 12.3C, and yields a new recommended dose of 75. Investigators now feel that they are close to knowing the MTD, so 4 subjects are treated at 75, with 2 DLTs being observed (Figure 12.3D). The MTD appears to be overshot, and the CRM indicates 70 as the correct dose. Three subjects are treated at 70
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-FINDING IS MODEL BASED
347
FIGURE 12.3 Simulated CRM dose-finding. Each panel shows the current data, updated model fit (solid curve), and the previous iteration’s model (dotted curve).
with 1 DLT (Figure 12.3E). The iterative process has now nearly converged, with the next dose recommendation being 69. Three more subjects are treated at 69, with the total DLT score of, say, 0.9 (Figure 12.3F). The new recommended dose is also 69. Assuming this meets prespecified convergence criteria (e.g., two recommended doses within 10%), investigators could recommend 69 as the MTD based on all the available data. This simple example illustrates some important features of the CRM. The method incorporates a mixture of clinical judgment and statistical rigor. All available data are used at each step, not just the outcomes from the current dose. It is fairly efficient, the process having converged after 5 dose escalations. The CRM requires only a starting dose as opposed to an entire set of doses specified in advance, the estimation method is unbiased, and it does not depend strongly on the starting dose. It was claimed that the CRM has a tendency to treat a larger fraction of subjects at higher doses, increasing the
Piantadosi
Date: July 27, 2017
348
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
TABLE 12.4 Simulated Dose-Finding Trial Using the CRM Step A B C D E F
CRM Dose
Dose Used
Patients Treated
Score of Responses
35 68 75 70 69 69
35 50 75 70 69 –
3 3 4 3 3 –
0 0 2 1 .9 –
See Figure 12.3 and Example 12.2 for the model fits.
chances for serious toxicity from new drugs [864]. This aggressiveness can be tempered by grouping subjects in cohorts, and limiting the magnitude of escalation steps [615]. Other modifications have been suggested by Faries [445]. More recently a likelihood, rather than a Bayesian, approach has been proposed for the CRM [1159]. Aside from these strong points, there are key flexibilities in the CRM method as listed in Table 12.5. There are a few disadvantages to the CRM. It is complex compared to dose-ranging designs, requiring special computer software and the aid of a statistical expert. In oncology clinical research today these are not serious limitations, but they may be in some other disciplines. An additional potential disadvantage in the CRM can be seen from the perspective of a peer reviewer or regulator. The CRM requires that we review and approve a search algorithm rather than a set of prespecified doses, leaving some uncertainty as to the design points that will be visited during the trial. This uncertainty, although relatively small, is sometimes consequential in the minds of reviewers. Much can be done to alleviate these concerns at the initiation of the study by enumerating all possible outcomes for the first one or two dose escalations. Assuming integer responses only, a fairly small number of outcomes are possible. Reviewers can then examine all the doses likely to be visited early in the escalation, which provides some reassurance that the escalations will be safe.
TABLE 12.5 Flexible Features of the CRM Algorithm Cohort sizes are freely variable. A dose continuum, set, or mixture can be used. Prior information can be represented by “pseudo-data.” Data corrections and re-fitting can be performed at any time. Method does not depend on the assumed model. Method does not depend on the technique for model fitting. Clinical judgment can override the recommended dose. Dose scale is irrelevant. Pharmacokinetic or other values can replace literal doses. Covariates can be incorporated. Fractional response scores can and should be used. Different outcomes can be scored differently.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-FINDING IS MODEL BASED
349
12.5.3 Pharmacokinetic Measurements Might Be Used to Improve CRM Dose Escalations There is potential for further increasing the efficiency of the CRM using the information in pharmacokinetic (PK) measurements of the type that are routinely obtained during phase I investigations [1206]. This method may be helpful when the PK measurements, usually derived by fitting compartmental models to time–concentration data, carry information beyond that conveyed by dose because of complexities in drug absorption, distribution, or elimination. For example, suppose that toxicity is thought to be a consequence of both the dose of drug administered and the area under the time–concentration curve (AUC) for blood. For relatively simple physiological distributions of drug, such as one- or two-compartment models with first-order transfer rates, dose of drug and AUC are proportional and therefore carry redundant information (see Section 20.2.2). In such a situation, we would not expect the addition of AUC to an escalation scheme based on dose to improve efficiency. However, if dose and AUC are “uncoupled” because of complex pharmacokinetics, random variability, or other reasons, and AUC is a strong determinant of toxicity, then information carried by AUC about toxicity may improve the efficiency of escalation schemes based only on dose. The CRM facilitates use of ancillary information from AUC or other PK measurements, or replacing literal dose values. We have only to modify the model that connects the probability of toxicity with dose or other predictor. For example, if 𝑥𝑖 is a response for the 𝑖th subject that takes the value 1 for a toxicity and 0 otherwise, the logistic dose–toxicity function is Pr[𝑥𝑖 = 1] =
1 1+
𝑒−𝛽0 −𝛽1 𝑑𝑖
,
or 𝑙𝑜𝑔𝑖𝑡(𝑝) = −𝛽0 − 𝛽1 𝑑𝑖 ,
(12.6)
where 𝑝 is the response probability, 𝑑𝑖 is the dose of drug administered, and 𝛽0 and 𝛽1 are the population parameters that characterize the response function. Note that this model is identical to equation (12.2) although the parameterization using 𝛽0 and 𝛽1 is slightly different to facilitate the next illustration. In one estimation method, 𝛽0 is fixed at the start of the dose escalations and 𝛽1 is estimated from subject data during the escalations. After sufficient information becomes available, both parameters are estimated from the data. A Bayesian procedure can be used, by which the parameter estimate is taken to be the posterior mean of a distribution based on the hypothesized prior distribution and the likelihood. For example, ∞
𝛽̂1 =
∫0 𝛽1 𝑓 (𝛽1 )(𝛽1 ) 𝑑𝛽1 ∞
∫0 𝑓 (𝛽1 )(𝛽1 ) 𝑑𝛽1
,
where 𝑓 (𝛽1 ) is the prior distribution for 𝛽1 and (𝛽1 ) is the likelihood function. These Bayesian estimates have desirable properties for the purposes at hand. However, any of several estimation methods could be used.
Piantadosi
Date: July 27, 2017
350
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
Using the same formalism, the effect of AUC on probability of response could be incorporated as 𝑙𝑜𝑔𝑖𝑡(𝑝) = −𝛽0 − 𝛽1 𝑑𝑖 − 𝛽2 AUC𝑖 , where 𝛽2 is an additional population parameter. This model is simply an extension of equation (12.6). Large AUCs increase the chance of toxicity and low AUCs decrease it in this model. This is probably not the optimal parameterization for incorporating information from AUC, nor is AUC the best covariate to combine with for dose, but the model illustrates the method. Bayesian estimates for the parameters can be obtained as suggested above. An approach similar to this shows the possibility of increasing the efficiency of dose-escalation trials based on statistical simulations [1206]. However, practical applications and experience with this method are currently lacking. 12.5.4
The CRM Is an Attractive Design to Criticize
The CRM focuses very sharply the nature of dose-finding trials. These studies by any method are characterized by subjective designs and assessments, high variability, use of ancillary data from outside the study design, and the need for efficiency. Because the CRM makes these characteristics explicit, is fairly aggressive, and uses a visible dose– response model, the design has been often criticized. In recent years, many criticisms have abated with wider use. Criticisms of the CRM are nearly always actually general problems with dose-finding or dose-ranging designs. Some concerns have been voiced about its efficiency, aggressiveness, and the value or impediment of the model. The CRM is demonstrably more efficient and less biased than classic designs. The fitted dose–toxicity model is central to the method, allowing both interpolation and extrapolation of the available data. It is not falsely precise, nor is it clear how the absence of a model would be better. 12.5.5
CRM Clinical Examples
A clinical example of the use of the CRM for dose-finding can be seen in the study of the anti-neoplastic agent irinotecan for the treatment of malignant gliomas [591]. This trial accrued 40 subjects who received a total of 135 cycles of the study drug. Other than the starting dose, levels were not specified in advance. The CRM implementation selected target doses on a continuum, which were administered to the nearest mg/m2 . Dose cohorts were size 3, and parallel groups were studied for subjects receiving enzyme-inducing anticonvulsants (EIA) or not. The trial provided estimates of the MTD for both subsets of subjects, which were found to differ by a factor of 3.5 (411 mg/m2 /week in the EIA+ subset versus 117 mg/m2 /week in the EIA− subset). As is typical, the trial also provided extensive information regarding the pharmacokinetics of the study drug, and preliminary evidence regarding clinical efficacy. This trial illustrates a successful seamless incorporation of the CRM method into multicenter collaborative dose-finding trials [1209]. Another practical example of the CRM is given by Kant et al. [820] in the context of regional anesthesia. In that trial, the CRM algorithm was used to locate the ED95 dose for bupivacaine used for supraclavicular block. The best starting dose for the algorithm was assessed in the first 8 subjects on study. After the starting dose was selected, the full CRM
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-FINDING IS MODEL BASED
351
located the ED95 using 40 additional participants. The final ED95 was approximately twice as large as the starting dose. Unlike many cytotoxic applications where the dose quantile being sought is in the range of 30% response probability, the ED95 is a more extreme place on the dose–response curve that necessitates a larger number of study subjects to locate precisely. In this clinical trial, the algorithm performed well and yielded an appropriate clinical dose with a [precision of within 10%.
12.5.6
Dose Distributions
At each iteration of data fitting for the CRM, the target dose, 𝑑, is estimated by an inverse model such as equation (12.3), 𝑙𝑜𝑔𝑖𝑡(𝑝) 𝑑 = 𝛽̂1 + , −𝛽̂0
(12.7)
where the notation omits iteration number for simplicity, and 𝛽̂0 and 𝛽̂1 are estimates using any appropriate method. Here the target dose is taken as a fixed number without regard for the obvious uncertainty associated with 𝛽̂0 and 𝛽̂1 . How should 𝑑 account for that uncertainty? Whether 𝛽̂0 and 𝛽̂1 are Bayesian or frequentist estimates, they reflect probability distributions, and consequently 𝑑 would have a distribution also. Equation (12.7) reflects a particular value of 𝑑 somewhere near the middle of its distribution. However, we are not required to use that value, but could choose a value from elsewhere in the distribution. The upper tail would make iterations escalate more aggressively, and conversely for values from the lower tail. Conservative choices for 𝑑 at each iteration would reduce the chance of exceeding the MTD for the next cohort. This kind of thinking supports the dose-escalation method described next.
12.5.7
Estimation with Overdose Control (EWOC)
Estimation with overdose control [91, 1480] is a dose-finding method that restricts the probability of employing a dose in excess of the maximum tolerated dose (MTD). The term “overdose” does not refer to populist notions of accidents, death, or abuse, but simply means exceeding the MTD. If we have reasonably accurate estimates of the MTD at each iteration, explicit control over dose excesses would seem to be a good idea. EWOC uses a dose–response model in ways similar to the CRM algorithm. Instead of employing a logit linear model, EWOC parameterizes the model in terms of a criterion, 𝛼, that represents the probability of exceeding a dose tolerance. Holding that probability to a low level provides overdose control. This parameter provides more direct clinical relevance than the logistic model. With each EWOC cycle, subjects receive the current best estimate for the correct dose tempered by a bound on the chance for exceeding the MTD. EWOC uses a fully Bayesian procedure for estimating model parameters, and like the CRM, it operates on a dose continuum rather than from a small set of prespecified doses.
Piantadosi
Date: July 27, 2017
352
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
The maximum tolerated dose is the dose at which the chance of a dose-limiting toxicity (DLT) meets a prespecified probability tolerance, 𝜃, Pr{DLT | dose = 𝛾} = 𝜃. The logistic dose–response model can be written 𝑙𝑜𝑔𝑖𝑡(𝑝) = 𝑙𝑜𝑔𝑖𝑡(𝜌0 ) +
) 𝑑( 𝑙𝑜𝑔𝑖𝑡(𝜃) − 𝑙𝑜𝑔𝑖𝑡(𝜌0 ) , 𝛾
(12.8)
where 𝑑 is the dose, 𝜌0 is the starting dose, and 0 < 𝜌0 < 𝜃. At the MTD, equation (12.8) reduces to 𝑙𝑜𝑔𝑖𝑡(𝑝𝑀𝑇 𝐷 ) = 𝑙𝑜𝑔𝑖𝑡(𝜃) or 𝑝𝑀𝑇 𝐷 = 𝜃 as required by definition. This is helpful, but the essential aspect of EWOC is not re-parameterization of the dose–response model. Denote the posterior cumulative distribution function for 𝛾 conditional on results from the first 𝑘 subjects, calculated using Bayesian methods, by Ψ𝑘 (𝛾). The next subject then receives a dose 𝑑𝑘+1 so that Ψ𝑘 (𝑑𝑘+1 ) = 𝛼 or 𝑑𝑘+1 = Ψ−1 (𝛼), where 𝑘 𝛼 is the investigator-specified probability of an overdose. As the posterior distribution Ψ𝑘 (𝛾) stabilizes, so too will the recommended dose. The dose 𝑑𝑛 chosen by EWOC is an estimate of 𝛾 with minimal risk according to a loss function { 𝛼(𝛾 − 𝑥), 𝑥 ≤ 𝛾, i.e., x is an underdose, 𝑙𝛼 (𝑥, 𝛾) = (1 − 𝛼)(𝑥 − 𝛾), 𝑥 > 𝛾, i.e., x is an overdose.
This loss function is asymmetric and implies that the relative loss for treating a subject above, compared to below, the MTD by the same amount is (1 − 𝛼)∕𝛼. EWOC also requires a firm biostatistical collaboration to support its use. The performance of EWOC is similar to CRM, depending on what is considered the appropriate performance criterion. Both are superior to 3+3 and related designs for nearly any criteria. Software to implement the EWOC algorithm is readily available [1279], and the algorithm has been adapted for drug combinations in a general formulation that is more capable than any other dose-finding method [1481]. Software to implement EWOC can be found at https://biostatistics.csmc.edu/ewoc/ewocWeb.php. There are many examples of the clinical application of EWOC. In one application the algorithm was used to adjust individual doses of a monoclonal antibody treatment for lung cancer [254]. The monoclonal antibody was fused to a superantigen staphylococcal enterotoxin A (SEA), and the MTD was adjusted for pretreatment anti-SEA antibody level. This type of complicated dose-finding would not be possible without the power and efficiency afforded by modeling. In this trial, 78 subjects were treated with the identification of individual MTDs depending on anti-SEA levels. The sample size seems large at first, but is a consequence of estimating the effect of anti-SEA level on the dose–toxicity relationship. A second EWOC example anticipates the difficult problem of joint dosing of two or more agents. It is a trial of bortezomib and sunitinib in refractory solid tumors [683]. Bortezomib is a targeted agent that produces apoptosis through several mechanisms. Sunitinib is a small molecule inhibitor of multiple growth factors. Each has had some effect as single agents in advanced cancer. In this trial, EWOC was used to increase the dose of sunitinib with bortezomib at fixed dose. Then the dose of bortezimib was increased. Thirty-seven subjects entered the trial, of whom 30 were valuable for the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-FINDING IS MODEL BASED
353
primary endpoint. The joint dose of agents was identified with minimal side effects. While more sophisticated joint dosing paths might be employed now, the study demonstrates the flexibility of EWOC and model-based methods for these emerging dose-finding questions.
12.5.8
Randomization in Early Development?
Early developmental trials are not comparative and selection bias is not much of a problem when using primarily mechanistic and pharmacologic endpoints. Randomization seems to have no role in such studies. However, consider the development of new biological agents, which are expensive to manufacture and administer and may have large beneficial treatment effects. Some targeted agents or genetically engineered vaccines that stimulate immunologic reactions to tumors and can immunize laboratory animals against their own cancers. Dose-finding objectives might be to determine the largest number of cells (dose) that can be administered and to test for effects of the immunization. Side effects could arise from either new genes in transfected tumor cells or the cells themselves in the case of cell-based vaccines. One way to assess risk of transfected cells is to compare their toxicity with untransfected cells using a randomized design. Such considerations led one group of investigators to perform a randomized dose escalation trial in the gene therapy of renal cell carcinoma [1401]. An advantage of this design was that the therapy had the potential for large beneficial effects, which might have been evident (or convincing) from such a design. As a general rule comparative goals are too ambitious when the primary clinical question is the relationship between dose and risk. In Section 12.2.3, I discuss some comparative questions that might be unavoidable in early development. These will likely be most appropriate when risk is known to be low, or when the dose question has been resolved in some other way.
12.5.9
Phase I Data Have Other Uses
For all phase I studies, learning about basic clinical pharmacology is important and includes measuring drug uptake, metabolism, distribution, and elimination. This information is vital to the future development and use of the drug, and is helpful in determining the relationship between blood levels and side effects, if any. These goals indicate that the major areas of concern in designing phase I trials will be selection of subjects, choosing a starting dose, rules for escalating dose, and methods for determining the MTD or safe dose. If basic pharmacology were the only goal of a phase I study, the subjects might be selected from any underlying disease and without regard to functioning of specific organ systems. However, phase I studies are usually targeted for subjects with the specific condition under investigation. For example, in phase I cancer trials, subjects are selected from those with a disease type targeted by the new drug. Because the potential risks and benefits of the drug are unknown, subjects often are those with relatively advanced disease. It is usually helpful to enroll subjects with nearly normal cardiac, hepatic, and renal function. Because bone marrow suppression is a common side effect of cytotoxic drugs, it is usually helpful to have normal hematologic function as well when testing
Piantadosi
Date: July 27, 2017
354
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
new drugs in cancer patients. In settings other than cancer, the first patients to receive a particular drug might have less extensive disease or even be healthy volunteers.
12.6 12.6.1
GENERAL DOSE-FINDING ISSUES The General Dose-Finding Problem Is Unsolved
Characterizing the OBD is a critical conceptual step that will suggest specific experiment design features for dose-finding. The narrow question of finding the MTD for cytotoxic drugs is facilitated by several lucky features. First, the relationship between dose and toxicity is biologically predetermined to be a nondecreasing function that saturates at high doses, that is, the chance of toxicity is 100% if the dose increases sufficiently. The curve labeled A in Figure 12.4 shows this qualitative behavior. Second, preclinical studies usually provide reliable guidance regarding the specific side effects and the quantitative dose–toxicity relationship, at least for a starting dose. Third, cytotoxic drugs are often members of a broad class about which much is known pharmacologically. Thus important properties of the dose–toxicity function are known in advance (i.e., its shape), and the problem can be reduced to determining the location and steepness of the shaped curve on the dose axis. Of these, location is the more critical parameter. Fourth, we tend not to need quantitative information about benefit for this class of drugs, only the knowledge that more is better. The most restrictive of the features surrounding MTD determination is the response scale, which is almost always taken to be the probability of a toxicity. Because most therapies outside of oncology do not need to be administered at their maximum tolerated dose, the most general OBD concept is to allow an arbitrary quantitative response scale.
FIGURE 12.4 nonlinear.
Some example dose-response curves. The dose axis is arbitrary and possibly
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
GENERAL DOSE-FINDING ISSUES
355
Once we consider nonprobability responses, we have to allow arbitrary and flexible possibilities for the biological model that links dose and response. If we think we know the mathematical form of that relationship, progress toward general dose-finding designs can be made as discussed below. Other general dose–response features might include threshold (zero response at low doses), background (some background response at zero dose), nonsaturating response, decreasing dose–response, schedule effects, and interactions. Threshold implies that a dose must exceed a critical level before it produces any response. This might be true of drugs like enzyme inhibitors. Background means that some level of response is present at zero dose, presumably due to intrinsic factors or perhaps a placebo effect. Pain relief and cure of infections might show this phenomenon. A nonsaturating response means that high doses cannot produce a 100% probability of response. Interaction means that the effect of one agent is modulated by the dose of a second agent. Schedule effects are a type of interaction between dose and time or sequence, which can be critical for optimizing risk–benefit. The effects of schedule are an additional dimension to be considered just as we would investigate joint effects with a second therapy. Nearly always we would attempt to take advantage of such interactions to enhance treatment effects. But interactions can produce complex relationships between the factors. For example, in Figure 12.5, the presence of Agent A alters the dose–response of Agent B. Even when beneficial interactions for efficacy are intended or desired, dosefinding trials often ignore the implications with regard to safety. This will be discussed more fully below. Experiment designs for the dose optimization of an agent with arbitrary properties have not been developed. Such an experiment would have to provide information about the shape, steepness, and location of the dose–response function. Of these characteristics, shape is probably the most difficult to determine empirically. Curves B, C, and D in Figure 12.4 demonstrate some general shape possibilities. Curve B shows a response that does not completely saturate at low doses. The optimum might be on the shoulder of curve B if response is efficacy—more drug does not help. On the other hand, if the response is toxicity, a more favorable risk–benefit might lie to the right if more drug is desirable. Curve C shows a maximal response declining at high doses. This behavior seems unlikely for most pharmacologic agents, but it might be seen with vaccines or biologicals. If the indicated response is efficacy, a dose near the peak could be optimal. If the response is toxicity, a dose near one extreme or the other might be a better choice. Curve D shows minimal activity at low doses with essentially a threshold, after which response increases rapidly. Here again we will need to know something about both risk and benefit before imagining where the OBD lies. These simple examples show how difficult dose-finding might be in the absence of knowledge about the qualitative behavior of the dose–response functions for both risk and benefit. The practical implications of the general dose-finding problem are that a wider range of doses need to be tested and the optimal dose may depend on subtle shape features. It is always possible that early trials will reveal a previously unknown property of the drug that strongly influences the definition of optimal dose. A flexible design for dose-finding would have to accommodate such changes.
Piantadosi
Date: July 27, 2017
356
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
FIGURE 12.5 Joint Dose–response surface for two interacting drugs A and B. Inset shows the curve of interaction at response probability 0.3.
12.6.2
More than One Drug
Current dose-finding designs are primitive and inadequate tools for studying treatment combinations, particularly those that intend to take advantage of a true therapeutic interaction. Examples where true synergy is sought are radiosensitizers, two or more drugs in combination for treatment of cancer, or multidrug therapy for infections. Drugs that alter the metabolism, distribution, or excretion of an active agent are another example. A CRM design using partial ordering of the drug combinations has been applied to this problem [1517], but has limited scope to fully explore clinically useful interactions. One approach for introducing a new agent is to keep the dose fixed for one drug, and conduct a standard dose escalation for the new one. This strategy is required when there is an ethics imperative to give the standard drug at full dose. But this approach ignores the potential for true synergy—greater effect (or toxicity) at the same dose, or the same effect at lower dose. A one-dimensional escalation will not yield a good estimate of the global optimum for the combination when interactions are expected, nor will independent
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
GENERAL DOSE-FINDING ISSUES
357
one-dimensional optimizations. The problem requires a two-dimensional search using multiple doses of each drug to identify optima reliably. Some approaches to this problem are suggested in the next section. These types of questions increase design complexity very rapidly for multiple agents, modalities, and adjuvants. No general dose-finding solution is offered here, except to make the reader aware of the shortcomings of simplistic designs. Similar problems historically in industrial production and agriculture have been solved using more sophisticated designs than many clinical investigators employ. The principles can be adapted to dose-finding questions. Interactions When more than one drug or therapy is present we must consider the possibility of interaction or synergy between them. Interaction implies that the effect of each agent depends on the dose of the other one. There are two possibilities: (1) the treatments each yield independent effects but no codependent effects, or (2) the treatments interact so that the joint effect is out of proportion to the independent effects. See Chapter 22 for a formal definition of interaction. In this second case, the interaction could be positive (synergy) or negative (inhibition). Most of the time we combine treatments hoping that there will be synergy, but sometimes we are satisfied just to have each therapy’s independent contribution. It never makes logical sense to believe in therapeutic synergy without acknowledging the possibility of toxic synergy. Treatments do not have to be present in the body at exactly the same time to produce an interaction—a lingering effect of one treatment or permanent changes to a person, organ, or tissue might be sufficient. But most of the time we expect interaction to require simultaneity. An interaction may not be dose dependent in the same way that the independent or main effect of a treatment is. It is theoretically possible for an interaction to result from a near-zero dose of one or both agents. An interaction effect of the same magnitude could arise from a continuum of joint doses. This raises the possibility of reducing the dose of a more toxic drug and increasing the dose of the safer agent to yield the same synergy. Interaction can be modeled in much the same way as single effects. Consider a logistic dose–response model similar to that above for the continual reassessment method with two agents 𝐴 and 𝐵. The logit of the response probability relates to the agent doses, 𝐷𝐴 and 𝐷𝐵 , according to log(
𝑝 ) = 𝛽0 + 𝛽1 𝐷𝐴 + 𝛽2 𝐷𝐵 + 𝛾𝐷𝐴 𝐷𝐵 , 1−𝑝
(12.9)
where 𝛾 represents the interaction effect. This model allows for independent effects, 𝛽1 and 𝛽2 , and an interaction that could be synergy or inhibition depending on the algebraic sign of 𝛾. Note that the same interaction can be produced by different 𝐷𝐴 and 𝐷𝐵 if their product is the same. In the two-dimensional dose space of agents 𝐴 and 𝐵, interaction is represented by a one-dimensional curve of 𝐷𝐴 × 𝐷𝐵 = 𝑐 (constant). Other models are possible but this is probably the simplest. This circumstance is illustrated in Figure 12.5 where the model used to draw the joint dose surface is equation (12.9). It is often said that interactions are scale-dependent because an outcome might relate to the right-hand side of equation (12.9) in a linear, multiplicative (the above case), or other way. A disproportionate effect on one scale may or may not be proportionate on
Piantadosi
Date: July 27, 2017
358
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
another scale. Assuming we have a clinically relevant scale of measurement, the model makes it perfectly clear when an interaction exists, that is when 𝛾 ≠ 0. Designs Some reflection on therapeutic synergy is necessary before designing a dose-finding trial for two (or more) agents. If we have reason to believe that a true interaction does not exist (𝛾 = 0), the design of a trial is simplified. In particular, the dose of agent 𝐴 can be fixed and agent 𝐵 can be escalated in essentially a one-dimensional design. The discussion above would pertain directly. Such designs are very frequently employed for two drugs, even without precise thought regarding interactions. But the converse implications of that one-dimensional design are also true—it yields an optimum only in the absence of interaction. We cannot expect to locate any point of synergy without exploring the joint dose space. If we expect an interaction or need to detect one to render a particular combination of therapies useful, a more sophisticated testing of the joint dose space will be required. Occasionally some ad hoc or informal changes in the dose level of the background drug are added to a design. These decisions are based on clinical intuition, but are inadequate generally for the purpose. Dose-finding designs for interactions are still evolving, but any successful design will require visiting a number of joint doses of differing magnitude. When adding a potential new therapy, investigators can feel constrained by ethics or efficacy to keep the proven drug at a fixed dose. This could be an insurmountable obstacle to truly determining synergy. But consider also the potential for synergistic toxicity, which as stated above, must always be possible when synergistic efficacy is sought. Synergistic toxicity is a valid motive for lowering the dose of standard therapy, at least initially, to provide a broader range of doses to be explored. In the end, we might imagine the possibility of minimal doses of a standard agent combined with higher does of a less toxic partner to produce enhanced efficacy. If such a scenario is possible, it motivates strongly exploration of the joint dose space. Dual Dose-Finding One reason why clinical investigators may be reluctant to explore a number of dose combinations is concern that a large number of study subjects will be required. This worry stems in part from thinking narrowly about such designs in the style of the Fibonacci dose escalation. One safe and efficient strategy that explores the design space more thoroughly is depicted in Figure 12.6. Many design points could be tested, only to discover that the optimal combination exists at none of the joint doses actually utilized (gray region in Fig. 12.6). It seems unlikely that investigators can lay out a simple design prior to obtaining any data and choose a dose combination that is precisely optimal. It is more likely that some two-dimensional modeling procedure (analogous to the onedimensional dose–response modeling of bioassay) will accurately and reliably locate the optimal dose combination. For achieving a dose optimum, the principal issues become (1) selecting the number of subjects to test at each dose combination, (2) working through the dose combinations in a safe fashion, and (3) modeling the relationship between the joint doses and outcome. Workable solutions to these problems have been described in the statistical literature. The problem appears amenable to response surface methodology, a well-studied statistical problem. The main idea in a response surface is to model response as a function of
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
GENERAL DOSE-FINDING ISSUES
359
FIGURE 12.6 Possible design space for a two-drug combination dose-finding experiment. The gray area indicates that the optimal dose combination may not exist at any design point. The grid numbers indicate a possible sequence of dosing.
multiple experimental factors (linear in the parameters), and to optimize that response. Here we have only two experimental factors, so the problem is relatively simple. The two factors (drug doses) are laid out on the X–Y plane, with the response measured in the Z direction. The response surface is mathematically a two-dimensional “manifold” in three-dimensional space—it is similar to a plane but does not have a flat surface. Plackett and Burman [1211] described the design of optimal multifactorial experiments. Subsequent investigators have refined this methodology. For example, see Box and Draper [175]. Short et al. [1383] discussed some of these design concerns in the setting of anesthetic drugs. The basic idea behind the Plackett and Burman designs is that the response surface can be described using relatively few parameters compared to the number of design points in the experiment. Thus the number of experimental subjects at each design point can be kept small. Classically, the number of design points and number of parameters were set equal, and a single experimental unit was treated at each design point. This stretches the statistical information to the limit. In the present context, a similar design could be employed, but with fewer parameters than design points. Figure 12.7 shows a possible design for a two drug combination, where each drug is administered at three doses as discussed above and illustrated in Figure 12.6. A hypothetical response surface is shown in the Z direction. Although a response surface could be quite convoluted, it seems biologically reasonable that the surface would be relatively smooth. We might assume that the surface can be parameterized by a cubic polynomial, for example. There are nine design points, and the number of parameters describing a complete cubic response surface is 10, including an intercept and all firstand second-order terms.
Piantadosi
Date: July 27, 2017
360
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
FIGURE 12.7 A hypothetical response surface for the two-drug combination dose-finding design from Figure 12.6. The interior minimum of the surface corresponds approximately to the region of dose optima from Figure 12.6.
In this case a flexible model (but not saturated) theoretically could be fit using a single experimental unit (subject) at each design point. Nevertheless, larger cohorts at each design point are appropriate, particularly in the center of the design space where the optimal is likely to be (assuming the investigators have chosen the dose combinations insightfully). One could use cohorts of size 2 at the “corners” and cohorts of size 3 at all other design points, for example, allowing a fully saturated cubic response surface model to be fit. Such a plan would require only 23 subjects for nine dose combinations. The safest dose at which to begin an escalation is the one in the lower left corner. In special circumstances, additional design points could be added. The dose combinations would be visited sequentially in a manner that escalates only one of the drugs relative to the previous design point. Thus the path through the dose space might be diagonal in the fashion indicated by the sequence numbers in Figure 12.6. The design points in the upper right corner may never be reached if dose-limiting toxicities are encountered. After the data are acquired, a parsimonious but flexible response surface can be fit. The response surface should permit reasonable but smooth nonlinearities to help define the optimum. Investigators might treat additional subjects at the model-defined optimum to acquire more clinical experience. However, this is not required under the assumptions of the design. The first dose-finding study employing this design has yet to be completed. However, the ideas are well worked out statistically and the design seems compatible with the clinical setting and assumptions that typically enter studies of drug combinations. Variations of this basic design will likely follow from early experiences. It is also possible to generalize the CRM to accommodate two drugs and a relatively smooth response function. However, to do so might require four parameters for the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
GENERAL DOSE-FINDING ISSUES
361
typically employed logistic model. In addition to the intercept, one parameter is required for the “slope” of each drug effect, and the last parameter would describe the interaction or synergy between the two drugs. This is the simplest generalization, and even more complex models could be necessary. Because of the difficulties in specifying prior values or information for model parameters (a relatively simple task in the one-dimensional case) when employing complex models, such generalizations of the CRM are not very practical.
12.6.3
More than One Outcome
Studies of cytotoxic agents are greatly aided by a priori knowledge regarding the form of the relationship between dose and toxicity. Investigators know that for such drugs there is a dose that saturates the response. For example, one can administer a high enough dose of a cytotoxic agent to create toxicity in 100% of recipients. This implies that there is a monotonically increasing dose–response function. This knowledge can be used implicitly in the dose escalation design. It can also be used explicitly, such as in a probit or logit CRM model, which yields even more efficiency in the design. Suppose, however, that relatively little is known about the shape of the dose–response function. This might be the case with drugs where the effect saturates, or when the behavior of the response at low doses does not qualitatively mimic a cytotoxic drug. Such assumptions are completely reasonable if investigators are interested in the maximum nontoxic dose (antibiotics) or minimum effective dose (analgesics), as discussed above. When designing dose-finding experiments in the absence of qualitative knowledge about the shape of the dose–response function, the investigator has little option except to study a range and number of doses that permits assessing shape, while simultaneously locating the position that is considered optimal. For example, if the dose–response function has a plateau, the OBD might be a dose just on the shoulder of the curve. Higher doses would not substantially increase the response but could increase the chance of undesirable side effects. In essence, there are two unknowns: the level of the plateau and the dose at which the plateau begins. Extra information is needed from the experiment design to optimize in such a circumstance. A potentially more difficult problem arises when balancing the toxicity of a new drug against its biological marker response or intermediate outcome. We may need to select a dose on this basis before estimating definitive clinical activity. This situation is shown hypothetically in Figure 12.8, where both the probability of toxicity (right-hand axis) and biological response (left-hand axis) are plotted against dose. The two effect curves are not directly comparable because the respective vertical axes are different. An optimum dose would produce low toxicity while nearly maximizing the biological response. Such a dose is not guaranteed to exist. After the data are collected and modeled, the region of the optimum dose might be fairly obvious. However, it might be difficult to specify in advance either a definition of, or a strategy for finding, the optimum. When the dose–response relationships are well defined for both effects, an approach to an optimum might be to consider the parametric space of the outcomes. A curve parameterized by dose exists in the plane whose X and Y axes are toxicity response and biological response. Suppose that the toxicity and biological responses are given by the curves drawn in Figure 12.8. The corresponding parametric curve is shown in Figure 12.9 along with a region of clinical utility defined by high biological response and low toxicity. The parametric curve defines the only biological possibilities that we
Piantadosi
Date: July 27, 2017
362
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
FIGURE 12.8
Relationship of toxicity and biological response to dose for a hypothetical drug.
can observe. If the OBD is an actual dose rather than a hypothetical, it would have to be a point on this curve. The response–toxicity curve might pass through the region of clinical utility, in which case it is straightforward to select the OBD. Alternatively, it
FIGURE 12.9 Relationship between toxicity and biological response (parameterized by dose) for the example functions from Figure 12.8.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
GENERAL DOSE-FINDING ISSUES
363
might be necessary to select as the OBD a point in the parametric space closest to the region of clinical utility (e.g., Fig. 12.9, dose = 50). If the parametric curve does not pass through or close to some clinically appropriate joint outcomes, the drug will not be useful. This might be the case in Figure 12.9 if higher doses are unusable for other reasons. For smooth dose–toxicity and dose–response functions, the parametric space could be nearly linear over a wide range of effects. Fig. 12.9 has this property. This could lead to an efficient design for such studies. The region of clinical utility would be prespecified on a toxicity–response parametric plot. As outcomes are gathered from small experimental cohorts at each dose, points in the parametric space can be plotted and extrapolated linearly (e.g., Fig. 12.10). If the resulting projections convincingly exclude the region of clinical utility, the drug can be abandoned. Such a situation is shown in Figure 12.10, where extrapolations from early data (black circles) suggest the chance of clinical utility (gray region), but data at higher doses demonstrate that the drug will not be useful. No particular ordering of the design points is required, although increasing doses will most likely be required as in other dose-finding trials. In practice, there would be some uncertainty associated with each outcome point in Figure 12.10 in both dimensions, which could be reduced by increasing the sample size at that dose. 12.6.4
Envelope Simulation
The general dose-finding problem is fairly intractable as discussed above. Any widely suitable design to address it cannot be restricted to probability outcomes. Such a design
FIGURE 12.10 Possible experiment design to assess two outcomes that determine clinical utility. The black circles are design points, not necessarily taken in order, and the gray area is thought to be that of clinical utility. The dashed lines show extrapolations made by investigators based on a sequential experiment. At dose number 4, the lack of utility becomes convincing.
Piantadosi
Date: July 27, 2017
364
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
would also have to be able to deal with (1) classic dose escalation to locate an MTD, (2) location of the dose associated with a maximum response, (3) location of a dose that nearly saturates an outcome, (4) titrations under many dose–response relationships, and (5) location of the dose that yields a specified response. A general design should also (6) apply in the laboratory, (7) be able to incorporate ancillary information from the clinical investigator, (8) make efficient use of data, (9) be ethically appropriate for participants, and (10) have simple, transparent, and flexible operational rules. Envelope simulation is a method that can accommodate most of these goals [1204]. The method contains a deterministic component that is model guided, but is not restricted to a single model. It also contains a stochastic component based on statistical simulations to characterize variation. The simulations represent variability in behavior that is otherwise difficult to capture quantitatively. The algorithm for envelope simulation is described in Table 12.6. It is essentially a CRM algorithm with an arbitrary model, dose, and response scales. Any dose response model can fit into this algorithm. For example, consider a logistic model similar to equation (12.1), 𝑌 (𝑑) =
𝐶 , 1 + 𝑒−𝛽(𝑑−𝑑50 )
(12.10)
TABLE 12.6 Steps in the Envelope Simulation Algorithm Step 1 2 3 4 5 6 7 8 9 10 11
Action Required Identify an appropriate mathematical model for the expected dose–response relationship and feature of interest. Use data or clinical insights to locate envelope regions in the dose–response plane. Choose simulation parameters such as number of points per envelope, weights, and number of repetitions. Check that model and envelopes are consistent with each other, and with biological knowledge. Run simulation/model fits and summarize the distribution of estimated doses yielding the feature of interest as a histogram. Use the histogram distribution to select an appropriate dose for the (first) next cohort of study subjects. Choose a cohort size. The algorithm does not dictate a cohort size. Treat cohort and gather outcome data. Include outcome data in a new simulation cycle. Optional: assess the sensitivity of model estimates to additional data and/or its location (design points) via simulation. Repeat steps 5–10 until the histogram of estimated doses indicates that real data are dominating envelope samples.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
GENERAL DOSE-FINDING ISSUES
365
for dose, 𝑑, which generalizes the probability model to allow responses on an arbitrary scale up to a maximum of 𝐶. A bilinear model that could be useful in detecting a maximum response is { 𝐶 − 𝑒𝛽 |𝑑 − 𝛼|, if 𝑑 ≤ 𝛼, 𝑌 (𝑑) = (12.11) 𝛾 𝐶 − 𝑒 |𝑑 − 𝛼|, if 𝑑 > 𝛼,
which is asymmetric around the maximum value 𝐶 at dose 𝛼. Again the response 𝑌 (𝑑) is not required to be a probability. Another useful model is a saturating response 𝑌 (𝑑) = 𝐶(1 − 𝑒−𝛽𝑑 ),
(12.12)
where 𝐶 is the maximal response. These represent just a few of many models that could be incorporated into the envelope simulation algorithm. The stochastic portion of the algorithm depends on prior specified envelopes or regions of the dose–response plane from which we expect data to arise. The expectation comes from prior clinical or biological knowledge. As a hypothetical example, suppose a drug is administered at a dose between 100 and 200 units, then we expect a response of between 300 and 500 on the measured outcome scale; if it is administered at a dose between 300 and 400 units, the anticipated response will be between 750 and 1000. The indicated envelopes are illustrated in Figure 12.11. This quantification comes from knowledge of similar agents, preclinical data, or clinical data at other doses or in other cohorts. Each biological circumstance requires customized envelopes. Different investigators will likely specify different envelopes for the same problem. Even so, the algorithm gives ultimate dominance to data from the subjects as opposed to the envelopes. The envelope simulation algorithm puts together a random sample from envelopes, real response data from treated cohorts, and a dose–response model fitted to the data in each simulation sample. From each model fit, the feature of interest or prediction is extracted in one of three ways: (1) as an estimated parameter value, (2) from an algebraic calculation using estimated parameters, or (3) through numerical root finding.
FIGURE 12.11 Hypothetical envelopes for simulating dose–response data.
Piantadosi
Date: July 27, 2017
366
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
The estimated feature or prediction is saved for later summary. Repeating this process a large number of times yields a distribution of predictions for each cycle of the algorithm. That distribution is used by the investigator to choose the dose for the next cycle. The most general way to extract the necessary feature from a model fit is by numerical root finding. In all cases, we assume that the parameter estimates from the fitted model are true constants. Denote the model being employed as 𝑓 (𝑑, 𝜃), where 𝜃 indicates one or more model parameters. If we want to know the dose, 𝑑𝑧 , associated with a response value 𝑍, the equation to be satisfied is ̂ − 𝑍 = 0, 𝑓 (𝑑𝑧 , 𝜃)
(12.13)
where 𝜃̂ is the current estimate of 𝜃. This equation can always be soved numerically for 𝑑𝑧 . Sometimes it can be solved algebraically. If the problem is to find the maximum response, we can numerically solve 𝜕𝑓 ̂ =0 (𝑑 , 𝜃) 𝜕𝑑 max
(12.14)
for 𝑑max . In either case we assume 𝜃̂ is a fixed constant. As we cycle through simulations, the set of doses determined by equation 12.13 or 12.14 forms the distribution of interest. We would take some appropriate quantile of that distribution, such as the median, to be the dose for the next cohort of subjects. Eventually this distribution will narrow due to dominance of real data, and the dose recommendations will converge to a constant value. A hypothetical example is shown in Figures 12.12 and 12.13 that contain the initiation and three steps in an envelope simulation using the model from equation (12.11). This model is designed to determine the dose that yields a maximum response. Numerical root finding is not necessary because the model parameters estimate the maximum response ̂ and the corresponding dose at 𝛼 at 𝐶, ̂. Envelopes, data, and model fits are shown in the left columns of the figures. Predicted distributions of the dose that yields a maximum response are shown in the right columns. Initially the envelopes are not so reliable at isolating the maximum as evidenced by the relatively wide distribution of predicted maxima in step 0 and step 1. As cohorts are tested, the resolution of the dose-yielding maximum response becomes sharp. The final fit illustrated in Figure 12.13 is based only on actual data points. The bilinear model is efficient at isolating the maximum despite it being only locally accurate at best. Although this is an artificial example, it illustrates the ability of the envelope simulation algorithm to cope with arbitrary dose response models. It also shows operationally how the envelopes initiate use of a model, but are quickly dominated by real data. Even so, dose–finding remains a sparse data problem and conclusions must accommodate the resulting imprecision.
12.7
SUMMARY
The purpose of a dose-finding or dose-ranging clinical trial is to locate an optimal biological dose, often defined with respect to safety. This problem is well defined in the classic oncology development paradigm, especially for cytotoxic agents. It is less well defined for targeted agents, outside oncology, and for therapeutics that are not drugs. The
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SUMMARY
367
FIGURE 12.12 Initiation (A) and step 1 (B) of a hypothetical envelope simulation process. Envelopes, data, and model fits are on the left. Predicted distributions of dose that yields a maximum response are on the right.
notion of a dose optimum depends on the purpose of the agent and a balance between safety and efficacy. It could be quite different for analgesics, antibiotics, and cytotoxic agents, for example. There are no completely general clinical trial designs that can locate optimal doses. For toxicity titration for cytotoxic anticancer agents, broadly useful study designs have been developed. All such designs are strongly subjective but are necessary and useful to obtain reliable dosing parameters for subsequent developmental trials. A widely used dose-ranging design is the (modified) Fibonacci dose escalation or one of its derivatives. Although simple to execute and analyze, such designs have poor operating characteristics with respect to true dose-finding. The continual reassessment method (CRM) and related model guided designs are more efficient and precise design for cytotoxic dose-finding. They come at a slightly higher cost in terms of complexity but almost universally employs fewer subjects. Neither of these trial designs is automatically the correct choice for early developmental questions in general. Model-guided dosing can be combined with simulation to yield very general study designs similar to the CRM. These methods are not restricted to probability outcomes or simplistic models. Provided there is sufficient biological information to postulate
Piantadosi
Date: July 27, 2017
368 (A)
Time: 4:45 pm
EARLY DEVELOPMENT AND DOSE-FINDING
160
0.16
140
0.14
120
0.12
100
0.10
80
0.08
60
0.06
40
0.04
20
0.02
0
0.00 0
200
400 600
800 1000 1200 1400
0
200
400 600
800 1000 1200 1400
0
200
400 600
800 1000 1200 1400
(B) 160
0.16
140
0.14
120
0.12
100
0.10
80
0.08
60
0.06
40
0.04
20
0.02
0
0.00 0
200
400 600
800 1000 1200 1400
FIGURE 12.13 Step 2 (A) and step 3 (B) of a hypothetical envelope simulation process. Envelopes, data, and model fits are on the left. Predicted distributions of dose that yields a maximum response are on the right.
a reasonable dose–response model, these methods show enormous flexibility for dose optimization. Other general dose-finding design problems remain unsolved. One is jointly estimating and balancing efficacy against toxicity. New experiment designs that target a region of clinical utility might be helpful in this case. A second incompletely solved problem is optimizing joint doses of multiple agents. To investigate interactions, some type of modified response surface method might be useful. Whatever complications arise, the relationship between dose and safety should be resolved before initiating safety and activity trials, after which they will be inefficient to investigate.
12.8
QUESTIONS FOR DISCUSSION
1. Suppose that drug 𝐴 is an established treatment administered at a fixed dose. Investigators wish to study the combination of drugs 𝐴 and 𝐵 but are unsure of the best dose
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
369
of either one in the presence of the other. What type of design would permit studying this question? Compare it with a typical phase I trial. What if the correct doses are known but the order of 𝐴 and 𝐵 is in question? 2. Propose a possible escalation strategy for a two-drug dose-ranging design. What if the drugs are strongly different with regard to risk of toxicity? 3. What mathematical model might be appropriate for a dose-response relationship that monotonically increases with no inflection point, such as Curve B in Figure 12.4.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
13 MIDDLE DEVELOPMENT
13.1 INTRODUCTION Following dose-finding and prior to definitive comparative trials we typically encounter a broad and complex set of therapeutic questions. This is middle development. I have heard and read dozens of investigators and methodologists in several disciplines organize this domain for teaching. All arrange it differently and emphasize different features, and my outline is probably distinct from all of those. I will attempt to extract common principles for middle development that would lead to serviceable designs anywhere. Two ubiquitous evidentiary goals of middle development are safety and therapeutic activity, which is why I often refer to such studies as safety and activity (SA) trials [1362]. No design will achieve multiple objectives definitively, but appropriate modest sized trials can yield reliable information on focused goals. After middle development lies an expensive and time consuming definitive comparative trial. Investigators cannot forego key information from middle development when a comparative trial looms, unless they are very risk accepting. More than one such trial may be needed, ordered, and focused appropriately on the objectives. Middle development designs have been very important in cytotoxic drug development to help filter ineffective therapies. With the rise of targeted or smart drugs in that context, safety and activity objectives seem to be relocating earlier in the pipeline, diminishing the role for classic designs. At the same time, the demand for more reliable signals in middle development has contributed to more use of randomized designs. As a result, the traditional forms of these trials may be disappearing from the new oncology drug development paradigm. In other settings however, they retain a key role. The properties of the context and pipeline are reflected very strongly when choosing an optimal middle development design. There are three key influences of the environment. Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
370
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
371
The pipeline can be pressured globally in the disease context, or more locally with regard to the class of therapies under study. Although human subjects are always a precious resource, how much competition is there for research volunteers in the disease being studied? Is the disease rare or are there many potential therapies needing testing? Design features as discussed below will be a consequence of such pressures. Second, we must have some quantitative prior expectation for the chance of success in the pipeline. This is a valid, empirical, disciplinary question—not an aspirational one for the investigator, who will always think their therapeutic ideas are insightful and likely to work. But history and the pipeline inform us that brilliant ideas frequently fail. This prior probability of success or failure is both a dominant motivation for performing a middle developmental trial in the first place, as well as a driver for optimal design characteristics. Existing safe and effective therapy is a third factor that strongly influences design characteristics. Without good treatments, patients with serious illness are risk accepting, and investigators are similarly appropriately inclined to adjust clinical thresholds. In the presence of a good therapeutic alternative, our design characteristics would likely be more skeptical, stringent, and risk averse. These perspectives will, at a minimum, be reflected in the error properties of the trial. The combination of a pressured pipeline and existing good therapeutic alternatives would lead to a classical “fail early and often” design bias that has been typical in cancer trials until recently. A recent trial design task force focused on cancer middle development (phase II) studies did not fully consider these environmental pressures and yielded mostly bland recommendations [1362]. The fact that a single stage/phase of development was isolated by the task force rather than taking a view across development as in Chapter 10 illustrates the problem. A broad view of the topic from an ethics perspective seems to have produced better insights [713]. A large perspective is necessary to choose wisely from a bewildering array of possible middle development study designs ranging from small single cohort estimations of variability, average measures, or clinical success rates, to futility designs, to randomized comparisons. Within a discipline, there tends to be repetitive use of certain designs based on familiarity, reliability, and efficiency. Across disciplines, there is little consistency to the pressures and hence to the designs used in middle development.
13.1.1
Estimate Treatment Effects
Middle developmental studies use clinical outcomes to assess risk, safety, activity, and efficacy. Because feasibility is often established earlier in development, safety and activity (SA) is a primary focus of such studies. The classical cancer drug paradigm for middle development presupposes that many therapies are in the pipeline, and that most will fail to be of clinical benefit. That perspective has been realistic for oncology but may not hold in other contexts. If potential therapeutic advances are encountered less frequently, for example, in treatment of degenerative neurologic disease, alternative middle developmental strategies might be better. Sometimes, it makes sense to skip this step of development entirely, as discussed below. Middle developmental studies with rigorous design features such as a randomized comparison group are also occasionally appropriate. The general characteristics of safety and activity trials presented here are broadly serviceable. In this stage of development the critical information gained is an estimate of the probability that patients will benefit from the therapy, or have serious side effects from
Piantadosi
Date: July 27, 2017
372
Time: 4:45 pm
MIDDLE DEVELOPMENT
it. These risk-benefit estimates will be compared with knowledge about conventional treatment to determine if additional, larger trials are justified and likely to be successful. Unsafe, inactive, or ineffective treatments will be discarded if better alternatives exist. This is in essence the overriding purpose of middle development—to discard inactive treatments without investing heavily in them. In this sense we might say that middle development is pivotal, meaning that it often yields a basis for terminating development, especially if resources are limited. However, the notion of a pivotal trial as a strong experiment design has a regulatory connotation discussed in Chapter 14. Oncology drug development classically has used prototypical safety and activity trials. Safety is assessed by organ system toxicity measured according to defined criteria. Evidence of activity is usually assessed by the surrogate outcome response, which is a measure of tumor shrinkage. In some middle development oncology trials, efficacy can be estimated from clinical outcomes such as disease recurrence, progression, or duration of survival. The classic cancer surrogate outcome is tumor shrinkage or response. This outcome seems to be in the causal pathway for definitive outcomes like survival. With discipline it can be measured reliably and is known fairly soon after the completion of treatment. In reality, tumor shrinkage is not an excellent surrogate, but it makes these trials reasonably efficient and reliable. Safety and activity trials can easily be adapted for definitive outcomes like survival. Designs for this are discussed in Chapter 16. The questions addressed by safety and activity trials are relevant in most developmental areas, but different designs may be necessary depending on the context. For example, if a good standard therapy exists, we would favor a replacement only if it appears to be substantially more effective than standard, or if it has fewer side effects. This has quantitative design implications. For certain types of biological agents like vaccines or gene therapies, we might be willing to pursue only large treatment effects. Designs in high-risk populations or less rigorous comparisons might be appropriate. Middle development trials usually involve between 25 and 200 subjects. Resources are seldom available to study more subjects, and the gain in precision that would be obtained by doing so is usually not worth the cost. The design can be fixed sample size, staged, or fully sequential. Staged designs are those conducted in groups with an option to terminate the trial after each group of subjects is assessed. Fully sequential designs monitor the treatment effect after each study subject. Staging allows a reliable decision to be made at the earliest possible time. A review of traditional middle development design types was given by Herson [707]; sample size is discussed in Chapter 16.
13.2
CHARACTERISTICS OF MIDDLE DEVELOPMENT
Questions for middle development depend on nuances of the context, therapy, existing evidence, and critical missing information. Even so, there are some typical designs. The knowledge from early development trials will leave a list of questions whose answers are required to support the impending large investment in a definitive randomized comparative trial. A small set of objectives will invariably emerge from that process. Those goals must be reconciled with features of an ideal middle development trial, and tempered by logistical constraints. Middle development goals typically include some of the following:
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
CHARACTERISTICS OF MIDDLE DEVELOPMENT
373
1. Establish biological activity against the disease using a meaningful clinical outcome. 2. Evaluate safety with a reasonable clinical threshold, as compared to risks estimated in dose-finding trials. 3. Learn more about mechanism of action. 4. Establish feasibility of administering the regimen based on logistics or cost. 5. Obtain estimates of efficacy at a specific dose and schedule. 6. Fine-tune dose, schedule, or combination of drugs. 7. Decide if definitive experiments should be undertaken. The ideal design for middle development has the following features: 1. 2. 3. 4. 5. 6. 7. 8.
Performed in the target population. Provides accurate unbiased information on a clinically relevant outcome. Reasonable precision—but not necessarily definitive. Yields unambiguous information on which to base the next development decision. Estimates parameters needed for designing a large definitive study. Robust to errors in design assumptions. Simple and efficient with moderate duration of follow-up. Transparent.
With regard to safety outcomes, middle development is an opportunity to establish a clinically meaningful tolerance for serious adverse events. Translational and doseranging trials can only rule out relatively high frequencies of such events because of their small sample size. In other words, they can point to risk, but cannot assess safety reliably. Importantly, middle development offers the opportunity to increase the frequency of true positive results from definitive comparative trials as discussed in Section 13.4. But the purpose of middle development is expressly not to predict the result of such trials—if it were we would have no need for the later studies. Finally, efficient middle developmental trials will depressure the pipeline. Absent any pressure, we might well decide to skip middle development as discussed in Section 13.3.1. 13.2.1
Constraints
The mid-pipeline setting brings significant design constraints derived from cost, efficiency, and other priorities of development. The requirements usually are as follows: 1. 2. 3. 4.
Cost, time, and sample size efficiency. Small scale (e.g., 30–200 subjects). Moderate duration of follow-up (e.g., days or weeks to 12 months). Simple logistics.
We can see from the balance of factors listed above that there can be a huge variety of trial designs appropriate in middle development. But we might also anticipate that there will be some standard designs reflecting common questions and similar rank-
Piantadosi
Date: July 27, 2017
374
Time: 4:45 pm
MIDDLE DEVELOPMENT
ing/restriction of goals and constraints. When enough constraints are brought to bear on a design question, there will usually be an “optimal” solution. We can expect to deal well with the question at hand in middle development, but these trials will be inadequate with respect to choice of outcome, size, precision, and control of bias to form a sole basis for changing therapeutic practice. The breadth and complexity of middle development has led some investigators to partition the landscape into trials that address dosing and clinical activity (sometimes called Phase IIa) and those oriented toward efficacy and prerequisites for a definitive comparative trial (sometimes called Phase IIb). This terminology is unhelpful because it is not descriptive, weakly defined, and inconsistently used, even in oncology where it seems to have originated (see Appendix B). There are many different relevant clinical questions that can arise in middle development especially now in the era of targeted therapy. We need not restrict imaginative design or descriptive terms with jargon-based names.
13.2.2
Outcomes
Middle development may be the best early opportunity to observe clinical outcomes attributed to a new therapy. To be time efficient we may prefer not to wait for the evolution of definitive clinical outcomes to evaluate performance. Instead, it might make more sense to observe intermediate or surrogate outcomes that can be measured relatively soon following treatment. Shortening the observation period for each subject shortens the trial duration, may reduce sample size, and lowers cost. But it directly complicates the interpretation of results due to the problems associated with surrogate outcome validity. Details of this and related matters are discussed in Chapter 5. Briefly, a good surrogate outcome will be in the causal pathway for the disease process, strongly correlated with the outcome, evident relatively soon after treatment, and easy and reliable to measure. But the most important characteristic for validity is that the surrogate must respond to treatment in the same way that the definitive outcome does. This is hugely demanding and would require validation across multiple trials. Although there are seemingly valid and useful surrogate efficacy outcomes, such as viral load in HIV and blood pressure for anti-hypertensives, there are examples where investigators have been badly misled by surrogates. An example is the Cardiac Arrhythmia Suppression Trial [50, 225], where there was higher mortality on treatment in definitive trials. This was the reverse of the effect seen in surrogates in early developmental trials. A valid surrogate outcome essentially allows us to view the underlying disease and the effects of treatment on it, greatly leveraging any middle development design. Viral load in HIV infection is much superior in this regard to CD4 lymphocyte count, the surrogate used early in the epidemic. In cancer, it has seemed historically that tumor mass should be an obvious surrogate for survival time as a definitive outcome. While this is not necessarily true, many middle development trials in cancer have used tumor shrinkage as an outcome and probably with good result. In some diseases we are not able to view the underlying disease process directly. Examples are many degenerative neurological diseases like Parkinsosn’s or ALS, and other diseases like diabetes, and psychiatric illness. Being unable to view the underlying disease process directly is a great disadvantage for clinical trials and development broadly. It means that we must rely on symptom or function assessments to gauge the effects of treatment, and these can be influenced by observer bias or subjects’ effort. Even if we can
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DESIGN ISSUES
375
eliminate such bias, a treatment that improves symptoms only may not be distinguishable from one that truly modifies the course of the underlying disease, unless an appropriate study design is constructed to discriminate between them. Safety is always an outcome for every clinical trial and there can be no surrogate for it. We might imagine a surrogate for risk—a marker or intermediate outcome that suggests the therapy will produce unwanted or dangerous effects. Transient mild decrements in measures of organ function are examples of this. But for establishing safety, defined as the absence of such effects, only lengthy observation, active ascertainment of relevant outcomes, and frequencies below meaningful clinical thresholds will allow us to declare a treatment safe. Even then, larger and more sophisticated studies can yield evidence of previously unobserved risks. It is worth recognizing an inherent paradox in the assessments of safety and efficacy in clinical trials generally. From middle development onwards, we tend to design trials with an emphasis on efficacy. This can reduce the reliability and precision for assessing safety signals. When there is reasonable evidence for efficacy, it will be questions of safety that primarily drive the decision to adopt a new treatment. This is particularly true when decent therapies are already available. The psychology of weighing this evidence comes out prominently in regulatory debates. Biomarkers generally are more suitable for translation than for middle development. The focus in middle development is on clinical outcomes or their surrogates, and few biomarkers have been validated for such a purpose. As biological knowledge increases and we can resolve fundamental disease processes, some biomarkers may emerge as appropriate outcomes for middle development trials. Some measures derived from functional imaging may eventually be validated for this purpose, for example. This could make previously hidden disease processes directly observable. 13.2.3
Focus
It is sometimes said that a middle development design can focus either on the drug, agent, or therapy, or alternatively on the disease. The difference in focus may be simply a way of characterizing appropriate scientific questions. Some trials done after dose finding have to address questions about clinical pharmacology, dosing, and the best indication for the new therapy. These are focused on the drug. Other trials, perhaps employing targeted therapeutics, address questions of clinical benefit, combinations with existing therapy, and preparations for a definitive comparative trial. Such trials are disease focused.
13.3 13.3.1
DESIGN ISSUES Choices in Middle Development
Understanding existing evidence, developmental needs and pressures, and design options makes middle development as straightforward as it is going to get. The details of the best next trial may be uncertain, but the need for additional evidence will be clear. There will be wide latitude in the size, timing, focus, and design of virtually any middle development trial. There may also be debate as to whether or not to proceed directly to a definitive comparative trial. The alternatives with respect to engaging middle development can be summarized as follows:
Piantadosi
Date: July 27, 2017
376
Time: 4:45 pm
MIDDLE DEVELOPMENT
1. Skip it entirely (a) proceed to a rigorous comparative trial (b) proceed to an error-prone comparative trial 2. Discontinue development 3. Screening design (a) eliminate the loser, with or without biomarker support (b) staged design (c) futility 4. Selection design (without standard therapy) 5. Hybrid designs with adaptive features
At first this list may seem incomplete because it doesn’t explicitly mention common ideas such as “randomized phase II” or “seamless phase II/III.” Those are mostly cosmetic terms, and the appropriate concepts consistent with the list above are discussed in Section 13.7. It is inevitable to discontinue development if safety or feasibility concerns arise in dose-finding or from initial middle development trials. This may be the most valuable decision possible if the operating characteristics of the study design on which the decision is based are strong, because it prevents wasting resources on an ineffective treatment. Skipping middle development entirely and jumping to a definitive comparative trial can also be wise. It is a gamble that can be justified by specific circumstances discussed below. Other choices from the list of choices above are discussed in the following sections.
13.3.2
When to Skip Middle Development
Most discussions of the developmental paradigm implicitly assume that middle development trials are always performed. However, this step was designed to answer specific questions in a specialized context. Not all therapeutic development fits the need for such trials. Middle development evolved to help us avoid doing expensive, lengthy, comparative trials on treatments that appear unlikely to be beneficial. This was a key issue in cancer drug development throughout much of the last 30 years of the twentieth century because there were many potential therapeutics, most were unlikely to work, and resources would not support testing all of them rigorously. Middle developmental designs with the efficiency of a single cohort and short-term surrogate outcomes can reliably reject ineffective treatments early. By spending a relatively small amount of calendar time and money on a study with reasonable operating characteristics, the overall development process can be made faster and more efficient. In some therapeutic settings, resources for performing a trial are relatively abundant, the number of potential therapeutics is small, and those in development are very promising. Also the marketplace may offer high profit and/or patient needs may be pressing. This could be the case for the development of some very specifically targeted drugs, for example. If so, investigators should move as quickly as possible to definitive comparative trials versus standard therapy, shortening the pipeline, reducing the total cost of development, and maximizing benefits and profits, assuming efficacy will be demonstrated.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DESIGN ISSUES
377
The risks in skipping middle developmental trials are considerable, and derive mostly from the scenario where the new therapy is shown to be ineffective after an expensive comparative trial. The following points should be considered before omitting this step: 1. The sponsor must be willing to accept the financial risk of a negative comparative trial, especially if it might have been obviated by a negative middle development trial. 2. The intellectual risk of a long comparative study that yields a null result must be acceptable. This may not be the case for young investigators who need academic credit from activities early in their careers. 3. There should be strong evidence regarding safety and likelihood of efficacy in advance of the comparative trial. This may come from dose-finding or other early developmental trials. 4. The treatment must have a very compelling biological rationale. Examples are drugs or other agents that are targeted to receptors or pathways that are known to be essential in the disease pathway. 5. Calendar time must be critical. The potential for lost opportunity or profit can make this true. 6. The chance of unforeseen events, both within and outside the study, must be low based on the development history up to that point. 7. The cost of middle developmental trials must be high relative to the information gained. In cases that appear to fit these points, we always intend to follow the principle of least chagrin, sometimes evident only in retrospect. Middle development in oncology drugs classically depended on tumor response as a surrogate outcome. For some drugs, short-term shrinkage of the tumor does not appear to be biologically possible, such as for agents that retard or stop growth without killing cancer cells. This would seem to provide an additional rationale for eliminating the middle developmental step and proceeding directly to a comparative trial. Doing so is still risky. The history of the development of matrix metalloproteinase inhibitors (MMPI) reflects this risk. Many drugs in this class were pushed forward into comparative studies without evidence from middle development trials. Over 20 comparative trials were performed without a single positive result [222]. 13.3.3
Randomization
Use of randomization in middle development trials seems to be widely misunderstood. The question arises most naturally when there are two or more treatments in development for the same population [1399]. We could perform the needed studies either serially or in parallel. Both strategies would use the same amount of calendar time and subjects. However, using a parallel design permits subjects to be randomized to the treatment groups, removing selection and temporal effects that might influence studies done serially. Parallel groups provide a basis for selection of the best-performing treatment without regard to the magnitude of differences. Determining the ranking of an outcome measure is usually easier (more efficient) than estimating relative effect size.
Piantadosi
Date: July 27, 2017
378
Time: 4:45 pm
MIDDLE DEVELOPMENT
Randomization would be used in this setting to remove selection bias and to control temporal trends. Its purpose is not to motivate a formal hypothesis test of the type routinely used in comparative studies. In other words, a low power randomized trial is not a cheap way to obtain a definitive answer to comparative questions. Randomization assures that selection of the winning treatment is free of selection bias. This situation is comparable to a car, horse, or human race where the starting positions are chosen randomly and the winner is crowned regardless of formal statistical comparisons to the followers. These designs are discussed extensively in Section 13.7. A second point of frequent confusion regarding randomized middle development trials is whether or not to include a standard therapy arm. As a matter of routine, none of the treatments in such a trial should be standard therapy. The principal reason to exclude them is resource utilization. Although including a standard treatment reference arm would improve the quality of decisions, we generally do not want to trade maximum efficiency for it. We expect that most new treatments tested in middle development will fail to be real advances, placing a premium on efficient evaluation and elimination of them. This perspective originated with the cancer drug development model but is valid more generally. Thus, the prototypical randomized selection trial will have two or more new treatments, no standard arm, and a primary objective of choosing for subsequent comparative testing the best-performing therapy, such as the one with the highest success rate, without regard to magnitude of differences. Having made these generalizations, there are occasional circumstances in which randomization versus standard therapy would be appropriate. For example, if we have few middle developmental questions, a high degree of confidence in one, and substantial resources, then a randomized design versus standard therapy is easier to justify. However, in such circumstances, it may make sense to proceed directly to a true comparative design as discussed above. It is also important to design the error properties of such trials thoughtfully so that a good treatment is not discarded. It might be useful to look again at Section 10.3.4 to appreciate the potential quantitative properties of this approach. Some special practical circumstances are required for selection trials. In particular, the accrual rate must be sufficient to complete a parallel-groups design while the question is still timely. Also the decision to conduct studies of some of the treatments cannot be a consequence of the results in other treatments. If so, the parallel design may be unethical, wasteful, or not feasible. Selection designs may be particularly well suited for selecting the best of several treatment schedules for drugs or drug combinations because these questions imply that the same population of patients will be under study. Similarly, when addressing questions about sequencing of different modes of treatment, such as radiotherapy with chemotherapy for cancer, randomized middle development studies may be good designs for treatment development.
13.3.4
Other Design Issues
There are important design considerations for middle development trials that are more of a clinical concern than statistical. For example, if standard therapy is available, investigators may be reluctant to displace it with an unproven therapy. It might be possible to administer the new treatment early, then proceed to standard therapy, if needed, after the delay. This design is sometimes called the “up front window” design–hideously bad ter-
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MIDDLE DEVELOPMENT DISTILLS TRUE POSITIVES
379
minology in my opinion. Some discussions of this type of design are given by Groninger et al. [649], Balis [104], Frei [535], and Zoubek et al. [1607]. The principal concern in such designs is that delay of standard therapy will be a detriment to the subject. It is also possible that some residual effect of the first treatment will interfere with the benefit of the second. Because this is more of a clinical question than a study design issue, it is not possible to provide uniformly sound advice about it. It might be possible to establish that delay of a standard therapy for a given disease is not harmful, but one cannot generalize from that circumstance to other diseases or even to other treatments for the same condition.
13.4
MIDDLE DEVELOPMENT DISTILLS TRUE POSITIVES
Results from middle development should not be expected to predict the outcome of definitive comparative trials. If we could do that, we would not need to perform the definitive trial. The true value of middle development is that it raises the prior probability of true positive results in the pipeline from a few percent to 15–20% or higher by eliminating therapies likely to fail a definitive comparison. Here, prior is with respect to late development. This can strongly increase the true positive frequency in late development without necessarily raising the overall success rate. This is the same effect that takes place in the pipeline overall as discussed in Chapter 10, but middle development provides a strong boost. Although somewhat redundant, I will repeat some of the discussions here to emphasize its importance. The difficulty with false positives is exactly the same problem that diagnostic screening tests face. Even an excellent screening test yields mostly false positives when the background frequency of the disease is low. Analogously, a highly reliable randomized comparative trial yields mostly false positive results if applied when the prior chance of a positive result is low. Bayes rule [123, 124] demonstrates this neatly by showing the relationship between the probability of a true positive and a positive trial result. Denote a positive trial result by 𝑇 + . The event 𝐴 indicates that the treatment is a true advance, and 𝐴 indicates the treatment is not a true advance. Then, Bayes rule can be used to relate the true positive rate, Pr[𝐴|𝑇 + ], to the design parameters of the trial and the pipeline properties. The same approach was used in the discussion of a quantitative pipeline in Section 10.3.
Pr[𝐴|𝑇 + ] =
Pr[𝑇 + |𝐴] Pr[𝐴] Pr[𝑇 + |𝐴] Pr[𝐴] + Pr[𝑇 + |𝐴] Pr[𝐴]
,
(13.1)
where Pr[𝑇 + |𝐴] is the probability of a positive trial result given that the treatment is truly an advance, that is the power of the trial, and Pr[𝑇 + |𝐴] is the probability of a positive trial result given that the treatment is not an advance, that is the type I error. Pr[𝐴] is the
Piantadosi
Date: July 27, 2017
380
Time: 4:45 pm
MIDDLE DEVELOPMENT
unconditional probability of a treatment advance, which is a property of the pipeline and context at the time the trial is conducted. It must be that Pr[𝐴] = 1 − Pr[𝐴]. We can then put equation (13.1) in the form Pr[true positive] =
(1 − 𝛽) Pr[𝐴] , (1 − 𝛽) Pr[𝐴] + 𝛼(1 − Pr[𝐴])
(13.2)
where 𝛼 and 𝛽 are the type I and II error rates. This is the same as equation (10.1). Assuming 𝛼 and 𝛽 are small, as they would be for a definitive trial, equation (13.2) shows that the true positive rate is driven almost entirely by Pr[𝐴], the prior chance of success in the pipeline. This illustrates the role of middle development–to increase the prior probability of success that late development definitive trials see, so that most positive results are true positives. Early in development, 𝑃 𝑟[𝐴] will be small, perhaps less than 5%. Even strong comparative designs will yield mostly false positives when this is the case. If middle development can increase 𝑃 𝑟[𝐴] to 20%, the great majority of positive comparative trials will be true positives. This effect is shown in Figure 13.1 which plots the true positive rate from equation (13.2) versus study power for varying values of 𝑃 𝑟[𝐴], the prior chance of success. It is clear that study power alone is insufficient to improve the true positive rate which is driven almost entirely by the prior probability of success. What is not obvious is how difficult it is to enrich positive advances from a background of 5% to over 20%. This does not sound like a difficult task, suggesting that reasonable middle development designs will be sufficient. We can apply the same quantitative approach to this question where the input to middle development is a low prior probability, and the output from middle development screening is the input to definitive comparative trials.
FIGURE 13.1 True positive rate as a function of power and prior probability of success.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
FUTILITY AND NONSUPERIORITY DESIGNS
381
TABLE 13.1 Solutions to Equation (13.3) with 𝑷 𝒓[𝑨] = 𝟎.𝟎𝟓 and 𝑷 𝒓[𝑨|𝑻 + ] = 𝟎.𝟐𝟓 𝛼 0.04 0.05 0.06 0.07 0.08 0.09
1−𝛽 0.25 0.32 0.38 0.44 0.51 0.57
𝛼 0.10 0.11 0.12 0.13 0.14 0.15
1−𝛽 0.63 0.70 0.76 0.82 0.89 0.95
To be specific, suppose 𝑃 𝑟[𝐴] ≈ 0.05 and we intend for 𝑃 𝑟[𝐴|𝑇 + ] ≥ 0.2, where 𝑇 + now denotes a positive result from middle development. Values of 𝛼 and 𝛽 that satisfy 0.25 =
(1 − 𝛽) × 0.05 , (1 − 𝛽) × 0.05 + 𝛼 × 0.95
(13.3)
are shown in Table 13.1. The interpretation is that modest error control in middle development designs can raise the true positive rate from 5% in the early pipeline to 25% prior to definitive randomized comparisons. Based on Figure 13.1, this effect is sufficient to increase the pipeline true positive rate to over 85% following good comparative trials. If 𝑃 𝑟[𝐴] is very low, say 1%, it is difficult for any design to raise the true positive rate to over 20%.
13.5
FUTILITY AND NONSUPERIORITY DESIGNS
Futility is a familiar concept in both middle and late development, and is implicit in many of the simple screening designs discussed above. For example, in the classic Gehan design for cancer therapy [571], a drug trial would end if it produced no responses out of the first 𝑛1 subjects enrolled, where 𝑛1 was chosen according to a futility bound. Any responses would require an additional 𝑛2 subjects be enrolled to estimate the success rate with more precision. For a response threshold of 20% and false negative rate of 5%, 𝑛1 = 14 subjects. This design has often been corrupted, particularly by ignoring the second stage. However, it is true that zero responses out of 14 subjects yields an exact binomial upper 95% confidence bound of 20% for the true response rate. A class of formal futility designs has been proposed for middle development in neurologic therapeutics. A recent discussion of this design is given by Levin [925] based partly on the original paper by Palesch et al. [1171] and the accompanying editorial [924]. The main intent of these futility trials is to identify ineffective treatments rather than establish efficacy. Like the cancer examples, these futility trials screen out therapies that should not proceed developmentally to comparative testing, while minimizing sample size and cost. Specific examples are given below. One feature of a futility design is a threshold, or threshold improvement, below which we are not interested in further developing a treatment. The threshold is based on clinical considerations. If treatment performs above the threshold, development is continued. This indicates that the treatment does not appear to be futile, but is not a demonstration of efficacy. This idea of futility is similar, but not identical, to that applied to interim monitoring. In the case of conditional power, for example, a trial might be stopped for
Piantadosi
Date: July 27, 2017
382
Time: 4:45 pm
MIDDLE DEVELOPMENT
futility if the probability of reaching the rejection region at the end of the study is small conditional on being near the null at the interim (see Chapter 18). Another goal of futility methods is to match the consequences of decision errors with the designed probability of making those errors. For example, if we lack good treatments for a debilitating disease, the consequences of discarding a promising treatment may be severe. In contrast, if we carry forward a treatment that is not of true or added value, that mistake will ultimately be revealed with definitive testing, albeit at incremental cost and delay. We might prefer for the pipeline to be “optimistic” so that we are unlikely to miss a promising therapy, and could choose study designs and error rates to reflect that intent (Chapter 10). The issue of tailoring the type I and II error rates to the consequences of making those mistakes is not new or controversial and I discuss it also in Section 16.7.1. A third feature of some frequentist futility designs is a reversal of the typical roles played by the null and alternative hypotheses. Bayesian methods do not have this issue. The classical approach is to make the null hypothesis a statement of equivalence or lack of improvement. In contrast, a futility design might set the null hypothesis to be that the treatment exceeds some performance tolerance, while the alternative hypothesis is that the treatment is no better than control, or performs below the tolerance. This inversion of the null and alternative hypothesis partly gives the design its name. Reversal of the usual null and alternative hypotheses seems to have appeared first with [1171], though the motivation for it was not made explicit. This reversal is an important issue.
13.5.1
Asymmetry in Error Control
The statistical literature does not contain extensive guidance about the appropriate way to construct a null hypothesis. However, the traditional arrangement is more than a simple convention, as will be seen below, because of asymmetry in the way that type I and II errors are managed. A hypothesis test provides certain and quantifiable control over the type I error. When we reject the null—classically when we declare that there is a significant difference or a treatment improvement—we know exactly the probability of a type I error. In other words, the chance of advancing a truly ineffective therapy is controlled at the 𝛼-level of the test. Provided we are wise about the ways that a type I error can be inflated (e.g., multiplicity), the probability of this error is always under the control of the experimenter by the choice of the 𝛼-level or critical value for the test. When we reject the null, the type II error is irrelevant. The type II error given for our experiment is a hypothetical, conditioned on a specific effect size. There are many type II error probabilities, some high and some low, depending on the hypothetical effect size (i.e., a power curve). Failing to reject the null hypothesis does not inform us which specific alternative is the relevant one, and therefore does not tell us the type II error. We can only say that we have a low probability of discarding a large effect size and a higher chance of discarding a lower effect size. Thus, the hypothesis testing framework codifies an asymmetry: the type I is knowable unconditionally and the type II is not. Returning to the futility/non superiority design, we can now see a pitfall of reversing the null and alternative hypotheses. For example, suppose the null hypothesis states “the effect of treatment X exceeds standard therapy by 30% or more” and the alternative
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
FUTILITY AND NONSUPERIORITY DESIGNS
383
states “the effect of treatment X is less than a 30% improvement”. If the null is rejected, treatment X will not be developed further and we know the probability that we have discarded a useful therapy: exactly the 𝛼-level of our hypothesis test. If we do not reject the null hypothesis, development continues but we are unsure of the probability that treatment X actually performs below our intended threshold. Was it important to control the probability of discarding a useful therapy? Or would it have been more useful to control the probability of advancing a poor therapy? Generally, the second question is more important as discussed next, but circumstance can make us attentive to the first.
13.5.2
Should We Control False Positives or False Negatives?
The awkwardness of hypothesis testing inhibits a rapid understanding of the implications of role reversal. But recall that the classical null hypothesis of no effect/improvement makes the type I error a false positive error. Thus, in the standard approach, our test controls the false positive rate exactly but only conditionally controls the false negative rate. Reversing the null and alternative hypotheses as in some futility designs allows our test to control the false negative rate exactly but only conditionally controls the false positive rate. Is it better to control false positive or false negative errors? Clinical circumstances drive the answer, just as they tell us to loosen or tighten error probabilities to reflect the consequences of the respective mistakes. The choice may also depend on where we are in development, and our beliefs about the overall success probability of ideas in the pipeline. Suppose that the disease under study is serious and there is at least one safe and effective treatment available. Then it is likely that we will prefer to control the false positive rate in our development strategy. This is true because we will not want a large risk of displacing even an imperfect treatment with an ineffective one. Missing a second treatment (false negative) is a less worrisome mistake when a treatment is already available. Strong control over the false positive rate in middle development also narrows the pipeline and tends to conserve resources so that fewer late trials may be needed. If the pipeline is full of developmental ideas a small false positive rate is a wise strategy. If we have not yet developed any safe and effective treatments, we might prefer to control the false negative rate for our pipeline. The best case for this is when therapeutic advances are rare, the pipeline is unpressured, and we can afford to invest in the ideas that arise. Lowering and controlling the chance of a false negative in middle development gives us the best chance of finding the first safe and effective treatment. However, we must keep in mind the cost for this preference. First, we will have less definitive control over the false positive rate that might increase the risk that the first treatment developed is actually ineffective. This deficiency can be corrected in late development with larger definitive trials, but the costs will be higher. Lowering the chance of a false negative result in middle development may also mean that more late development trials are needed. In other words, it implies that the pipeline will need more capacity at higher cost. We incur additional developmental cost when choosing tight control over false negatives and less control over false positives rather than the other way around. This strategy should probably not be routine, even in neurological disease, but might be applied while we develop initial therapeutic tools. As a last perspective, consider that strong evidence is required to reject the null, and strong evidence is unlikely to mislead us. Weak evidence is more likely to mislead, and
Piantadosi
Date: July 27, 2017
384
Time: 4:45 pm
MIDDLE DEVELOPMENT
unlikely to reject the null. When the null hypothesis is one of efficacy/improvement, weak evidence encourages further development while obscuring the chance of carrying forward a loser. Strong evidence is needed to support non efficacy or non superiority. Generally speaking, we do not want to generate strong evidence that a new therapy is inferior to our benchmark. It is inferentially and ethically more appropriate to use strong evidence to support superiority rather than inferiority.
13.5.3
Futility Design Example
The futility/non superiority design with a reversed null hypothesis has been applied in neurodegenerative diseases such as ALS [618, 931], Parkinson’s Disease [1119, 1482], and Huntington’s Disease [751]. An example futility objective would be “to rule out that a new treatment is at least 15% superior to standard.” These designs use relatively small sample sizes when error criteria are relaxed, as is often the case. A one-sided hypothesis is not sufficient reason to relax the error rate. Negative predictive values are high and positive predictive values are not as strong for reasons given above. Thus, these designs are good at identifying ineffective agents, but not so good at identifying effective agents. NET-PD was a trial in subjects with early Parkinson’s Disease. There were three treatment groups: creatine (n = 67), minocycline (n = 66), and placebo (n = 67). The primary outcome was change from baseline in the total unified Parkinson’s Disease Rating Scale (UPDRS) score measured at 1 year or the time at which symptomatic therapy was needed, whichever came first. Investigators hoped that treatment would yield a 30% reduction in the UPDRS decline. If the study evidence ruled out a benefit of this magnitude, then it would be deemed futile to proceed with a confirmatory comparative trial of creatine or minocycline. An historical control group was the comparator. Placebo and vitamin E treated groups from an earlier Parkinson’s Disease trial (DATATOP [1178, 1385]) trial (n = 401) were used for this purpose. In that cohort the mean change was 10.65 points (95% CI was 9.63–11.67). A difference of 7.46 points corresponds to a 30% difference. Based on the historical data, the null hypothesis was formulated as a mean change of 7.46 points, and the alternative hypothesis was a mean change exceeding 7.46. NET-PD was then designed to detect a true mean change of 10.65 points assuming a standard deviation of 10.4. This is an effect size of approximately 1. Type I and II error rates were set at 0.1 and 0.15, respectively, and a 5% non adherence rate was assumed. The sample size needed to satisfy these design assumptions is 65 subjects. In other words, if treatment were ineffective, the trial would yield 85% power to see that the mean change exceeded 7.46 using a one-sided test with 10% type I error. The NET-PD design was a bit odd, in that the concurrent placebo group was not used directly in the hypothesis test. The placebo group did facilitate treatment masking and allowed the historical control to be validated. The one-sample comparison to a reference provided by the historical control reduced the required sample size substantially. The trial results showed mean changes in the creatine group of 5.60 ± 8.69 (p = 0.96), minocycline 7.09 ± 8.71 (p = 0.63), and placebo 8.39 ± 9.76 (p = 0.22). The reference group placebo change was 10.65 ± 10.40. Thus, it was harder to establish futility using the historical control. The study points out the dangers of historical controls, changes in supportive care with time, differences in rater behavior and entry criteria, and selection bias.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE–EFFICACY QUESTIONS
13.5.4
385
A Conventional Approach to Futility
My warnings in the previous section are not intended to discredit the idea of futility, which is also widely used in late development. An excellent recent discussion of this is by Herson et al. [711]. Reversal of the usual null and alternative hypotheses is not required either to construct an optimistic pipeline, demonstrate futility/non superiority, or to align type I and II errors with their consequences. Null and alternative hypothesis reversal does not appear to be a feature of any of the old motivating examples for futility designs from the cancer literature. Suppose, we conventionally construct the null hypothesis to be “the effect of treatment X is less than a 30% improvement over standard,” and the alternative hypothesis to be “the effect of treatment X exceeds standard therapy by 30% or more.” This is a clean non superiority hypothesis. If we want an optimistic pipeline that favors moving therapies forward, we might set the type I error at 10% or even 20%. It could be a very serious error to miss a treatment that actually represented a 50% improvement, in which case we might want the type II error for that alternative to be as small as say 5%. Then weak evidence makes it difficult to advance a treatment, and strong evidence is likely to advance a treatment when it is actually good. Thus, the properties that we admire in a “futility” design can be achieved using conventional ideas. A commonly employed middle development design, especially in oncology, with the potential to stop for futility is the two-stage design [1398]. This design is discussed quantitatively in Section 16.6.2. This two-stage design uses a conventional arrangement for null and alternative hypotheses. It is predicated on a binomial outcome and employs a decision rule at the end of the first stage to stop if the treatment is yielding a success probability below a prespecified threshold. If the success rate is promising, the trial continues to complete a second stage. A treatment can also be discarded after the second stage. A recent typical example of the use of this design is by Abou-Alfa et al. [4].
13.6
DOSE–EFFICACY QUESTIONS
Dose-finding on the basis of differential efficacy is a relatively expensive question compared to dose-finding based on toxicity or side effects. Toxicity titrated dosing is founded on the principle that more treatment (a higher dose) is better than less treatment, and the principal bound for dose is a prespecified risk threshold. Dose versus efficacy questions never seem amenable to this efficient strategy. For one thing, higher doses don’t necessarily yield higher efficacy, whereas a dose–toxicity relationship has to be monotonically increasing. Second, we never prespecify an efficacy threshold analogous to the “frequency of dose limiting toxicity” that terminates classic dose escalations. For the moment, assume there is no toxicity. Suppose we actually define “dose limiting efficacy” in individuals, meaning that there is (i) an efficacy signal of some level, nature, or quality that we accept as definitive benefit, and (ii) no rationale for using higher doses when we can produce the observed effect in a high proportion of subjects. Finally, suppose our efficacy measure is scaled to take values between 0 and 1. We would then have a situation exactly analogous to classic toxicity-titrated dose finding. We could titrate dose to a desired frequency of efficacy and expect efficiency in trial design similar to that seen in classic dose-finding. In fact, the same statistical and mathematical models could likely be used. This analogy is not perfect because, as indicated
Piantadosi
Date: July 27, 2017
386
Time: 4:45 pm
MIDDLE DEVELOPMENT
above, there is no guarantee that higher doses are more likely to yield dose limiting efficacy. One reason we don’t do what is described above is because we do not usually have the necessary efficacy signal. Any such signal would have to be definitive and apparent very soon after treatment. What seems to be true more often is that efficacy signals are slower to manifest themselves, and may be manifest as the absence or delay of negative events rather than the occurrence of some positive event. The best example is survival, which is measured by the delay of death rather than the occurrence of some positive happening. Hence, it is difficult to map dose–efficacy into a dose–toxicity study design. As a result, it is frequently a better strategy to establish dose before encountering efficacy questions developmentally. Otherwise, we find that comparing doses with respect to efficacy is the same problem as comparing efficacy with respect to doses, which is as difficult as comparing different treatments. We can expect to pay for this with fully comparative trial designs and larger sample sizes. As an example, consider the PREQUEL study, a randomized, masked trial of coenzyme Q10 (CoQ10 ) at three doses in subjects with pre-manifest Huntington’s Disease (HD) [758]. The purpose of the study was to determine the highest safe and well-tolerated dose of CoQ10 over 20 weeks. Particpants were CAG expansion-positive for HD but without manifest disease. The doses were 600 mg/day (n = 30), 1200 mg/day (n = 30), and 2400 mg/day (n = 30). There was no placebo group. The secondary aim was to estimate a dose–response relationship between CoQ10 and serum levels of biomarkers of oxidative stress, 8OHdG and 8OHrG. The primary outcome was the ability to tolerate 20-weeks on the originally assigned dosage of study medication. The tolerability rate was expected to exceed 75%. Investigators would reject the null hypothesis of unacceptable tolerability if at least 25/30 subjects succeeded with 𝛼 = 0.20. If the true tolerability rate was 89%, the trial would have 90% power. Alternatively, using a 26/30 decision rule, the trial would have 77% power to detect the same difference using 𝛼 = 0.10. The 25/30 decision rule was preferred by the investigators because rejecting a truly tolerable dosage is a more serious error than accepting a truly intolerable dosage. If too low a dose were taken into a comparative trial, a null result would be difficult to interpret. Typically, in a comparative trial, dosage reductions or rechallenges could be used to retain subjects. In any case, we can see that 90 subjects were needed to resolve the dose–efficacy question. This is several times as many as might typically be required to resolve a dose–toxicity question. Even so, the PREQUEL design was necessary for the setting.
13.7
RANDOMIZED COMPARISONS
Randomized comparisons are increasingly prescribed in middle development in the form of so-called “randomized phase II” trials, as a way to furnish strong evidence for a clinically meaningful treatment effect and the decision to continue or terminate development. Stronger evidence early would seem to fix deficiencies late in the pipeline. Deficiencies show up as too many randomized comparative trials ending with null results. We might expect to reduce such failures with more reliable middle development findings. Unfortunately, favorable middle development results are sometimes contradicted by
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZED COMPARISONS
387
definitive comparative trials. This seems to have been the case for the cancer treatments iniparib [1165] and amrubicin [431, 805]. As appealing as this reasoning seems to be, it has defects. First, there is really no such thing as a “randomized phase II” trial—the term is essentially a deception for an under-powered comparative trial. The fact that such a trial is conducted in the typical window does not make it a middle development design. To keep the sample size low for such a trial, investigators must seriously weaken type I and II error control. It has been proposed to allow 40% type I errors, for example [1310]. This type of design is more properly described as an error-prone comparative trial. It can dangerously convey a false sense of reliability because randomization is used. The flip of a coin is randomization as well, but we would not base a developmental decision on it. My purpose here is not to speak against the merits of randomization, but rather to clarify design strengths and weaknesses in practical settings. It is not obvious that late development failures are a consequence of middle development weaknesses alone. My view is that such is not the case, as discussed extensively in Section 10.4. Blame might also be attributed to over-enthusiastic support for emerging therapeutic ideas based on preclinical evidence in idealized, low-noise models. Financial and intellectual investments in promising ideas can soften criteria for terminating development, and thereby encourage comparative trials. This is a classic intellectual conflict of interest. Understanding the best use for these designs depends on (i) errors, (ii) enthusiasm (blame), (iii) efficiency, (iv) ethics, and (v) economics as discussed below.
13.7.1
When to Perform an Error-Prone Comparative Trial
A cleverly corrupted adage says “If it is important enough to do, it’s important enough to do badly.” This thought might relate to a randomized comparative trial with relaxed error criteria. Because even a weak prelude to late development can increase the true positive rate as described in Section 13.4, such trials have to be considered as an option. But we must ask if doubling the sample size for an internal control group, for example, is worth the added reliability gained. It can be under some conditions. But Section 10.3 suggests that this strategy may have significant shortcomings. Reliability may be further weakened if surrogate outcomes are used. Conditions under which small randomized comparisons with relatively high errors might be appropriate include: (i) if the pipeline is not overburdened, (ii) if confidence in the treatment is not high (otherwise we would likely skip middle development), (iii) if a reliable external comparator or reference point is not available, (iv) if there is a strong development bias pushing ineffective therapies forward, and (v) if a new outcome is being used. When such circumstances fail to hold, we should be cautious about using such designs. We must also acknowledge different perspectives regarding the expenditure of resources for control groups in middle development. There are three relevant views: trial participants, academic clinical scientists, and commercial sponsors. Study participants might be very accepting of possible randomization to standard therapy. This is appropriate given how frequently the superior treatment arm is the control. But in fatal diseases or end of life with demonstrated unsatisfactory therapeutic options, only accepting a new therapy is more sensible for participants. This was the behavior of potential study participants early in the AIDS epidemic, and is a common view today among people
Piantadosi
Date: July 27, 2017
388
Time: 4:45 pm
MIDDLE DEVELOPMENT
with life threatening diseases like brain tumors and other advanced cancers. Academic investigators and IRBs are likely to support this view because of their patient oriented perspectives and the high value placed on the principle of autonomy. It is natural for sponsors to prefer the strongest possible data at every stage of development. This allows a decision to terminate development to be made at the earliest possible time. In the current development paradigm, definitive evidence often arrives late after huge resources have been spent. Expending more resources in middle development and terminating an ineffective or toxic treatment there represents a large potential cost saving. Also, sponsors tend to have a positive bias toward the worth or their products in development beyond objective evidence, making it seem worthwhile to invest more extensively in their evaluation. This illustrates a potential for small randomized trials to put economic resources partially at odds with human resources and ethics. Generally, it is appropriate to take a dim view of trials with high error rates. The value of randomization is greatest when random errors are well controlled. Looking at the pipeline, other circumstances might suggest when high error rates can offset the value of randomization. For example, if the pipeline is full, the opportunity cost is high with a randomized control group because we invest double the resources in one trial. When the early evidence in favor of the new treatment is very high or very low, a small randomized trial may not be the most efficient strategy. High confidence suggests that we skip middle development, and low confidence suggests that we use a simple screening design to eliminate losers. A simple screening design, such as a single cohort with an external, is quite sensible if the potential bias is thought to be small relative to the treatment effect.
13.7.2
Examples
An example of a middle development trial is a preliminary efficacy study first suggested by Schoenfeld [1341] using the now unfortunate term “pilot” studies. The goal of such a trial is to identify a promising efficacy signal from a new treatment. The hypotheses for such a trial are constructed as in a typical superiority study. The alternative hypothesis can be one-sided, although as emphasized elsewhere, this does not settle the question of the proper type I error rate. If a higher than typical type I error can be tolerated, sample size can be reduced appreciably. The type II error control would likely be set near conventional levels. In either case, these error rates should be set to reflect the consequences of the respective mistakes. The design of such a trial as with most middle development studies is more difficult when there are no biomarkers or interim measures of therapeutic activity available, or when clinical outcomes are highly variable or slow to evolve. An example of this type of trial was the middle developmental study of three doses of coenzyme Q10 in subjects with early Parkinson’s Disease who did not yet require conventional treatment. This was a randomized, double-blind, placebo-controlled trial [1386]. Aside from the demonstration that coenzyme Q10 was well tolerated, the trial indicated a slower rate of disability in subjects treated, especially at the highest dose, compared to placebo. The trial contained 16 subjects in the placebo group and 21, 20, and 23 subjects in each of the dose groups. Hence, it was not precise enough to form valid independent treatment assessments, leaving the need for a larger comparative trial. Such a trial in 600 participants was published in 2014, showing no benefit for high dose coenzyme Q10 [1179].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZED COMPARISONS
389
Illustrating some of the pitfalls of underpowered randomized comparisons is the trial of prednisolone versus docetaxel plus prednisolone in androgen-independent prostate cancer. [508] This “randomized phase II” trial included 134 participants and used the six-week PSA response as the primary (surrogate) outcome. There were 25 ineligible subjects, and outcomes measured in 104 participants at 6 weeks and 97 participants at 12 weeks. Biochemical response rates in the docetaxel arm were approximately double those in the prednisolone alone arm. Survival, disease progression, and quality of life also appeared superior in the docetaxel arm. The authors concluded that “docetaxel plus prednisolone should become the standard systemic treatment in androgen independent prostate cancer”. They did not reconcile the purposes and design of the trial, or its placement in middle development, with the definitive clinical conclusions made. Randomized middle development designs have been suggested for routine use in brain tumor trials. Little progress has been made in that disease in recent years and there have been a few unsuccessful large randomized comparative clinical trials. Is seems easy to implicate historically controlled single cohort designs as having selection or other bias in favor of ineffective new therapies, which then lead to failed comparative trials. While more rigorous middle development data would be helpful in this setting, investigators also have to respect the predicament of research participants faced with a fatal disease and minimally effective therapeutic options. They are properly reluctant to participate in trials as members of the standard therapy arm. Risk acceptance for new drugs and combinations is probably appropriate for them. While single cohort designs are subject to bias, only half as many research subjects are needed to answer a question as compared to randomized designs. Furthermore, in such aggressive diseases we should probably be looking for large treatment effects, well above the magnitude of selection bias.
13.7.3
Randomized Selection
As discussed in Chapter 8 and Section 13.3.3, the ideal application of randomization in middle development is to select the best-performing therapy reliably rather than to estimate the magnitude of treatment differences. Selection is an easier or statistically more efficient task than estimation, because it relies on ranking rather than measuring actual effect size. The probability of correctly selecting the superior treatment provides a quantitative hook for determining sample size for these randomized safety and activity trials.
Two Groups Suppose that there are two treatment groups of size 𝑛 and the treatment group mean is the basis of the selection. Assume that the true means are 𝜇 and 𝜇 + 𝛿, the observed means are 𝜇̂1 and 𝜇̂2 , respectively, and that the variance is known. For every possible value of the first mean, we must calculate the chance that the second mean is greater, and sum all the probabilities. Therefore, the chance that 𝜇̂2 ≥ 𝜇̂1 is ∞
Pr(𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑜𝑟𝑑𝑒𝑟𝑖𝑛𝑔) =
∫−∞
∞
𝜙(𝑢)
∫𝑢
∞
𝜙(𝑢 + 𝛿)𝑑𝑢 𝑑𝑢 =
∫−∞
𝜙(𝑢)Φ(𝑢 + 𝛿) 𝑑𝑢, (13.4)
Piantadosi
Date: July 27, 2017
390
Time: 4:45 pm
MIDDLE DEVELOPMENT
where 𝜙(⋅) is a normal density function with mean 0 and variance 1∕𝑛 and Φ(⋅) is the cumulative normal distribution function. The structure of this formula is quite general, but for two groups a shortcut can be used. The probability of a correct ordering of two groups is also the chance that the ordered difference between the observed means is positive. The difference 𝜇̂2 − 𝜇̂1 has mean 𝛿 and variance 2∕𝑛, so that ∞
Pr(𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑜𝑟𝑑𝑒𝑟𝑖𝑛𝑔) =
∫0
𝜙(𝑢 + 𝛿)𝑑𝑢,
where 𝜙(⋅) has mean 0 and variance 2∕𝑛. Thus, we can choose 𝑛 to make this probability as high as needed, or conversely, solve numerically for 𝑛, given a specified probability. For example, to have 90% chance of a correct ordering when 𝛿 = 0.50, 14 subjects are needed in each group (Table 13.2). For a binary outcome, the same theory can be used after transforming the proportions 𝑝 and 𝑝 + 𝛿 using the arcsin–square root, √ √ 𝜃 = 2(arcsin 𝑝 + 𝛿 − arcsin 𝑝), (13.5) which is approximately normal with mean 𝜃 and variance 2∕𝑛. For example, when 𝛿 = 0.1 and 𝑝 = 0.4, 𝜃 = 0.201. Note that in Table 13.2, 83 subjects are required in each group to yield 90% chance of correct ordering. There is a small problem with sample sizes determined for binary outcomes using this method. The equations are derived for a continuous distribution, whereas binomial outcomes are discrete. Suppose that 𝑟1 and 𝑟2 are the number of responses in the treatment TABLE 13.2 Sample Size Needed for Reliable Correct Ordering of Two Means 𝛿 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 𝑎P
.8
.85
P𝑎 .9
142 63 36 23 16 12 9 7 6 5 4 4 3 3 3 2 2 2 2
215 96 54 35 24 18 14 11 9 8 6 6 5 4 4 3 3 3 3
329 146 83 53 37 27 21 17 14 11 10 8 7 6 6 5 5 4 4
is the probability of correct ordering.
.95
.99
542 241 136 87 61 45 34 27 22 18 16 13 12 10 9 8 7 6 6
1083 482 271 174 121 89 68 54 44 36 31 26 23 20 17 15 14 12 11
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZED COMPARISONS
391
groups with true success probabilities 𝑝 and 𝑝 + 𝛿 respectively. For 𝑟2 > 𝑟1 , the binomial distribution yields Pr(𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑜𝑟𝑑𝑒𝑟𝑖𝑛𝑔) =
𝑛−1 ∑
𝑛 ∑
𝑏(𝑝)
𝑖=0
𝑏(𝑝 + 𝛿),
𝑗=𝑖+1
whereas for 𝑟2 ≥ 𝑟1 , Pr(𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑜𝑟𝑑𝑒𝑟𝑖𝑛𝑔) =
𝑛 ∑ 𝑖=0
𝑏(𝑝)
𝑛 ∑
𝑏(𝑝 + 𝛿).
𝑗=𝑖
From these exact equations, to have 90% chance of correct ordering defined strictly as 𝑟2 > 𝑟1 , 91 subjects are needed in each group. To have 90% chance of correct ordering defined as 𝑟2 ≥ 𝑟1 , 71 subjects are needed in each group. The normal approximation yielded 83, a compromise between these numbers. For correctly ordering hazards of failure from time-to-event outcomes, the derivation for the normal case applies with the following modifications. The log hazard is approximately normal with variance 1∕𝑑, where 𝑑 is the number of events observed in the group. To simplify, assume that the number of events in each of two groups is 𝑑. An appropriate transformation for hazards is ( ) 𝜆+𝛿 𝜂 = log(𝜆 + 𝛿) − log(𝜆) = log = log(Δ), 𝜆 which is approximately normal with mean 𝜂 and variance 2∕𝑑. For example, when 𝛿 = 0.05 and 𝜆 = 0.1, 𝜂 = 0.405. To have a 90% chance correctly ordering the hazards, Table 13.2 shows that 𝑑 = 21 events are required in each group. The same accrual dynamics discussed in Section 16.5.1 apply, so more than 21 subjects must be placed on study to yield the required number of events. A more elaborate approach to this type of problem is given by Liu et al. [948]. More Than Two Groups See Desu and Raghavarao [364] for a general discussion of ranking and selection. The essential sample size problem is to make a reliable selection of the best out of 𝑘 ≥ 2 treatments. We assume that the sample size in each group is the same and that the variances are equal and known. The best treatment can be chosen by chance with probability 1∕𝑘 so that 𝑃 > 1∕𝑘, where 𝑃 is the probability of correct selection. For more than two groups, equation (13.4) generalizes to ∞
∫−∞
Φ𝑘−1 (𝑢 + 𝛾)𝜙(𝑢)𝑑𝑢 ≥ 𝑃 .
(13.6)
Values of 𝛾 that satisfy this equation for some 𝑘 and 𝑃 have been tabulated by Bechhofer [129] Desu and Sobel [363]. They can be used to determine sample size according to ( )2 𝜎 = (𝛾∕𝜉)2 , (13.7) 𝑛≥ 𝛾 𝛿 where 𝜎 is the standard deviation, 𝛿 is the minimum difference between the largest mean and all others, and 𝜉 is the effect size of that difference. Equation (13.7) looks similar to other precision formulae given earlier in this chapter. It is relatively easy to calculate the
Piantadosi
Date: July 27, 2017
392
Time: 4:45 pm
MIDDLE DEVELOPMENT
TABLE 13.3 Values of 𝜸 That Satisfy Equation 13.7 for Selection Designs. 𝑃 0.80 0.85 0.90 0.95 0.99
𝑘 2
3
4
5
1.190 1.466 1.812 2.326 3.290
1.652 1.908 2.230 2.710 3.617
1.893 2.140 2.452 2.916 3.797
2.053 2.294 2.600 3.055 3.920
necessary 𝛾 values that satisfy equation (13.6) using current software. Some values are given in Table 13.3 that can also be used to determine the sample sizes for the two group selection cases given above. Example 13.1. Suppose that there are four treatment groups, the effect size (𝛿∕𝜎) is 0.5, and we intend to select the correct group with probability 99%. From Table 13.3, 𝛾 = 3.797. From equation (13.7), the required sample size is 𝑛 = (3.797∕0.5)2 = 58 per group. Selection Design Example An interesting example of a middle development selection design was the QALS trial of coenzyme Q10 (CoQ10 ) in subjects with amyotropic lateral sclerosis (ALS) [931]. The trial was conducted in two stages: the first stage was intended to choose the best of two doses of CoQ10 , and the second stage was an efficacy comparison to placebo. The trial was like an adaptive design that dropped one dose group after 8 months of treatment and 9 months of additional follow-up in the first stage. However, stage 1 also incorporated a placebo treated group, which like the selected CoQ10 dose, was enlarged with additional recruitments in the second stage. The second stage was 6 months of treatment and 9 months of follow-up. The dose selection stage used 35 subjects in each of the three treatment groups, and the efficacy stage used 40 additional subjects in each of the remain two groups (185 subjects total) [931]. The QALS primary outcome was the change from baseline to 9 months in the ALSFRSr score (a standard measure). The selection stage with 𝑁 = 35 per group was designed to yield an 80% probability of correct selection in the presence of a standardized effect of 20% difference in score decline. This can be seen in Table 13.2. The placebo group could not be selected in stage 1, and safety was not formally incorporated into the selection rule. Detecting an effect size of 0.2 with 80% power would require over 380 subjects per group in a conventional comparative design. The second stage of QALS was a futility trial using the selected dose of CoQ10 . 13.8
COHORT MIXTURES
A cohort under study might be a mixture of subgroups based on a biological characteristic indicated by a biomarker. Increasingly, such subgroups are derived from natural history coupled with genomic findings. Treatment efficacy can vary strongly among the subgroups and the trial design may need to reflect this as discussed in Sections 6.3.1, 8.5.1, and 14.3.2. Such heterogeneity can be complex, but to simplify assume that there
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COHORT MIXTURES
393
are two types of individuals with proportions 𝑝 and 1 − 𝑝 in the study population. In the 𝑝 subset, the efficacy of the treatment is 𝜃, and in the 1 − 𝑝 subset the efficacy is 𝜂 so the average treatment effect in the overall population is 𝑝𝜃 + (1 − 𝑝)𝜂. We can ignore this heterogeneity if 𝑝 ≈ 1 or 𝜃 ≈ 𝜂. More generally, if or how to accommodate it in the design of our trial will depend on the magnitudes of 𝑝, 𝜃, and 𝜂. Enrichment is the strategy of actively sampling to include disproportionately more 𝑝 type subjects in the study cohort. Active sampling is likely to be expensive compared to convenience sampling because it basically discards eligible subjects. So it must be motivated by a firm biological basis, which in this case would be provided by 𝜃 >> 𝜂. To assess how large the trial should be, if it should be enriched, and what design has minimal cost, the following considerations can help. Sample size for the enriched trial is 𝑁𝑒 and for a trial in the overall population is 𝑁. The sample size ratio, 𝑅, is 𝑁∕𝑁𝑒 . It’s also important to note that to accrue 𝑁𝑒 subjects, we need to screen at least 𝑁𝑒 ∕𝑝 potential subjects. Denote the cost of screening a potential subject by 𝐶1 , and the cost of accruing a subject to either the overall cohort or the enriched cohort as 𝐶2 . Then the total cost of the overall strategy is 𝑁𝐶2 and the total cost of the enriched trial is 𝑁𝑒 𝐶2 + 𝑁𝑒 𝐶1 ∕𝑝 = 𝑁𝑒 (𝐶2 + 𝐶1 ∕𝑝). The relative cost of the overall trial to the enriched trial is 𝑊 =
𝑁𝐶2 𝐶2 1 =𝑅 =𝑅 . 𝐶 𝑁𝑒 (𝐶2 + 𝐶1 ∕𝑝) 𝐶2 + 𝐶1 ∕𝑝 1+ 1 𝐶2 𝑝
When 𝑊 > 1, the overall cohort strategy is more costly, and when 𝑊 < 1, the enrichment strategy is more costly. The relative cost of the two strategies is the same when 𝑊 = 1 or when 𝑅=1+
𝐶1 , 𝐶2 𝑝
or 𝐶1 = 𝑝(𝑅 − 1). 𝐶2
(13.8)
This equation characterizes the break-even point. Recall that 𝑅 is a function of 𝑝 and 𝜃. The term 𝑝(𝑅 − 1) can be studied to tell us what the screening versus accrual costs must be to make one strategy or the other more cost effective. For example, suppose we are comparing mean treatment effects where the sample size is given by a formula like equation (16.22), 𝑛 = (𝑍𝛼 + 𝑍𝛽 )2
𝜎2 , 𝛿2
where 𝛿 is the treatment effect and 𝜎 2 is the person to person variance. Assuming that the variance in the enriched cohort is the same as in the overall cohort, the sample size ratio will be ( )2 𝑁 𝜃 𝑅= = , 𝑁𝑒 𝑝𝜃 + (1 − 𝑝)𝜂
Piantadosi
Date: July 27, 2017
394
Time: 4:45 pm
MIDDLE DEVELOPMENT
where the reference mean is taken to be 0. The critical cost ratio is then [( ] )2 𝐶1 𝜃 =𝑝 −1 . 𝐶2 𝑝𝜃 + (1 − 𝑝)𝜂 In brackets is a kind of amplification term which is the ratio of the treatment effect in the subset divided by an average or mixed effect in the overall population. If the treatment has benefit only in the 𝑝 subset (𝜂 = 0), and [ ] 𝐶 1 1 𝐶𝑟 = 1 = 𝑝 2 − 1 = − 𝑝. (13.9) 𝐶2 𝑝 𝑝 In this case the actual treatment effect is irrelevant, and we prefer the overall cohort or the enriched cohort depending only on the mixture proportion and 𝐶𝑟 , the relative cost of screening and accrual. When the screening cost is low compared to the accrual cost, as may often be the case, we prefer the enriched cohort for nearly all mixture proportions 𝑝. If the screening cost is high relative to the accrual cost, as might be the case for a very expensive biomarker or genetic test, we might prefer the overall cohort design (Figure 13.2, solid line). These results are in accord with intuition. The decision boundary in Figure 13.2 is nearly linear with respect to the logarithm of the cost ratio, 𝐶𝑟 , over four orders of magnitude. The approximate linear boundary is log10 (𝐶𝑟 ) = 2 − 4𝑝,
(13.10)
which might be taken as a rule of thumb. Any mixture proportion that falls below the line of equation (13.10) would cause us to prefer an enriched cohort in our trial design.
FIGURE 13.2 Decision boundaries for using an enriched cohort versus an unselected cohort as a function of the mixture proportion, 𝑝, and the ratio of screening to accrual costs. Solid line is for differences in means from equation (13.9). Dashed lines are for hazard ratios from equation (13.12). Upper dashed line is for 𝜃 = 1.4; lower dashed line is for 𝜃 = 2.0.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SUMMARY
395
As a second example, suppose our trial has an event time or survival outcome. A basic sample size requirement from equation (16.29) is 𝐷=4
(𝑍𝛼 + 𝑍𝛽 )2 [log(Δ)]2
,
where 𝐷 is the required number of events and Δ is the hazard ratio (HR), which represents a treatment effect. Assume that the HR in the 𝑝 subset is 𝜃, and in the less interesting part of the population is 1. This means that only the specified subset can benefit from the therapy. Then in a mixed overall cohort, the diluted treatment effect will be the weighted harmonic mean 𝜙=
𝑝 𝜃
1 , +1−𝑝
(13.11)
which can be derived from the actual event rates, as in Section 16.9.3. In reciprocal form, we have 𝜙′ = 𝑝𝜃 ′ + 1 − 𝑝. The sample size ratio from above then becomes 𝑅=
log(𝜃)2 , log(𝜙)2
which is the ratio of log hazard ratios, not to be confused with either a hazard ratio or a log hazard ratio. 𝑅 is not sensitive to the direction of either hazard ratio, so we can take them to be their reciprocals if algebraically convenient. The critical cost ratio is [ ] [ ] 𝐶 log(𝜃)2 log(𝜃)2 𝐶𝑟 = 1 = 𝑝 − 1 = 𝑝 − 1 , (13.12) 𝐶2 log(𝜙)2 log(𝑝𝜃 + 1 − 𝑝)2 using equation (13.11) as well. equation (13.12) has the behavior shown in the dashed lines in Figure 13.2, which are nearly identical to the solid line from equation (13.9). The closeness is no accident because equation (13.9) is formally a first order series approximation to equation (13.12). The size of the treatment effect in the subset of interest, 𝜃, is not a strong influence on the decision. This discussion also makes it possible to assess the impact of imprecision in the classifying test or biomarker that delineates the subset of interest. Both equations (13.9 and 13.12) demonstrate fairly flat behavior for a range of fractions between 0.2 and 0.8, suggesting that classification errors in that range will not strongly influence a decison regarding enrichment. Similarly, the logarithmic scale suggests that errors in the screening–accrual cost ratio will also be relatively uninfluential. I will return to the general topic of population mixtures driven by biomarkers in Section 14.3.2.
13.9
SUMMARY
Middle development represents an opportunity to rid the pipeline of poorly performing treatments and increase the true positive rate. Classically, this step of development embodies a “fail early and often” philosophy. The purpose of middle development is not to predict the results of subsequent comparative trials—if that were possible, the
Piantadosi
Date: July 27, 2017
396
Time: 4:45 pm
MIDDLE DEVELOPMENT
later trials would not be needed. Middle development is focused on safety and activity outcomes, and often employ surrogate measures. One option in middle development is to skip this stage entirely. Investigators have to be risk accepting to do so. Calendar time can be saved, but the risk of taking an ineffective treatment to an expensive comparative trial is real. The idea of futility designs with switched null and alternative hypotheses has been popularized in some middle development trials. It is important to recognize the inferential differences between controlling the rate of false positives versus false negatives implied by such futility designs. As a general principle, research designs preferentially control false positives as would be the case with a conventional futility design. Exchanging the null and alternative hypotheses controls false negative error rates. Randomized designs have been proposed in two ways in middle development. One is for selection among a small set of competing new treatments. That design is relatively efficient in terms of sample size. A second use of randomization is for comparisons against control standard therapy, which necessarily yields low power if modest middle development sample sizes are employed. However, middle development is defined by questions and design, not sample size. Hence, this second use of randomization is essentially an underpowered comparative trial no matter when it takes place. An important perspective in developmental trials with internal controls is that of the participant. In fatal diseases, arms employing standard treatments may have little appeal. In the current age of targeted therapies, testing in cohorts with specific characteristics defined by biomarkers may be essential in middle development. The relative cost of an enriched cohort versus a conventional approach can be determined. Enrichment is not universally the preferred strategy, depending on the relative cost of biomarker screening.
13.10
QUESTIONS FOR DISCUSSION
1. List some classic design features of middle developmental trials. 2. Sketch a hypothetical clinical circumstance in which it might be appropriate to skip middle development entirely. 3. If the prior probability of a true positive treatment effect is 1%, what true positive probability will result following a trial with 95% power and 2.5% type I error? 4. Investigators require a 99% probability of correctly choosing the winner from two new treatments with an expected standardized difference of 0.3. What sample size will achieve this goal? 5. Assume the cost of screening subjects for a trial is about the same as accruing them to the trial. What mixture proportions for a characteristic would then cause us to prefer including all member of the population? What if screening costs one-tenth of the cost of accrual?
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
14 COMPARATIVE TRIALS
14.1 INTRODUCTION Comparative clinical trials are the most glamorous designs because they yield the strongest evidence that we can gather from a single trial, and often inform definitive therapeutic decisions. All clinical trials make comparisons at least implicitly, but I will use this term to refer to designs that contain an internal comparison group. The prototypical design randomly assigns subjects to treatment groups, hence the name randomized clinical (or comparative) trial (RCT). Such trials are also often large multicenter collaborations, increasing the scope and reliability of inferences generated from them. These trials sit atop the hierarchy of evidence discussed in Section 8.4. The three principal dimensions of an RCT are validity, complexity, and cost. The inherent paradox of RCTs is that their rigorous experimental structure simplifies and validates causal inferences, but complicates logistics, with consequent increases in cost. The art in performing such studies is optimal balance of validity, complexity, and cost. A rare excellent trial will optimize all three. Complexity and cost make RCTs targets for criticism, especially by those who discount validity and inferential simplicity. RCTs are limited by their expense and complexity, but that does not diminish their scientific strengths. The results we expect from RCTs are low bias, high precision, high reliability, and strong evidence. They represent the most powerful empirical arm of scientific medicine, a force that counterbalances rationalist deduction and dogma. Findings from an RCT may cause us to discard results of earlier development, overwhelm purely theoretical advantages that a given treatment might have, contradict medical dogma or replace standard of care, revise what we think we know about the biology of disease, regulate the availabilities of therapies in the marketplace, and make reimbursement decisions. RCTs Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
397
Piantadosi
Date: July 27, 2017
398
Time: 4:45 pm
COMPARATIVE TRIALS
can appear invincible, but we must acknowledge the possible effects of bias, random variation, methodologic errors, and limitations in scope and applicability that make them an imperfect tool. It is not necessary or practical to employ an RCT for most therapeutic questions. But they are the best, if not the only, tool when we require a highly reliable (low error) estimate of a treatment effect that has roughly the same magnitude as person-to-person variability. Confidently separating a signal from noise of the same strength is the sole province of the RCT. However, we often use RCTs when we need some but not all of their strengths, which means that not every such trial winds up being definitive. One example of a less than definitive trial is when concurrent randomized controls are used, but error probabilities are relaxed, yielding a less reliable but less costly estimate than a rigorous trial would provide. This was discussed in Section 13.7. Another example is when we reduce control over observer bias by omitting masking or placebos, which can be necessitated by practical or ethics reasons. This is common in tests of surgical treatments. Wise compromises in design are not unique to RCTs. There are essentially no principles about RCTs that are not reflected in earlier development designs. Replicated RCTs provide the strongest empirical evidence regarding efficacy, but we seldom actually duplicate a trial. Trials addressing similar questions might be performed, from which shared evidence could contribute to formal meta-analyses. But true replicates of therapeutic questions in RCTs are uncommon. Unavoidable differences in trials can create uncertainties in meta-analyses or in some cases make them impossible to conduct. Meta-analyses combining trials with slightly different therapeutic questions are a testament to the true validity of biological rather than empirical extrapolation. Meta-analyses are discussed in Chapter 24.
14.2
ELEMENTS OF RELIABILITY
RCTs produce reliable results for several reasons, including (1) their placement in a developmental pipeline that presents a substantial odds of a true positive result just prior to the trial, (2) control over random errors commensurate with the consequences of those errors, (3) elimination of treatment selection bias, (4) reduction of observer bias, (5) control over both known and unknown confounders and extraneous effects, (6) strong evidence and high precision by virtue of sample size, (7) simplified analyses, (8) proper counting of events and missing data, and (9) transparency, oversight, and review. The reliability of RCTs is often seen as a consequence of their high power or precision. But the true reliability of an RCT is as much a consequence of the developmental pipeline as it is a result of individual design parameters. This principle was discussed in Chapter 10 which quantified pipeline effects in distilling out truly effective treatments. The enhanced reliability of an RCT when placed near the end of the developmental pipeline makes them the centerpiece of late development. Table 14.1 shows how a background true positive rate is amplified by a trial. Even a strong design is unable to compensate for a low background frequency of true positive findings. The 50–50 point comes at 3% background true positives with strong type I and II error rates. If we want over 90% true positives from our RCTs, the true positive frequency going into the trial must exceed 20%. Despite the ability of an RCT to amplify the true positive rate, bigger and stronger RCTs are not as much help as one might think.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ELEMENTS OF RELIABILITY
399
TABLE 14.1 RCT Error Rates Amplifying True Positives Error 𝛼
Rates 𝛽
Background True Positive
Post-Trial True Positive
0.025 0.025 0.025 0.025 0.025 0.025 0.01 0.01
0.90 0.90 0.90 0.90 0.90 0.99 0.90 0.99
0.01 0.03 0.05 0.10 0.20 0.19 0.09 0.08
0.27 0.50 0.65 0.80 0.90 0.90 0.90 0.90
For example, if our RCT has 99% power and 1% type I error, the performance from the pipeline must be 8% going into the trial if we want to produce 90% true positives. Strong trials and good developmental pipelines are both required to produce true positive treatment advances reliably. Other elements of reliable design have been discussed in Section 8.5 and summarized in Table 8.4. Reliability has relatively little to do with analysis and everything to do with design. There is a large literature associated with the methodology of randomized trials, that requires no amplification here. However, there is much less said about why this sometimes unwieldy method is absolutely essential. The history of therapeutics in the last half of the twentieth century makes this clear, although the lessons may seem diffused nowadays. One nice perspective on the justification for controlled trials is in the classic Pocock text [1216]. The discussion here will take for granted the strong rationale for randomized comparisons and focus instead on contemporary issues related to precision medicine and the integration of biomarkers into randomized designs. Adaptive design features, which are not restricted to comparative trials, are discussed in Chapter 15. 14.2.1
Key Features
Many of the experiment design elements discussed in Chapter 8 are routine in RCTs. Some key features of late development RCTs are the following: 1. Definitive clinical efficacy and safety outcomes. 2. An internal concurrent control group that receives an appropriate comparator treatment. 3. Strong control over random errors. 4. Elimination of selection bias using randomized treatment assignment. 5. Elimination of placebo effect when practical and necessary. 6. Control over observer and ascertainment bias. 7. Control of known extraneous effects using balancing or blocking. 8. Control over unknown extraneous effects using randomization, 9. Large, heterogeneous cohorts that support external validity, which might require multi-center or multi-national sites. 10. Structured formal analytic plans.
Piantadosi
Date: July 27, 2017
400
Time: 4:45 pm
COMPARATIVE TRIALS
11. Independent monitoring and safety supervision. In short, these designs strongly support causal inferences, or reliable attribution of effects or differences to the treatment of interest. 14.2.2
Flexibilities
Despite what seem to be restrictive features, RCTs allow some key flexibilities, some of which will be discussed below. A few are the following: 1. Definition of treatment or intervention: could be drugs, biologicals, surgery, treatment algorithms, combinations, preventives, adjuvants, or even physician choice from among a set of therapies. 2. Experimental unit: usually is the individual but can also be groups or clusters of individuals. 3. Hypothesis: superiority classically but can be equivalence or non inferiority. 4. Cohort: can be large and loosely defined as in a large simple or large-scale trial focused on population effectiveness as distinct from biological efficacy. 5. Number of treatment arms: two or more. 6. Clinical effect of interest: between subject differences (completely randomized design), within subject differences (cross over design), treatment interactions (factorial design), or more than one therapeutic question (factorial design). 7. Adaptive features: see Chapter 15. 14.2.3
Other Design Issues
Comparative trials are sometimes performed on an exceptionally large scale. A typical multicenter collaborative group might randomize a few hundred subjects. Some trials in oncology, cardiovascular disease, and prevention randomize thousands of subjects. The purpose might be to assess a small treatment difference or to enhance external validity by employing a large heterogeneous study cohort similar to the disease population. Some investigators have advocated simple methods of treatment definition and data collection to minimize the cost of these trials, giving rise to the term large simple trials [1597]. I will refer to such studies as large-scale (LS) trials, acknowledging the fact that they are frequently not simple. See also Freedman [523] and Souhami [1417] for a discussion of these types of studies. In a regulatory context RCTs are called pivotal because the evidence they provide is central to decisions regarding approval. The sampling design for most comparative trials uses a single subject as the unit of study. The treatment design can be more complex, however. Single-modality trials tend to be the most simple and the results of such trials are likely to be easily interpretable. The design for multi-modalities must permit isolating the effect of the treatment(s) under investigation without confounding by other modalities. Adjuvant trials are combined modality studies in which one of the treatments is given before or after the primary treatment in an attempt to enhance its effect. An example is the use of systemic chemotherapy following (or preceding) surgery for resection of tumors. Selecting the right combination of drugs and their timing relative to another
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ELEMENTS OF RELIABILITY
401
therapy is a challenge for RCTs. In cancer prevention, so-called phase IIb studies are comparative trials that employ intermediate endpoints. Thus they are not definitive unless the intermediate endpoint has been validated. Error control for RCTs nearly always consists of either a completely randomized design or blocking and stratification. Precision is increased primarily by increasing the sample size. Sometimes further reductions in variability can be accomplished using within subject differences, such as in crossover trials. Matching, which might be likened to an extreme form of stratification, is seldom used. Another distinguishing feature of comparative trials is that they are frequently performed using the resources of several institutions or hospitals simultaneously. This is necessary because most single institutions do not accrue a sufficient number of subjects to complete comparative testing of therapies in a reasonable length of time. The logistics and organization of these studies can be complex. Furthermore the potential heterogeneity that arises from using subjects in various institutions may be a consideration in the design, analysis, and interpretation of these studies. For many diseases fairly effective treatments are already available and widely used. Developing new treatments in this setting presents special problems for comparative trial design. For example, suppose that we wish to demonstrate that a new anti-inflammatory agent is as effective for chronic arthritis as an agent already in widespread use. Such a clinical trial would not be designed necessarily to show the superiority of the new treatment. Rather, it might be designed to demonstrate the equivalence of the new treatment with standard therapy, at least as far as pain relief and mobility are concerned. The new agent might also have a lower incidence of side effects. These designs, often called equivalence or non inferiority trials, are important, but sometimes difficult to implement.
Expanded Safety Studies Look for Uncommon Effects of Treatment When treatments are initially widely applied, as after regulatory approval, there is an opportunity to learn about uncommon side effects, interactions with other therapies, or unusual complications. Uncommon events may affect the use or indication of the treatment if they are serious enough. Treatments can become widely used after being administered to relatively few subjects in RCTs, as in the case of a few AIDS drugs or orphan drugs. Regulatory approval can be gained after studying only a few hundred individuals. An expanded safety (ES) study can provide important information that was not gathered earlier in development. Some post-marketing surveillance studies are ES trials. However, most such studies capture only serious side effects and may not precisely ascertain the number of subjects who have received the treatment. Furthermore, some of these studies are intended to be marketing research, to uncover new product indications that may protect patents or yield other financial benefits. A true ES trial would be designed to provide a reliable estimate of the incidence of serious side effects. There have been several circumstances in which these types of studies have resulted in removal of new drugs from the market because of side effects that were not thought to be a problem during development. Since 1974 at least 10 drugs have been removed from the market in the United States because of safety concerns. The rate is higher in some other countries [102]. In other circumstances, performing such studies would have been
Piantadosi
Date: July 27, 2017
402
Time: 4:45 pm
COMPARATIVE TRIALS
of benefit to both the subjects and the manufacturers. An example of this is the recent problems attributed to silicone breast implants. When investigators are interested in uncommon but life threatening or irreversible side effects, ES clinical trials employ large sample sizes. However, these studies need not be conducted only in a setting of post-marketing surveillance. For example, similar questions arise in the application of a standard treatment in a common disease. Even single institutions can perform studies of this type to estimate long-term complication rates and the frequency of uncommon side effects. Other late development trials may recapitulate designs discussed above but be applied to subsets or defined populations of subjects. Such studies may investigate or target patients thought to be the most likely to benefit from the treatment and may be driven by hypotheses generated during earlier development. These types of late developmental trials should be governed by methodologic and design concerns similar to those discussed elsewhere in this book. One of the difficulties in interpreting these large-scale post-marketing safety studies is being able to attribute complications or uncommon side effects reliably to the treatment in question, rather than to other factors associated with the underlying disease process or to unrelated factors. This is even more of a problem when events occur at long intervals after the initial treatment. A second problem arises because many ES studies do not determine how many individuals have received the treatment, only the number who experience adverse events. An incidence rate for adverse events cannot be calculated unless the number of individuals at risk can be estimated. In some other countries closer tracking of drug use after marketing permits true incidence rates to be estimated.
14.3 14.3.1
BIOMARKER-BASED COMPARATIVE DESIGNS Biomarkers Are Diverse
A biological marker or biomarker is an objectively measured quantity that serves to inform or indicate an underlying biological state. A biomarker might indicate some normal process, a pathological condition, or a therapeutic response [514]. Operationally, biomarkers are things like metabolites, proteins, genetic markers, genes and gene products, and cells. There is great focus today on gene-based biomarkers. Genes may be assessed directly for amplification or mutation. Gene expression is often measured by mRNA levels. Microarrays can facilitate measuring expression of hundreds or thousands of genes simultaneously. Because a functional gene unit includes a regulatory region and coding region, the term “gene" as a biomarker may require additional clarity. Some familiar biomarkers are known to carry information about disease risk. Examples are cholesterol and lipoproteins, and cancer markers like prostate-specific antigen (PSA). High levels of PSA indicate the presence of cancerous cells, around which there are two debates. One is the extent to which PSA can be taken as a valid surrogate outcome for therapeutic development. The other is when, if, or which prostate cancers require intervention. These questions, while focused on prostate cancer, typify a universal need to use biomarkers for information about future outcomes. The utility of biomarkers can be understood from several perspectives. One follows recommendations of an NIH working group [655]. A type 0 biomarker is one that follows the longitudinal course of a disease process. A type 1 biomarker changes in response
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
BIOMARKER-BASED COMPARATIVE DESIGNS
403
to a therapeutic intervention, without necessarily indicating a change in the underlying disease. A type 2 biomarker reflects an alteration in the disease process and is essentially a surrogate outcome. In terms of therapeutic development, one type of biomarker might indicate target engagement by the new therapy. A second could be an indicator of alteration in the target behavior, especially as a function of dose of a therapy. If we can alter the target behavior, a third biomarker might indicate disease outcome is affected. From the perspective of clinical trial design, biomarkers can be classified into diagnostic markers, screening markers, prognostic markers (disease outcome), predictive markers (treatment response), and surrogate outcomes [978]. Predictive or prognostic factors include diverse constructs such as age, sex, functional status, and other characteristics of the individual that may be metabolically or genomically derived. Well-done clinical trials can be great assets in the identification and validation of any of these types of biomarkers [206]. A surrogate outcome is usually measured before, during, or after treatment to monitor its effects. These are discussed in Section 5.4. Predictive biomarkers are assessed prior to treatment to determine who might benefit from the therapy. For example in the current era of precision therapies for cancer, there would be little point in administering a targeted treatment unless we knew from a valid diagnostic test that the tumor was driven by the factor being targeted. Such a predictive biomarker might save patients from unhelpful treatment and unnecessary side effects, improve therapeutic development, and reduce costs. A prognostic biomarker is typically measured before treatment to indicate the likelihood of a specific long-term outcome following treatment. While prognosis with or without treatment might be different, a prognostic factor does not typically depended on treatment. For example, medical management can help individuals with either good or poor NYHA functional class heart failure, but the classification conveys information about risk even with treatment. Prognostic and predictive biomarkers must be validated, possibly at several levels. If a predictive biomarker emerges from analysis of archived tissue, for example, investigators would have to be certain that its measurement in prospective settings remains valid for the intended use. Beyond this, does the biomarker perform similarly in an independent cohort (clinical validation)? Finally, does use of the biomarker actually convey benefit to patients (clinical utility)? Improving therapeutic decisions via a biomarker is a direct benefit to patients. Clinical utility is the point where some prognostic factors can come up short. Information about prognosis is almost always useful, but the benefit to patients may be indirect. However, it would be clinically useful if a marker indicates that prognosis is so favorable that treatment is not required. For the remainder of this discussion focused on comparative trial design, I will assume that a relevant biomarker has been validated for the intended purpose. A biomarker assay for a multicenter study could be performed locally or centrally, with obvious consequences of misclassifications and discordances. I will not discuss those issues. The next sections discuss how a validated biomarker might be used to help answer therapeutic questions using randomized trial designs. See also [620] and [537]. Biomarker adaptive trials are discussed in Section 15.3.
Piantadosi
Date: July 27, 2017
404
Time: 4:45 pm
COMPARATIVE TRIALS
FIGURE 14.1
14.3.2
Biomarker enrichment design.
Enrichment
Enrichment has been discussed in Sections 13.8 in the setting of population mixtures for middle developmental designs, 6.3.1 with regard to selection bias, and 8.5.1 as a principle of trial design. The concept is relevant to comparative trials when the mixture is indicated by the value of a biomarker. The concept is illustrated in Figure 14.1 where all potential trial participants are assessed for the biomarker. Only those who test positive proceed to a randomized comparison. Presumably this biomarker carries some information with regard to the ability to benefit from the new therapy. The structure of this design is the same as a randomized discontinuation trial, although response during the initial part of such a study is not usually described as a biomarker. We would employ this design based on knowledge of the biology of a new agent and perhaps its molecular target. The control treatment allows elimination of possible selection bias attributable to a positive biomarker determination. The design does not allow evaluation of off-target effects of the new agent. Also, final developmental or regulatory indications will be limited to the marker positive subset. Biomarker refinements outside of the marker positive subset are not feasible. 14.3.3
Biomarker-Stratified
A second possible use of biomarkers in comparative trials is shown in Figure 14.2. Here again all potential participants are tested for the biomarker, and both positive and negative subsets are randomized to assess the relative effect of a new treatment. Such a design might be used when there is reasonable confidence in the biomarker, such as a target gene or pathway, but investigators continue to have equipoise for randomization to assess the treatment. This design controls fully for the prognostic effect of the biomarker and directly compares a new agent to control therapy in all subjects. It can permit retrospective evaluation of biomarkers measured by different methods, or different biomarkers in the same pathway. This design has also been called a biomarker interaction design because it permits estimating the interaction between a positive biomarker determination and the treatment effect. To accomplish this, one would have to be certain about the power of the trial for that purpose, because interaction effects are estimated with fourfold lower precision than main effects. An example of this type of design is the FOCUS4 master protocol, which is a molecularly stratified study in colorectal cancer [822, 1022].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
BIOMARKER-BASED COMPARATIVE DESIGNS
FIGURE 14.2
14.3.4
405
Biomarker-stratified design.
Biomarker-Strategy
In a so-called biomarker-strategy design (Figure 14.3), participants are randomized to marker assessment or an unguided approach. In the biomarker guided group, the marker evaluation assigns subjects to new therapy or control. Participants in the unguided therapy can receive standard treatment or themselves be randomized between control and new therapy. Unfortunately, this design does not allow distinguishing between a biomarker prognostic effect and a treatment effect. The biomarker would have to be measured in the unguided arm to distinguish these effects. Also it is less efficient than a standard randomized design. The secondary randomization of the unguided therapy arm renders the design even less efficient. However, for complex strategies of multiple biomarkers, some version of this design might be required.
FIGURE 14.3
Biomarker-strategy design.
Piantadosi
Date: July 27, 2017
406
Time: 4:45 pm
COMPARATIVE TRIALS
FIGURE 14.4
14.3.5
Multiple-biomarker signal-finding design.
Multiple-Biomarker Signal-Finding
When studying several biomarkers with the expectation that each will indicate a particular treatment, a design such as that shown in Figure 14.4 could be used. It is essentially equivalent to multiple single-arm trials conducted in parallel. As such it is amenable to the designs, sample sizes, and outcomes that might be used in single-cohort trials. However, it can’t really be considered comparative because the cohorts differ systematically by the marker selections. This design is further limited by inability to assess off-target or prognostic effects. Further biomarkers can’t be refined outside of the marker positive group. Some randomized strategies might be required if biomarkers overlap in their implications. To incorporate control therapies, a randomized enrichment design as shown in Figure 14.5 could be used. However, off-target effects still cannot be assessed by this design.
FIGURE 14.5
Multiple-biomarker randomized enrichment design.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
BIOMARKER-BASED COMPARATIVE DESIGNS
14.3.6
407
Prospective–Retrospective Evaluation of a Biomarker
A study with both prospective and retrospective components may be necessary to evaluate a predictive or prognostic biomarker. Assume at the outset that we have a single welldescribed biomarker, and that it has been analytically validated. We might then imagine designing a prospective clinical trial to assess the clinical utility of the biomarker. The protocol for such a clinical trial might be complex, including a definitive statistical analysis plan (SAP). In a worst case type of scenario, investigators would actually plan and conduct such a trial for these purposes. However, it might be possible to find an existing clinical trial with many of the features intended for the design of a clinical utility study. An existing trial would require an appropriate design for the question at hand, archived specimens suitable for the biomarker assay from a sizeable proportion of participants, and a sufficient sample size to assess clinical utility precisely. An existing clinical trial testing a therapeutic question might not contain a sufficient sample size for a biomarker question. Also, this trial would have to be independent from all data sources used for developing and validating the biomarker. If such a trial exists, it could reasonably be used to test clinical utility of the biomarker in question. In such a case, the key biological specimens would have been collected prospectively and independently of biomarker development. The question of clinical utility would be prospective in design, but would come out of this archive in retrospect. The analysis would take place according to a prospective SAP. Hence the prospective–retrospective terminology.
14.3.7
Master Protocols
A master protocol is the idea that similar research questions can be brought under a common protocol rather than repeating the design, regulatory oversight, and logistics of independent individual clinical trials [1254]. A master protocol could cover multiple diseases, multiple treatments, or multiple biomarkers. A similar concept is the platform trial [149]. In the latter case for example, participants can be screened on the basis of biomarkers and assigned to a randomized sub-study or arm that is most appropriate. Treatment arms can be opened or closed independently of one another. The study depicted in Figure 14.5 could be viewed as a master protocol, for example. Other names have been applied to the master protocol idea, including umbrella trial, cloud trial, and basket trial. Advantages of the master protocol approach are that the trial network and infrastructure can be made more efficient, and data quality and sharing can be facilitated. In some cases, innovative statistical designs can be applied to the collaboration. From a potential participants perspective, overall screen failure or rejection can be lowered compared to an individual trial. The common protocol is more efficient to launch and maintain than individual trials. This overall structure lends consistency to the methods used across the various treatment questions. These benefits can likely shorten the time to completion, review, and approval of new therapies. An early example of this approach is the Lung-MAP study in advanced non small cell lung cancer [705], which uses “umbrella screening” coupled with multiple drugs to match participants with sub-studies based on tumor characteristics. A master protocol also covers the “basket” idea in which a treatment targeting a single pathway or mutation in the case of cancer can be tested across several disease types. Although not randomized,
Piantadosi
Date: July 27, 2017
408
Time: 4:45 pm
COMPARATIVE TRIALS
the NCI MATCH trial has this design [1018]. In the first several hundred individuals screened for entry into the NCI MATCH trial, only about 2.5% could be assigned to one of the treatments, pointing out one of the difficulties with this design.
14.4
SOME SPECIAL COMPARATIVE DESIGNS
There are many variations and modifications of the basic RCT design tailored meet special circumstances. A few will be discussed here in broad strokes.
14.4.1
Randomized Discontinuation
The randomized discontinuation (RD) design originated as an attempt to reduce placebo therapy [32]. An RD trial initiates with all participants on the treatment of interest. In the first part of the trial subjects are evaluated for clinical benefit. Those who do not benefit are taken off the trial and allowed to receive an appropriate alternative treatment. In the second part of the trial, the remaining subjects are randomized to either continue or stop therapy. Hence the term “randomized discontinuation.” The treatment effect is then estimated from the difference between outcomes in the randomized groups. This design might be most appropriate for long-term trials without severe outcomes like death [1429]. The RD design has several potential strong points. First, it incorporates a type of cohort enrichment because those who do not benefit from the therapy are taken out of the study cohort in the first step. The remainder of the trial is conditioned on a cohort proven capable of the relevant therapeutic response. Investigators do not require a biomarker or other predictive test on which to base a cohort-enrichment strategy as discussed elsewhere in this book. Second, the RD design provides an opportunity for all trial participants to benefit from a new therapy. This may be an attractive recruitment point or a comfort for investigators [1284]. An unfortunate weakness of the RD design is the requirement to withdraw apparently beneficial treatment from one of the randomized groups. This point of ethics can be more acute than the question of benefit in the usual RCT. In the typical RCT, benefit is unknown at the outset, and the trial would incorporate interim analyses and early stopping to maintain an appropriate risk–benefit context for the participants if one of the treatments appears convincingly better than the other. The RD design actually removes active therapy, and the ethics concern is that it specifically disadvantages study participants. An additional potential weakness of the RD design is that it may also not be as efficient as we would expect under idealized circumstances [538]. An example of randomized discontinuation is a comparison of sorafenib versus placebo for their effect on time to disease progression in metastatic renal cell cancer [1250]. This study was described as “phase II” because of developmental timing, despite its formal comparative design and the initial entry of 202 subjects. After the initial treatment period, 33 subjects on placebo were compared to 32 on drug, the remainder having either clearly benefited from the drug or clearly progressed while on treatment. This randomization of those whose disease was stable circumvents the problem
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SOME SPECIAL COMPARATIVE DESIGNS
409
mentioned above of withdrawing a clearly beneficial therapy. This trial showed an increase in progression-free survival from 6 weeks on placebo to 24 weeks on study drug with a 𝑝-value of 0.009. This and other evidence was convincing enough to continue development in a larger comparative trial [429], and the drug was eventually FDA approved. Another example of randomized discontinuation is the trial of aripiprazole for maintenance therapy in schizophrenia [819]. In that trial, 710 subjects began a period of oral therapy for stabilization of symptoms. Of those, 576 progressed to the need for intramuscular (IM) depot therapy; 403 were subsequently randomized 2:1 to masked IM aripiprazole or placebo. The outcome of the trial was time to exacerbation of psychotic symptoms. In this trial, individuals who were stable on aripiprazole were randomized to placebo. The unbalanced randomization may have been intended to compensate for this troublesome circumstance. The trial was stopped after 64 exacerbation events with an estimated hazard ratio of 5 favoring study drug (𝑝 < 0.0001).
14.4.2
Delayed Start
The delayed start study design has been proposed to detect treatments that produce disease modification, as opposed to mere symptom alleviation [329, 908]. The design is appropriate for some diseases with a sub-optimal observation model, as discussed in Section 8.2.4 and Figure 8.1. The essential problem is when investigators are unable to view the underlying disease process directly, and must infer its status via symptoms. This happens regularly in degenerative neurological diseases for example. The delayed start design is essentially a one-way cross over. In the first treatment period, subjects are randomized to treatment or control. There is a short wash in, during which there may be some improvements in the experimental arm. During the first treatment period the natural history of the disease is progressive, and changes in the treatment and control arms may be parallel. At the end of the first treatment period, the control group begins to receive the new therapy. This delay of therapy in the control arm gives the design its name. Again there may be a short wash in period during which effects of the new treatment may alter the course of the control arm. Both groups continue on the new therapy for the duration of the second treatment period. If the new therapy provides symptomatic improvement only, both groups will finish the second treatment period at the same place with respect to the chosen outcome measure. It does not matter whether symptoms were alleviated in the first or second treatment period, the underlying disease was unaffected and both groups progress as though there was no lasting effect. In comparison, if the therapy has modified the underlying disease status, the experimental arm has had longer on the therapy than the control arm, and the result will be a difference in outcomes after the second treatment period. This design has been implemented in Parkinson’s Disease [1143, 1145, 1146, 1247]. There are a few potential complicating issues with the delayed start design. One is missing data, which can have strong and differing effects in the two treatment periods. The design seems best suited to an outcome measure that changes linearly over the course of the trial. This is reasonable for functional rating scales for some degenerative neurological conditions, but not universal. Also the design does not appear to be robust with respect to time or period by treatment interactions.
Piantadosi
Date: July 27, 2017
410
14.4.3
Time: 4:45 pm
COMPARATIVE TRIALS
Cluster Randomization
In most therapeutic clinical trials, the experimental unit is the individual human subject. Sometimes, however, the treatment or intervention can not or should not be honed down to an individual. Treatment is then applied to a group or cluster of individuals, which then becomes the experimental unit. This circumstance could arise in infectious disease trials where it might make more sense to treat all the members of a household, for example. Some interventions must be applied in an intensivist setting like intensive care units or emergency departments, in which case it might be impractical or unsafe for the staff to cope with similar patients receiving different therapies simultaneously. The entire unit might then be randomly assigned one of the treatments, creating a cluster. This partially confounds the treatment effect with the unit or institution effect, but this could be preferable to the impracticality of individual randomization. In prevention trials, the intervention itself might take place at the population level whereas the effects will be felt by individuals. Advertising as an intervention to reduce smoking is an example. Other public health measures implemented in churches, schools, villages, or cities may create the same type of clustering. Clusters may be randomized to one intervention or the other.
14.4.4
Non Inferiority
Non inferiority and equivalence designs were mentioned briefly earlier in this chapter. Their goal is to demonstrate that a new treatment is not clinically significantly worse than standard therapy with respect to a definitive outcome. The threshold for what it means to be worse than standard therapy is determined on clinical grounds, but is expressed quantitatively. For example, in a serious disease we might consider that a 20% higher risk of death (hazard ratio of 1.2) relative to standard treatment is clinically acceptable for a new treatment provided that it carries some advantages like fewer or milder side effects or improved quality of life. The trial design would have to be able to show a hazard ratio convincingly less than 1.2 instead of less than 1.0 as in a superiority design. The idea of a non inferiority design is a slight misnomer because there is no structural design modification to produce it. The design is entirely a consequence of the crafting of the null and alternative hypotheses. In some instances, the sample size needs to be larger than a typical superiority trial as a result. Illustrations of this are provided in Section 16.7.10. In a non inferiority trial, we can’t escape the fact that small differences may be clinically consequential, motivating larger sample sizes than those often employed in optimistic superiority studies.
14.4.5
Multiple Agents versus Control
There are circumstances where an RCT requires more than two treatment groups. More than one experimental therapy might be compared to the same control group, for example. We might also imagine a need for two control groups, for example with and without placebo, but this seems wasteful. Also, a multiarm trial may be required when studying combinations of treatments. The most important class of those designs is factorial as discussed in Chapter 22. Multiarm non factorial trials are common in cancer, for example, where treatment optimization depends on combination therapy.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SUMMARY
411
Using this sort of parallel structure will be more efficient than serial two-arm RCTs because the control group is not repeated. There is also likely to be some savings of calendar time because the total sample size is correspondingly reduced. Inferences between experimental arms are enhanced because of the common control. Participants may like the overall greater chance of being assigned to a new experimental therapy. But the multiarm design presents a problem of multiple comparisons, for which some investigators will want to restrict the type I error rate. Doing so will remove some but not all of the gains in efficiency [539]. When comparing several experimental therapies to control treatment it is natural to consider dropping arms that perform below some standard at interim analysis. Such a procedure is adaptive in the sense of Chapter 15, and is a selection design. In treatment arms dropped early, the estimated relative treatment effect compared to control will be slightly biased toward the null. The optimal distribution of sample size among more than two treatment arms requires some consideration—equal allocation is not necessarily the best choice. The optimal allocation is discussed in Section 17.6.4 where it is shown that, assuming we intend to minimize the variance √ of treatment comparisons, the ratio of treatment to control sample sizes should be 1∕ 𝑘. Here 𝑘 represents the number of experimental treatments so there are 𝑘 + 1 total arms. If there are 4 treatments and 1 control group, √ 𝑘 = 4 and the optimal control-to-treatment allocation is 2:1 for all groups since 1∕ 4 = 1∕2. The relative chance of assignment to experimental therapy versus control is 2:1, but not 4:1 as might be superficially assumed. There can be some significant challenges with multiarm trials. Masking across all groups may be a problem. Also, the study entry criteria will have to allow participants to be eligible for any of the treatments. Combination therapies may raise concerns about treatment interactions or overlapping toxicities. If we are comparing agents from different sponsors or manufacturers, getting cooperation may not be so straightforward. A commercial entity will not be keen on having its agent shown to be inferior to others. Nevertheless this sort of comparison is vital to improving the quality and economics of care. An example of a three arm trial in cancer is a currently ongoing trial of sunitinib or sorafenib in subjects with locally advanced renal cell cancer sponsored by NCI (clinicaltrials.gov identifier NCT00326898). This study is not formally a selection trial. The allocation ratios are 1:1:1.
14.5 SUMMARY The most reliable research design for a clinical trial is randomized group assignment with one or more internal controls. Such designs are nearly always coupled with a definitive outcome. Treatment and evaluation masking adds to validity. The principal value of randomization is control over both known and unknown confounders. Such designs effectively control selection bias and ascertainment bias. Although this research design is typically the most valid, the complexity and cost of performing randomized comparisons demands preparatory developmental studies and substantial funding. It is possible to overlay key flexibilities on randomized designs including multiple treatment arms, interim decision points, and flexible cohorts.
Piantadosi
Date: July 27, 2017
412
Time: 4:45 pm
COMPARATIVE TRIALS
Standard comparative designs are usually structurally simple. However, in the modern era of targeted treatments, incorporation of biomarkers makes the design structure more complex. Biomarkers may be essential for cohort definition or enrichment, and are sometimes also the subject of investigation. Biomarker utility can be assessed by randomizing both marker positive and negative subsets. Many modified designs using randomization have been proposed for special circumstances. These include randomized discontinuation, delayed start designs intended to assess disease modification, cluster randomization, and multiple treatment arms. In the last case, it may be useful to observe that the optimal allocation across arms is not necessarily equal.
14.6
QUESTIONS FOR DISCUSSION
1. In a RCT with 1 control group and 3 treatment groups, what is the optimal allocation ratio between treatment and control? 2. Just before a trial, the prior chance that a treatment is a true positive is 5%. If the trial has a 5% type I error rate and a 20% type II error rate and the treatment appears positive, what is the chance that it is a false positive? 3. Discuss the pros and cons of centralized biomarker assay versus institutional assays in a multicenter trial. 4. Consult the following randomized controlled trials with similar interventions in similar conditions [940, 1030]. Why are the results so different?
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
15 ADAPTIVE DESIGN FEATURES
15.1 INTRODUCTION Adaptive design (AD) refers to a varied set of techniques to modify a trial during its conduct based on information from accumulating data. For a recent summary of this topic, see Coffey and Kairalla [276]. AD modifications are put forward as part of the initial design in exactly the same way as fixed design features. All modifications require justification and implementation criteria in the study protocol. Adaptive modifications are not substitutes for proper planning. They help compensate for design assumptions that prove inaccurate. Adaptations are usually points where refined assumptions, actual data, or both are brought to bear to shorten a trial, improve its properties for participants, or achieve objectives. Nonadaptive or fixed designs are driven by information available prior to the acquisition of data, whereas AD is driven additionally by findings from interim data. So AD might more properly be called data-driven design. An important exception to this distinction is the now routine nature of data-dependent stopping discussed in Chapter 18. This feature is ubiquitous in comparative trials and many middle developmental studies. It is formally adaptive in that the design calls for study termination or continuation depending on findings in the data. Data dependent stopping also carries another important criterion for AD that might as well be definitional: adaptive changes should not undermine the validity of the trial. They are typically proposed to enhance the efficiency, ethics, or validity of the study. In 2006, the PhRMA working Group on Adaptive Design operationally defined AD as
Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
413
Piantadosi
Date: July 27, 2017
414
Time: 4:45 pm
ADAPTIVE DESIGN FEATURES
. . . [a design] that uses accumulating data to modify aspects of the study as it continues, without undermining the validity and integrity of the trial. . . . changes are made by design, and not on an ad hoc basis . . . not a remedy for inadequate planning.
An FDA Draft Guidance Document in 2010 defined AD as . . . a study that includes a prospectively planned opportunity for modification of one or more specified aspects of the study design and hypotheses based on analysis of data (usually interim data) from subjects in the study. . . . The term prospective here means that the adaptation was planned (and details specified) before data were examined in an unblinded manner . . . . This can include plans that are introduced or made final after the study has started if the blinded state of the personnel involved is unequivocally maintained when the modification plan is proposed [499].
Several signals based on the data might trigger adaptive choices. They include the current treatment effect and its significance level, a predicted significance level at the end of the trial, various posterior probabilities, aggregate covariate, prognostic marker, or biomarker composition of the treatment groups, safety signals, or data quality. The trial factors that might be changed include drug dose, sample size or termination date, eligibility criteria, opening, closing, or combining treatment arms, probability of assignment to a particular treatment, study procedures, or clinic performance. Some possible adaptations are contentious, such as changing drugs or other essential characteristics of the treatment, primary outcome, statistical parameters like the type 1 error, or the unit of randomization or analysis. If a trial ends with a null result, we might look back and wish that certain aspects of the design had been different. Those features might have been candidates for formal adaptive options. However, it may be more difficult to anticipate regrettable circumstances in a new trial. If we can do this for a new trial and validate the design for its intended purposes, we may have a useful AD. Adaptive design methods will be most useful when (i) outcomes or valid biomarkers are available rapidly in the overall timeframe of the trial, (ii) the disease has substantial morbidity or risk, (iii) uncertainty regarding treatments is high, and (iv) adaptations are logistically practical and acceptable to all trial stakeholders. These circumstances are frequently appropriate.
15.1.1
Advantages and Disadvantages of AD
The right adaptation can confer some advantages on a trial compared to a nonadaptive strategy. For example, we might be able to assign more participants to an evidently superior treatment. A trial might be made more flexible in that endpoints could be modified or doses may be optimized. In some cases resources can be conserved if we stop early because of risk, futility, or efficacy. It is possible that an adaptive feature will result in more subjects being treated successfully. One should not accept methods put forward as adpative without a full accounting of strengths and weaknesses. For example, the simple notion that more trial participants can be placed on the better performing treatment by response adaptive randomization
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
415
requires consideration beyond the intuitive appeal. The method may overbalance in the wrong direction, and can yield an overestimate of the actual treatment difference [1469]. The appeal to improved ethics of such designs has also been challenged [714]. It should also be noted that typical group sequential methods for trial monitoring may be superior to adaptive designs in many circumstances [1487]. The costs of potential advantages are also not trivial. Responses on which base adaptations need to be observed relatively promptly following treatment. Adaptations could create a very heterogeneous cohort. Investigators who are not familiar with this methodology may be skeptical of results from such trials. The complexities of a trial could slow recruitment or make consent documents difficult to understand. Some adaptations can yield biased estimates, inflate the type I error, or reduce power. Finally, the logistics of an adaptive trial can complicate implementation at multiple centers. Trial features can and should be changed when appropriate regardless of the specific design philosophy. Such changes are often recommended by a Treatment Effects Monitoring Committee, Institutional Review Board, independent medical monitor, investigators, or sponsor as part of the due diligence in managing a clinical trial. External information that requires a design change may become available during a trial. It is usually impossible to anticipate this sort of surprise and define alternatives in an adaptive design. Also, a trial may mix nonadaptive or “fixed” design features with AD. For example, in studies of eye diseases, a standard method of treatment assignment might be made to one eye, with AD (such as play-the-winner) used to assign therapy to the second eye. The discussion to this point has emphasized adaptations based on aggregate features or derivatives of the trial data. We might also consider planned choices based on individual subject characteristics. Pathways through a therapeutic or diagnostic decision algorithm, for example, can depend on disease biomarkers, individual interim outcomes (e.g., response to therapy), or side effects of treatment. The trial can be designed to provide “adaptive” alternatives for each subject depending on such findings. From the viewpoint of the study subject or investigator who is uncertain of the future pathway, the trial seems to adapt to the clinical circumstance of the subject with appropriate treatment plans and information capture. From the viewpoint of the methodologist who has anticipated the measurements required and branch points relevant to the clinical questions, the trial is anticipatory but not adaptive by the definition above. The simplest illustration of this is eligibility criteria, which represent the first and most definitive subject branch-point for every clinical trial. A more consequential example is a cohort enrichment strategy (Sections 15.3 and 13.8) which could be either eligibilitylike or depend on an intermediate outcome. More complex still is when therapeutic alternatives for a study subject are driven by interim outcomes. There may be less danger of damaging the inferences from a clinical trial by designing appropriate subject-level choices compared to group- or trial-level decisions. The primary goal of any subject filter is to form homogeneous cohorts, and therapeutic inferences could therefore be conditional on the composition of those groups. This is a familiar problem. In comparison, the result of structural changes in the trial such as stopping early, dropping a group, or changing the sample size, might alter important operating characteristics of the design in an unhappy way. This is less familiar and more consequential. Most of the discussion in this chapter will be on this structural notion of adaptation.
Piantadosi
Date: July 27, 2017
416
15.1.2
Time: 4:45 pm
ADAPTIVE DESIGN FEATURES
Design Adaptations Are Tools, Not a Class
Some current usage of the term adaptive design suggests that there is a distinct class of trials employing such methods. There is also sometimes an implied value judgment attached suggesting that AD repairs some of the frustrating features of clinical trials that exist. Despite such usage of the term, it is probably best to refer to adaptive design features, which as stated above, many trials employ. At scientific meetings, you sometimes hear statements like “this is a difficult problem but perhaps we can solve it using adaptive designs.” This can have the effect of off-loading the difficult part of a problem onto the trial design. A trialist would want strong evidence that any design or class of designs can solve a problem that seems difficult for conventional designs. In some cases, problems that trouble conventional designs can be ameliorated by AD. But sometimes a thorny problem is just a thorny problem. Adaptive design methods have already proven very useful if taken in the broad context as will be discussed below. Some example adaptive methods are listed in Table 15.1. And there is no doubt that we will continue to find new applications of adaptive features and designs, in some cases replacing traditional methods. But adaptive methods do not address some significant problems in clinical trials such as complexity, cost, ethics, acceptance, observer bias, and logistics. AD is a useful tool but not a fix-all, as it can worsen some of these problems. A common claim is that AD reduces sample size or otherwise increases study efficiency. This is clearly the case for certain design adaptations. However, there are other potential consequences such as bias and/or inflation of error rates when using such designs, so it is prudent not to overstate benefits. With regard to sample size per se, consider interim analysis and possible early stopping for either futility or efficacy in a comparative trial. There are two domains of interest—the theoretical design properties of the trial, and a real-world implementation (usually only one). Under the null hypothesis, the study design might assure stopping at a minimal sample size compared to a fixed sample alternative. Similarly, under a set of reasonable alternative hypotheses, the stopping criteria might also assure minimal expenditure of sample size. In the real world, the occasional early stopping of a large clinical trial (for either futility or efficacy) has to represent a savings of resources to the research community. This Gedanken experiment would seem to support the notion that adaptation must increase efficiency. But some caution is needed. For example, a trial terminated early for efficacy yields a biased overestimate of the true treatment effect. This was discussed in Section 6.3.1. This TABLE 15.1 Examples of Possible Adaptations to Address Specific Study Requirements Study Requirement
Adaptation
Dose optimization Group size balance Prognostic factor balance Favorable allocation End as early as feasible Guarantee minimum number of events Drop poor performing treatments Compensate for incorrect design assumptions Cohort homogeneity
Outcomes modify dose (titration) Treatment adaptive randomization Covariate adaptive randomization Response adaptive randomization Interim analyses Extend calendar time Interim analyses Re-estimate sample size Modified eligibility or enrichment
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
417
can be clinically unimportant, but illustrates the potential apples-to-oranges nature of the efficiency comparison. What scientific objective has been “more efficiently” reached, and is it the intended one? When adaptive design features support cohort enrichment (Sections 15.3 and 13.8), efficiency depends on the relative cost of eligibility screening compared to the cost of accruals—the balance is not predetermined. Finally, the questions addressed or study properties achieved can be subtly altered by some adaptive features compared to those in a similar design without the modifications. However, the desired flexibility may lead to higher error rates or logistical difficulty. Know all the properties of your design and those of closely related designs.
15.1.3
Perspective on Bayesian Methods
Frequentist, likelihood, and Bayesian methods can all be adaptive. AD seems to be commonly paired with Bayesian techniques in the minds of many investigators. Before the experiment begins, the Bayesian represents knowledge, or more properly ignorance, of the unknown but required parameters in the form of a prior probability distribution. My prior probability distribution and yours may not be the same. Also, a strong prior probability distribution, whether skeptical or optimistic, will influence the final inference. Regardless of adaptations, these are sometimes problem points for Bayesian methods, especially the feature of “subjective probability.” It’s also important to recognize, as I have stated elsewhere in this book, that we can always exchange assumptions for data, which is not to say that we should. Assumptions can take the form of pseudodata (e.g., CRM, Section 12.5.2), models (e.g., covariate adjustment, Section 21.3), imputations (e.g., missing data, Section 19.5.2), or probability distributions (Bayesian priors). The typical result of using assumptions in place of actual data is greater efficiency and ease. We want to know if the putative efficiency of “Bayesian AD” is due to adaptive features intrinsically or to the leveraging effect of assumptions. Either is possible. Bayesian AD has been proposed to perform feats such as altering the stopping point of a trial, dose finding and dose dropping, seamlessly changing “phases,” adaptive randomization, population finding, and ramping up accrual. Non-Bayesian methods can address some of these objectives as well. It should be noted that a Bayesian approach is compulsory for many adaptive features, and some researchers still consider Bayesian procedures to be nonstandard. In the following sections, I discuss some AD elements. The flexibility of these tools is the essential lesson, not their Bayesian or Frequentist connections. The discussion covers two-group trials. Extension to more than two groups is straightforward.
15.1.4
The Pipeline Is the Main Adaptive Tool
The primary adaptive mechanism in therapeutic development is the pipeline itself rather than the individual trial. The stereotypical phases of therapeutic development are attractive, precisely because they represent landmarks where the entire design, purposes, and outcomes for a trial need to adapt depending on discoveries to that point. Traditional phased development is often applied too rigidly, as discussed in Chapter 10, and the inflection points in the pipeline should be more flexible, as should the actual study designs employed.
Piantadosi
Date: July 27, 2017
418
Time: 4:45 pm
ADAPTIVE DESIGN FEATURES
A great deal of attention is usually given to the optimization of any individual clinical trial. This is appropriate if we recognize the individual trial as a building block. Less attention is given to the overall pipeline, but its properties are much more important.
15.2
SOME FAMILIAR ADAPTATIONS
A sensible classification of adaptive design features as they have been proposed and implemented in the literature is group sequential designs, adaptive dose-finding, adaptive randomization, biomarker adaptive designs, sample size re-estimation, and adaptive seamless designs (Table 15.1). Most of these will be described below. Group sequential designs are a mature class and are discussed in Chapter 18. Similarly, adaptive dosefinding is fairly well evolved and discussed only briefly here. 15.2.1
Dose-Finding Is Adaptive
Most dose-finding and dose-ranging trials are conducted as formally adaptive designs. Examples of this are estimation with overdose control (EWOC) and the continual reassessment method (CRM) discussed in Chapter 12. In both EWOC and the CRM, the dose under study is changed in an adaptive fashion depending on predictions of a dose response model fit to all the available data. Using all the available data in a model increases efficiency and reduces bias in the estimate of the target dose compared to designs based only on dose-ranging. This is consistent with the remark above that models (assumptions) can be viewed as replacements for data. Even dose-ranging designs have had adaptive modifications suggested to improve their performance. The accelerated titration method is an example of this. In that method, the cohort size is initially one subject per dose. The dose is escalated until side effects are seen, at which point the cohort size is increased from one to (say) three subjects. The smaller initial cohorts allow more rapid escalation of doses and fewer subjects treated at inactive levels. This adaptive feature improves the sample size performance of the design on average, but not the tendency to underestimate the MTD. 15.2.2
Adaptive Randomization
Adaptive randomization (AR) is a common technique in which the probability of assignment of a new subject to a particular treatment group changes according to findings in the trial. The simplest alteration in adaptive randomization is to change the assignment probability in a way that encourages tight balance in the number of subjects in the treatment groups. This is discussed in Chapter 17. Recall that simple randomization can yield imbalances in the group sizes that may make unsophisticated observers think that the trial is flawed. Two general remedies for this are constrained randomization and adaptive randomization. Assuming we want nearly equal sizes in each of two treatment groups, the probability of assignment to treatment A can be determined by a formula such as 𝑃𝐴 =
1 + 𝑁𝑏 , 2 + 𝑁𝑎 + 𝑁𝑏
(15.1)
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SOME FAMILIAR ADAPTATIONS
419
where 𝑁𝑎 and 𝑁𝑏 are the number of current assignments to treatments A and B, respectively. This is essentially equation (17.2) modified so the first assignment has probability 1 of being assigned to either group. 2 AR can also respond to outcomes of treatments in the assignment groups—so called response adaptive randomization. Theory supporting this is discussed by Hu and Rosenberger [744]. This method preferentially assigns subjects to the group that appears to be performing better, which is occasionally proposed to ameliorate some ethics concerns. This motivation is probably overstressed. Reasonable evidence that one treatment is better than another would terminate the trial—there is no ethics imperative to respond to weaker evidence, such as by altering the allocation ratio. Moreover, the class of trials that are unethical except under response adaptive randomization is likely empty. Putting aside the imperfect ethics rationale, suppose some conventionally randomized ̃ is the current best estimate of the treatment assignments have been made, and that Δ relative risk of a favorable outcome in group B compared to group A. We might assign to treatment A with probability 𝑃𝐴 =
1 . ̃ Δ+1
(15.2)
̃ >> 1, B is doing better than A and the probability of assignment to Group A decreases. If Δ ̃ = 2) This equation drives allocation moderately—even a large treatment effect (e.g., Δ yields an assignment probability of only 1/3. Also it may be important to keep in mind that this is a one-dimensional assessment of treatment performance, and that we are ignoring ̃ With regard to ethics, obligations are at the individual subject level. imprecision in Δ. AR cannot alleviate ethics concerns for treatment allocation in the trial. Specifically, it cannot convert a questionable fixed randomization into an acceptable adaptive one. The ECMO Trial discussed in Section 17.4.3 is an interesting example where a deterministic form of response adaptive treatment allocation was employed. This “play the winner” method assigns all subjects to Group A until a failure is observed. Then assignments are switched to Group B with the same rule. 𝑃𝐴 switches between 0 and 1 depending on the most recent failure. Adaptive randomization may also be driven by covariate or prognostic factors to balance the aggregate risk factor composition in the treatment groups and remove the need for a covariate adjusted analysis of treatment effect. Algorithms to accomplish this may be complex because they track and use the covariates of all subjects previously assigned. The next subject is assigned to a treatment group in a way that reduces the aggregate covariate difference between the groups. As a simple example, suppose we represent the risk for an individual subject as 𝑒𝑥𝑝(𝛽𝐱), where 𝛽 is a vector of known or assumed weights common to all groups and 𝐱′ is a vector of individual covariates. The individual subject level index has been suppressed for simplicity. This construct is intentionally related to relative risk regression models. The aggregate risk in the treatment groups could be represented as 𝑅𝐴 =
∑ 𝐴
𝑒𝑥𝑝(𝛽𝐱)
Piantadosi
Date: July 27, 2017
420
Time: 4:45 pm
ADAPTIVE DESIGN FEATURES
and 𝑅𝐵 =
∑
𝑒𝑥𝑝(𝛽𝐱)
𝐵
where the sum is over the members of the appropriate treatment group only. To make | | | | a new treatment assignment, we calculate |𝑅∗𝐴 − 𝑅𝐵 | and |𝑅𝐴 − 𝑅∗𝐵 | where the asterisk | | | | indicates assignment of the current subject to that group. The subject is actually assigned to the group that yields the smaller difference. This method or something similar will balance the overall risk in the treatment groups assuming that the representation of aggregate risk is appropriate. It will not reliably balance individual covariates marginally as, for example, blocked strata would. Correspondingly, balancing individual covariates marginally will not reliably balance overall risk. One could argue that this is not adaptive randomization because it is not randomization at all. Dynamic balancing might be a better description. However, as a general rule, studies using any method of constrained randomization or balancing are analyzed as though they are completely randomized. The benefits of adaptive randomization are intuitively clear. An important secondary goal of the trial might be realized by this technique. For example, more subjects could receive the better performing treatment. Or precision (and sometimes credibility) can be augmented by tighter balance in group sizes. The need for multivariate adjustment could be eliminated. The drawbacks are minimal, so these techniques should be considered whenever they seem appropriate. AR in Factorial Designs If we are interested in two or more treatments and their interaction, AR could be used to randomize differentially to the treatment groups if it suits all the goals of the trial. Factorial trials are the only way to study interactions (Chapter 22), and it is important to observe that they are relatively inefficient for estimating interaction effects compared to main effects when interactions are absent. Generally, four times as many subjects are needed to estimate an interaction effect with the same precision as a main effect. Unbalanced allocation, whether by original design or as a consequence of AR will reduce the precision of the estimated interaction effect. As a simple example, we can consult equation (22.3) which is the formula for 𝛽𝐴𝐵 , the 𝐴𝐵 interaction effect in a 2 × 2 factorial trial. The variance of the interaction effect is var(𝛽𝐴𝐵 ) =
𝜎2 𝜎2 𝜎2 𝜎2 + + + , 𝑛𝐴 𝑛0 𝑛𝐴𝐵 𝑛𝐵
where 𝜎 2 is the person to person variance, 𝑛𝐴 is the group size in Treatment 𝐴, and so on. The notation explicitly indicates an unbalanced design. To incorporate the constraint that the fixed total sample size is 𝑇 = 𝑛0 + 𝑛𝐴 + 𝑛𝐵 + 𝑛𝐴𝐵 , we have ( ) 1 1 1 1 2 + + + , var(𝛽𝐴𝐵 ) = 𝜎 𝑛𝐴 𝑛0 𝑛𝐴𝐵 𝑇 − 𝑛𝐴 − 𝑛0 − 𝑛𝐴𝐵 where 𝑛𝐵 has been eliminated from the equation. The minimum variance for the interaction effect can be found by differentiating with respect to the individual group sample
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SOME FAMILIAR ADAPTATIONS
421
sizes, equating to zero, and solving, which yields equations of identical form 𝑛𝐴 = 𝑇 − 𝑛0 − 𝑛𝐴 − 𝑛𝐴𝐵 , 𝑛0 = 𝑇 − 𝑛0 − 𝑛𝐴 − 𝑛𝐴𝐵 , 𝑛𝐴𝐵 = 𝑇 − 𝑛0 − 𝑛𝐴 − 𝑛𝐴𝐵 . Therefore, 𝑛𝐴 = 𝑛0 = 𝑛𝐴𝐵 = 𝑛𝐵 is the criterion for the minimum variance for 𝛽𝐴𝐵 . This demonstrates that AR in a factorial design is likely to be somewhat inefficient for estimating interactions because it will lead to unbalanced groups. Of course, unequal allocation may be appropriate in other circumstances as discussed in Sections 14.4.5 and 17.6. This represents the same lesson as for unbalanced allocation in two-group comparative trials, where we see significant power loss when the allocation ratio exceeds about 2:1. This is discussed in Chapter 16. AR in a Selection Design In a selection design, several treatments are studied in parallel with randomized assignment (to remove bias) between the treatments. The general intent of such designs is to select the best performing treatment according to a single outcome measure. Differences between treatments are not measured precisely, which is why such designs typically require smaller sample sizes than conventional trials. In other words, ordering treatments is an easier task than measuring their differences. AR can be used to assign more subjects to a treatment that appears to be performing better. Provided the overall design is reliable for correct ordering, this provides more experience with the best treatment, increases precision for detecting safety signals, and may be ethically more comfortable. It might be appropriate to have a minimum number of subjects on each therapy before initiating AR. Generally speaking, this is a developmental design and standard therapy should not be one of the arms of a selection design. I say this because direct comparisons to a control are usually predicated on yielding a clinically significant improvement. A selection design will choose a winner regardless of the magnitude of the difference. Example An example of an adaptively randomized trial was the drug combination study in adults with adverse karyotype acute myeloid leukemia [594]. In that trial, three induction regimens of idarubicin and ara-C (IA), troxacitabine abd ara-C (TA), and troxacitabine and idarubicin (TI) were compared for complete remission (CR) rates. Following an initial equal randomization, the treatment arms with higher CR rates in the first 49 days received a higher proportion of new subjects. The study could have enrolled a maximum of 75 subjects. The probability of assignment to IA was fixed at 1/3 provided all three treatment arms remained open. Using the following definitions, 𝑝1 = Pr{𝑇 𝐴 >> 𝐼𝐴 |𝑖𝑛𝑡𝑒𝑟𝑖𝑚 𝑑𝑎𝑡𝑎}, 𝑝2 = Pr{𝑇 𝐼 >> 𝐼𝐴 |𝑖𝑛𝑡𝑒𝑟𝑖𝑚 𝑑𝑎𝑡𝑎}, 𝑝3 = Pr{𝑇 𝐴 >> 𝑇 𝐼 |𝑖𝑛𝑡𝑒𝑟𝑖𝑚 𝑑𝑎𝑡𝑎},
Piantadosi
Date: July 27, 2017
422
Time: 4:45 pm
ADAPTIVE DESIGN FEATURES
where the symbol “>>” indicates a superior remission rate, the rules for stopping assignments or closing arms were If 𝑝1 > 0.85 or 𝑝2 > 0.85, drop IA, If 𝑝1 < 0.15 or 𝑝3 < 0.15, drop TA, If 𝑝2 < 0.15 or 𝑝3 > 0.85, drop TI. A closed arm could be reopened if evolving information during 49 days altered the estimates above. A total of 34 participants entered the trial. TI was closed after 5 subjects and TA was closed after 11 subjects. Accounting for events beyond 49 days, the CR rates were 10/18 in IA, 5/11 in TA, and 1/5 in TI. There was a 70% chance that TA was inferior to IA, and only a 5% likelihood that the CR rate on TA would exceed that on IA by 20%. No survival differences were observed. This trial illustrates the ability to make choices and save sample size using an adaptive rule. Eighteen partipants received the apparently superior IA induction under adaptive rules, whereas only 11 would have under fixed allocation. Even so, imperfections of the method are evident when one examines the operating characteristics, that is the behavior of the design under assumed truths of nature, as the authors did. For example, if the true CR probabilites were 0.30 on all treatments, the probability of selecting each as being superior would be 0.1 for IA, 0.45 for TA, and 0.45 for TI, with an average of 52 study participants. For the case where the true CR probabilities were 0.5, 0.3, and 0.5 for IA, TA, and TI, respectively (i.e., the TA arm was clearly inferior), the chance of selecting TA was 15% using an average of 45 participants. This illustrates as a general rule the higher type I errors associated with this design. As for any design, one has to assess if the error properties are appropriate for the clinical purposes.
15.2.3
Staging is Adaptive
Staging may be the most common adaptive feature in clinical trials. In Section 10.2.2, I discussed the efficiencies conferred on the developmental pipeline by staging. This design feature is so familiar that it is often overlooked as an adaptive part of individual trials. A decision to continue or stop is required at the end of each stage of a trial and the typical staged design explicitly uses the study data to inform the decision. Fully sequential trials (data are evaluated after every subject) are an extreme form of staging. The sequential probability ratio test (SPRT) discussed in Chapter 18 is a prototypical example. Group sequential methods discussed in the same chapter represent a more common adaptive feature because they are logistically easier to execute. Staging happens in two subtly different ways in designs—implicitly or explicitly. A fixed sample size trial may call for interim analyses at specific points of information time. The trial could be terminated on the basis of an interim analysis before the fixed sample size has been accrued. This is a typical method in RCTs but usually not described as being staged. The interim analysis is conducted from a snapshot of the data, cleaned up and codified, while the study continues to run in the background, so to speak. The staging is implicit.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
BIOMARKER ADAPTIVE TRIALS
423
An explicitly staged design (usually two or three stages) typically has a formal halting of accrual at the end of each stage while the specified analysis is done and the decision rules are tested. In some circumstances, it may be sensible to deviate from the explicit stage sizes [1588]. These designs are common in middle development and may or may not have a data monitoring committee. Whether implicitly or explicitly staged, such trials meet all the definitional requirements for adaptive designs discussed above.
15.2.4
Dropping a Treatment Arm or Subset
A frequent point of discussion for adaptation is discontinuing accrual in a well-defined subset of the population, or dropping a treatment entirely from a multi-arm trial. These ideas are related in that subjects in a specific subset may be as well delineated as those assigned to a particular treatment group. In practice of course, the multifactorial definition of a risk group may not be as clean as a randomized group assignment. But categorizations based on biomarkers or genomic findings, for example, may be unambiguous. I will assume that the subset definitions and hypotheses supporting its discontinuation are specified in the trial design. The question of stopping the trial in the relevant subset only then becomes similar to more familiar interim analysis problems. Unacceptable events or safety signals in absolute terms, for example, major adverse events when none should occur, could prompt stopping. An interim analysis could demonstrate that the subset is relatively inferior to other treatments or the remaining trial cohort, supporting a stop decision. A combination of safety and efficacy outcomes will inform decisions. Such actions are not controversial when properly planned. The logistics and properties of the trial are likely to be complex, and some quantitative study of its operating characteristics will be necessary to assure that the decision rules are appropriate under reasonable hypothetical truths of nature. Accrual of a high risk subset of subjects was terminated in the NETT, discussed in Section 4.6.6, based on a planned DSMB review. The methods employed are discussed in Lee et al. [914]. Some controversy accompanied this case mostly surrounding whether or not the subset should have been included in the randomization initially. The journal report surrounding the recruitment suspension [1093] was also not read carefully by some critics, leading to controversy. In any case it is a good illustration of what adaptations can be implemented in seemingly traditional designs. Discontinuation of an entire treatment in a multiarm trial can be accomplished in a similar way. This is essentially the goal of a selection design, or perhaps more accurately an interim selection analysis. The decision could be based on the mean treatment differences or merely on the ranking of mean outcomes. The effect of this type of decision rule is similar to the long-term result of adaptive randomization favoring the best performing treatment. The effect of these adaptations on error rates may not be critical in middle development, but may be quite important in late development or those with regulatory implications.
15.3
BIOMARKER ADAPTIVE TRIALS
In drug development, biomarkers are indicators of therapeutic effect or response derived from laboratory, physiological, or other assessments of a subject. More broadly,
Piantadosi
Date: July 27, 2017
424
Time: 4:45 pm
ADAPTIVE DESIGN FEATURES
biomarkers may measure diagnostic, prognostic or physiologic status outside the context of drug response. Biomarkers were also discussed in Section 14.3. The term is widely used and misused, but has been formally defined [654] as a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacological responses to a therapeutic intervention.
Biomarkers are probably more useful for safety assessments than for evidence of therapeutic benefit. The concept is related to surrogate clinical outcomes discussed in Section 5.4. A biomarker adaptive trial uses biomarker measurements to guide adjustments to the study. Biomarkers might be used in an interim analysis, to select a particular population, or enrich the study cohort with respect to a necessary characteristic, thereby changing the composition of the study cohort as discussed in Section 8.5.1. The final analysis is based on a definitive outcome however. As a practical distinction, biomarkers are sometimes classified as either predictive or prognostic. A predictive biomarker is informative with regard to how an individual will respond to a therapy or class of therapies. A prognostic biomarker is informative about the long term outcome or disease course for an individual, perhaps independent of treatment. Consider how a biomarker might be used to enrich a study cohort with subjects who are likely to benefit from the therapy being studied, perhaps as a consequence of genotype. This is discussed in Chapter 13, Section 13.8. For simplicity, assume that the population, and therefore the study cohort, is a mixture of individuals who benefit from therapy with effect size 𝜃 and those who do not benefit. If the frequency of the desired genotype is 𝑝, the average effect size in our sample is only 𝜃𝑝. Compared to the desired homogeneous cohort, the sample in our heterogeneous cohort has to be increased by a factor of 1∕𝑝2 to yield the same statistical properties (assuming all members of the cohort have the same variability in the outcome measurement). A moderately efficient biomarker that enriches the study cohort can reduce the required sample size substantially. Doubling the fraction of subjects with the correct genotype, for example, could reduce the required sample size by a factor of 4. It is important to observe that we must still screen a large number of subjects to produce the smaller enriched cohort. Following this reasoning, enriching the study cohort may not always be the best option, depending on the cost of doing so. If half the population has the desired genotype and enrichment with an imperfect biomarker increases the frequency to 0.75, the required sample size is reduced by just over 50%. The cost of the smaller study plus testing may not be advantageous over the simpler heterogeneous cohort. A biomarker could also guide decisions during a trial. Suppose we are testing two treatments for complications of HIV infection against a background of standard therapy. We randomize study subjects between treatments A and B, and measure viral load as a biomarker. If viral load increases, suggesting the infection is progressing, we could face several options. If the progression is restricted to a single treatment group, that arm might be stopped. If progression occurs in both groups, individuals who progress might be re-randomized to a second therapy. The biomarker mediates structural choices in the study. More generally, a biomarker can help us partition a cohort into subsets linked to different therapeutic questions.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RE-DESIGNS
425
An often cited example is the biomarker-integrated approaches of targeted therapy for lung cancer elimination (BATTLE) trial [846]. In this middle development trial, 255 subjects with refractory lung cancer were randomized to four different treatment strategies. AR was driven by molecular biomarkers analyzed in fresh tissue specimens. This was an early biomarker adaptive design that demonstrated efficacy of some treatments when specific biomarkers were present. From a cost and logistics perspective however, the BATTLE trial was problematic. Outside of oncology, biomarker based research can be even more challenging. In heterogeneous diseases, identifying relevant biomarkers may be difficult. Systemic lupus erythematosis is a disease that fits this criterion, for example. Even if a candidate biomarker emerges, it may require extensive validation for the intended purpose. For instance, does it actually identify a relevant subgroup with a higher likelihood of treatment responsiveness?
15.4
RE-DESIGNS
Re-design is, as of this writing, a neologism. But I see no other way to describe the set of AD possibilities that are derived from revisiting essential logistics-based design parameters of a trial such as accrual rate, sample size, duration, outcomes, and the like. Adjustment of these due to practical circumstances or faulty assumptions may propagate changes in more theoretical design elements such as error rates. Therefore, re-designing the trial in light of experience seems like the correct description of what is often done. In any case, such a re-design will not likely result in radical changes as compared to adjustments to keep the trial from failing. Some precautions are in order. Re-design is a very broad and potentially permissive idea that requires constraints to keep it from damaging the inferences from a trial. Properly, the outcome of the study derives from the design—we do not want the study design to be a consequence of the outcome as might be the case if we were to use interim results to continually re-design the trial. An example based on sample size re-estimation is discussed below.
15.4.1
Sample Size Re-Estimation Requires Caution
Sample size re-estimation is a method where the accrual goal of a trial is changed during the study based on interim findings. A recent discussion of this approach is given by Mehta [1024]. An older reference is Herson [710]. In a comparative trial, this sometimes implies that the target sample size is increased beyond that originally specified because the observed treatment effect is smaller than expected. The intent is to yield a statistically significant result at the end of the trial. This simple description probably already indicates some ways in which that kind of sample size estimation can be a bad idea. For example, if the interim treatment effect is clinically unimportant, we might not want to increase the size, expense, and duration of our trial to create statistical significance in the absence of clinical utility. We would never design a fixed sample size trial to be so large as to detect clinically insignificant effects. Similarly, we would not want any AD to yield the same consequence.
Piantadosi
Date: July 27, 2017
426
Time: 4:45 pm
ADAPTIVE DESIGN FEATURES
However, there may be other circumstances where re-estimation of the sample size or related goal of a trial is appropriate and necessary. The design of a trial relies on quantitative estimates or assumptions about key parameters such as accrual rate, subject to subject variability, adherence, losses to follow-up, and others that are not directly related to the effect of interest. These are often called nuisance parameters. The experience of a portion of the trial may prove that some critical assumptions were too optimistic. The sensible thing to do is to “re-design” the trial in light of the more accurate design parameters to determine if the study remains feasible with the resources at hand. Some adjustments might be made to preserve the ability of the trial to answer the scientific question without undermining the clinical or statistical foundations. Narrowly speaking, this might not be sample size re-estimation, but the principles are the same, and it is done routinely. The term “internal pilot” has sometimes been applied to designs where sample size re-estimation is based only nuisance parameters. A noncontroversial example will help illustrate. Trials with survival or event times as the primary outcome depend on the total number of events observed to yield the planned precision (power). The accrual dynamics of such studies causes them to be “back-loaded” with events, meaning that events accumulate slowly at first and then more rapidly as the study cohort gains person-time of follow-up. If either accrual is slower than anticipated or the event rates are lower than planned, the study will not yield sufficient events to meet the precision requirements in the calendar time allotted. A re-design of the trial using the actual accrual and event rates may suggest several options. If the loss of efficiency is mild, we might accept it, particularly if the original design was robust. Events can be increased by adding clinics or lengthening the duration of recruitment. Alternatively, we could increase the number of events observed by increasing the duration of follow-up. This last strategy might work even if accrual was terminated according to the original design of the study. Another circumstance relating to sample size re-estimation is if the subject to subject variability is found to be larger than anticipated. Equation (16.22) demonstrates how sample size can be directly proportional to this variability. To preserve power, the sample size needs to be increased directly in proportion to the larger observed variance. I assume here that the projected treatment difference remains appropriate. Re-estimating the required sample size in light of a more accurate variance estimate seems sensible, and will not undermine the error properties of the trial. It could be explicitly planned as an adaptive strategy from the outset. Another goal of sample size re-estimation might be to end our trial early if the results are convincing or if the study proves futile because of a weak treatment difference. I will not discuss using AD in this way because I believe such circumstances are more appropriately handled by interim analyses and data dependent stopping guidelines as discussed in Chapter 18. As an example, sample size re-estimation was attempted in a clinical trial for the secondary prevention of subcortical strokes [1006]. The study was a randomized comparison of aspirin alone versus aspirin plus clopidogrel. Originally, the trial planned 417 subjects to detect a 25% relative risk reduction. Early data reviewed by the DSMB suggested lower than planned annual stroke recurrence rates. Simulation studies calculated the effect of sample size re-estimation on power and type I error and determined that the type I error could be controlled despite increasing the sample size. Although unplanned, this adaptation appeared justified.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SEAMLESS DESIGNS
15.5
427
SEAMLESS DESIGNS
A seamless design intentionally combines objectives that would normally be separated in serial trials and combines them in a single study [543, 1430, 1431]. Reductionism usually encourages separate trials that each address a narrow range of objectives or a single objective, but there are efficiencies to an overarching protocol or design. A proper seamless design can offer efficiencies while preserving scientific objectives. A seamless AD uses information from before and after the “seam” in the final analysis and may use interim data to decide if the second portion of the trial should even be done. An example is the so called phase I/II trial in oncology drug development where subjects in the dose-finding portion who receive the ultimately recommended dose are combined with those in the phase II component to assess efficacy and safety. Such designs are frequently proposed for efficiency (in terms of both subjects enrolled and calendar time). Most institutional review boards (IRBs) will expect a full accounting of safety from the dose-finding portion of the study before allowing the safety/activity component of the trial to proceed. This is appropriate and may reduce the speed and efficiency of the design. In recent years there has been a push toward seamless dose finding designs with very large expansion cohorts. This means that the sponsor or investigators attempt to generate sufficient safety and efficacy data within the framework of a dose finding trial to satisfy regulatory approval. This approach worked with the cancer immunotherapy drug pembrolizumab and others have attempted to follow. However, the circumstances have to be exceptional with regard to need, promise, and apparent efficacy for such designs to make sense [1239]. Significant problems with this approach include lack of adequate design for outcomes and precision, absent controls, safety logistics, and monitoring [1068]. A seamless phase II/III means that the first stage is essentially a randomized phase II trial and the second is a more typical RCT. One advantage is that the usual administrative delay between developmental phases will be short, thereby saving drug development time. A second advantage is that combining subjects from both phases requires a smaller total number. Third, the earliest accrued subjects yield longer safety follow-up. There has not been enough experience with such designs to know if they are superior to more familiar group sequential designs. A randomized comparative trial with possible early stopping for futility may be operationally very similar to a seamless design. This seamless design strategy was used in a trial of a glucagon-like peptide-1 analog for diabetes control [575, 1424]. A number of significant issues arose while planning the trial. More upfront planning was required compared to a conventional design, including planning for infrastructure coordination. The adaptive features required rapid data acquisition, analysis, and reporting. Because there were nine treatment arms, drug supply demanded careful management across multiple sites worldwide. The statistical design suggested that the trial would identify the correct dose 88% of the time [1404], and the study concluded apparently successfully [1405]. Seamless designs have significant limitations. One is the limited ability of the second half of a seamless design to accommodate unexpected findings from the first half. A second important limitaton is the need for the endpoints in the phases to be identical or similar enough to combine. While this does not sound very restrictive, it is worth noting that we typically employ different outcomes for serial studies in response to
Piantadosi
Date: July 27, 2017
428
Time: 4:45 pm
ADAPTIVE DESIGN FEATURES
developmental needs or questions. If for example, middle and late development studies were planned to use the same outcome, it seems very natural to combine them into so type of seamless design. Methods to cope with this issue have been proposed [261].
15.6
BARRIERS TO THE USE OF AD
Substantive adaptive features do not lend themselves to simple analytic design methods. Most statisticians have never been taught to design or analyze an adaptive trial. Understanding the properties of a proposed AD often requires the methodologist to conduct extensive simulation studies. One type of simulation may be needed to find the operating characteristics of a given trial, such as the type I error, power, and sample size distribution. A second related type may be required to determine the sensitivity of critical study parameters to basic design assumptions. Burton et al. [203] gave a nice example of how to develop a protocol for such a simulation study. To perform these design simulations well, substantial time and effort is required. Some trial sponsors cannot afford the research staff required to accomplish this routinely. Aside from these technical barriers, a complex workflow in AD may also stress study logistics such as budget preparation and administration, informatics support, operating protocols, central randomization, and drug supply. These logistics may amplify costs greatly in larger comparative trials compared to early developmental studies. These potentially higher costs may inhibit the use of AD. Like most other aspects of clinical trials, it is important to be influenced primarily by the scientific questions at hand and apply this methodology when appropriate. However, investigators should be aware of the barriers and make these choices wisely. Multiple adaptations in a single trial may not be a good strategy. Inferences from a trial with interim analyses, modified dose, altered sample size, duration, or outcome, for example, could be impossible. It makes sense to limit the number of adaptations in any one trial.
15.7
ADAPTIVE DESIGN CASE STUDY
As an example of AD, consider the multicenter middle development trial of high-dose coenzyme Q10 (an antioxidant and mitochondrial cofactor) in subjects with amyotrophic lateral sclerosis (ALS) [828, 931]. The first stage of this trial was a dose selection, and the second was an efficacy comparison of the selected dose to placebo. Given the usual separation between dose finding and efficacy in developmental phases, we can view this trial as a seamless design. The outcome for both stages of the trial was a decline over 9 months in the revised ALS Functional Rating Scale (ALSFRSr). The dose selection required 35 subjects per group and used CoQ10 doses of 1800 and 2700 mg/day. The efficacy stage was designed as a futility trial and compared the selected dose (which turned out to be 2700 mg/day) to placebo in groups of 75 subjects. A selection algorithm rather than a formal hypothesis test was used to identify the best dose. Safety concerns had to be taken into account so a simple pick the winner rule could not be used. Masking such a data review would have to use a plan for dealing with the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
429
TABLE 15.2 Doses and Stages in the High Dose Coenzyme Q10 Middle Development Trial Stage 1: Dosing
Stage 2: Efficacy
Total
2700 mg/day (𝑁 = 35) 1800 mg/day (𝑁 = 35) Placebo (𝑁 = 35)
2700 mg/day (𝑁 = 40) – Placebo (𝑁 = 40)
𝑁=75 – 𝑁=75
unexpected situation where the placebo group has the smallest decline. For the QALS study, the selection strategy was described in a written document agreed upon by both the DSMB and investigators. The DSMB formally approved a selection strategy prior to viewing any of the summary data by treatment group. Without adjustment, an adaptive seamless design increases the type I error because the final test statistic does not account for selection of dose in the first stage. In this case, the investigators developed a bias correction, and the final test statistic incorporated the correction to preserve the overall type I error rate. The QALS trial was a success. It required only 185 participants to: (i) select a preferred dose, and (ii) establish that the cost and effort of undertaking a comparative trial would not be promising. This demonstrates that an adaptive design can be implemented within commonly used trial management structures.
15.8
SUMMARY
Adaptive design is a diverse set of techniques, some of which can be applied to nearly any clinical trial. Adaptive design is a tool, not a class of designs. In common usage, adaptive tools have been frequently associated with Bayesian design, but this is not definitional. Familiar adaptations include dose-finding designs, interim monitoring, adaptive randomization, and staging. Less widely used adaptations include sample size re-estimation and integrated designs that cross developmental stages. Biomarkers and other interim assessments can often be used effectively to make adaptive decisions. The operating characteristics of adaptive designs generally require simulation to determine because of their complexity. The design selected should be studied for its behavior under various assumptions to assure investigators that error rates and other properties are controlled appropriately. Adaptive design features add flexibility and efficiency in therapeutic development. Well chosen adaptations can increase the success of development without sacrificing the scientific integrity of the trial.
15.9
QUESTIONS FOR DISCUSSION
1. Discuss ways of determining the impact of sample size re-estimation on the type I error in advance of a trial. 2. Staging of a trial is an adaptive method. How many stages are ideal? How could such a question be answered?
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
16 SAMPLE SIZE AND POWER
16.1 INTRODUCTION Precision of estimation, error properties, best sample size , and optimal study duration are questions frequently asked of statisticians about clinical trials. Such questions are appropriate given the importance of these study properties for inference. Independent replication is our only method to control random variation, and studying an adequate number of subjects is the way to align clinical and statistical significance. Most trials have a component of longitudinal follow-up that sometimes can be adjusted to help manage the number of events observed. These design features are especially critical for comparative and middle developmental trials. Size is less important in translational trials because their findings depend strongly on underlying biological models. For dose-finding and dose-ranging, sample size is typically an outcome of the study. However, adequate precision is a universal concern. The literature regarding sample size and power in clinical trials is large, although not always well organized. Exceptions are the works by Desu and Raghavarao [364], Kraemer and Thiemann [868], and Machin and Campbell [966, 967]. Shuster [1389, 1390] has also provided a compendium of methods and tables. Other useful references will be provided in context below. The widespread availability of flexible computer programs with good interfaces and graphics has made the use of tables nearly unnecessary. Even so, there are overall relationships that a well organized table can highlight, but that may be obscured by a one-at-a-time calculation from a computer program. No matter what computational aid is used to support sample size and power determinations, there is no substitute for common sense and instinct to help validate calculated numbers against one’s experience. There are two widely used frequentist perspectives for determining the appropriate sample size for a clinical trial. One is based on confidence intervals as a representation Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
430
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PRINCIPLES
431
TABLE 16.1 Brief Description of Quantitative Design Parameters Frequently Used in Clinical Trials Power: 𝛽 level (type II error): 𝛼 level (type I error): Likelihood ratio: Sample size: Effect size: Number of events: Study duration: Percent censoring: Allocation ratio: Accrual rate: Loss to follow-up rate: Follow-up period: Δ∶
1−𝛽 type II error probability type I error probability relative strength of evidence number of experimental subjects treatment difference expressed as number of standard deviations number of experimental subjects who have a specific outcome interval from beginning of trial to end of follow-up percent of study participants left without an event by the end of follow-up ratio of sample sizes in the treatment groups new subjects entered per unit of time rate at which study participants are lost before outcomes can be observed interval from end of accrual to end of study smallest treatment effect of interest based on clinical considerations
For exact definitions of these terms, see Appendix B.
of precision for the observed effect. Sample size is chosen to make the confidence interval suitably narrow. The second method is based on the chance of rejecting the null hypothesis when a treatment effect of given magnitude is present, which is power. Power is often used to describe and motivate the size of comparative trials. This chapter will also discuss a third approach to sample size based on likelihood ratios or relative evidence. The common theme for all methods is precision of estimation, so it is not surprising that they are closely related or interchangeable. Specification and study of the quantitative properties of a clinical trial requires both concepts and computations. The conceptual framework involves the interplay of (i) the nature of the outcome variable, (ii) the framework for specifying precision, (iii) accrual dynamics, and (iv) the design of the trial. A number of design parameters are, to varying degrees, under the control of the investigator (Table 16.1). It is essential to study the sensitivity of computations to the design parameters to prevent overly optimistic or nonrobust designs. A trial that is feasible despite conservative assumptions is vastly more comfortable than one whose success depends on them.
16.2
PRINCIPLES
Because there is no universal way to answer precision, power, and sample size questions, I will discuss basic ideas and examples on a case-by-case basis, synthesizing whenever possible. Although some abbreviated sample size tables are presented later in the chapter, it will be necessary to use the formulas for direct calculation of specific cases or to consult the original references for more extensive tabulations. For a more statistically oriented
Piantadosi
Date: July 27, 2017
432
Time: 4:45 pm
SAMPLE SIZE AND POWER
review of fixed sample size calculations, see Donner [387] or Lachin [877]. Tables and equations for specific purposes are presented in the references cited above. Quantitative aspects of group sequential designs for comparative trials are discussed in Chapter 18. For adaptive designs, sample size may be an outcome of the trial rather than a fixed design parameter. Such designs are best studied by simulation, which although requires more computer programming, is also merely a computation.
16.2.1
What Is Precision?
The underlying theme of sample size considerations in all clinical trials is precision. Precision is the reproducibility of measurements. High precision implies little variation. Precision therefore can be described by the confidence interval around, or standard error of, the estimate. For this discussion, I will always assume that our estimates are unbiased, meaning accurate. Precision of estimation is the characteristic of an experiment that is most directly a consequence of sample size. In contrast, other important features of the estimates, such as validity, unbiasedness, and reliability do not relate to study size. Precision is a consequence of measurement error, person-to-person (and other) variability, number of replicates (e.g., sample size), experiment design, and methods of analysis. By specifying quantitatively the precision of measurement required in an experiment, the investigator is implicitly outlining the sample size, and possibly other features of the study. We can specify precision in several ways including (i) direct specification on the scale of measurement, (ii) indirect specification through confidence limits, (iii) the power of a statistical hypothesis test, (iv) relative evidence through a likelihood ratio, (v) a guess based on ordinal scales of measuring effect sizes, (vi) ad hoc use of designs similar to those employed by other investigators, and (vii) simulations of the design and analysis assuming different sample sizes. In this chapter, I will discuss primarily approaches 2, 3, and 4. Direct specification of precision is sometimes possible. For example, we may need a sample size sufficient to estimate a mean diastolic blood pressure with a standard error of ±2 mm Hg. Such a specification is unusual in clinical trials. More typically we specify precision through the absolute or relative width of a confidence interval. This is useful in single cohort studies. For example, our sample size may need to be large enough for the 95% confidence interval around the mean diastolic blood pressure to be ±4 mm Hg. The size of a comparative trial could be described in terms of the precision in the estimated relative treatment effect. For example, we might want the lower confidence bound for an expected hazard ratio of 1.75 to be greater than 1.5, which can be translated into a required number of events. However, the sizes of comparative trials are more often described by the power of hypothesis tests. For example, we may need a sample size sufficient to yield 90% power to detect a 5 mm Hg difference as being statistically significant, using a specified type I error level. How this captures the notion of precision will be made more clear later. In all of these circumstances a convenient specification of precision combined with knowledge of variability can be used to determine the sample size.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PRINCIPLES
16.2.2
433
What Is Power?
Many clinical investigators have difficulty with the concept of power and the way it should be used to design and interpret studies, or if it should be used to interpret studies. A principal reason for this difficulty is that power is an unnatural idea. However, it is central to thinking about clinical trials and therefore necessary to understand. A key utility of power is that it requires investigators to address precision of estimation while a study is being designed. One source of vagueness is that power can only be defined in terms of a hypothetical treatment effect of a certain size. Note how this contrasts with the type I error whose basis is the null hypothesis of no treatment difference. For clinical inferences, we might prefer that our study yield the probability that the true treatment effect is at least a certain size. However, power assumes a treatment effect of a specified size and describes the probability that a result as large or larger would be statistically significant. Power is therefore neither the chance that the null hypothesis is false nor a probability statement about the treatment effect, both common errors of interpretation. A third source of difficulty with the idea of power is that scientists tend to conceptualize variation at the level of the individual or the experimental unit, whereas power refers to variation at the level of the experiment. Because we never conduct a large number of replicates of our experiment, variability at the experiment level is always an unobserved abstraction. Individual level and experiment level or sampling frame variation are connected fundamentally, but understanding the latter is not so intuitive. Investigators take only a single sample from the space of all possible experiments, meaning we perform our study once. It is not obvious what probability statements can be made under such conditions. To be fair, precision also refers to variation at the level of the experiment, but may be easier to understand. As a final entanglement, we cannot separate power from either the size of the study or the magnitude of the hypothetical treatment effect. All three ideas must be discussed simultaneously. I refer to this as the power triplet. Unfortunately, statements about power are often made without reference to the effect size or study size. For example, someone might say only “This trial has 90% power.” Such isolated statements are meaningless or at best ambiguous. Every trial has 90% power for some effect size. But many do not have sufficient power for clinically relevant effect sizes. The frequency of such errors among inexperienced investigators is high, and contributes greatly to misunderstanding power. The size of a comparative study is intertwined with the intended error rates in the frequentist paradigm. By convention, most power equations are written in terms of the corresponding normal quantiles for the type I and type II error rates, 𝑍𝛼 and 𝑍𝛽 , respectively, rather than the error probabilities themselves. The definitions of 𝑍𝛼 and 𝑍𝛽 follow from that for the cumulative normal distribution, which gives lower or left tail areas as 𝑍
Φ(𝑍) =
∫−∞
1 − 𝑥2 √ 𝑒 2 𝑑𝑥. 2𝜋
(16.1)
Then the normal quantile for a two-sided type I error would be defined by 1 − 𝛼∕2 = Φ(𝑍1−𝛼∕2 ). Similarly for a type II error, we would have 1 − 𝛽 = Φ(𝑍1−𝛽 ). To simplify the notation slightly in the remainder of the chapter, I will employ a typical inconsistency. Because type I errors are typically small, the respective 𝑍 quantile is large (positive) and
Piantadosi
Date: July 27, 2017
434
Time: 4:45 pm
SAMPLE SIZE AND POWER
in the upper tail of the distribution. The upper tail area corresponding to power is large, making the corresponding 𝑍 quantile negative. To keep both 𝑍 ′ s positive, we can define 𝑍𝛼 = Φ−1 (1 − 𝛼∕2) and 𝑍𝛽 = Φ−1 (1 − 𝛽). For 𝛼 = 0.05, 𝑍𝛼 = 1.96, and for 𝛽 = 0.10, 𝑍𝛽 = 1.282.
16.2.3
What Is Evidence?
We would like our clinical trial to provide evidence regarding the treatment effect sufficient for the decisions at hand. One difficulty with the notion of evidence is that it is weighed differently depending on circumstances. For example, in regulation “substantial evidence of effectiveness” is required for drug approval. For new devices, “reasonable assurance of safety and efficacy” is needed, whereas for existing devices “substantial evidence of equivalence” is the standard. Some professional societies ask for “useful and effective” treatments to establish as standards. In the payer domain, CMS approves those things that are “reasonable and necessary,” whereas others use “useful and customary” to set standards. In none of these important cases is evidence or relative evidence actually defined. Statistically, it is not possible to define evidence in support of a single hypothesis. However, when assessing competing hypotheses, relative evidence can be defined appropriately. This was demonstrated informally in Section 11.5.3 where the information or evidence produced by a small trial could only be assessed in relation to a specific state of uncertainty. There, the reference state was taken to be the state of maximal uncertainty. Evidence in support of one hypothesis compared to another is captured in the statistical likelihood function—specifically the likelihood ratio (LR), which can be interpreted as relative strength of evidence as in Chapter 7 [1298]. Likelihoods form a basis for a wide class of both estimation methods and statistical tests, and it is perhaps not surprising that the LR can be interpreted as relative evidence. The likelihood is essentially the probability of observing the data under an assumed probability model. The parameters of the probability model are the entities about which we would like to make inferences. Clinical trialists have not been accustomed to assessing and designing studies in terms of LRs, but as statistical tools they have very desirable properties. More detail is provided in Chapter 7. For methods based on the normal distribution of a test statistic, as many of the common power and sample size techniques are, a simple perspective on likelihood-based thinking can be obtained by considering the following: the LR can be calculated from a standard 𝑍-score as 𝑅 = 𝑒Λ = exp(𝑍 2 ∕2). This follows directly from the normal distribution and likelihood. Power formulas derived from hypothesis tests almost always contain a term of the form (𝑍𝛼 + 𝑍𝛽 )2 arising from quantification of the type I and II errors. Therefore, this term can be replaced by 2 log(𝑅), where 𝑅 is an appropriately chosen standard of relative evidence for the comparison being planned. With a little practice, it is no more difficult to choose an appropriate standard for relative evidence than to choose sensible values for type I and type II errors. For example, a two-sided 5% type I error and 90% power would result in (𝑍𝛼 + 𝑍𝛽 )2 = (1.96 + 1.282)2 = 10. 5, for which 𝑅 = exp(10.5∕2) = 190. This LR corresponds to strong evidence, as discussed below. Evidence can be weak or strong, misleading or not ([1298], Chapter 2). It is relatively easy to reduce the probability of misleading evidence by an appropriate choice for the likelihood ratio. For example, the chance of misleading evidence is 0.021, 0.009, and
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PRINCIPLES
435
0.004 for 𝑅 = 8, 16, and 32, respectively. But producing strong evidence in favor of the correct hypothesis requires sample sizes somewhat larger than those yielded by the typical hypothesis testing framework. With sample sizes based on hypothesis tests in the usual paradigm, we will fail to produce likelihood ratios in excess of eight about 25–30% of the time. The sample sizes required by an evidentiary approach can be 40–75% greater. This emphasizes that choosing between competing hypotheses in the typical framework is an easier task than generating strong evidence regarding the correct hypothesis. Some quantitative consequences of this can be seen in Section 16.7.10.
16.2.4
Sample Size and Power Calculations Are Approximations
There are several ways in which power and sample size calculations are approximations. First, the equations themselves are often based on approximations to the exact statistical distributions. For example, a useful and accurate equation for comparing means using the t-test can be based on a normal approximation with known variance — a so-called z-test. Second, the idea of predicting the number of subjects required in a study is itself an approximating process because it depends on guesses about some of the parameters. For example, the accrual rate, censoring rate, or variance may be known only approximately. The difference in samples sizes produced by hypothesis testing approaches and those based on relative evidence also indicate how solutions to this problem depend on our framework and assumptions. One cannot take the estimated sample size or power that results from such approximations rigidly. We hope that our errors are small and the required sample size is accurate, generally within 5 or 10% of what is truly needed. However, we must be prepared to make the design more robust if some parameters are not known reliably. A sensitivity analysis will reveal the need for this. Often we must increase sample size to compensate for optimistic assumptions. Furthermore, we should not allow small differences in sample size to dictate important scientific priorities. If it is essential clinically to achieve a certain degree of precision, then it is probably worthwhile to accrue 10 or 20% more participants to accomplish it. Finally, some subjects are usually lost from the experiment. We can inflate the sample size to compensate for this, but the exact amount by which to do so is often only an educated guess. When solved for sample size, the equations that follow do not necessarily yield whole numbers. The answers must be rounded to a nearby integer. For comparative trial designs, it is probably best to solve for the sample size in one group, round to the next higher integer, and multiply by the number of treatment groups. The tables presented should be consistent with this policy. I have not rounded the answers for many of the calculations in the text because the important concepts do not rely on it.
16.2.5
The Relationship between Power/Precision and Sample Size Is Quadratic
Investigators are often surprised at how small changes in some design parameters yield large changes in the required sample size or study duration for experiments. At least for middle- and late development clinical trials, sample size tends to increase as the square, or inverse square of other design parameters. In general, the sample size increases as the square of the standard deviation of the treatment difference and the sum of normal quantiles for the type I and II error rates. Reducing the error rates increases 𝑍𝛼 and 𝑍𝛽
Piantadosi
Date: July 27, 2017
436
Time: 4:45 pm
SAMPLE SIZE AND POWER
and increases the required sample size. Larger variability in the endpoint measurement increases the sample size. Also the sample size increases as the inverse square of the treatment difference—detecting small treatment effects increases the required sample size greatly. These and other quantitative ideas will be made clear in the following sections.
16.3
EARLY DEVELOPMENTAL TRIALS
Early developmental trials do not present difficult issues with regard to sample size or precision, but they are somewhat unconventional. These studies are not designed to provide the strong statistical evidence or decision properties of later trials. I include a short discussion here for completeness, but the reader should refer to the appropriate chapters for more details regarding design considerations for these trials. It might also be useful to read Section 2.4.6 at this point. 16.3.1
Translational Trials
Translational clinical trials are small, almost by definition, and nearly always exist below the sample size threshold for any reasonable degree of purely statistical certainty. However, as I indicate in Chapter 11, with a relatively small numbers of subjects and biological knowledge, these trials can reduce uncertainty enough to guide subsequent experiments. Their size can be motivated formally by considering the information gained, but practical constraints will usually make them smaller than 20 subjects. A problem in the design of some translational studies, as well as other circumstances, is taking a large enough sample to estimate reliably the mean of some measurement. Let us assume that the measurements are samples from a normal distribution with mean 𝜇 and known variance 𝜎 2 . The sample mean 𝑋 is an unbiased estimate of 𝜇, and the | | absolute error is |𝑋 − 𝜇 |. | | The sample size could be chosen to yield a high probability that the absolute error will be below some tolerance 𝑑. In other words, we intend ) ( | | Pr |𝑋 − 𝜇| ≤ 𝑑 ≥ 1 − 𝛼, | |
(16.2)
where 𝛼 is the chance of exceeding the specified tolerance. The sample size required to satisfy this is ) ( 𝜎 2 . 𝑛 ≥ 𝑍1−𝛼∕2 𝑑
(16.3)
In practice, we round 𝑛 up to the next highest integer. To test a hypothesis regarding the mean, this equation must be modified slightly to account for the type II error. Specifically, we obtain ( 𝑛≥
(𝑍1−𝛼∕2 + 𝑍𝛽 )𝜎 (𝜇1 − 𝜇0 )
)2 ,
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
EARLY DEVELOPMENTAL TRIALS
437
where 𝑍𝛽 is the normal quantile for the intended type II error (Section 16.2.2) and 𝜇1 and 𝜇0 represent the alternative and null means. This formula will arise later in the discussion of comparative trials in Section 16.7.2. If 𝜎 2 is unknown, we cannot use this formula directly. However, if 𝑑 is specified in units of 𝜎, the formula is again applicable. In that case 𝑑 = 𝑚𝜎 and ( ) 𝑍1−𝛼∕2 2 . 𝑛≥ 𝑚 If 𝑑 cannot be specified in terms of 𝜎, a staged procedure is needed where the first part of the sample is used to estimate 𝜎 2 and additional samples are taken to estimate the mean. See Desu and Raghavarao [364] for a sketch of how to proceed in this circumstance. Example 16.1. Suppose that the variance is unknown, but we require a sample size such that there is a 95% chance that the absolute error in our estimate of the mean is within 12 a standard deviation (0.5𝜎). We then have ) ( 1.96 2 ≈ 16, 𝑛= 0.5 so 16 observations are required. The sample size is very strongly affected by the precision. If we intend to be within 0.1𝜎, for example, almost 400 samples are required. A similar circumstance arises in a sequence of Bernoulli trials when estimating the proportion of successes. If 𝑝 is the probability of success with each trial, 𝑟, the number of successes in 𝑛 trials, follows a binomial distribution with mean 𝑛𝑝 and variance 𝑛𝑝(1 − 𝑝). Equation (16.2) then becomes ( ) |𝑟 | | | Pr | − 𝑝| ≤ 𝑑 ≥ 1 − 𝛼, |𝑛 | and we can use the normal approximation to the binomial with 𝜎 2 = 𝑝(1 − 𝑝) so that equation (16.3) yields ( )2 √ 𝑝(1 − 𝑝) . (16.4) 𝑛 ≥ 𝑍1−𝛼∕2 𝑑 The variance is maximal when 𝑝 = 12 , so conservatively ( 𝑛≥
𝑍1−𝛼∕2 2𝑑
)2 .
Further use of this equation will be illustrated below. 16.3.2
Dose-Finding Trials
Dose-finding/ranging trials are commonly not designed to meet criteria of precision in estimation. Instead, they are designed to select or identify a dose that meets certain operational criteria, and the resulting sample size is an outcome of the study. The precision
Piantadosi
Date: July 27, 2017
438
Time: 4:45 pm
SAMPLE SIZE AND POWER
with which important biological parameters are determined is usually not a formal part of the design of such trials. For example, in classic DF oncology trials we usually do not focus on the precision in the estimate of the MTD or of pharmacokinetic parameters, even though some such assessment might be available from the data at the end of the study. Perhaps, this lack of attention to precision is an oversight. Numerous methods might be applied to DF trials to assess their precision. For example, from the model-fitting that is integral to the CRM, the variance of the parameters can be estimated. It would be reduced by increasing cohort sizes or additional dose points. The precision in estimates of PK parameters might be assessed by simulation or bootstrapping and increased in similar ways. Study of such methods would have to provide strong evidence of inadequacy in current designs to change the habits of clinical investigators. Sampling issues can arise in pharmacokinetic studies. For example, when estimating the area under the blood time–concentration curve (AUC) after administration of a drug, what is the best number and spacing of sample points. Under an assumed model for the time–concentration curve, the precision in the estimated AUC as a function of the number of samples and their spacing could be studied by simulation. This is not routinely done, perhaps for several reasons. First, great precision in the estimated AUC is not usually required. Second, the optimal design will necessarily depend on unknown factors such as the half-life of the drug and its peak concentration. Finally, the sample points are often largely dictated by clinical considerations, such as patient comfort and convenience.
16.4
SIMPLE ESTIMATION DESIGNS
Simplicity, which classically took the form of a single cohort, is a highly desired goal for middle development designs. It contributes to ease of design, analysis, and interpretation, short calendar time and therefore low cost, a minimum number of subjects, and the use of clinically relevant outcomes. Also, ethics considerations are less complex if all participants receive the new therapy. The main difficulty with such trials arises from the lack of internal controls coupled with a select cohort, opening the estimate of treatment effect to bias. Both the benefits and risks of such simple single cohort trials are considerable, and it is a serious debate as to when they can and should be used to screen therapies. Estimation of treatment effects provides both a clinical and statistical entry in the rationale, design, and interpretation of such trials. Therefore, I will discuss several of these designs quantitatively from that viewpoint.
16.4.1
Confidence Intervals for a Mean Provide a Sample Size Approach
When the treatment effect of interest is the mean of a distribution of measured values, the size of a screening trial can be motivated by precision defined via a confidence interval. We assume that the sample mean has a normal distribution. Both mean and standard deviation are needed to characterize a normal distribution. Preliminary data may be needed to estimate the standard deviation. Here I will assume that the standard deviation is known. There is no naturally bounded range for the mean or standard deviation, unlike for the binomial discussed below, so there is no universal scale for specifying the width of relevant confidence intervals. We could express precision either in absolute terms or relative to the mean or variance.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SIMPLE ESTIMATION DESIGNS
439
If 𝜇̂ is the estimated √ mean from 𝑛 observations, the confidence interval for the true mean is 𝜇̂ ± 𝑍𝛼 × 𝜎∕ 𝑛, where 𝜎 is the standard deviation and, to simplify notation, 𝛼 implies the usual 1 −√ 𝛼∕2 quantile. If our tolerance for the width of the confidence interval is 𝑤 = 𝑍𝛼 × 𝜎∕ 𝑛, the sample size must satisfy ( 𝑛=
𝑍𝛼 𝜎 𝑤
)2 .
See also Section 16.3.1 and equation (16.3). Example 16.2. Suppose that we are measuring the serum level of a new drug such that we require a precision of ±0.5 mg/dl in the 95% confidence interval, and the standard deviation of the distribution of drug levels is 𝜎 = 5 mg/dl. The required sample size is )2 ( = 384.2 or 385 subjects. then 𝑛 = 1.96×5 0.5 The tolerance, 𝑤, specifies one-half the width of the confidence interval in absolute terms. It is not a percentage, and relates to the scale of the distribution only in absolute terms. In some cases this may be appropriate. However, precision could be expressed relative to either 𝜇 or 𝜎. If we specify the width of the confidence interval in terms of the standard deviation, 𝑤 = 𝑤′ 𝜎, ( 𝑛=
𝑍𝛼 𝜎 𝑤′ 𝜎
(
)2 =
𝑍𝛼 𝑤′
)2 .
Similarly, in terms of the mean, 𝑤 = 𝑤′′ 𝜇, ( 𝑛=
𝑍𝛼 𝜎 𝑤′′ 𝜇
(
)2 =
𝑍𝛼 𝑤′′
)2 ( ) 2 𝜎 , 𝜇
when the mean is not zero. The parameters 𝑤′ and 𝑤′′ have convenient interpretations: 𝑤′ is the desired tolerance expressed as a fraction of a standard deviation, and 𝑤′′ has a similar interpretation with respect to the mean. The ratio of standard deviation to mean, 𝜎∕𝜇, is the “coefficient of variation.” Example 16.3. To narrow the 95% confidence interval to ±5% of the mean, set 𝑤′′ = 0.05, 𝑛=
(
1.96 0.05
( )2 )2 ( )2 𝜎 𝜎 ≈ 1536 × . 𝜇 𝜇
Unless the coefficient of variation is quite small, a large sample size will be required to attain this precision. From the other perspective, if the 95% confidence interval for the true mean is required to be one-fourth of a standard deviation, w′ = 0.25, then 𝑛=
(
1.96 0.25
)2 ≈ 62.
Piantadosi
Date: July 27, 2017
440
16.4.2
Time: 4:45 pm
SAMPLE SIZE AND POWER
Estimating Proportions Accurately
Another simple objective for middle development could be to estimate the proportion of clinical successes or response rate with a specified precision. Response could be objective evidence of benefit, such as tumor shrinkage in the classical oncology context, or more generally measured on an appropriate biological, symptom, or functional scale and dichotomized. A useful measure of precision in this circumstance is, as above, the confidence interval around the estimated proportion. Narrow 95% confidence intervals indicate a high degree of certainty about the magnitude of the true effect. Confidence intervals for proportions also depend on the number of subjects studied so precision specified in this way can be translated into sample size. There are many methods for estimating these intervals [1514]. The estimated proportion will be denoted by 𝑝. It is customary to use 𝑝̂ for estimates, but I will omit the hat to ease readability. A confidence interval around 𝑝 will be based on the binomial distribution, and could use either exact distribution methods or a normal approximation to the binomial. The normal approximation is fine for 0.2 < 𝑝 < 0.8. An approximate 100(1 − 𝛼2 )% confidence interval for the proportion, 𝑝, is √ 𝑝 ± 𝑍𝛼 ×
𝑝(1 − 𝑝) , 𝑛
where 𝑛 is the sample size of the study and 𝑍𝛼 is the normal quantile for the two-sided coverage percentage. For example, 𝑍𝛼 = 1.96 for a two-sided 95% confidence interval. The square root term is the binomial standard deviation, but the structure of the formula is from the normal approximation. Example 16.4. Consider a trial in which patients with esophageal cancer are treated with chemotherapy prior to surgical resection. A complete response is defined as the absence of macroscopic and microscopic tumor at the time of surgery. We suspect that this might occur 35% of the time and would like the 95% √ confidence interval of our estimate to be ±15%. The formula yields 0.15 = 1.96 × 0.35(1 − 0.35)∕𝑛 or 𝑛 = 39 subjects required to meet the stated requirements for precision. Because 35% is just an estimate of the proportion responding and some subjects may not complete the study, the actual sample size to use might be increased slightly. Expected accrual rates can be used to estimate the duration of this study in a straightforward fashion. A rough but useful general guide for estimating sample sizes needed for proportions may be derived in the same way. Because 𝑝(1 − 𝑝) is maximal for 𝑝 = 12 , and 𝑍0.975 ≈ 2, an approximate and conservative relationship between 𝑛, the sample size, and 𝑤, the width of the 95% confidence interval, is 𝑛 = 1∕𝑤2 . To achieve a precision of ±10% (0.10) requires 100 subjects, and a precision of ±20% (0.20) requires 25 subjects. The relationship between sample size and precision is always an inverse square law, which especially in middle development, works against overly optimistic requirements. Very frequently, investigators accept precision requirements in the range of 0.15–0.20 that can be achieved with 25–50 subjects. This rule of thumb is not valid for proportions that deviate greatly from 12 . For proportions less than about 0.2 or greater than about 0.8, exact binomial methods should be used.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SIMPLE ESTIMATION DESIGNS
441
Example 16.5. A new therapy is expected to benefit 3/4 of all new people with stroke if applied within the first 6 hours of onset of symptoms. To estimate the proportion of individuals who benefit with a precision of ±15% in the 95% confidence bounds, 𝑛 = 1∕0.152 = 44 subjects would be needed on a trial. 16.4.3
Exact Binomial Confidence Limits Are Helpful
The method just sketched for middle development studies uses 95% confidence intervals based on the normal approximation to the binomial distribution. This approximation may not be valid when 𝑝 is extreme, when the sample size is small, or when we require a high degree of confidence as for 99% confidence intervals. Quantitative safety guidelines for middle development trials are often concerned with probabilities that fall into these ranges. For example, we might want to stop a trial if the incidence of serious side effects is greater than 10% or the success rate is less than 20%. In these circumstances, exact confidence limits based on the tail area of the binomial distribution are needed. Tail areas of a discrete distribution like the binomial can be obtained by summing probability terms. For example, the probability of obtaining exactly 𝑟 successes out of 𝑛 independent trials when the success probability for each trial is 𝑝 is ( ) 𝑛 𝑟 Pr[𝑋 = 𝑟] = 𝑝 (1 − 𝑝)𝑛−𝑟 , 𝑟 and the chance of 𝑟 or fewer successes is the lower tail area 𝑟 ( ) ∑ 𝑛 𝑘 𝑝 (1 − 𝑝)𝑛−𝑘 . Pr[𝑋 ≤ 𝑟] = 𝑘 𝑘=0
Our confidence bounds exclude values of 𝑝 that are not consistent with 𝑝̂ = 𝑟∕𝑛. Therefore, a 100(1 − 𝛼2 )% lower confidence bound for the true proportion is the value of 𝑝 that satisfies 𝑟 ( ) 𝛼 ∑ 𝑛 𝑘 = (16.5) 𝑝 (1 − 𝑝)𝑛−𝑘 . 2 𝑘=0 𝑘 An upper 100(1 − 𝛼2 )% confidence bound is the value of 𝑝 that satisfies 𝑛 ( ) 𝛼 ∑ 𝑛 𝑘 = 𝑝 (1 − 𝑝)𝑛−𝑘 . 2 𝑘=𝑟 𝑘
(16.6)
Note, the resemblance of these formulas to equation (16.1). When 𝑟 = 0, the lower limit is defined to be 0, and when 𝑟 = 𝑛, the upper limit is defined to be 1. The confidence limits from these equations have the properties that 𝑝𝐿 for
𝑟 𝑛−𝑟 = 1 − 𝑝𝑈 for 𝑛 𝑛
𝑝𝑈 for
𝑟 𝑛−𝑟 = 1 − 𝑝𝐿 for . 𝑛 𝑛
and
Piantadosi
Date: July 27, 2017
442
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.2 Exact Binomial 𝟗𝟓% Confidence Limits for 𝒓 Responses Out of 𝒏 Subjects 𝑟 𝑛
0
𝑎
5 0.451 6 0.393 7 0.348 8 0.312 9 0.283 10 0.259 11 0.238 12 0.221
1
2
3
4
5
0.005 0.716 0.004 0.641 0.004 0.579 0.003 0.526 0.003 0.482 0.003 0.445 0.002 0.413 0.018 0.385
0.053 0.853 0.043 0.777 0.037 0.710 0.032 0.651 0.028 0.600 0.025 0.556 0.023 0.518 0.055 0.484
0.118 0.882 0.099 0.816 0.085 0.755 0.075 0.701 0.067 0.652 0.060 0.610 0.099 0.572
0.157 0.843 0.137 0.788 0.122 0.738 0.109 0.692 0.152 0.651
0.187 0.813 0.167 0.766 0.211 0.723
𝑎 One-sided
confidence interval for 𝑟 = 0, two-sided for all other cases. For each 𝑛, the first row is the lower bound; the second row is the upper bound.
In general, these equations have to be solved numerically. Alternatively, some values can be found tabulated [377]. Because this calculation is correct and useful even when the normal approximation can be used, a flexible computer program for it is available, as explained in Appendix A. Some values are provided in Tables 16.2–16.4. Note that the confidence interval, like the binomial distribution, need not be symmetric around 𝑝, even when both tails contain the same fraction of the distribution. It is customary to refer to confidence limits based on equations (16.5 and 16.6) as being “exact” because they use the binomial probability distribution and sum tail areas exactly. In reality, they can still be approximate because of the discreteness of the distribution. These binomial confidence limits are known to be conservative, meaning wider than necessary, in general. A slightly improved confidence interval can be based on a Bayesian method [1523]. This is discussed below. Example 16.6. Suppose that 3 of 19 subjects respond to 𝛼-interferon treatment for multiple sclerosis. Exact 95% binomial confidence limits on the proportion responding are (0.034–0.396) (Table 16.3). Approximate 95% confidence limits based on the normal approximation are (0.00–0.32). In this case the normal approximation yielded a negative number for the lower 95% confidence bound. An interesting special case arises when 𝑟 = 0, that is, when no successes (or failures) out of 𝑛 trials are seen. For 𝛼 = 0.05, the one-sided version of equation (16.5) is ( ) 𝑛 ( ) ∑ 𝑛 0 𝑛 𝑘 0.95 = 𝑝 (1 − 𝑝)𝑛 𝑝 (1 − 𝑝)𝑛−𝑘 = 1 − 0 𝑘 𝑘=1
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
443
SIMPLE ESTIMATION DESIGNS
TABLE 16.3 Exact Binomial 𝟗𝟓% Confidence Limits for 𝒓 Responses Out of 𝒏 Subjects 𝑟 𝑛
0
𝑎
13 0.206 14 0.193 15 0.181 16 0.171 17 0.162 18 0.153 19 0.146 20 0.139
1
2
3
4
5
0.002 0.360 0.002 0.339 0.002 0.320 0.002 0.302 0.001 0.287 0.001 0.273 0.001 0.260 0.001 0.249
0.019 0.454 0.018 0.428 0.017 0.405 0.016 0.383 0.015 0.364 0.014 0.347 0.013 0.331 0.012 0.317
0.050 0.538 0.047 0.508 0.043 0.481 0.040 0.456 0.038 0.434 0.036 0.414 0.034 0.396 0.032 0.379
0.091 0.614 0.084 0.581 0.078 0.551 0.073 0.524 0.068 0.499 0.064 0.476 0.060 0.456 0.057 0.437
0.139 0.684 0.128 0.649 0.118 0.616 0.110 0.587 0.103 0.560 0.097 0.535 0.091 0.512 0.087 0.491
𝑎
One-sided confidence interval for 𝑟 = 0, two-sided for all other cases. For each 𝑛, the first row is the lower bound; the second row is the upper bound.
TABLE 16.4 Exact Binomial 𝟗𝟓% Confidence Limits for 𝒓 Responses Out of 𝒏 Subjects r 𝑛
6
7
8
9
10
13
0.192 0.749 0.177 0.711 0.163 0.677 0.152 0.646 0.142 0.617 0.133 0.590 0.126 0.566 0.119 0.543
0.230 0.770 0.213 0.734 0.198 0.701 0.184 0.671 0.173 0.643 0.163 0.616 0.154 0.592
0.247 0.753 0.230 0.722 0.215 0.692 0.203 0.665 0.191 0.639
0.260 0.740 0.244 0.711 0.231 0.685
0.272 0.728
14 15 16 17 18 19 20
For each 𝑛, the first row is the lower bound; the second row is the upper bound.
Piantadosi
Date: July 27, 2017
444
Time: 4:45 pm
SAMPLE SIZE AND POWER
or 0.05 = (1 − 𝑝)𝑛 . Thus, 𝑙𝑜𝑔(1 − 𝑝) = 𝑙𝑜𝑔(0.05)∕𝑛 ≈ −3∕𝑛. This yields 𝑝 ≈ 1 − 𝑒−3∕𝑛 ≈
3 , 𝑛
(16.7)
for 𝑛 >> 3. The same result was derived via a different model for equation (5.2). Actually, 𝑛 does not need to be very large for the approximation to be useful. For example, for 𝑟 = 0 and 𝑛 = 10, the exact upper 95% confidence limit calculated from equation (16.5) is 0.26. The approximation based on equation (16.7) yields 0.30. When 𝑛 = 25, the exact result is 0.11 and the approximation yields 0.12. So, as a general rule, to approximate the upper 95% confidence bound on an unknown proportion, which yields 0 responses (or failures) out of 𝑛 trials, use 3∕𝑛. This estimate is sometimes helpful in drafting stopping guidelines for toxicity or impressing colleagues with your mental calculating ability. Some one-sided values are given in the first column of Tables 16.2 and 16.3. Example 16.7. The space shuttle flew successfully 24 times between April 12, 1981 and the January 28, 1986 Challenger disaster. What is the highest probability of failure consistent with this series of 24 successes? Example 16.8. We could calculate a one-sided 95% confidence interval on 0 events out of 24 trials. The solution to the exact equation (16.5 yields) 𝑝̂ = 0.117. Equation (16.7) gives 𝑝̂ = 3∕24 = 0.125, close to the exact value. The shuttle data did not rule out a high failure rate before the Challenger accident. Up to 2003 at the time of the Columbia accident, there were 106 successful space shuttle flights. If no improvements in launch procedures or equipment had been made (a highly doubtful assumption), an exact two-sided 95% binomial confidence interval on the probability of failure would have been (0.002–0.05). Thus, we could not exclude a moderately high failure rate before Columbia (e.g., the data are not inconsistent with a 5% failure rate). Two events in 106 trials yield 𝑝̂ = 0.02 (0.006–0.066). Some other interesting problems related to the Challenger can be found in Agresti [12], p.135) and Dalal, Fowlkes, and Hoadley [331]. At the close of the space shuttle flights in July 2011, there were 135 successful flights and two accidents as mentioned. The empirical failure probability is 0.015. Exact binomial confidence limits for the probability of failure on these flights are 0.005–0.053. Because the shuttle flights constitute a finite series rather than a sample from an infinite series of such flights, the confidence limits may not be meaningful. 16.4.4
Precision Helps Detect Improvement
Suppose there is an agreed landmark or success rate for outcomes on some standard therapy, which will be denoted by 𝑝0 . This landmark might come from recent historical experience or could represent a clinical consensus regarding an uninteresting success
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SIMPLE ESTIMATION DESIGNS
445
FIGURE 16.1 Conceptual framework for comparing the proportion of successes on a new therapy with a reference value external to the experiment.
rate. It is taken as a known constant external to the experiment. Our new therapy would have to exceed 𝑝0 by some clinically significant amount to be considered worthwhile. A second design specification is the expected performance of our new therapy, denoted by 𝑝1 , which is subject to sampling variability from the study cohort. On a success rate scale between 0 and 1 we would expect 𝑝1 > 𝑝0 . Equivalently, we could specify the clinically significant improvement 𝛿 required to sustain interest in the therapy, so that 𝑝1 = 𝛿 + 𝑝0 . The essence of this design is to demonstrate convincingly that 𝑝1 > 𝑝0 by having the lower confidence bound for 𝑝1 exceed 𝑝0 at the conclusion of the experiment. Figure 16.1 shows the set-up for the study parameters. The design can now be made essentially equivalent to the previous discussion motivated by precision. The required precision is 𝑤 = 𝑝1 − 𝑝0 = 𝛿. Attempting to resolve small differences between old and new treatments will quickly lead to infeasible sample sizes. Example 16.9. Standard therapy for a bladder infection produces a success rate of 70%. A new antibiotic is expected to increase this rate to 85%. To have the lower two-sided 95% (exact binomial) confidence bound exceed 70% when the true success rate is 85%, we would have to observe 34 or more successes in 40 subjects. If the success rate is higher than 85%, other data combinations satisfy the confidence bound criterion. Examples are 31 successes in 35 subjects and 15 successes out of 16 subjects. Power Up to now, I have omitted any consideration of the type II error in the implied comparison. But it is worth pointing out that the power of the procedure just outlined is only 50%. If 𝑝1 is the true performance of our new treatment and the study is designed for 𝑤 = 𝑝1 − 𝑝0 , the lower confidence bound on the observed proportion will exceed 𝑝0 only if we are lucky enough that success rate in our study sample equals or exceeds 𝑝1 . This will happen only half the time. To increase power, we must narrow the confidence interval by increasing the sample size. Then the lower confidence bound for some observed success rates less than 𝑝1 will also exceed 𝑝0 . Heuristically, this is equivalent to increasing 𝑍 √𝛼 , and can be accomplished by specifying our confidence interval as 𝑝1 ± (𝑍𝛼 + 𝑍𝛽 ) 𝑝1 (1 − 𝑝1 )∕𝑛, where 𝑍𝛽 has the role for type II error analogous to 𝑍𝛼 . This quantity must still satisfy the width requirement
Piantadosi
Date: July 27, 2017
446
Time: 4:45 pm
SAMPLE SIZE AND POWER
so that
√ 𝑤 = 𝑝1 − 𝑝0 = (𝑍𝛼 + 𝑍𝛽 )
𝑝(1 − 𝑝) , 𝑛
or 𝑛=
(𝑍𝛼 + 𝑍𝛽 )2 4(𝑝1 − 𝑝0 )2
,
(16.8)
where 𝑝1 (1 − 𝑝1 ) has been conservatively taken to be 14 . This is essentially a sample size formula for a one-sample binomial comparison, and gives results within 10% of some standard approaches. 16.4.5
Bayesian Binomial Confidence Intervals
Bayesian intervals have been much less frequently used than the classical formulation. However, they generally have good statistical properties, are flexible, and may be more appropriate for some purposes [791, 1523, 1524]. A Bayesian formulation for a binomial confidence interval can be derived as discussed below. We should be a little careful regarding terminology, because the Bayesian typically refers to credible intervals rather than confidence intervals. This arises from the Bayesian view that the parameter of interest is a random quantity, rather than a fixed constant of nature as in the frequentist formulation. Thus, a Bayesian credible interval is a true probability statement about the parameter of interest. Before any observations are taken, we might represent our knowledge of 𝑝 as a uniform prior distribution on the interval [0, 1]. In other words, any value of 𝑝 is equally likely. This assumption is subjective but leads to a workable solution. With a uniform prior, the posterior cumulative distribution for 𝑝 is 𝑝
𝐹 (𝑝) =
∫0 𝑢𝑟 (1 − 𝑢)𝑛−𝑟 𝑑𝑢 1
∫0 𝑢𝑟 (1 − 𝑢)𝑛−𝑟 𝑑𝑢
=
𝑝 (𝑛 + 1)! 𝑢𝑟 (1 − 𝑢)𝑛−𝑟 𝑑𝑢. 𝑟!(𝑛 − 𝑟)! ∫0
Therefore, by the same reasoning as for the frequentist approach, the lower confidence bound, 𝑝𝐿 , satisfies ) 𝑛+1 ( 𝑝𝐿 ∑ (𝑛 + 1)! 𝛼 𝑛+1 𝑘 𝑟 𝑛−𝑟 = 𝑢 (1 − 𝑢) 𝑑𝑢 = 𝑝𝐿 (1 − 𝑝𝐿 )𝑛+1−𝑘 . 2 𝑟!(𝑛 − 𝑟)! ∫0 𝑘 𝑘=𝑟+1
(16.9)
Similarly, the upper confidence bound satisfies ) 𝑟 ( 1 ∑ (𝑛 + 1)! 𝑛+1 𝑘 𝛼 = 𝑢𝑟 (1 − 𝑢)𝑛−𝑟 𝑑𝑢 = 𝑝𝑈 (1 − 𝑝𝑈 )𝑛+1−𝑘 . 𝑘 2 𝑟!(𝑛 − 𝑟)! ∫𝑝𝑈 𝑘=0
(16.10)
Confidence intervals based on this approach tend to be less conservative (narrower) than those based on the “classical” formulation. The difference is not large, and some may consider the additional conservativeness of the usual method to be desirable for designing trials. In any case it is important to see the different frameworks, and to recognize the technically superior performance of the Bayesian intervals. Tables 16.5–16.7
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
447
SIMPLE ESTIMATION DESIGNS
TABLE 16.5 Exact Bayesian Binomial 𝟗𝟓% Confidence Limits for 𝒓 Responses Out of 𝒏 Subjects 𝑟 0𝑎
𝑛 5
0.393 6 0.348 7 0.312 8 0.283 9 0.259 10 0.238 11 0.221 12 0.206
1
2
3
4
5
0.043 0.641 0.037 0.579 0.032 0.527 0.028 0.482 0.025 0.445 0.023 0.413 0.021 0.385 0.019 0.360
0.118 0.777 0.099 0.710 0.085 0.651 0.075 0.600 0.067 0.556 0.060 0.518 0.055 0.484 0.050 0.454
0.184 0.816 0.157 0.755 0.137 0.701 0.122 0.652 0.109 0.610 0.099 0.572 0.091 0.538
0.212 0.788 0.187 0.738 0.167 0.692 0.152 0.651 0.139 0.614
0.234 0.766 0.211 0.723 0.192 0.684
𝑎
One-sided confidence interval for 𝑟 = 0, two-sided for all other cases. For each 𝑛, the first row is the lower bound; the second row is the upper bound.
show Bayesian confidence limits for some cases that can be compared to those from the classical approach. A useful tabulation of Bayesian confidence limits for a binomial proportion is given by Lindley and Scott [943]. 16.4.6
A Bayesian Approach Can Use Prior Information
Consider the problem of planning a safety and activity study when some information is already available about efficacy, like the success probability or response rate, for the new treatment. In the case of cytotoxic drug development, objective sources for this evidence may be animal tumor models, in vitro testing, DF studies, or trials of pharmacologically related drugs. The evidence could be used to plan a safety and activity trial to yield more precise information about the true response rate for the treatment. Thall and Simon [1470, 1471] discuss an approach for the quantitative design of such trials. Here, I consider a rudimentary case using binomial responses to illustrate further a Bayesian approach. At the start of the trial, the Bayesian paradigm summarizes prior information about response probability in the form of a binomial probability distribution. For the confidence intervals presented above, the prior distribution on 𝑝 was taken to be uninformative, meaning any value for 𝑝 was considered equally likely. Suppose that the prior is taken to be equivalent to evidence that there have been 𝑟1 responses out of 𝑛1 subjects on the treatment. This information about the number of responses, 𝑟, or equivalently the true response rate, 𝑝, could be summarized as a binomial distribution ( ) 𝑛1 𝑟1 𝑓 (𝑟) = 𝑝 (1 − 𝑝)𝑛1 −𝑟1 . 𝑟1
Piantadosi
Date: July 27, 2017
448
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.6 Exact Bayesian Binomial 𝟗𝟓% Confidence Limits for 𝒓 Responses Out of 𝒏 Subjects 𝑟 𝑛
0𝑎
13 0.193 14 0.181 15 0.171 16 0.162 17 0.153 18 0.146 19 0.139 20 0.133
1
2
3
4
5
0.018 0.339 0.017 0.319 0.016 0.302 0.015 0.287 0.014 0.273 0.013 0.260 0.012 0.249 0.012 0.238
0.047 0.428 0.043 0.405 0.040 0.383 0.038 0.364 0.036 0.347 0.034 0.331 0.032 0.317 0.030 0.304
0.084 0.508 0.078 0.481 0.073 0.456 0.068 0.434 0.064 0.414 0.060 0.396 0.057 0.379 0.055 0.363
0.128 0.581 0.118 0.551 0.110 0.524 0.103 0.499 0.097 0.476 0.091 0.456 0.087 0.437 0.082 0.419
0.177 0.649 0.163 0.616 0.152 0.587 0.142 0.560 0.133 0.535 0.126 0.512 0.119 0.491 0.113 0.472
𝑎
One-sided confidence interval for 𝑟 = 0, two-sided for all other cases. For each 𝑛, the first row is the lower bound; the second row is the upper bound.
TABLE 16.7 Exact Bayesian Binomial 𝟗𝟓% Confidence Limits for 𝒓 Responses Out of 𝒏 Subjects r 𝑛
6
7
8
9
10
13
0.230 0.711 0.213 0.677 0.198 0.0646 0.184 0.617 0.173 0.590 0.163 0.566 0.154 0.543 0.146 0.522
0.266 0.734 0.247 0.701 0.230 0.671 0.215 0.643 0.203 0.616 0.191 0.592 0.181 0.570
0.278 0.722 0.260 0.692 0.244 0.665 0.231 0.639 0.218 0.616
0.289 0.711 0.272 0.685 0.257 0.660
0.298 0.702
14 15 16 17 18 19 20
For each 𝑛, the first row is the lower bound; the second row is the upper bound.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
449
SIMPLE ESTIMATION DESIGNS
The current best estimate of the response rate is 𝑝̂ = 𝑟1 ∕𝑛1 . The goal of the trial is to increase the precision of the estimate of 𝑝̂ using additional observations. When the experiment is completed, the response rate can be estimated by 𝑝̂ = 𝑟∕𝑛, where there are 𝑟 responses out of 𝑛 trials, and a confidence interval can be calculated from equations (16.5 and 16.6). To assist with sample size determination, an alternate parameterization of equations (16.5 and 16.6) for binomial tail areas could be used, )𝑘 ( )𝑛−𝑘 𝑟 ( )( [𝑛𝑝] [𝑛𝑝] 𝛼 ∑ 𝑛 = −𝑤 − 𝑤) , 1−( 2 𝑘=0 𝑘 𝑛 𝑛
(16.11)
)𝑘 ( )𝑛−𝑘 𝑛 ( )( [𝑛𝑝] [𝑛𝑝] 𝛼 ∑ 𝑛 = +𝑤 + 𝑤) , 1−( 2 𝑘=𝑟 𝑘 𝑛 𝑛
(16.12)
and
where 𝑤 is the width of the confidence interval and [⋅] denotes the nearest integer function. Specifying a value for 𝑤 at the completion of the study and assuming a value for 𝑝̂, will allow calculation of a total sample size by solving equations (16.11 and 16.12) for 𝑛. This problem is somewhat more tractable when using the incomplete beta function, 𝑛 ( ) ∑ 𝑛 𝑘 𝐵𝑝 (𝑎, 𝑛 − 𝑎 + 1) = 𝑝 (1 − 𝑝)𝑛−𝑘 , 𝑘 𝑘=𝑎
and its inverse, which I will denote by 𝐵 −1 (𝑝, 𝑎, 𝑏). Equations (16.11 and 16.12) then become ( ( ) ) 𝛼 𝛼 𝑝 − 𝑤 = 𝐵 −1 , 𝑟, 𝑛 − 𝑟 + 1 = 𝐵 −1 , 𝑛𝑝, 𝑛 − 𝑛𝑝 + 1 (16.13) 2 2 and ) ) ( ( 𝛼 𝛼 𝑝 − 𝑤 = 𝐵 −1 1 − , 𝑟 + 1, 𝑛 − 𝑟 = 𝐵 −1 1 − , 𝑛𝑝 + 1, 𝑛 − 𝑛𝑝 , 2 2
(16.14)
which are more obscure but easier to solve numerically using standard software. Example 16.10. Suppose that our prior information is 𝑛1 = 15 observations with 𝑟1 = 5 responses so that 𝑝̂ = 1∕3. Assume that our final sample size should yield a 99% credible interval with width 𝑤 = 0.15 and the true response rate is near 1∕3. Equations (16.13 and 16.14) will be satisfied by 𝑛 = 63 and 𝑛 = 57 (with 𝑟 ≈ 20). This solution will also be approximately valid for 𝑝̂ near 1∕3. Therefore, 42–48 additional observations in the study will satisfy the stated requirement for precision. An approximate solution can also be obtained from the normal approximation as in Section 16.4.2 with 𝑍𝛼 = 2.57. Then 𝑛 ≈ 0.33 × 0.66 × (2.57∕0.15)2 = 64, which compares favorably with the calculation based on exact confidence limits.
Piantadosi
Date: July 27, 2017
450
16.4.7
Time: 4:45 pm
SAMPLE SIZE AND POWER
Likelihood-Based Approach for Proportions
The binomial likelihood function is 𝑒(𝑝) = 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 , where 𝑝 is the probability of success, 𝑘 is the number of successes, and 𝑛 is the number of subjects. The likelihood ratio or relative evidence for comparing the observed success rate, 𝑝1 , to a hypothetical value, 𝑝0 , is 𝑒 = Λ
𝑝𝑘1 (1 − 𝑝1 )𝑛−𝑘 𝑝𝑘0 (1 − 𝑝0 )𝑛−𝑘
,
so that Λ = 𝑘 log
𝑝1 1 − 𝑝1 + (𝑛 − 𝑘) log . 𝑝0 1 − 𝑝0
As a likelihood ratio test, 𝑘 and 𝑛 represent data in this equation (see also Section 18.4.3), but a slight change in perspective allows us to imagine the data required to produce some specified relative evidence. Solving for 𝑛 yields 𝑛=
Λ 𝑝0 log(Θ) + log
1−𝑝1 1−𝑝0
=
Λ , 𝑝 log 𝑝1 − (1 − 𝑝1 ) log(Θ)
(16.15)
0
where Θ is the odds ratio for 𝑝1 versus 𝑝0 . Examples are shown in Table 16.8.
TABLE 16.8 Sample Sizes for One-Sample Binomial Comparisons Using Relative Evidence and Equation (16.15) 𝑝0 𝑝1 0.5
0.4
0.3
0.2
Λ
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
8 32 64 128 8 32 64 128 8 32 64 128 8 32 64 128
3 4 5 6 4 6 7 9 6 11 13 15 15 25 30 35
4 7 8 9 7 11 13 16 14 23 27 32 47 78 94 109
6 10 12 14 11 19 23 26 29 48 58 67 230 384 460 537
9 16 19 22 20 33 40 46 74 123 148 172
14 24 29 34 38 64 77 90 325 541 650 758
24 40 48 56 92 153 184 215
44 73 88 103 386 643 772 901
102 170 204 238
414 690 828 966
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
EVENT RATES
16.5
451
EVENT RATES
A similar approach to a screening design can be taken when the outcomes are event or failure times rather than proportion of successes. Failure or event rates are the number of events per unit of follow-up time. We can conceptually replace the success rate by the failure rate (per person time) in the previous discussion, with two changes. First, the confidence bounds are a function of the number of events (rather than sample size explicitly) because of censoring. Second, we might place the comparison on a logarithm scale to compensate for the skewness in failure rate distributions. √ A confidence interval for a failure rate 𝜆 is best obtained as log(𝜆) ± 𝑍𝛼 ∕ 𝑑, where 𝑑 is the number of observed events. This is also based on a normal approximation. If √ 𝑤 is our precision requirement on the log scale, we have 𝑤 = 𝑍𝛼 ∕ 𝑑, and can see immediately the inverse square relationship between precision and number of events, which then translates into sample size. Precision on the log scale may seem unintuitive at first, but the log scale is natural for event rate or hazard ratios. Reducing the failure rate by 20%, for example, would imply 𝑤 = 0.2. A 95% confidence interval with this specification requires 𝑑 = (𝑍𝛼 ∕𝑤)2 = (1.96∕0.2)2 = 96 events. In other words, 96 events provide a precision in the estimate of the failure rate of ±20%. It is critical to note that it may take many more subjects to yield 96 events in some specified length of time. We may know from recent historical experience that the failure rate on standard therapy is 𝜆0 , which we can take as a known constant. Suppose the rate on a new treatment is expected to be 𝜆1 and we intend for our trial to resolve the difference by excluding 𝜆0 from the confidence interval for 𝜆1 . On the log scale we have 𝑤 = log(𝜆1 ) − log(𝜆0 ) = log(𝜆1 ∕𝜆0 ). If 𝜆1 ∕𝜆0 = 0.75, which is a substantial clinical effect, log(0.75) = −0.29 and 𝑑 = (𝑍𝛼 ∕𝑤)2 = (1.96∕0.29)2 = 46 events. Again this sketch has not considered power of this procedure, which is 50% for the same reasons as above. The observed failure rate has to be less than 𝜆1 to assure that the upper confidence bound is less than 𝜆0 . If 𝜆1 is the true failure rate, this happens only 50% of the time. If we follow the same heuristic √ fix-up as above, the confidence interval might be constructed as log(𝜆) ± (𝑍𝛼 + 𝑍𝛽 )∕ 𝑑. Then, √ 𝑤 = log(𝜆1 ∕𝜆0 ) = (𝑍𝛼 + 𝑍𝛽 )∕ 𝑑 or 𝑑=
(𝑍𝛼 + 𝑍𝛽 )2 (log(Δ))2
,
which is essentially the normal approximation as in Section 16.4.2. It is also similar to equation (16.29), the difference being that this result is for a one-sample test. 16.5.1
Confidence Intervals for Event Rates Can Determine Sample Size
Some safety and efficacy trials employ time-to-event measurements with censoring, such as the death, recurrence, or overall failure rate in the population under study. This type of study design is becoming more common in cancer trials as new agents are developed that do not work through a classical cytotoxic mechanism. For example, some new drugs
Piantadosi
Date: July 27, 2017
452
Time: 4:45 pm
SAMPLE SIZE AND POWER
FIGURE 16.2
Screening design for improved failure rate.
target growth factor receptors or inhibit angiogenesis. These may not shrink tumors but might improve survival. The classical middle development trial using response or tumor shrinkage as the primary design variable will not work for such agents. However, many of the concepts that help determine sample size for proportions also carry over to event rates. Absolute Precision As in the discussion above for proportions, the confidence interval or precision required for a failure rate on an absolute scale can be used to estimate sample size. This situation is shown in Figure 16.2—note the similarity to Figure 16.1 for proportions. The reference failure rate is 𝜆0 , which might be obtained from earlier cohort studies or clinical knowledge. On treatment, we expect the failure rate to be 𝜆1 . If the improvement is real, we would like for the trial to reliably resolve the difference between the two failure rates, that is, the confidence interval around 𝜆1 should exclude 𝜆0 . A study of this design will most likely include a fixed period of accrual and then a period of follow-up, during which the cohort can be observed for additional events. Sometimes accrual can continue to the end of the study, which shortens the length of the trial. However, subjects accrued near the end of the study may not contribute much information to the estimate of the failure rate. Assume that accrual is constant at rate 𝑎 per unit time over some interval 𝑇 , and that a period of continued follow-up is used to observe additional events. Also assume that the failure rate is constant over time, that is exponential event times, and that there are no losses to follow-up. At the end of the study, we summarize the data using methods presented in Chapter 20, so that the estimated failure rate is 𝑑 𝜆̂ = ∑ , 𝑡𝑖 where 𝑑 is the number of events or failures and the sum is over all follow-up times 𝑡𝑖 . An approximate confidence interval for the failure rate is 𝜆 𝜆̂ ± 𝑍1−𝛼∕2 √ . 𝑑
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
EVENT RATES
453
Alternatively, an approximate confidence interval for log(𝜆) is 𝑍1−𝛼∕2 ̂ ± √ log(𝜆) . 𝑑 Because of the design of this study, 𝑑 is actually a function of time. Anticipating results from Section 9.3.2, the number of events depends on time such that { 𝑎 𝑎0 𝑡 − 𝜆0 (1 − 𝑒−𝜆𝑡 ) if 𝑡 ≤ 𝑇 , 𝑑(𝑡) = (16.16) 𝑎0 −𝜆(𝑡−𝑇 ) 𝑎0 𝑇 − 𝜆 (𝑒 − 𝑒−𝜆𝑡 ) if 𝑡 > 𝑇 , where 𝑡 is the total study duration. Thus, the confidence interval for 𝜆 depends on time. Note that for fixed 𝑇 , as 𝑡 → ∞, the number of failures approaches 𝑎0 𝑇 , the total number of subjects accrued. We could express the approximate confidence interval as ( ) 𝑍1−𝛼∕2 𝜆̂ 1 ± √ . 𝑑(𝑡) If 𝑤 is the desired width of the confidence interval expressed as a fraction of 𝜆, the study design parameters must satisfy ) ( 𝑍1−𝛼∕2 2 , 𝑑(𝑡) = 𝑤 ̂ which has a familiar form. A similar derivation could be used for log(𝜆). Relative Precision Suppose that the hypothesized event rate is 𝜆 and we require an estimate that is ±𝑝% of that value. In other words, we would like to be fairly certain that the event rate lies ̂ − 𝑝) and 𝜆𝑈 = 𝜆(1 ̂ + 𝑝). On the logarithmic scale, an approximate between 𝜆𝐿 = 𝜆(1 √ ̂ ± 𝑍1−𝛼∕2 ∕ 𝑑, where 𝑑 is the number of events observed. confidence interval is log(𝜆) √ √ This means that log(1 − 𝑝) = 𝑍1−𝛼∕2 ∕ 𝑑 and log(1 + 𝑝) = 𝑍1−𝛼∕2 ∕ 𝑑 ′ , where 𝑑 does not necessarily equal 𝑑 ′ . Thus, we are led to choosing a sample size based on the maximum of ( )2 𝑍1−𝛼∕2 𝑑= , log(1 ± 𝑝) which is again an inverse square relationship between precision and events (sample size). If we intend for a two-sided 95% confidence interval on the failure rate to yield [ ]2 ±20%, then 𝑑 = 1.96∕ log(1.2) = 116 events. The number of study subjects required to yield 116 events is likely to be substantially higher than 116 due to censoring, and considerations regarding accrual dynamics are also relevant here. Example 16.11. In patients with malignant gliomas, the historical median failure time is 8 months (𝜆̂ = log(2)∕8 = 0.087 events per patient-month). A safety and efficacy trial using a new agent intends to estimate the failure rate with a precision of ±25% (95% confidence interval) so that it can be compared to the historical rate. A promising
Piantadosi
Date: July 27, 2017
454
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.9 Relative Precision in Confidence Intervals for Event Rates 95% 𝑑 25 50 75 100 125 150 175 200 225 250
90%
95%
(
)
(
)
0.68 0.76 0.80 0.82 0.84 0.85 0.86 0.87 0.88 0.88
1.48 1.32 1.25 1.22 1.19 1.17 1.16 1.15 1.14 1.13
0.77 0.83 0.86 0.88 0.89 0.90 0.91 0.91 0.92 0.92
1.29 1.20 1.16 1.14 1.12 1.11 1.10 1.09 1.09 1.08
𝑑 275 300 325 350 375 400 425 450 475 500
90%
(
)
(
)
0.89 0.89 0.90 0.90 0.90 0.91 0.91 0.91 0.91 0.92
1.13 1.12 1.11 1.11 1.11 1.10 1.10 1.10 1.09 1.09
0.93 0.93 0.93 0.93 0.94 0.94 0.94 0.94 0.94 0.94
1.08 1.08 1.07 1.07 1.07 1.07 1.06 1.06 1.06 1.06
improvement in the failure rate will motivate a CTE trial. The required number of events for this degree of precision is (1.96∕0.25)2 = 61. The accrual rate for the trial is expected to be nine subjects per month. For the expected accrual and failure rates, one can calculate that the following combinations of accrual periods and total study durations yield 61 events (all times are in months): (6.8, ∞); (9, 21); (12, 16); (15.25, 15.25). Many other combinations would also satisfy the design equation, but the shortest total trial duration possible is 15.25 months. Relative precision in event rates can assist in another way to determine the size of a trial cohort. Using the formula above for an approximate confidence interval on the log √ ̂ scale, log(𝜆) ± 𝑍1−𝛼∕2 ∕ 𝑑, observe that on the natural scale the confidence bounds will be ̂
√ 𝑑
𝑒log(𝜆)±𝑍1−𝛼∕2 ∕
√
̂ ±𝑍1−𝛼∕2 ∕ = 𝜆𝑒
𝑑
. √ ̂ To obtain the Therefore, exp(±𝑍1−𝛼∕2 ∕ 𝑑) are the confidence bounds relative to 𝜆. actual confidence bounds, the relative precision is multiplied by the estimated event rate. Some relative confidence bounds for event rates calculated in this way are shown in Table 16.9. We can see there that a relative precision of around ±20% in the 95% confidence interval requires 100–125 events as in the previous example, and doubling that number to 200 events yields a relative precision of 13–15%. A 90% confidence interval must necessarily be narrower than a 95% interval, and this is reflected in the respective column of Table 16.9 . In either case, the inverse square law yields diminishing returns in precision quickly. It is very expensive to attain a precision of ±10% in event rate estimation. 16.5.2
Likelihood-Based Approach for Event Rates
Assuming a normal model for log(𝜆), the likelihood is ( )2 ⎛ ̂ − log(𝜆) ⎞ log( 𝜆) ⎟ ⎜ 𝑒(𝜆) = exp ⎜− ⎟, 2∕𝑑 ⎟ ⎜ ⎠ ⎝
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STAGED STUDIES
455
where 𝜆 is the true hazard and 𝜆̂ is the observed hazard. This represents the relative evidence for 𝜆̂ versus 𝜆 so that ( )2 ̂ − log(𝜆) log(𝜆) Λ= 2∕𝑑 or 2Λ 2Λ 𝑑=( )2 = log(Δ)2 , ̂ − log(𝜆) log(𝜆) where Δ is the hypothesized hazard ratio. The previous discussion regarding accrual dynamics pertains to this equation also. To connect this with hypothesis tests, the evidence produced by our trial can be partitioned into any appropriate allocation of type I and type II errors, 𝑑=
(𝑍𝛼 + 𝑍𝛽 )2 2Λ 𝑍2 = = . log(Δ)2 log(Δ)2 log(Δ)2
(16.17)
However, the focus here is on controlling the total error, a distinction that has some practical consequences. Example 16.12. Returning to the glioma example, if we expect a new treatment to reduce the hazard of failure by a factor of 1.75 (a large effect given the clinical circumstances), modest evidence (Λ = 8) regarding this hazard ratio can be produced by 𝑑 = 16∕ log(1.75)2 = 51 events.
16.6
STAGED STUDIES
The idea of performing a study in stages to allow an early decision to discard a new therapy has been around for a long time. Staging apples to the individual study in the same way that it can improve efficiency in the overall pipeline as discussed in Chapter 10. This idea is well suited to middle development where the emphasis is often to learn as quickly as possible that new therapies are not promising. A staged procedure was proposed by Gehan [571] for screening cancer therapies. It also relies on confidence bounds. The basic idea was not to develop treatments if their response rate was below 0.2. How many individual subject failures would convince us that the true response rate is less than 0.2? A way to address this question is to construct an upper confidence bound for the success probability when no successes have been observed. The confidence interval method above will not work in this circumstance. However, exact binomial confidence bounds can be constructed for observed frequencies of zero. 16.6.1
Ineffective or Unsafe Treatments Should Be Discarded Early
The purpose of middle development is to discard unsafe treatments or those that are inactive or less active than existing therapy (Section 13.3). Middle development studies
Piantadosi
Date: July 27, 2017
456
Time: 4:45 pm
SAMPLE SIZE AND POWER
in oncology assess the response or tumor shrinkage proportion of subjects treated with new drugs or modalities. If the response rate is low or there is too high an incidence of side effects, investigators would like to terminate the trial as early as possible. This has led to the widespread use of “early stopping rules” for these studies that permit accrual to be terminated before the fixed sample size end. A detailed look at this issue is given in Chapter 18, but some designs are introduced here because they are often used to determine the size of safety and activity trials. An historic approach to this problem is the following: Suppose that 0.2 is the lowest response rate that investigators consider acceptable for a new treatment. The exact binomial upper 95% confidence limit on 0 responses out of 12 tries is 0.22. Therefore, if none of the first 12 study participants respond, the treatment can be discarded because it most likely has a response rate less than 0.22. This rule was proposed by Gehan [571] and is still used in some middle development trials because it is simple to employ and understand. It is easily modified using exact binomial confidence limits for target proportions other than 0.2 (Tables 16.2–16.4). A more flexible approach to deal with this problem is outlined in the next section.
16.6.2
Two-Stage Designs Increase Efficiency
Optimal two-stage designs for middle development trials are discussed by Simon [1398]. They originated in the oncology context, but employ dichotomous outcomes that may be suitable for many circumstances. The suggested designs depend on two clinically important response rates 𝑝0 and 𝑝1 and type I and II error rates 𝛼 and 𝛽. If the true probability of response is less than some clinically uninteresting level, 𝑝0 , the chance of accepting the treatment for further study is set to be 𝛼. If the true response rate exceeds some interesting value, 𝑝1 , the chance of rejecting the treatment is set to be 𝛽. The study terminates at the end of the first stage only if the treatment appears ineffective. The design does not permit stopping early for efficacy. The first stage (𝑛1 subjects) is relatively small. If a small number (≤ 𝑟1 ) of responses are seen at the end of the first stage, the treatment is abandoned. Otherwise, the trial proceeds to a second stage (𝑛 subjects total). If the total number of responses after the second stage is large enough (> 𝑟), the treatment is accepted for further study. Otherwise, it is abandoned. The size of the stages and the decision rules can be chosen optimally to test differences between two response rates of clinical interest. The designs can be specified quantitatively in the following way: Suppose that the true probability of response using the new treatment is 𝑝. The trial will stop after the first stage if 𝑟1 or fewer responses are seen in 𝑛1 subjects. The chance of this happening is
𝐵(𝑟1 ; 𝑝, 𝑛1 ) =
𝑟1 ( ) ∑ 𝑛1 𝑘 𝑝 (1 − 𝑝)𝑛1 −𝑘 , 𝑘 𝑘=0
(16.18)
which is the cumulative binomial mass function given above in equation (16.5). At the end of the second stage the treatment will be rejected if 𝑟 or fewer responses are seen. The chance of rejecting a treatment with true response probability 𝑝 is the chance of stopping at the first stage plus the chance of rejecting at the second stage if the first stage
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPARATIVE TRIALS
457
is passed. We must be careful to count all of the ways of passing the first stage but not passing the second. The probability of rejecting the treatment is ) 𝑛1 𝑘 𝑝 (1 − 𝑝)𝑛1 −𝑘 𝐵(𝑟 − 𝑘; 𝑝, 𝑛2 ), 𝑘
min(𝑛1 ,𝑟) (
𝑄 = 𝐵(𝑟1 ; 𝑝, 𝑛1 ) +
∑
𝑘=𝑟1 +1
(16.19)
where 𝑛2 = 𝑛 − 𝑛1 . When the design parameters, 𝑝0 , 𝑝1 , 𝛼, and 𝛽 are specified, values of 𝑟1 , 𝑟2 , 𝑛1 , and 𝑛2 can be chosen that satisfy the type I and type II error constraints. The design that yields the smallest expected sample size under the null hypothesis is defined as optimal. Because a large number of designs could satisfy the error constraints, the search for optimal ones requires an exhaustive computer algorithm. Although many useful designs are provided in the original paper and some are tabulated here, a computer program for finding optimal designs is described in Appendix A. Trial designs derived in this way are shown in Tables 16.10 and 16.11. Multistage designs for middle development studies can be constructed in a similar way. Three-stage designs have also been proposed [427]. However, two-stage designs are simpler and nearly as efficient as designs with more stages, or those that assess treatment efficacy after each subject. These designs are discussed in Chapter 18. It is important to recognize that many combinations of sample sizes and decision rules will satisfy the error criteria specified for a two-stage design. Sensible definitions of optimal, such as minimum expected sample size, help us select a single design. It is possible that the best definition of optimal would change under special circumstances. Example 16.13. Suppose that we intend to detect a success rate of 40% compared to a background rate of 20% with 𝛼 = 0.05 and 𝛽 = 0.1. Table 16.10 indicates that the decision rules and stage sizes should be 4 out of 19 and 15 out of 54. For total sample sizes less than or equal to 54, there are 610 designs that satisfy the error criteria, and for total sample sizes less than or equal to 75, there are 4330 designs that satisfy the error criteria. Designs with decision rules based on larger denominators will have smaller type I and type II error rates. The design that allows a decision with the smallest number of successes (not listed in Table 16.10) is 2 out of 15 and 13 out of 45.
16.7
COMPARATIVE TRIALS
For comparative trials, the discussion that follows emphasizes an approach based on a planned hypothesis test when the trial is completed. This is a convenient and frequently used perspective for determining the size of comparative trials, and motivates the use of the term power. The null hypothesis usually represents equivalence between the treatments. The alternative value for the hypothesis test is chosen to be the smallest difference of clinical importance between the treatments. Following this, the size of the study is planned to yield a high probability of rejecting the null hypothesis if the alternative hypothesis is true. Therefore, the test statistic planned for the analysis dictates the exact form of the power equation. Even though the power equation depends on the test being employed, there is a similar or generic form for many different statistics.
Piantadosi
Date: July 27, 2017
458
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.10 Optimal Two-Stage Designs for Middle Development Trials for 𝒑𝟏 − 𝒑𝟎 = 𝟎.𝟐𝟎 and 𝜶 = 𝟎.𝟎𝟓 𝑝1
𝛽
𝑟1
𝑛1
𝑟
𝑛
𝐸{𝑛 ∣ 𝑝0 }𝑎
0.05
0.25
0.10
0.30
0.20
0.40
0.30
0.50
0.40
0.60
0.50
0.70
0.60
0.80
0.70
0.90
0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1
0 0 1 2 3 4 5 8 7 11 8 13 7 12 4 11
9 9 10 18 13 19 15 24 16 25 15 24 11 19 6 15
2 3 5 6 12 15 18 24 23 32 26 36 30 37 22 29
17 30 29 35 43 54 46 63 46 66 43 61 43 53 27 36
12 17 15 23 21 30 24 35 25 36 24 34 21 30 15 21
𝑝0
𝑎
Gives the expected sample size when the true response rate is 𝑝0 .
TABLE 16.11 Optimal Two-Stage Designs for Middle Development Trials for 𝒑𝟏 − 𝒑𝟎 = 𝟎.𝟏𝟓 and 𝜶 = 𝟎.𝟎𝟓 𝑝0
𝑝1
0.05
0.20
0.10
0.25
0.20
0.35
0.30
0.45
0.40
0.55
0.50
0.65
0.60
0.75
0.70
0.85
0.80
0.95
𝑎
𝛽
𝑟1
𝑛1
𝑟
𝑛
𝐸{𝑛 ∣ 𝑝0 }𝑎
0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1 0.2 0.1
0 1 2 2 5 8 9 13 11 19 15 22 17 21 14 18 7 16
10 21 18 21 22 37 27 40 26 45 28 42 27 34 19 25 9 19
3 4 7 10 19 22 30 40 40 49 48 60 46 64 46 61 26 37
29 41 43 66 72 83 81 110 84 104 83 105 67 95 59 79 29 42
18 27 25 37 35 51 42 61 45 64 44 62 39 56 30 43 18 24
Gives the expected sample size when the true response rate is 𝑝0 .
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPARATIVE TRIALS
16.7.1
459
How to Choose Type I and II Error Rates?
Convention holds that most clinical trials should be designed with a two-sided 𝛼-level set at 0.05 and 80 or 90% power (𝛽 = 0.2 or 0.1, respectively). To be more thoughtful, the type I and II error rates should be chosen to reflect the consequences of making each type of error. For example, suppose that a standard therapy for a certain condition is effective and associated with few side effects. When testing a competing treatment, we would probably want the type I error rate to be small, especially if it is associated with serious side effects, to reduce the chance of a false positive. We might allow the type II error rate to be higher, indicating the lower seriousness of missing an effective therapy because a good treatment already exists. Circumstances like this are commonly encountered when developing cytotoxic drugs for cancer treatment. In contrast, suppose that we are studying prevention of a common disease using safe agents such as diet or dietary supplements. There would be little harm in the widespread application of such treatments, so the consequences of a type I error are not severe. In fact, some benefits might occur, even if the treatment was not preventing the target condition. In contrast, a type II error would be more serious because a safe, inexpensive, and possibly effective treatment would be missed. In such cases there is a rationale for using a relaxed definition of statistical significance, perhaps 𝛼 = 0.10, and a higher power, perhaps 𝛽 = 0.01. Special attention to the type I and II error rates may be needed when designing trials to demonstrate equivalence or noninferiority of two treatments. This topic is discussed in Section 16.7.10. 16.7.2
Comparisons Using the t-Test Are a Good Learning Example
Suppose that the endpoint for a comparative clinical trial is a measured outcome so that the treatment comparison consists of testing the difference of the estimated means of the two groups. Assume that the true means in the treatment groups are 𝜇1 and 𝜇2 and the standard deviation of the measurement in each subject is 𝜎. Let the treatment difference be Δ = 𝜇1 − 𝜇2 . The null hypothesis is 𝐻0 : Δ = 0. Investigators would reject the null hypothesis if |Δ| exceeds the critical value, 𝑐, where 𝑐 = 𝑍𝛼 × 𝜎Δ , and 𝜎Δ is the standard deviation of Δ (Fig. 16.3). In other words, if the estimated difference between the treatment means is too many standard deviations away from 0, we would disbelieve that the true difference is 0. Under the alternative hypothesis, the distribution of Δ is centered away from 0 (the right-hand curve in Fig. 16.3). The power of our statistical test is the area under the alternative distribution to the right of 𝑐, that is, the probability of rejecting 𝐻0 when the alternative hypothesis is true. This area can be calculated by standardizing 𝑐 with respect to the alternative distribution −𝑍𝛽 =
𝑐 − Δ 𝑍𝛼 × 𝜎Δ − Δ = , 𝜎Δ 𝜎Δ
that is, by subtracting the mean of the alternative distribution and dividing by its standard deviation. The minus sign for 𝑍𝛽 comes from the fact that areas are tabulated from the
Piantadosi
Date: July 27, 2017
460
Time: 4:45 pm
SAMPLE SIZE AND POWER
FIGURE 16.3 Sampling distribution of an estimate under the null and alternative hypotheses. Vertical lines are drawn at the null Δ = 0 and the critical value 𝑐 = 1.96 as explained in the text.
left tail, whereas we are taking the area of the right tail of the distribution (equation (16.1)). Thus, −𝑍𝛽 =
𝑍𝛼 × 𝜎Δ − Δ Δ = 𝑍𝛼 − 𝜎Δ 𝜎Δ
or 𝑍𝛼 + 𝑍 𝛽 = Now
√ 𝜎Δ =
Δ . 𝜎Δ
𝜎2 𝜎2 + =𝜎 𝑛1 𝑛2
(16.20)
√ 1 1 + , 𝑛1 𝑛2
assuming that the groups are independent and of sizes 𝑛1 and 𝑛2 . Substituting into equation (16.20) and squaring both sides yields 1 Δ2 1 + = . 𝑛1 𝑛2 (𝑍𝛼 + 𝑍𝛽 )2 𝜎 2
(16.21)
Now suppose that 𝑛1 = 𝑟𝑛2 (𝑟 is the allocation ratio) so that equation (16.21) becomes Δ2 1 1 𝑟+1 1 = + = 𝑟𝑛2 𝑛2 𝑛2 𝑟 (𝑍𝛼 + 𝑍𝛽 )2 𝜎 2 or 𝑛2 =
2 2 𝑟 + 1 (𝑍𝛼 + 𝑍𝛽 ) 𝜎 . 𝑟 Δ2
(16.22)
The denominator of the right-hand side of equation (16.22), expressed as (Δ∕𝜎)2 , is the square of the number of standard deviations between the null and alternative treatment
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPARATIVE TRIALS
461
TABLE 16.12 Approximate Total Sample Sizes for Comparisons Using the 𝒕-Test and Equal Group Sizes 𝛽 = 0.1
𝛽 = 0.2
Δ∕𝜎
𝛼 = 0.05
𝛼 = 0.10
Δ∕𝜎
𝛼 = 0.05
𝛼 = 0.10
0.25 0.50 0.75 1.00 1.25 1.50
672 168 75 42 28 18
548 138 62 34 22 16
0.25 0.50 0.75 1.00 1.25 1.50
502 126 56 32 20 14
396 98 44 24 16 12
Δ is the difference in the treatment group means and 𝜎 is the standard deviation. See equation (16.22).
means. All of the factors that affect the statistical power are evident from equation (16.22): the variance of an observation, the difference we are trying to detect, the allocation ratio, and the type I error level for the statistical comparison. Some convenient values that solve equation (16.22) are shown in Table 16.12. A small amount of ambiguity remains, however. Equation (16.22) gives the sample size in one group, and its solution is not necessarily an integer. Table 16.12 rounds to the nearest integer and multiplies by 2 to obtain a total sample size. Other strategies are possible, such as rounding to the next higher integer. It is probably best to round first so that the total sample size is divisible by 2. Equation (16.22) can be solved for the type II error, √ Δ 𝑁 𝑍𝛽 = − 𝑍𝛼 , 2𝜎 where I have set 𝑟 = 1 and 𝑛1 = 𝑛2 = 𝑁∕2. Then the power is ) ( √ Δ 𝑁 1−𝛽 =Φ − 𝑍𝛼 , 2𝜎 where I have used the notational conventions indicated at the end of Section 16.2.2. Although this power equation has been derived assuming normal distributions, the so called z-test, it yields values nearly correct for the t-test. One could increase accuracy by using quantiles from the t-distribution in place of 𝑍𝛼 and 𝑍𝛽 . However, when using the t-test, the power calculations are made more difficult by the need to evaluate the noncentral t-distribution. Also, we have assumed that the variance is known. In some situations the variance will be estimated from the observed data. This slightly increases the sample size required. However, the effect is small, amounting to an increase of only one or two subjects for sample sizes near 20 [1410, 1411]. Consequently, the correction can be ignored for most clinical trials. Although many important test statistics for clinical trials fit assumptions of normality, at least approximately, some important cases do not. For example, power and sample size equations for analyses of variance involve F-distributions that are computationally more cumbersome than Gaussian distributions. Because these are uncommon designs for clinical trials, details are not given here. Approaches to this problem can be found in
Piantadosi
Date: July 27, 2017
462
Time: 4:45 pm
SAMPLE SIZE AND POWER
Winer [1568] and [1569]. Some important nonnormal cases more commonly encountered in clinical trials are discussed below. 16.7.3
Likelihood-Based Approach
Suppose that we observe 𝑛 independent values that arise from a normal distribution with unknown mean 𝜇. For simplicity, the variance, 𝜎 2 , will be assumed known. The likelihood, 𝑒(𝐗|𝜇) (where (𝐗|𝜇) is the log likelihood) is the probability of observing the data under the normal model, 𝑒(𝐗|𝜇) = √
1
𝑛 ∏
2𝜋𝜎
𝑖=1
exp(−(𝑥𝑖 − 𝜇)2 ∕2𝜎 2 ).
The relative evidence comparing two hypothetical values for 𝜇 is the likelihood ratio 𝑛 ∏
𝑒 = Λ
𝑖=1 𝑛
exp(−(𝑥𝑖 − 𝜇𝑎 )2 ∕2𝜎 2 ) =
∏
exp(−
∑𝑛
1 (𝑥𝑖
∑𝑛
− 𝜇𝑎 )2 ∕2𝜎 2 )
exp(− 1 (𝑥𝑖 − 𝜇𝑏 )2 ∕2𝜎 2 ) exp(−(𝑥𝑖 − 𝜇𝑏 )2 ∕2𝜎 2 ) 𝑖=1 ) ( 𝑛 1 ∑ 2 2 (𝑥 − 𝜇𝑏 ) − (𝑥𝑖 − 𝜇𝑎 ) . = exp 2𝜎 2 𝑖=1 𝑖
After some rearranging of the right-hand side, it can be shown that Λ=
) 𝑛(𝜇𝑎 − 𝜇𝑏 ) ( 𝑥 − 𝜇𝑎𝑏 , 2 𝜎
where Λ is the log(LR), 𝑥 is the observed mean, and 𝜇𝑎𝑏 is the midpoint of the hypothetical means. Therefore, 𝑛=
Λ𝜎 2 . (𝜇𝑎 − 𝜇𝑏 )(𝑥 − 𝜇𝑎𝑏 )
The preceding equation has all of the intuitive properties of a sample size relationship for testing a mean. Sample size is increased by a higher standard of evidence (larger Λ), hypothetical values that are close to one another, a larger person-to-person variance, and an observed mean that lies close to the middle of the range in question. If we compare the observed mean to some value as in a typical hypothesis test, we can set 𝜇𝑎 = 𝑥 and obtain 𝑛=
2Λ𝜎 2 . (𝜇𝑎 − 𝜇𝑏 )2
(16.23)
From the fact derived above that 2Λ = 𝑍 2 , it can be seen that 𝑛=
(𝑍𝛼 + 𝑍𝛽 )2 𝜎 2 (𝜇𝑎 − 𝜇𝑏 )2
,
(16.24)
which is the common sample size formula for testing a mean when the variance is known.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPARATIVE TRIALS
463
For a two sample problem as in a comparative trial, the same basic approach applies. Then 𝑥𝑖 represents the difference of two means, the variance of which is 2𝜎 2 , so equations (16.23 and 16.24) become 𝑛=
4Λ𝜎 2 (𝜇𝑎 − 𝜇𝑏 )2
(16.25)
and 𝑛=
2(𝑍𝛼 + 𝑍𝛽 )2 𝜎 2 (𝜇𝑎 − 𝜇𝑏 )2
,
(16.26)
which is identical to equation (16.22).
16.7.4
Dichotomous Responses Are More Complex
When the outcome is a dichotomous response, the results of a comparative trial can be summarized in a 2 × 2 table: Treatment Success
A
B
Yes
𝑎
𝑏
No
𝑐
𝑑
The analysis essentially consists of comparing the proportion of successes or failures in the groups, for example, 𝑎∕(𝑎 + 𝑐) versus 𝑏∕(𝑏 + 𝑑). The full scope of methods for determining power and sample size in this situation is large. A review of various approaches is given by Sahai and Khurshid [1326]. Here I discuss only the basics. The usual analysis of such data would employ Fisher’s exact test or the 𝜒 2 test, with or without continuity correction. The exact test assumes that 𝑎 + 𝑏, 𝑐 + 𝑑, 𝑎 + 𝑐, and 𝑏 + 𝑑 are fixed by the design of the trial. However, in a trial, 𝑎 and 𝑏 are random variables, indicating that the 𝜒 2 test without continuity correction is appropriate. However, a fixed sample size with random treatment assignment leads to the exact test or 𝜒 2 test with continuity correction. The sample size required for a particular trial can be different, depending on which perspective is taken [1592]. A derivation similar to the t-test above for comparing two proportions, 𝜋1 and 𝜋2 , without continuity correction yields ( 𝑛2 =
𝑍𝛼
√
(𝑟 + 1)𝜋(1 − 𝜋) + 𝑍𝛽
√
𝑟𝜋1 (1 − 𝜋1 ) + 𝜋2 (1 − 𝜋2 )
𝑟Δ2
)2 ,
(16.27)
where 𝜋 = (𝜋1 + 𝑟𝜋2 )∕(𝑟 + 1) is the (weighted) average proportion, Δ = 𝜋1 − 𝜋2 , and 𝑛1 = 𝑟𝑛2 . The total sample size required is 𝑛1 + 𝑛2 = 𝑛2 (𝑟 + 1). Convenient values that
Piantadosi
Date: July 27, 2017
464
Time: 4:45 pm
SAMPLE SIZE AND POWER
solve equation (16.27) for 𝑟 = 1 are given in Tables 16.13 and 16.14. When 𝑟 = 1, equation (16.27) can be approximated by 𝑛2 =
) ( (𝑍𝛼 + 𝑍𝛽 )2 𝜋1 (1 − 𝜋1 ) + 𝜋2 (1 − 𝜋2 ) Δ2
.
(16.28)
The calculated sample size must be modified when planning to use the 𝜒 2 test with continuity correction. The new sample size must satisfy
𝑛∗2
𝑛 = 2 4
√
( 1+
)2 2(𝑟 + 1) 1+ 𝑟𝑛2 Δ
,
where 𝑛2 is given by equation (16.27). It is noteworthy that equation (16.22) could be solved algebraically for any single parameter in terms of the others. However, equation (16.27) cannot be solved simply for some parameters. For example, if we wish to determine the treatment difference that can be detected with 90% power, a sample size of 100 per group, equal treatment allocation, and a response proportion of 0.5 in the control group, equation (16.27) must be solved using iterative calculations, starting with an initial guess for 𝜋2 and using a standard method such as Newton’s iterations to improve the estimate. The need for iterative solutions is a general feature of sample size equations. Consequently, good computer software is essential for performing such calculations that may have to be repeated many times before settling on the final design of a trial. 16.7.5
Hazard Comparisons Yield Similar Equations
Comparative clinical trials with event time endpoints require similar methods to estimate sample size and power. To test equality between treatment groups, it is common to compare the ratio of hazards, defined below, versus the null hypothesis value of 1.0. In trials with recurrence or survival time as the primary endpoint, the power of such a study depends on the number of events—recurrences or deaths, for example. Usually, there is a difference between the number of subjects placed on study and the number of events required for the trial to have the intended statistical properties. In the following sections, I give several similar appearing sample size equations for studies with event time outcomes. All will yield similar results. The sample size equations can be classified into those that use parametric forms for the event time distribution and those that do not (nonparametric). The equations use the ratio of hazards in the treatment groups, Δ = 𝜆1 ∕𝜆2 , where 𝜆1 and 𝜆2 are the individual hazards. Exponential If event times are exponentially distributed, some exact distributional results can be used. Suppose that 𝑑 observations are uncensored and 𝜆̂ is the maximum likelihood estimate of the exponential parameter (see Chapter 20), 𝑑 𝜆̂ = ∑ , 𝑡𝑖
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
465
COMPARATIVE TRIALS
TABLE 16.13 Sample Sizes per Group for Comparisons Using the 𝝌 𝟐 Test without Continuity Correction with Equal Group Sizes Determined From Equation (16.27) Δ = 𝜋2 − 𝜋1 𝜋1
0.05
0.10
0.15
0.20
0.25
0.30
0.05
435 582 686 918 905 1212 1094 1465 1251 1675 1376 1843 1470 1969 1533 2053 1565 2095 1565 2095
141 188 199 266 250 335 294 393 329 440 356 477 376 503 388 519 392 524 388 519
76 101 100 133 121 161 138 185 152 203 163 217 170 227 173 231 173 231 170 227
49 65 62 82 73 97 82 109 89 118 93 125 96 128 97 130 96 128 93 125
36 47 43 57 49 65 54 72 58 77 61 81 62 82 62 82 61 81 58 77
27 36 32 42 36 47 39 52 41 54 42 56 43 57 42 56 41 54 39 52
0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
In all cases, 𝛼 = 0.05. For each pair of 𝜋1 and 𝜋2 , the upper number corresponds to 𝛽 = 0.20 and the lower number corresponds to 𝛽 = 0.10.
where the denominator is the sum of all follow-up times. Then 2𝑑𝜆∕𝜆̂ has a 𝜒 2 distribution with 2𝑑 degrees of freedom [428, 669]. A ratio of 𝜒 2 random variables has an 𝐹 distribution with 2𝑑1 and 2𝑑2 degrees of freedom [314]. This fact can be used to construct tests and confidence intervals for the hazard ratio. For example, a 100(1 − 𝛼)% confidence interval for Δ = 𝜆1 ∕𝜆2 is ̂ 2𝑑 ,2𝑑 ,1−𝛼∕2 < Δ < Δ𝐹 ̂ 2𝑑 ,2𝑑 ,𝛼∕2 . Δ𝐹 1 2 1 2 See Lee [911] for some examples. Power calculations can be simplified somewhat using a log transformation, as discussed in the next section. Other Parametric Approaches Under the null hypothesis, log(Δ) is approximately normally distributed with mean 0 and variance 1.0 [583]. This leads to a power/sample size relationship similar to equation (16.22): 𝐷=4
(𝑍𝛼 + 𝑍𝛽 )2 [log(Δ)]2
,
(16.29)
where 𝐷 is the total number of events required and 𝑍𝛼 and 𝑍𝛽 are the normal quantiles for the type I and II error rates. It is easy to verify that to have 90% power to detect a hazard ratio of 2.0 with a two-sided 0.05 𝛼-level test, approximately, 90 total events are
Piantadosi
Date: July 27, 2017
466
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.14 Sample Sizes per Group for Comparisons Using the 𝝌 𝟐 Test without Continuity Correction with Equal Group Sizes Determined From Equation (16.27) Δ = 𝜋2 − 𝜋1 𝜋1
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.05
22 28 25 33 27 36 29 39 31 40 31 41 31 41 31 40 29 39 27 36
18 23 20 26 22 28 23 30 24 31 24 31 24 31 23 30 22 28 20 26
15 19 16 21 17 23 18 24 19 24 19 24 18 24 17 23 16 21 15 19
12 16 14 17 14 19 15 19 15 19 15 19 14 19 14 27 12 16 11 14
11 14 11 15 12 15 12 16 12 16 12 16 12 15 11 14 10 12 8 10
9 12 10 12 10 13 10 13 10 13 10 12 9 12 8 11 7 9 6 8
8 10 8 10 8 11 8 11 8 10 8 10 7 9 7 8 6 7 4 5
7 8 7 9 7 9 7 9 7 8 6 8 6 7 5 6 4 5 3 3
0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
In all cases, 𝛼 = 0.05. For each pair of 𝜋1 and 𝜋2 , the upper number corresponds to 𝛽 = 0.20 and the lower number corresponds to 𝛽 = 0.10.
required. It is useful to remember one or two such special cases because the formula may be difficult to recall. A more general form for equation (16.29) is [log(Δ)]2 1 1 + = , 𝑑1 𝑑2 (𝑍𝛼 + 𝑍𝛽 )2
(16.30)
which is useful because it shows the number of events required in each group. Note the similarity to equation 16.21. Ideally, the subjects should be allocated to yield equal numbers of events in the two groups. Usually, this is impractical and not much different from allocating equal numbers of subjects to the two groups. Equations (16.29 and 16.30) are approximately valid for nonexponential distributions as well, especially those with proportional hazards such as the Weibull. Nonparametric Approaches To avoid parametric assumptions about the distribution of event times, a formula given by Freedman [522] can be used. This approach is helpful when it is unreasonable to make assumptions about the form of the event time distribution. Under fairly flexible assumptions, the size of a study should satisfy 𝐷=
(𝑍𝛼 + 𝑍𝛽 )2 (Δ + 1)2 (Δ − 1)2
,
where 𝐷 is the total number of events needed in the study.
(16.31)
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
467
COMPARATIVE TRIALS
Example 16.14. Besides this formula, to detect a hazard rate of 1.75 as being statistically significantly different from 1.0 using a two-sided 0.05 𝛼-level test with 90% power requires (1.96 + 1.282)2 (1.75 + 1)2 = 141 𝑒𝑣𝑒𝑛𝑡𝑠. (1.75 − 1)2 A sufficient number of subjects must be placed on study to yield 141 events in an interval of time appropriate for the trial. Suppose that previous studies suggest that approximately 30% of subjects will remain event free at the end of the trial. The total number of study subjects would have to be 141 = 202. 𝑛= 1 − 0.3 This number might be further inflated to account for study dropouts. 16.7.6
Parametric and Nonparametric Equations Are Connected
Interestingly, equations (16.30 and 16.31) can be connected directly in the following way: For Δ > 0, a convergent power series for the logarithmic function is log(Δ) =
∞ ∑ 𝑖=1
2 𝜓 2𝑖−1 , 2𝑖 − 1
(16.32)
where 𝜓 = (Δ − 1)∕(Δ + 1). Using only the first term of equation (16.30), we have log(Δ) ≈ 2𝜓. Substituting this into equation (16.32), we get equation (16.31). The quantity 2(Δ − 1)∕(Δ + 1) gives values closer to 1.0 than log(Δ) does, which causes equation (16.31) to yield higher sample sizes than equation (16.30). The ratio of sample sizes given by the two equations, 𝑅, is √
𝑅=
1 2
log(Δ) 𝜓
Taking the first two terms of the sum, √
=
∞ ∑ 𝑖=1
1 𝜓 2𝑖−2 . 2𝑖 − 1
1 𝑅 ≈ 1 + 𝜓 2. 3
For Δ = 2, 𝜓 = 13 , and √
𝑅=1+
1 , 27
or 𝑅 ≈ 1.08. Thus, equation (16.31) should yield sample sizes roughly 8% larger than equation (16.30) for hazard ratios near 2.0. This is in accord with Table 16.15. 16.7.7
Accommodating Unbalanced Treatment Assignments
In some circumstances it is useful to allocate unequally sized treatment groups. This might be necessary if one treatment is very expensive, in which case the overall cost of
Piantadosi
Date: July 27, 2017
468
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.15 Number of Total Events for Hazard Rate Comparisons Using the Log Rank Test 𝛽 = 0.10
𝛽 = 0.20
𝛼 = 0.05
𝛼 = 0.10
𝛼 = 0.05
𝛼 = 0.10
1.25
844 852
688 694
630 636
496 500
1.50
256 262
208 214
192 196
150 154
1.75
134 142
110 116
100 106
80 84
2.00
88 94
72 78
66 70
52 56
2.25
64 72
52 58
48 54
38 42
2.50
50 58
42 48
38 42
30 34
Δ
Δ is the hazard ratio. The upper row is for the exponential parametric assumption (equation (16.29)) and the lower row is a nonparametric assumption (equation (16.31)). All numbers are rounded so as to be divisible evenly by 2, although this is not strictly necessary.
the study could be reduced by unequal allocation. In other cases we might be interested in a subset of subjects on one treatment group. If 𝑑1 and 𝑑2 are the the required numbers of events in the treatment groups, define 𝑟 = 𝑑2 ∕𝑑1 to be the allocation ratio. Then equation (16.30) becomes 𝑑1 =
2 𝑟 + 1 (𝑍𝛼 + 𝑍𝛽 ) . 𝑟 [log(Δ)]2
(16.33)
From equation (16.31) the corresponding generalization is 𝑑1 =
2 2 𝑟 + 1 (𝑍𝛼 + 𝑍𝛽 ) (Δ + 1) 𝑟 4(Δ − 1)2
and 𝑑1 + 𝑑2 =
2 2 (𝑟 + 1)2 (𝑍𝛼 + 𝑍𝛽 ) (Δ + 1) . 𝑟 4(Δ − 1)2
(16.34)
When 𝑟 = 1, we recover equation (16.31). The effect of unequal allocations can be studied from equation (16.33), for example. Suppose that the total sample size is held constant: 𝐷 = 𝑑1 + 𝑑2 = 𝑑1 + 𝑟𝑑1 = 𝑑1 (𝑟 + 1)
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPARATIVE TRIALS
FIGURE 16.4
469
Power versus allocation ratio for event time comparisons.
or 𝑑1 = 𝐷∕(𝑟 + 1). Then equation (16.33) becomes 𝐷=
2 (𝑟 + 1)2 (𝑍𝛼 + 𝑍𝛽 ) [ ]2 . 𝑟 log(Δ)
From this we have
(√ power = Φ
𝐷𝑟 log(Δ) − 𝑍𝛼 𝑟+1
(16.35)
) .
A plot of power versus 𝑟 using this equation is shown in Figure 16.4. As the allocation ratio deviates from 1.0, the power declines. However, this effect is not very pronounced for 0.5 ≤ 𝑟 ≤ 2. Thus, moderate imbalances in the treatment group sizes can be used without great concern about loss of power or the need to increase total sample size. 16.7.8
A Simple Accrual Model Can Also Be Incorporated
Often it is critical to estimate the length of time required to complete a clinical trial with event time endpoints while also satisfying the other design requirements mentioned above. This is the case in studies of many chronic diseases like cancer, cardiovascular disease, and AIDS. Consider a trial that accrues subjects over an interval of time from 0 to 𝑇 . After time 𝑇 the study does not terminate, but continues while those subjects already on study are actively followed for events (Fig. 5.2). The additional follow-up period is advantageous because it increases the number of events that will be observed from subjects accrued near 𝑇 . Without this additional follow-up, the study would end, and those subjects accrued near the end would mostly be administratively censored. This design increases the information available to the trial at the cost of only additional follow-up.
Piantadosi
Date: July 27, 2017
470
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.16 Accrual Times That Satisfy Equation (16.36) with Equal Group Sizes 𝜆1
Event Rates 𝜆2
20
30
40
Accrual Rate 50
75
100
200
0.10
0.15 0.20 0.25
20.4 9.9 6.9
15.6 7.8 5.5
13.0 6.6 4.7
11.3 5.8 4.1
8.9 4.6 3.3
7.5 3.9 2.8
5.1 2.7 2.0
0.15
0.20 0.25 0.30
31.2 12.9 8.5
22.7 9.9 6.6
18.3 8.2 5.5
15.6 7.2 4.9
11.8 5.6 3.9
9.8 4.8 3.3
6.4 3.2 2.2
0.20
0.30 0.35 0.40
16.9 10.4 7.7
12.5 7.9 5.9
10.2 6.6 4.9
8.8 5.7 4.3
6.8 4.5 3.4
5.7 3.8 2.9
3.8 2.5 2.0
𝛼 = 0.05 (two-sided), 𝛽 = 0.10, and 𝜏 = 0.
Under the assumptions of Poisson accrual rates and exponentially distributed failure times, the study parameters in this case should satisfy [ ]2 ( )2 2 ∑ log(Δ) 2 𝜆∗𝑖 1 = , (16.36) 𝑛𝜆𝑖 𝜆∗𝑖 𝑇 − 𝑒−𝜆∗𝑖 𝜏 (1 − 𝑒−𝜆∗𝑖 𝑇 ) (𝑍𝛼 + 𝑍𝛽 )2 𝑖=1 where the subscript indicates group 1 or 2, 𝜆 is the event rate, 𝜆∗𝑖 = 𝜆𝑖 + 𝜇, which is the event rate plus 𝜇, a common loss to follow-up rate, 𝑇 is the accrual period, 𝜏 is the period of additional follow-up, 𝑛 is the accrual rate per unit of time, and the other parameters are as above [1312]. This equation must be used thoughtfully to be certain that the time scales for the event rates and accrual periods are the same (years, months, and so on). Numerical methods are required to solve it for the event rates, accrual time, and follow-up time parameters. The other parameters can be found algebraically. Values of accrual time that satisfy equation (16.36) for various other parameter values are shown in Table 16.16. Although equation (16.36) assumed exponentially distributed failure times, it yields useful values for other event time distributions such as the Weibull. Example 16.15. A randomized clinical trial is planned to detect a two-fold reduction in the risk of death following surgery for nonsmall cell lung cancer using adjuvant chemotherapy versus placebo. In this population of patients, the failure rate is approximately 0.1 per person-year of follow-up on surgery alone. Investigators expect to randomize 60 eligible subjects per year. Assuming no losses to follow-up and type I and II error rates of 5 and 10%, respectively, equation (16.36) is satisfied with 𝑇 = 2.75 and 𝜏 = 0. This is the shortest possible study because accrual continues to the end of the trial, meaning it is maximized. The total accrual required is 60 × 2.75 = 165 subjects. However, those individuals accrued near the end of the study may contribute little information to the treatment comparison because they have relatively little time at risk and are less likely to have events than those accrued earlier. As an alternative design, investigators consider adding a 2-year period of follow-up on the end of the accrual period, so that individuals accrued late can be followed long enough to observe events. This type of
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPARATIVE TRIALS
471
design can be satisfied by 𝑇 = 1.66 and 𝜏 = 2, meaning these parameters will yield the same number of events as those above. Here, the total accrual is 60 × 1.66 = 100 subjects and the total trial duration is 3.66 years. Thus, this design option allows trading follow-up time for up-front accruals. This strategy is often possible in event time studies. Follow-up time and accruals can be exchanged to yield a favorable balance while still producing the necessary number of events in the study cohort. In its original form equation (16.36) is somewhat difficult to use and the sample size tables often do not have the needed entries. If the exponential terms in equation (16.36) are replaced by first-order Taylor series approximations, 𝑒𝜆𝑥 ≈ 1 − 𝜆𝑥, we obtain
𝑛 ≈ 𝜆̃
(𝑍𝛼 + 𝑍𝛽 )2 𝑇 𝜏[log(Δ)]2
,
where 𝜆̃ is the harmonic mean of 𝜆1 and 𝜆2 . This approximation yields accrual rates that are slightly conservative. Nevertheless, it is useful in situations where the first-order approximation is valid, for example, when the event rates are low. This might be the case in disease prevention trials where the population under study is high risk but the event rate is expected to be relatively low.
16.7.9
Stratification
Stratification means that the study cohort is a discrete mixture of groups, each with a different absolute risk of the outcome. As is the rule in stratified design or analysis, we will assume that the relative treatment effect (in this case, the hazard ratio) is the same in all strata, although the absolute risks can be different. In such a case, each stratum will yield different numbers of events depending on the absolute risk. However, because the overall hazard ratio is constant across strata, the required total number of events remains the same as in a homogeneous cohort. But a heterogeneous cohort will require more subjects to produce the same number of events. In the absence of censoring, there is no effect from strata because all events will be observed and it would not matter from which stratum they came. In the presence of a proportion, 𝑝, of censoring, the number of subjects, 𝑁, is related to the number of events, 𝐷, by 𝑁 = 𝐷∕(1 − 𝑝). Let us suppose that there are 𝑘 strata and the fraction of study subjects from each stratum is 𝑓𝑖 . Then the required events will be produced according to the weighted average 𝑘 𝑘 ∑ ∑ 𝐷𝑓𝑖 𝑓𝑖 𝑁= =𝐷 , 1 − 𝑝 1 − 𝑝𝑖 𝑖 𝑖=1 𝑖=1
∑ where 𝑝𝑖 is the proportion censored in the 𝑖𝑡ℎ stratum, and 𝑓𝑖 = 1. But we already know 𝐷, the required total number of events, for example, from equation (16.31) or
Piantadosi
Date: July 27, 2017
472
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.17 Sample Sizes Satisfying Equation (16.37) for Various Cohort Compositions and 𝒌 = 𝟑 𝑓1
𝑓2
𝑓3
𝑁
𝑓1
𝑓2
𝑓3
𝑁
0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2
0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3
0.8 0.7 0.6 0.5 0.7 0.6 0.5 0.4 0.6 0.5
187 198 208 218 206 217 227 238 225 235
0.3 0.4 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4
0.3 0.3 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5
0.4 0.3 0.5 0.4 0.3 0.2 0.4 0.3 0.2 0.1
246 256 244 254 265 275 263 273 284 294
Δ = 2 is the common hazard ratio, 𝑍𝛼 = 1.96, and 𝑍𝛽 = 1.645. For all cases, 𝑝1 = 0.1, 𝑝2 = 0.5, and 𝑝3 = 0.8.
other appropriate design equation. Therefore, the sample size for a stratified design will be 𝑘 (𝑍𝛼 + 𝑍𝛽 )2 (Δ + 1)2 ∑
𝑓𝑖 1 − 𝑝𝑖
(16.37)
𝑘 (𝑍𝛼 + 𝑍𝛽 )2 (Δ + 1)2 ∑ 𝑓𝑖 , = 𝑆 (Δ − 1)2 𝑖 (𝜏) 𝑖=1
(16.38)
𝑁=
(Δ −
1)2
𝑖=1
where 𝑆𝑖 (𝜏) is the respective survival fraction at the end of study, 𝜏. As a fine point, it’s useful to note that each stratum contains two groups with different event rates according to Δ. Hence, the proportion of individuals censored in each treatment group within stratum will be different, but I have provided only one parameter, 𝑝𝑖 (or 𝑆𝑖 ), to summarize them. Accordingly, we could take 𝑝𝑖 to be the average proportion censored in the two treatment groups, or the highest proportion censored. For reasonable values of Δ, there will be little consequence either way, and no reason to create 𝑘 additional parameters to model every event curve. We can view the rightmost term in equation (16.37) as an inflation factor for the sample size to account for the heterogeneous trial cohort. Some solutions of equation (16.37) are shown in Table 16.17 where a study cohort is presumed to have low, medium, and high risk subsets (𝑘 = 3), and Δ = 2. Absent heterogeneity and the need for stratification, the required number of events for a hazard ratio of 2 is 90. This number of events would require 105, 189, and 473 subjects under 0.1, 0.5, and 0.8 proportions censoring, respectively. The mixture of subsets requires a sample size between the extremes as the compositions in Table 16.17 indicate. 16.7.10
Noninferiority
Noninferiority or equivalence questions and some of their design concepts were introduced in Section 8.6.2. Because there is a tendency to accept the null whenever we fail to reject it statistically, noninferiority decisions made under the typical hypothesis testing
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPARATIVE TRIALS
473
FIGURE 16.5 Relationship between the observed hazard ratio and the total number of events required for a two-sided lower confidence bound to exceed 0.75 (set A, solid) or 0.80 (set B, dashed) in a noninferiority trial. Note the logarithmic vertical axis. In each set, the upper curve is for 95%, the middle curve for 90%, and the lower curve for 80% confidence bounds.
paradigm can be anticonservative. A trial with low power might fail to reject the null, leading to a declaration of equivalence and adoption of a useless therapy [330, 904]. Based on this, there is some sense in designing such trials with the null hypothesis being “the treatments are different” and the alternative hypothesis being “the treatments are the same,” which is a reversal of the usual situation where the null is the hypothesis of no difference [158]. Reversing the null and alternative hypotheses reduces the chance of adopting a therapy simply because it was inadequately studied. Such tests are naturally one-sided because we don’t wish to prove that the new treatment is significantly better or worse, only that it is equivalent. Thus, the roles of 𝛼 and 𝛽 are reversed, and they must be selected with extra care. It might be worth a re-examination of Section 13.5 where futility designs were discussed with emphasis on the consequences of switching the null and alternative hypotheses. The impact on sample size of choosing 𝛼 and 𝛽 in this way does not have to be great. It was shown above that the quantiles for type I and II error probabilities add together directly in the numerator of many power formulas. However, the operational definition of significant may change depending on 𝛼 and 𝛽, which can be consequential. A careful consideration of the consequences of type I and II errors is useful before planning any noninferiority trial. Approaches to power and sample size for these trials are discussed by Roebruck and K¨uhn [1278] and Farrington and Manning [446]. Some of the conventional regulatory perspectives on the statistical aspects of these studies are discussed and challenged by Garrett [568]. The sample size for noninferiority trials depends strongly on the quantitative definition of equivalence, just as that for superiority trials depends on the effect size. Equivalence defined with high precision requires an extraordinarily large sample size. In equation
Piantadosi
Date: July 27, 2017
474
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.18 Approximate Total Number of Events Required to Satisfy Equivalence as Defined by Confidence Intervals and Equation (16.41) 𝑏
̂ Δ
0.75
0.80
0.85
(981) (691) (420) (463) (326) (198) 275 194 118
(4181) (2946) (1789) (1108) (781) (474) 521 367 223
0.90
0.95
0.85
(4704) (3314) (2013) 1243 875 532
0.90
0.75
0.80
0.85
0.90
1.00
186 131 80 136 96 59 105 74 45
309 218 133 208 147 89 152 107 65
582 410 249 345 243 148 232 163 99
1385 976 593 647 456 277 382 269 164
1.05
(5257) (3703) (2249)
𝑏
̂ Δ
1.10
Δ is the observed hazard ratio and 𝑏 is the value of the lower confidence bound that must be exceeded to satisfy equivalence. Within each Δ grouping, the first row is for 𝑍 = 1.96 (e.g., a two-sided 95% confidence interval), the second row for 𝑍 = 1.645, and the third row for 𝑍 = 1.282. Numbers in parentheses indicate that the upper confidence bound is less than 1.0.
(16.28) for example, the sample size increases without being bound as 𝜋1 → 𝜋2 . If we are willing to declare two proportions equivalent when ||𝜋1 − 𝜋2 || ≤ 𝛿, then equation (16.28) can be modified as ] [ (𝑍𝛼 + 𝑍𝛽 )2 𝜋1 (1 − 𝜋1 ) + 𝜋2 (1 − 𝜋2 ) 𝑛2 = . (16.39) [𝛿 − (𝜋1 − 𝜋2 )]2 Most frequently in this context, we also assume that 𝜋1 = 𝜋2 . Example 16.16. Suppose that the success rate for standard induction chemotherapy in adults with leukemia is 50% and that we would consider a new treatment equivalent to it if the success rate was between 40 and 60% (𝛿 = 0.1). The null hypothesis assumes that the treatments are unequal. Then, using an 𝛼 = 0.10 level test, we would have 90% power to reject nonequivalence with 428 subjects assigned to each treatment group. In contrast, if equivalence is defined as 𝛿 = 0.15, the required sample size decreases to 190 per group. Confidence Limits From the above discussion, it is evident that the usual hypothesis-testing framework is somewhat awkward for noninferiority questions. Either confidence intervals or likelihood methods may be a more natural way to deal with such issues. See Fleming [488] for a nice summary of this approach. To simplify the following discussion, assume that safety is not at stake, and that the benefits of standard treatment have already been established. Noninferiority will be assessed in terms of a survival hazard ratio. Other outcomes can be handled in an analogous way. Suppose that a trial of 𝐴 versus placebo shows a convincing reduction in risk of ̂ 𝐴𝐵 = 1.0. some magnitude, and a later comparison of 𝐴 versus 𝐵 yields a hazard ratio Δ Although the point estimate from the second study suggests that 𝐵 is not inferior to 𝐴, we need to be certain that the data are not also consistent with substantial loss of benefit, for
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
COMPARATIVE TRIALS
475
example, if the study were underpowered. One assurance is to check that the confidence ̂ 𝐴𝐵 does not strongly support low values less than 1, perhaps below interval around Δ some tolerance, 𝑏. A small study with a wide confidence interval around the estimated hazard ratio might not convincingly rule out values below 𝑏. As an example, 𝑏 might be taken to be 0.75, that is, requiring confidence that three-quarters of the treatment benefit is preserved to declare noninferiority. ̂ 𝐴𝐵 can be obtained from An approximate confidence interval for Δ 𝑍 ̂ 𝐴𝐵 ± √ 𝛼 , log Δ 𝐷∕2
(16.40)
where 𝐷 is the total number of events observed on study (assumed to be evenly divided between the two treatment groups) and 𝑍𝛼 is the two-sided normal distribution quantile. If we require that the entire confidence interval exceed the tolerance, 𝑏, 𝑍 ̂ 𝐴𝐵 − √ 𝛼 . log 𝑏 = log Δ 𝐷∕2 From this we obtain ( 𝐷=
2𝑍𝛼 ̂ 𝐴𝐵 ∕𝑏) log(Δ
)2 (16.41)
as the required number of events in our study, assuming that it yields a hazard ratio ̂ 𝐴𝐵 . Studies that satisfy this design criterion could end up being quite large for several Δ reasons. First, we would require relatively high precision or narrow confidence limits. Second, a high fraction of observations could be censored. Third, we cannot always expect ̂ 𝐴𝐵 ≥ 1. Figure 16.5 shows the relationship between total events required, the hazard Δ ratio observed in the study cohort, and the width of the confidence interval. Table 16.18 provides the number of events calculated from this equation for a few cases. The sample size does not have to be large relative to a superiority trial, as Table 16.18 indicates. The ̂ 𝐴𝐵 ≥ 1. As for most event bottom two panels show more modest sample sizes when Δ time outcomes, the number of subjects needed for the study will be substantially larger than the number of events required. This method has some deficiencies. If the trial yields a confidence interval for the hazard ratio that extends slightly below 𝑏, the decision rule indicates that we should reject noninferiority. The principal flaw in such an approach is that not all points in the confidence interval are equally supported by the data. The data support the center of the interval most strongly, but the decision rule is sensitive to the behavior of points at the end that are not well supported by the data. Furthermore, there is as much support for the upper end of the confidence interval as for the lower, but only the lower end strongly affects the decision. This situation is logically consistent with hypothesis tests and the common interpretation of 𝑝-values, as a cutoff of 0.05. Rigidity is inappropriate, as stated, because it presupposes the equivalence of all points in the confidence or rejection interval. In
Piantadosi
Date: July 27, 2017
476
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.19 Approximate Total Number of Events Required to Satisfy Equivalence as Defined by Likelihood Ratios and Equation (16.42) 𝑏
̂ Δ
0.75
0.80
0.85
2478 1770 1062 1168 834 500 695 496 298
10561 7544 4526 2798 1999 1199 1314 939 563
0.90
0.95
0.85
11881 8486 5092 3138 2241 1345
0.90
0.75
0.80
0.85
0.90
1.00
469 335 201 343 245 147 265 189 113
780 557 334 525 375 225 382 273 164
1470 1050 630 869 621 373 584 417 250
3497 2498 1499 1634 1167 700 964 689 413
1.05
13278 9485 5691
𝑏
̂ Δ
1.10
Δ is the observed hazard ratio and 𝑏 is the value of the relative hazard that must be exceeded to satisfy equivalence. Within each Δ grouping, the first row is for 𝑅 = 128, the second row for 𝑅 = 32, and the third row for 𝑅 = 8.
reality, the choice between nearly equivalent treatments will depend on additional outcomes such as toxicity, quality of life, or convenience. The discussion here is intended to be illustrative, particularly of the large sample sizes required, and not necessarily an endorsement of this as being the optimum approach to noninferiority. Example 16.17. Suppose a randomized clinical trial intends to compare oral versus injectable vaccines for a certain infectious disease. The hazard ratio will be defined as the infection rate in the injection group divided by the rate in the oral group. The oral formulation will be considered noninferior if the hazard ratio is at least 0.95 and the lower 95% confidence bound exceeds 0.80. This means that the estimated risk of infection using the injectable vaccine is 95% or higher than that of the oral vaccine, and that the 95% confidence interval does not support risk ratios below 0.80. Approximately, inverting this means that the oral vaccine does not yield more than 105% of the estimated infection risk of the injectable, and confidently less than 120% of the risk. Using Table 16.7.10 ̂ = 0.95 and 𝑏 = 0.80, 1300 events would be required. Because infections would with Δ be somewhat uncommon with a preventive vaccine, the study size required to support this number of events could be huge. Relative Precision We can use a similar approach to express the relative precision in confidence intervals for the hazard ratio as a function of the number of events. This was done in Section 16.5.1 for event rates. Using equation (16.40), approximate confidence bounds for a hazard ratio, Δ, on the natural scale are ̂
√
𝑒log(Δ)±𝑍𝛼 ×2∕
𝐷
√
̂ ±𝑍𝛼 ×2∕ = Δ𝑒
𝐷
,
̂ is the observed hazard ratio, 𝐷 is the total number of events, and 𝑍𝛼 is the where Δ two-sided normal distribution quantile for the intended coverage. The relative precision √ ±𝑍 ×2∕ 𝐷 ̂ 𝛼 , that being the part that does not depend on Δ. in the confidence interval is 𝑒 On the natural scale, the confidence bounds are asymmetric.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
477
COMPARATIVE TRIALS
TABLE 16.20 Relative Precision in Confidence Intervals for Hazard Ratios Events 𝐷 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800
95%
90%
(
)
(
)
0.68 0.68 0.73 0.76 0.78 0.80 0.81 0.82 0.83 0.84 0.85 0.85 0.86 0.86 0.87 0.87
1.74 1.48 1.38 1.32 1.28 1.25 1.23 1.22 1.20 1.19 1.18 1.17 1.17 1.16 1.15 1.15
0.70 0.77 0.81 0.83 0.85 0.86 0.87 0.88 0.89 0.89 0.90 0.90 0.90 0.91 0.91 0.91
1.44 1.29 1.23 1.20 1.18 1.16 1.15 1.14 1.13 1.12 1.12 1.11 1.11 1.10 1.10 1.09
Events 𝐷 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600
95%
90%
(
)
(
)
0.87 0.88 0.88 0.88 0.89 0.89 0.89 0.89 0.90 0.90 0.90 0.90 0.90 0.90 0.91 0.91
1.14 1.14 1.14 1.13 1.13 1.13 1.12 1.12 1.12 1.11 1.11 1.11 1.11 1.11 1.10 1.10
0.92 0.92 0.92 0.92 0.92 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.94 0.94 0.94
1.09 1.09 1.09 1.08 1.08 1.08 1.08 1.08 1.08 1.07 1.07 1.07 1.07 1.07 1.07 1.07
Some values for relative precision are given in Table 16.20. A relative precision for a 95% confidence interval within ±20% of the estimated hazard ratio can be attained with 300–500 events. To increase precision to ±10% requires 1500 events. Values in Table 16.20 are independent of the estimated hazard ratio. Example 16.18. Suppose we must estimate the hazard ratio in a randomized trial for two treatments for glioblastoma with a precision of ±15%. A 95% confidence interval for the true hazard ratio will be used. Using Table 16.20, a relative lower bound of 15% can be attained with 550 events and a relative upper bound of 15% with 750 events. Thus, the trial should employ 550–750 events. The exact number may depend on which is the more interesting end of the confidence interval. Example 16.19. As in the previous example, suppose the hazard ratio must be less than 1.2 to meet the definition of noninferiority. If the point estimate of the hazard ratio can be as high as 1.1, we would then require a precision of less than ±10% for the upper bound to remain below 1.2. From Table 16.20, the required number of events has to be 1550 to keep the upper 95% confidence bound at 10%. Note that with this large number ̂ ≈ 1.1, we would be close to demonstrating convincing inferiority, that of events and Δ is the lower confidence bound might exceed 1.0. This is the potential consequence of narrowing both the upper and lower confidence bounds as the number of events increases. Likelihood Based An approach to sample size for noninferiority trials based on likelihood ratios may be slightly more natural than confidence intervals. Simply stated, we would require the same ̂ ≥ 𝑏 in a noninferiority trial relative evidence (measured by the likelihood ratio) that Δ ̂ ̂ as for Δ ≥ 1 in a superiority trial. Using the fact that log(Δ) is approximately distributed
Piantadosi
Date: July 27, 2017
478
Time: 4:45 pm
SAMPLE SIZE AND POWER
as 𝑁(log(Δ), 𝑑1 + 𝑑1 ), where 𝑑1 and 𝑑2 are the numbers of events in the two treatment 1 2 groups, the normal likelihood yields )2 ( )2 ( ⎛ ̂ − log(Δ1 ) ⎞ ̂ − log(Δ2 ) log( Δ) log( Δ) ⎟ ⎜ − 𝐿(Δ1 )∕𝐿(Δ2 ) = exp ⎜ ⎟, 8∕𝐷 8∕𝐷 ⎟ ⎜ ⎠ ⎝ where we have assumed that 𝑑1 ≈ 𝑑2 and 𝑑1 + 𝑑2 = 𝐷. In the usual superiority situation ̂ is we would reject the null value (Δ = 1) if we observed sufficient evidence that Δ different. Thus, ( ̂ 𝑅 = 𝐿(Δ)∕𝐿(1) = exp
̂ 2 log(Δ) 8∕𝐷
) ,
where 𝑅 is a specified likelihood ratio. The number of events required is 𝐷=
8 log(𝑅) . ̂ 2 log(Δ)
For example, suppose that we need moderate to strong evidence against the null if ̂ = 2. When 𝑅 = 128 corresponding to strong evidence, 𝐷 = 82, and when 𝑅 = 32 Δ corresponding to moderate evidence, 𝐷 = 58, in accord with sample sizes produced by conventional hypothesis test-based designs. ̂ versus the tolerance 𝑏 defined above For noninferiority, the relative evidence for Δ yields 𝐷= (
8 log(𝑅)
8 log(𝑅) )2 = ( )2 , ̂ ̂ log(Δ) − log(𝑏) log(Δ∕𝑏)
(16.42)
which could have been more directly derived from equation (16.41) because 𝑅 = ̂ ≤ 1. 𝑒𝑥𝑝(𝑍 2 ∕2). Again, the principal difficulty for trial size will arise when 𝑏 ≤ Δ The relative evidence implied by the 95% confidence interval is moderate to weak: exp(1.962 ∕2) = 6. 8. Hence, the sample sizes implied by Table 16.18 are actually relatively small. Table 16.19 is based on likelihood ratios of 8, 32, and 128, up to strong evidence, resulting in numbers of events to demonstrate noninferiority that are very large.
16.8
EXPANDED SAFETY TRIALS
ES (phase IV) studies are large safety trials, typified by postmarketing surveillance, and are intended to uncover and accurately estimate the frequency of uncommon side effects that may have been undetected in earlier studies. The size of such studies depends on how low the event rate of interest is in the cohort under study and how powerful the study needs to be.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
EXPANDED SAFETY TRIALS
16.8.1
479
Model Rare Events with the Poisson Distribution
Assume that the study population is large and that the probability of an event is small. Also assume that all subjects are followed for approximately the same length of time. The number of events, 𝐷, in such a cohort would follow the Poisson distribution. The probability of observing exactly 𝑟 events is Pr[𝐷 = 𝑟] =
(𝜆𝑚)𝑟 𝑒−𝜆𝑚 , 𝑟!
where 𝜆 is the event rate, and 𝑚 is the size of the cohort. The chance of observing 𝑟 or fewer events is the tail area of this distribution or Pr[𝐷 ≤ 𝑟] =
𝑟 ∑ (𝜆𝑚)𝑘 𝑒−𝜆𝑚 𝑘=0
𝑘!
.
The study should be large enough to have a high chance of seeing at least one event when the event rate is 𝜆. Example 16.20. The chance of seeing at least one event in the Poisson distribution is 𝛽 = 1 − Pr[𝐷 = 0] = 1 − 𝑒−𝜆𝑚 , or 𝑚 = −log(1 − 𝛽)∕𝜆. If 𝜆 = 0.001 and 𝛽 = 0.99, then 𝑚 = 4605. In other words, if we studied 4605 subjects and observed no events, we would have a high degree of certainty that the event rate was lower than 0.1%. Confidence intervals for 𝜆 can be calculated from tails areas of the distribution in a fashion similar to those for the binomial above. Some results are given in Table 16.21, where 𝑑 represents the number of events observed. The numerical values for the upper and lower limits provided in Table 16.21 must be divided by the cohort size to obtain the actual confidence bounds. Example 16.21. If 1 event is observed in a cohort of size 10, from Table 16.21 the 95% confidence bounds on 𝜆 are (0.024, 0.557). If 2 events are observed in a cohort of 100, the 95% confidence bounds on 𝜆 are (0.006, 0.072). Finally, if 11 events are observed in a cohort of 250, the bounds are (6.201∕250, 19.68∕250) = (0.025, 0.079). In some cases comparative trials are designed with discrete outcomes having a Poisson distribution. Power computations in these cases are discussed by Gail [553]. 16.8.2
Likelihood Approach for Poisson Rates
Some useful insights can be obtained by considering the relative evidence for an observed Poisson event rate, 𝜆. The likelihood ratio for 𝜆 versus 𝜇 is ( )𝑟 𝜆 𝑒Λ = 𝑒−𝑚(𝜆−𝜇) , 𝜇
Piantadosi
Date: July 27, 2017
480
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.21 Exact Two-Sided 95% Poisson Confidence Limits 𝑑
(
)
𝑑
(
)
𝑑
(
)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0.242 0.619 1.090 1.623 2.202 2.814 3.454 4.115 4.795 5.491 6.201 6.922 7.654 8.395 9.145 9.903 10.67 11.44 12.22 13.00 13.79 14.58 15.38 16.18 16.98
5.572 7.225 8.767 10.24 11.67 13.06 14.42 15.76 17.08 18.39 19.68 20.96 22.23 23.49 24.74 25.98 27.22 28.45 29.67 30.89 32.10 33.31 34.51 35.71 36.90
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
17.79 18.61 19.42 20.24 21.06 21.89 22.72 23.55 24.38 25.21 26.05 26.89 27.73 28.58 29.42 30.27 31.12 31.97 32.82 33.68 34.53 35.39 36.25 37.11 37.97
38.10 39.28 40.47 41.65 42.83 44.00 45.17 46.34 47.51 48.68 49.84 51.00 52.16 53.31 54.47 55.62 56.77 57.92 59.07 60.21 61.36 62.50 63.64 64.78 65.92
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
38.84 39.70 40.57 41.43 42.30 43.17 44.04 44.91 45.79 46.66 47.54 48.41 49.29 50.17 51.04 51.92 52.80 53.69 54.57 55.45 56.34 57.22 58.11 58.99 59.88
67.06 68.19 69.33 70.46 71.59 72.72 73.85 74.98 76.11 77.23 78.36 79.48 80.60 81.73 82.85 83.97 85.09 86.21 87.32 88.44 89.56 90.67 91.79 92.90 94.01
When 𝑑 = 0, the one-sided upper bounds are 3.689 (95%) and 4.382 (97.5%).
where 𝑟 events are observed with cohort size (exposure) 𝑚. Then, ( ) 𝜆 Λ = 𝑟 log − 𝑚(𝜆 − 𝜇) 𝜇
(16.43)
or 𝑚=
𝑟 log( 𝜇𝜆 ) − Λ 𝜆−𝜇
.
(16.44)
Equation (16.44) can be viewed as an event-driven sample size relationship and might have some utility if our design called for observing a fixed number of events. For example, if we anticipate observing 0 events, we can only consider hypothetical rates 𝜇 > 𝜆. If we need to generate relative evidence against 𝜇 close to 𝜆, more events are needed, and consequently a larger cohort. Equation (16.43) also yields 𝑚=
Λ 𝜆 log( 𝜇𝜆 )
+ (𝜇 − 𝜆)
,
(16.45)
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
OTHER CONSIDERATIONS
481
TABLE 16.22 Sample Sizes for Comparison of 𝝀 versus 1.0 from Equation (16.45). Λ 𝜆 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
8
32
64
128
6 10 16 26 42 73 142 346 1493
23 40 64 102 166 289 565 1383 5970
46 80 127 203 332 578 1130 2766 11940
92 159 254 405 663 1155 2259 5531 23879
which has a more familiar structure and is rate-driven in the sense that 𝜆 = 𝑟∕𝑚. Some example sample sizes from equation (16.45) rounded to the next highest integer are shown in Table 16.22. In all cases the reference event rate is 𝜇 = 1.
16.9 16.9.1
OTHER CONSIDERATIONS Cluster Randomization Requires Increased Sample Size
The effect of clustered experimental units is to reduce efficiency or increase variance in measured summaries. Intuitively this seems justified because observations within a cluster are more similar than those between clusters. This correlation acts to reduce the effective sample size relative to independent units. In the case of comparing means, as in Section 16.7.2, for 𝑘 clusters of 𝑚 individuals in each treatment group, the mean of either treatment group has variance var{𝑋} =
𝜎2 (1 + (𝑚 − 1)𝜌) , 𝑘𝑚
where 𝜌 is the intracluster correlation coefficient. The inflation factor, 1 + (𝑚 − 1)𝜌, can be applied in a straightforward way to obtain sample sizes for clustered designs. Also 𝜎 2 can be replaced by 𝑝(1 − 𝑝) for dichotomous outcomes to yield an appropriate inflation factor for the usual sample size formulas for those outcomes. Many other considerations must also be applied for clusterrandomized studies besides sample size. For example, the analysis of observations that are not independent of one another requires special statistical procedures. See Donner and Klar [388, 389] and Murray [1073] for thorough discussions of cluster randomized trials. Example 16.22. If we randomize in clusters of size 5, and the intracluster correlation coefficient is 0.25, the inflation factor is 1 + 4 × 0.25 = 2. Thus, for an effect size of 1.0 when comparing means (90% power, two-sided 𝛼 = 0.05), the sample size must be increased from 21 per group to 42 per group to compensate for the clustering.
Piantadosi
Date: July 27, 2017
482
16.9.2
Time: 4:45 pm
SAMPLE SIZE AND POWER
Simple Cost Optimization
It is possible to perform a simple cost optimization using unbalanced allocation and information about the relative cost of two treatment groups. For the t-test case as an example, equation (16.22) indicates that the total sample size must satisfy 𝑛2 + 𝑛 1 =
(
) (𝑍𝛼 + 𝑍𝛽 )2 𝜎 2 𝑟+1 +𝑟+1 , 𝑟 Δ2
where 𝑟 is the allocation ratio and 𝑛1 = 𝑟𝑛2 . Suppose that the relative cost of subjects on group 2 (treatment) compared to group 1 (control) is 𝐶. Then the total cost of the study, 𝑀, is proportional to 𝑀 ∝𝐶
𝑟 + 1 𝑟2 + 𝑟 + . 𝑟 𝑟
The minimum cost with respect to the allocation ratio can be found from 𝜕𝑀 𝐶 = 1 − 2, 𝜕𝑟 𝑟 which, after setting the derivative equal to zero, yields √ 𝑟min = 𝐶. Thus, the allocation ratio that minimizes total cost should be the square root of the relative costs of the treatment groups. This result is fairly general and does not depend explicitly on a particular statistical test. It might be worth comparing this square root rule to the one derived in Section 17.6.4 for unequal allocation ratios. Rather than allocation ratio, 𝑟, we could consider the fractional allocation, 𝑓 = 𝑟∕(𝑟 + 1), for which the minimum is 𝑓min =
1 1+
1 √ 𝐶
.
Example 16.23. Suppose that a treatment under study costs√2.5 times as much as control therapy. The allocation ratio that minimizes costs is 𝑟 = 2.5 = 1.58. Therefore, 1.6 times as many subjects should be enrolled on the control therapy √ compared to the experimental treatment. The fractional allocation is 𝑓 = 1∕(1 + 1∕ 2.5) = 0.61 or approximately 61:39 control to experimental. Note that the power loss for this unbalanced allocation would be quite modest. 16.9.3
Increase the Sample Size for Nonadherence
Frequently, subjects on randomized clinical trials do not comply with the treatment to which they were assigned. Here we consider a particular type of nonadherence, called drop-in, where subjects intentionally take the other (or both) treatment(s). This might happen in an AIDS trial, for example, where study participants want to receive all treatments on the chance that one of them would have a large benefit. A more general discussion of the consequences of nonadherence is given by Schechtman and Gordon [1337]. Nonadherence, especially treatment crossover, always raises questions about how best to analyze the data. Such questions are discussed in Chapter 19.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
OTHER CONSIDERATIONS
483
Drop-ins diminish the difference between the treatment groups, requiring a larger sample size than one would need to detect the same therapeutic effect in perfectly complying subjects. Investigators frequently inflate the sample size planned for a trial on an ad hoc basis to correct for this problem. For example, if 15% of subjects are expected to fail to comply, one could increase the sample size by 15% as though the analysis would be based on compliant subjects only. This strategy is helpful because it increases the precision of the trial, but gives credence to the incorrect idea that noncompliers can be removed from the analysis. It is possible to approximate quantitatively the consequence of nonadherence on the power of a comparative clinical trial. Under somewhat idealized conditions, suppose that the trial endpoint is survival and that a fixed proportion of study participants in each group cross over to the opposite treatment. For simplicity, we assume that the compliant proportion is 𝑝, and is the same in both groups, and that subjects only drop-in to the other treatment. Denote the hazard ratio in compliant subjects by Δ and in partially compliant ′ subjects by Δ . Then, for the same type I and II error rates, the required number of events ′ are 𝐷 and 𝐷 , respectively. Hazards By equation (16.31), the sample size inflation factor, 𝑅, that corrects for different hazard ratios satisfies ′
(Δ − 1)2 (Δ + 1)2 𝐷 = . 𝐷 (Δ + 1)2 (Δ′ − 1)2 ′
𝑅= ′
Expressing Δ in terms of Δ will put this equation in a usable form. ′ ′ ′ ′ ′ To accomplish this, assume that Δ = 𝜆1 ∕𝜆2 , where 𝜆1 and 𝜆2 are the composite hazards in the two treatment groups after noncompliant subjects are taken into account. In the first treatment group, 𝑝 subjects have hazard 𝜆1 and 1 − 𝑝 have hazard 𝜆2 . The “average” hazard in this group is the weighted harmonic mean, 2
′
𝜆1 =
𝑝 𝜆1
+
1−𝑝 𝜆2
,
and in the second treatment group, 2
′
𝜆2 =
1−𝑝 𝜆1
+
𝑝 𝜆2
.
Therefore, ′
Δ =
1 + 𝑝(Δ − 1) . Δ − 𝑝(Δ − 1)
From this, it is straightforward to show that ′
Δ +1 Δ+1 1 = . Δ′ − 1 Δ − 1 2𝑝 − 1
Piantadosi
Date: July 27, 2017
484
Time: 4:45 pm
SAMPLE SIZE AND POWER
Substituting into the equation defining 𝑅 yields ′
(Δ − 1)2 (Δ + 1)2 𝐷 1 𝑅= = = , ′ 2 2 𝐷 (Δ + 1) (Δ − 1) (2𝑝 − 1)2 ′
which gives the inflation factor as a function of the adherence rate. It is not obvious why the harmonic mean rather than an ordinary mean should be used for aggregate failure rates. A brief justification for this is as follows: Suppose that 𝑑 failure times are ranked from smallest to largest. The total exposure time is 𝑑 ∑ 𝑖=1
𝑡𝑖 =
𝑑 ∑ 𝑖=1
(𝑡𝑖 − 𝑡𝑖−1 )(𝑑 − 𝑖 + 1) =
𝑑 ∑ 𝑖=1
𝛿𝑖 (𝑑 − 𝑖 + 1),
where 𝑡0 = 0 and 𝛿𝑖 = 𝑡𝑖 − 𝑡𝑖−1 . There is one event in each interval so that the interval hazard rates are 𝜆𝑖 =
1 . 𝛿𝑖 (𝑑 − 𝑖 + 1)
The overall hazard is 𝑑 𝜆 = ∑𝑑
𝑑
= ∑𝑑
𝑖=1 𝑡𝑖
𝑖=1 𝛿𝑖 (𝑑
− 𝑖 + 1)
𝑑 = ∑𝑑
1 𝑖=1 𝜆𝑖
.
Thus, the overall hazard is the harmonic mean of the interval hazards. Alternatively, the problem can be viewed from the perspective of the median even time. Median event times could be averaged in the usual manner to estimate an overall median. However, the median event time is the reciprocal hazard, again indicating the need for the harmonic mean of the hazards. Specifically, if log(2) , 𝜆𝑖
𝜏𝑖 =
where 𝜆𝑖 is the 𝑖𝑡ℎ hazard and 𝜏𝑖 is the 𝑖𝑡ℎ median event time, then the mean of the median event times is 𝑛
𝜏=
𝑛
1∑ 1 ∑ log(2) 𝜏𝑖 = 𝑛 𝑖=1 𝑛 𝑖=1 𝜆𝑖 𝑛
=
log(2) ∑ 1 . 𝑛 𝑖=1 𝜆𝑖
Therefore, 𝑛 ∑𝑛
1 𝑖=1 𝜆𝑖
=
log(2) = 𝜆, 𝜏
where 𝜆 represents the mean hazard, that is, the hazard that yields the overall median event time. It is simply the harmonic mean of the individual hazards.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
OTHER CONSIDERATIONS
485
Means Returning to the underlying problem of nonadherence, the same inflation factor also applies to power and sample size considerations for differences in means. Suppose that when all participants adhere to their assigned treatments, the treatment difference is 𝛿 = 𝜇1 − 𝜇2 . As above, assume that a fraction, 𝑝, of subjects in each group crossover to the other treatment. The observed treatment difference would then be a consequence of the weighted average means in the treatment groups, which would yield 𝛿 ′ = 𝜇1′ − 𝜇2′ = 𝜇1 (1 − 𝑝) + 𝜇2 𝑝 − 𝜇1 𝑝 − 𝜇2 (1 − 𝑝) = (𝜇1 − 𝜇2 )(1 − 2𝑝). The inflation factor, 𝑅, is the ratio of sample sizes required by 𝛿 versus 𝛿 ′ (e.g., from equation (16.26)), or 𝑅=
1 . (2𝑝 − 1)2
Proportions For dichotomous outcomes or proportions, the same inflation factor applies, at least approximately. To see this, note that the weighted proportions in the treatment groups, accounting for noncompliance, are 𝜋1′ = 𝜋1 𝑝 + 𝜋2 (1 − 𝑝) and 𝜋2′ = 𝜋1 (1 − 𝑝) + 𝜋2 𝑝, so Δ′ = 𝜋1′ − 𝜋2′ = (𝜋1 − 𝜋2 )(2𝑝 − 1). As can be seen from equation (16.28), if differences in the numerator as a consequence of 𝜋1′ and 𝜋2′ are ignored, the inflation factor will again be approximately 𝑅 = (2𝑝 − 1)−2 . When all subjects comply, 𝑝 = 1 and 𝑅 = 1. If 95% of subjects in each group comply (5% “drop-in” to the opposite group), 𝑅 = 1∕(1.9 − 1)2 = 1.23. In other words, to preserve the nominal power when there is 5% nonadherence (crossover) of this type in each treatment group, the number of events has to be inflated by 23%. If the nonadherence is 10% in each group, the sample size inflation factor is 56%, illustrating the strong attenuation of the treatment difference that crossover can cause. Similar equations can be derived for dropouts or when the drop-in rates are different in the two groups. 16.9.4
Simulated Lifetables Can Be a Simple Design Tool
Sometimes one does not have a computer program available but still wishes to have a rough idea of how long a trial might take to complete. Assuming that we can calculate the number of events needed from an equation like (16.29 or 16.31), a basic lifetable can be constructed to show how long such a study might take. The only additional piece of information required is the accrual rate. The procedure is fairly simple. For each interval of time, calculate (i) the number of subjects accrued, (ii) the total number at risk, (iii) the number of events in the interval produced by the overall event rate, and (iv) the cumulative number of events. When the
Piantadosi
Date: July 27, 2017
486
Time: 4:45 pm
SAMPLE SIZE AND POWER
TABLE 16.23 Simulated Lifetable to Estimate Study Size and Duration Assuming Exponential Event Times Time Interval 1 2 3 4 5 6 7 ⋮
Number Accrued
Number on Study
Events in Interval
Number Event Free
Cumulative Events
30 30 30 30 30 30 30 ⋮
30 56 78 97 113 127 139 ⋮
4 8 11 14 16 18 19 ⋮
26 48 67 83 97 109 120 ⋮
4 12 23 37 53 71 90 ⋮
cumulative number of events reaches its target, the trial will stop. As an example, suppose that we wish to know how long a trial will take that requires 90 events and has event rates of 0.1 and 0.2 per person-year in the two treatment groups. Accrual is estimated to be 30 subjects per year. Table 16.23 shows the calculations. Here I have assumed the overall event rate is 𝜆 = 2∕(
1 1 2 = 0.14, + )= 𝜆1 𝜆2 10 + 5
that is, the harmonic mean of the two event rates. In the seventh year of the lifetable, the cumulative number of events reaches 90. Thus, this study would require approximately seven years to complete. In the event that accrual is stopped after five years, a similar calculation shows that the trial would take nine years to yield 90 events. This method is approximate because it assumes that all subjects accrued are at risk for the entire time interval, there are no losses from the study, and both the event rate and accrual rate are constant. Also, not all accruals are always accounted for at the end of the table because of round-off error. These problems could be corrected with a more complex calculation, but this one is useful because it is so simple. 16.9.5
Sample Size for Prognostic Factor Studies
The generalization of sample size to cases where the treatment groups are unbalanced suggests a simple method to estimate sample sizes for prognostic factor studies (PFS). Prognostic factor studies are discussed in Chapter 21. Somewhat simplistically, consider designing a prognostic factor study to detect reliably a hazard ratio, Δ, attributable to a single dichotomous variable. If there is more than one factor, we might consider Δ to be the hazard ratio adjusted for all the other covariate effects. If somehow we could make our PFS like a comparative trial, we would “assign” half of the subjects at random to “receive” the prognostic factor. The power of such a study would then be assessed using the methods just outlined. The distribution of the prognostic factor of interest will not be 1:1 in the study cohort in general. Instead, it will be unbalanced, perhaps severely. However, equation (16.34) might be used, for example, to estimate the size of the cohort required. When the imbalance is severe, we know to expect sample sizes in excess of those required to detect the same size effect in a designed comparative trial.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
OTHER CONSIDERATIONS
487
Example 16.24. Consider the surgical staging of lung cancer, which is based on both the size of the tumor and lymph node involvement. Suppose that a molecular marker can detect tumor DNA in lymph nodes that appear negative using conventional pathology evaluation. Up-staging such tumors would place the subject in a higher risk category for recurrence or death following surgical resection. Assume that the frequency of a positive molecular marker in a subject thought to be negative by conventional pathology is 15%. What size cohort would be needed to reliably detect a two-fold increase in risk attributable to the up-staging? The imbalance in the risk factor is 6:1. Assuming that we want 90% power to detect a hazard ratio of 2 using a one-sided 0.025 𝛼-level test, then equation (16.34) indicates that approximately 193 events are required. Here the relative hazard is assumed to be constant across all levels of other risk factors. Such a study could be done retrospectively on banked specimens, assuming that clinical follow-up is available. The number of specimens needed might exceed this, if some follow-ups are censored. The study could also be done prospectively, where considerably more subjects might need to be entered in the study cohort to yield the required number of events in a reasonable time. For a prospective study the production of the required events depends on the absolute event rate in the study cohort, the accrual rate, and the length of follow-up. In either case the sample size exceeds that required for a balanced trial to detect the same magnitude effect, where approximately 90 events are needed.
16.9.6
Computer Programs Simplify Calculations
The discussion in this chapter demonstrates the need for flexible and fast, user-friendly computer software to alleviate hand calculations. Most statisticians who routinely perform such calculations have their own programs, commercial software, tables, or other labor-saving methods. Good software packages can greatly simplify these calculations, saving time, increasing flexibility, and allowing the trialist to focus on the conceptual aspects of design. The clinician does not need software as much as he or she needs a working relationship with a statistical expert. Many simple programs are on the web, but may be variably documented. One computer program to perform some of the calculations described in this chapter is available as described in Appendix A. Apart from the interface, a new program is not necessarily better, because the basic formulas remain valid. In recent years there seems to have been a proliferation of programs on the web, and no sensible way to compare them. Most of the comparative reviews that have been done are older [613, 779]. One program noteworthy for its large scoope is Power and Sample Size (PASS v.14) by NCSS Software [1111].
16.9.7
Simulation Is a Powerful and Flexible Design Alternative
It is not uncommon to encounter circumstances where one of the numerous sample size equations is inadequate or based on assumptions known to be incorrect. In these situations, it is often possible to study the behavior of a particular trial design by simulating the data that might result and then analyzing it. If this process is repeated many times with data generated from the appropriate model with a random component, the distribution of various trial outcomes can be seen. The process of simulating data, sometimes called the Monte Carlo method, is widely used in statistics to study analytically intractable
Piantadosi
Date: July 27, 2017
488
Time: 4:45 pm
SAMPLE SIZE AND POWER
problems. Simulation will be required routinely to assess the operating characteristics of adaptive methods. An example of a situation in which the usual assumptions might be inaccurate occurs in prevention clinical trials studying the effects of treatments such as diet or lifestyle changes on the time to development of cancer. In this circumstance it is likely that the effect of the treatment gradually phases in over time. Thus, the risk or hazard ratio between the two treatment groups is not constant over time. However, a constant hazard ratio is an assumption of all of the power and sample size methods presented above for event time endpoints. It is also difficult to deal analytically with time-varying hazard ratios. Additionally, the loss rates or drop-in rates in the treatment groups are not likely to be constant over time. An appropriate way to study the power of such a trial might be to use simulation. Simulations can facilitate studies in other situations. Examples include studying dose escalation algorithms in DF trials, the behavior of flexible grouped designs, and making decisions about early stopping. Reasonable quantitative estimates of trial behavior can be obtained with 1000–10,000 replications. Provided the sample size is not too great, these can usually be performed in a reasonable length of time on a microcomputer.
16.9.8
Power Curves Are Sigmoid Shaped
I emphasized earlier the need to include the alternative hypothesis when discussing the power of a comparative trial. It is often helpful to examine the power of a trial for a variety of alternative hypotheses (Δ) when the sample size is fixed. Alternatively, the power for a variety of sample sizes for a specific Δ is also informative. These are power curves and generally have a sigmoidal shape with increasing power as sample size or Δ increases. It is important to know if a particular trial design is a point on the plateau of such a curve, or lies more near the shoulder. In the former case, changes in sample size, or in Δ, will have little effect on the power of the trial. In the latter case, small changes can seriously affect the power. As an example, consider the family of power curves shown in Figure 16.6. These represent the power of a clinical trial using the log rank statistic to detect differences in hazard ratios as a function of Δ. Each curve is for a different number of total events and all were calculated from equation (16.31) as 1 − 𝛽 = Φ(𝑍𝛽 ) = Φ
(√
𝑑
) Δ−1 − 𝑍𝛼 . Δ+1
For any sample size, the power increases as Δ increases. In other words, even a small study has a high power against a sufficiently large alternative hypothesis. However, only large trials have a high power to detect smaller and clinically realistic alternative hypotheses, and retain power, even if some sample size is lost. Suppose that we find ourselves having designed a study with a power of 0.8 on the Δ = 2 curve (Fig. 16.6). If difficulties with the trial, such as losses to follow-up, reduce the number of events observed, the power loss will be disproportionately severe compared to if we had designed the trial initially with a power of over 0.9. Thus, reduction of the type II error is not the only benefit from high power. It also makes a trial less sensitive to inevitable inaccuracies in other design assumptions.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SUMMARY
489
FIGURE 16.6 Power versus number of events for the logrank test.
16.10
SUMMARY
The size of a clinical trial must be motivated by precision, power, or relative evidence, all of which can be formally related. Size quantifications, although approximate, are useful and necessary for designing trials and planning for resource utilization. A hypothesistesting framework is often adopted for sample size considerations. Important design parameters besides sample size and power include the intended type I error rate, the number of events in an event-time study, accrual rate and duration, losses to follow-up, allocation ratio, total study duration, and the smallest treatment difference of clinical importance. The practical use of each of these depends on the type of trial and its specific design. The sample size needed for a phase I trial is usually an outcome of the study, although it is frequently less than 20–25 subjects. The exact sample size cannot usually be specified in advance because the study stops only after prespecified clinical outcome criteria are met. For dose-finding, fairly simple and familiar study designs typically suffice, although newer methods such as the continual reassessment method have better performance characteristics. Designs that explicitly incorporate pharmacokinetic information into the dose escalation may make these studies more efficient. Developmental studies, like safety and activity trials, which look for evidence of treatment efficacy, can employ a fixed sample size. The sample size can be determined as a consequence of the precision required to estimate the response, success rate, or failure rate of the treatment. These trials usually require 25–50 subjects. When faced with evidence of low efficacy, investigators wish to stop a middle development trial as early as possible. This motivates the use of quantitative early stopping rules, sequential, or staged designs that minimize the number of subjects given an unpromising treatment. When evidence about efficacy is available from previous studies, a Bayesian design may be useful.
Piantadosi
Date: July 27, 2017
490
Time: 4:45 pm
SAMPLE SIZE AND POWER
Sample size and power relationships for CTE trials depend on the particular test statistic used to compare the treatment groups. Sample size increases as the type I and type II error rates decrease. Sample size decreases as the treatment difference (alternative hypothesis) increases or as the variance of the treatment difference decreases. For event-time studies, the currency of design is the number of events required to detect a particular hazard ratio. The actual sample size needed to yield the required number of events depends on the degree of censoring and the duration of the trial and its followup. Nonadherence with assigned treatment may increase the required number of study participants dramatically. Noninferiority designs generally require very large sample sizes because evidence is needed to demonstrate that at least a high fraction of the treatment benefit from standard therapy is preserved. Operationally, this requires relatively narrow confidence intervals (high precision), increasing the sample size compared to superiority designs. Besides analytic power and sample size equations, statistical simulation may be a useful way to study the quantitative properties of some trial designs. These can be as simple as a single hypothetical life table constructed under the assumptions of the trial, accrual, and treatment alternatives. More complex simulations could help quantify the effects of nonadherence or other analytically intractable problems in trials. Specialized, flexible computer programs are necessary for performing the required calculations efficiently. Depending on the shape of the power curve, small changes in design parameters can have large or small effects on the power of the study against a fixed alternative.
16.11
QUESTIONS FOR DISCUSSION
1. How large a study is required to estimate a proportion with a precision of ±0.1 (95% confidence limits)? How large a study is required to estimate a mean value with precision ±0.1 standard deviation (95% confidence limits)? 2. How many space shuttle flights, in the presence of two failures, would it take to provide a high degree of confidence that the true failure rate is ≤ 0.1%? Assume that you intend for the upper two-sided 95% confidence bound to have the value 0.001. 3. What is the largest proportion consistent with 0 successes out of 5 binomial trials? Define what you mean by “consistent with.” What about 0 out of 10, 12, 15, and 20 trials? 4. Consider equation 16.21 and the upper confidence bound when the point estimate for a proportion is 0. Derive an analogous rule for hazard rates, assuming that the rate is the intensity of a Poisson distribution. 5. With the normal approximation, confidence limits for a proportion are symmetric around the point estimate. Exact binomial confidence limits can also be constructed to be symmetric around the point estimate. Give some numerical examples for 𝑝 ≠ 0.05, and discuss the advantages and disadvantages of this approach. 6. Suppose that a series of binomial trials is stopped when the first “success” occurs. What distribution should be used for confidence intervals on the probability of success? Give some numerical examples. 7. A middle development trial is planned with a fixed sample size of 25 subjects, expecting 5 (20%) of them to “respond.” Investigators will stop the study if 0 of the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
8.
9.
10.
11.
12.
13.
14.
15. 16.
17.
18. 19.
491
first 15 subjects respond. What is the chance that a treatment with a true response rate of 20% will be discarded by such an approach? What about a treatment with a true response rate of 5%? Discuss the pros and cons of this simple rule for early stopping. Standard treatment produces successes in 30% of subjects. A new treatment will be pursued only if it seems to increase the success rate by 15%. Write a formal statistical methods section for a protocol using an appropriate design for this situation. A staged middle development trial design uses the following stopping rules: 0/5, 1/10, 2/15 and 5/5, 9/10, 13/15. These correspond approximately to lower and upper 95% confidence bounds on 𝑝 = 0.5. If the study stops early, the estimated proportion of successes will be biased. For example, when the true 𝑝 = 0.5 and the 5/5 upper boundary is hit, the apparent 𝑝 is 0.75. When the lower 2/15 boundary is hit, the apparent 𝑝 is 0.13. Discuss ways to systematically investigate this bias. Suppose that preliminary data are available regarding the mean value of some clinically important measurement from subjects in a middle development trial. If 𝑎 and 𝑏 are the minimum and maximum data values, assume a uniform distribution for the preliminary data and derive a sample size equation based on requiring a precision of 𝑤% of the mean. If 𝑎 = 5, 𝑏 = 35, and 𝑤 = 10%, what sample size is required? A CTE randomized trial is investigating the reduction in mortality attributable to reduced dietary fat. Investigators choose 𝛼 = 0.05 (two-sided) and 𝛽 = 0.10 for the trial design. Discuss the appropriateness of these choices. Suppose the trial in question 11 is studying the use of a synthetic fat substitute instead of dietary modification. Does your opinion of 𝛼 = 0.05 and 𝛽 = 0.10 change? Why or why not? A CTE randomized trial will compare the means of the two equally sized treatment groups using a t-test. How many subjects are required to detect a difference of 0.5 standard deviations using 𝛼 = 0.05 (two-sided) and 𝛽 = 0.20? How does the sample size change if twice as many subjects are assigned to one treatment as the other? Suppose that investigators wish to detect the difference between the proportion of successes of 𝑝1 = 0.4 and 𝑝2 = 0.55, using equal treatment groups with 𝛼 = 0.05 (two-sided) and 𝛽 = 0.10. How large must the trial be? Suppose that the allocation ratio is 1.5:1 instead of 1:1? Investigators will compare survival on two treatments using the logrank statistic. What sample size is required to detect a hazard ratio of 1.5, assuming no censoring? Five-year survival on standard treatment is 40%. If a new treatment can improve this figure to 60%, what size trial would be required to demonstrate it using 𝛼 = 0.05 (two-sided) and 𝛽 = 0.10? How does this estimated sample size compare with one obtained using equation (16.27) for the difference of proportions? Discuss. Compare the sample sizes obtained from equation (16.31) with those obtained by recasting the problem as a difference in proportions and using equation (16.27). Discuss. Considering equation (16.30), what is the “optimal” allocation of subjects in a trial with a survival endpoint? One could employ a Bayesian or frequentist approach to specifying the size of a safety and activity clinical trial. Discuss the strengths and weaknesses of each and when they might be used.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
17 TREATMENT ALLOCATION
17.1 INTRODUCTION A physician in practice uses the diagnosis, prognosis, characteristics of the individual patient, medical knowledge, opinion, patient preference, and circumstantial constraints to recommend a treatment expected to be optimal. These clinical judgments do not directly facilitate assessing relative efficacy among the therapies employed because so many factors are confounded with treatment selection, producing selection bias. Only when differences in outcomes between therapies are very large, or if incremental improvements produce a large change, the confounding may be irrelevant. This physician-autonomous method of treatment selection is seldom employed in clinical trials, despite it being the high art of medicine. A large well described experience of this sort is merely happenstance data. One slight exception to this principle is when subjects are assigned to physiciandriven choice as a control arm in a larger trial. In that case we would probably not dissect the control arm to assess differences among its therapies. Using only a single therapy or algorithm for a disease is a second method of treatment assignment. Exclusive use of one therapy can be the influence of expert opinion, ignorance, tradition, dogma, economic incentives, or individual or institutional preferences. A single therapy is also employed in some single cohort safety and efficacy trials, but such studies are conducted with a high degree of standardization, explicit eligibility criteria, a defined outcome, and can invalidate the therapy. Practice-based use of a single therapy does not control selection bias or confounders, shortcomings evident when comparing cohorts from different times, institutions, or clinics. Treatment assignments made in this way are haphazard with respect to assessing treatment differences, and are also not a consideration in comparative clinical trials. A slight exception here might be in cluster
Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
492
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
493
randomized trials where one clinic, practice, hospital, or city as the unit of randomization would be assigned to a single treatment. A third method for making treatment choices controls alternatives and the allocation process with the purpose of unbiased comparison. Treatment assignments made in this way are the domain of true experiments because extraneous influences are removed, causally isolating the effect of interest. Deterministic (nonrandom) allocation methods, such as alternating assignments, can also remove extraneous influences and validate treatment comparisons, but they are potentially discoverable and therefore subject to bias or manipulation. Properly managed random schemes are not subject to discovery and are reliable and convincing tools to eliminate selection bias. This chapter is only about this third method of treatment allocation. There are three considerations when choosing the technical method of treatment allocation for a clinical trial: eliminating selection bias, producing balanced comparison groups, and quantifying errors attributable to chance. The allocation method should satisfy all three objectives. Operationally, this requires that control over individual subject assignment is removed from the investigator, and is well planned. Credible control requires the assignment process to be nondiscoverable and auditable. 17.1.1
Balance and Bias Are Independent
There is a common serious misconception regarding unbiased treatment assignments for clinical trials. The misconception is that unbiased assignment must yield covariate balanced groups, or conversely that unbalanced groups signify a lack of comparability. An extension of this idea is that imbalance invalidates comparisons even when assignments are unbiased. In truth, bias and balance do not necessarily reflect on each other. The goals for assignments in a comparative trial in order of priority should be: (i) remove selection bias by breaking the natural correlation between treatment assignment and prognosis, (ii) make the treatment assignment process nondiscoverable, and (iii) induce covariate balance. Elimination of selection bias occurs when subject characteristics cannot differentially influence treatment assignment. It is possible and acceptable for chance alone to yield unbalanced groups after selection bias has been eliminated. For example, an unbiased coin can yield an arbitrarily long sequence of heads purely by chance. Just as importantly, excess heads (or tails) would be the rule at all except a very few points in a long series of random coin tosses. Investigators must constrain chance or remove much of its influence to avoid these small annoyances. It may be most appropriate to ignore imbalances that result from randomized assignments, as will be explained below. Moreover, treatment groups can appear balanced when variables are compared marginally (one at a time), but joint or multivariate distributions of factors can be unbalanced. Investigators tend to look almost exclusively at marginal summaries in trials, and they seem to prefer balance more than they value the literal use of randomization. We often prefer to make some treatment assignments nonrandomly rather than cope with the ugliness of chance imbalances. Randomization assures that we can obtain a valid test of the null hypothesis. That test does not depend on any outcome, such as prognostic factor balance. There are several different methods of treatment allocation commonly labeled as “randomization” [816]. Some computation considerations are given by Carey and Gentleman [227].
Piantadosi
Date: July 27, 2017
494
17.2
Time: 4:45 pm
TREATMENT ALLOCATION
RANDOMIZATION
There are three potential problems that motivate the use of randomized treatment assignments in comparative clinical trials. The first is the wide variation in clinical outcomes relative to the magnitude of effects or differences that treatments are likely to produce. Good study design, including randomization, helps control and quantify this variability. The second motive arises because of “confounding by indication,” a type of selection bias [1039]. This is a sort of reverse causality, wherein prognosis determines the treatment— individuals with certain characteristics are more likely to receive particular treatments. This confounds the effects of the therapy with prognosis. Without randomization, selection of subjects for particular treatments could yield differences of a clinically important size purely as a consequence of this bias. Third, researchers are usually unable to characterize why individuals receive a particular treatment or define homogeneous risk strata for nonrandomized comparisons. If we could characterize quantitatively the relationship between risk factors and treatment choice outside a trial, the effect of selection could be undone, yielding an unbiased treatment comparison. Alternatively, if we knew all prognostic factors, we could define homogeneous risk strata and make treatment comparisons within each. The treatment differences could then be pooled across strata to yield a valid overall estimate. The most practical and convincing way to avoid these problems is to use random treatment assignment. In some cultures, chance phenomena were a means of communicating with the gods or were associated with chaos and darkness. Unpredictability was associated with danger. In other cultures, randomness has been a time-honored method to apportion blame and remove human bias. As Barrow [119] says in his brief but interesting discussion of this topic in a cosmological context, … dabbling with random devices was serious theological business, not something to be trifled with or merely studied for the fun of it.
In contemporary Western society, we use randomization more for demonstrating a convincing lack of bias than for superstitious reasons, such as coin flips before sporting contests and choosing straws for unpleasant tasks. In clinical trials researchers use randomization for its objectivity in removing bias. In Western science, randomization as an experimental method was suggested formally by R. A. Fisher in the 1920s [176] and used in medical studies by Bradford Hill and Richard Doll in Great Britain in the 1940s. In the United States, randomization was advocated by early trialists such as Tom Chalmers and Paul Meier. It continues to be widely used for preventing bias in allocating treatments in comparative clinical trials and is popular because of its simplicity and reliability. In an early single-masked tuberculosis treatment clinical trial in the United States, randomization was used slightly differently than it is today. The investigators stated: The 24 patients were then divided into two approximately comparable groups of 12 each. The cases were individually matched, one with another, in making this division. … Then, by a flip of the coin, one group became identified as group I (sanocrysin-treated) and the other as group II (control) [26].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZATION
495
Today investigators seldom have treatment groups already composed and ready for randomization as a whole. Instead, each individual subject is usually randomized independently, although both procedures are logically equivalent. Even when the experimental unit is a group rather than an individual, we may not be able to randomize all clusters at once. Current practice and theory associated with randomization is discussed by Lachin, Matts, and Wei [880] and [878, 879]. For a discussion of the timing of randomization, see Durrleman and Simon [397].
17.2.1
Heuristic Proof of the Value of Randomization
Questions surrounding the use of randomized treatment assignments seem to arise more from practical and ethical concerns than from scientific legitimacy. In Chapter 2, I noted the objections to randomization by Abel and Koch [2, 3] and Urbach [1499], and indicated the worth of studying their concerns and errors. They reject randomization as a (i) means to validate certain statistical tests, (ii) basis for causal inference, (iii) facilitation of masking, and (iv) method to balance comparison groups. They omit the principle benefit of randomization often discussed in the methodologic literature and emphasized here, namely, the control over unknown prognostic factors. The value of randomization is not theoretical in the sense of being difficult to appreciate in the real world. Randomization has value both for the justification of what we do based on theory, as well as for practical application. For example, how randomization combined with stratification and blocking induces balanced treatment groups is discussed in Section 17.3.2. Facilitation of masking is obvious and needs no further clarification. With respect to theory justifying action, how randomization does indeed validate an important class of statistical tests is discussed in Section 17.5.3. Randomization as a foundation for causal inference is an important and complex epistemological question discussed by many authors [473, 727, 728, 1304–1306]. Exactly how randomization mitigates concern over unobserved prognostic factors also has roots in theory. Here, I sketch an informal justification for such claims. This discussion does not constitute formal proof, but should provide some clarity on why and how randomization has great value. Consider a possibly nonrandomized study, where the risk of failure or death observed in a cohort depends on prognostic factors and treatment administered. An individual’s hazard is not identifiable, but the aggregate failure rate in the study cohort is observable. ̂ in the Suppose that the correct model for the estimated overall hazard or failure rate, 𝜆, cohort while accounting for an arbitrary number, 𝑘, of prognostic factors is ̂ = 𝛽0 + 𝛽1 𝑋 1 + 𝛽2 𝑋 2 + ⋯ + 𝛽𝑘 𝑋 𝑘 + 𝜃𝑇 , log(𝜆) where each 𝑋 𝑖 indicates the average value for the 𝑖𝑡ℎ prognostic factor or covariate, 𝛽𝑖 is the strength or effect of that factor assumed constant for everyone, 𝑇 indicates aggregate treatment choice, and 𝜃 is the true effect of treatment. This model corresponds to an exponential failure time process with multiplicative effects on the baseline hazard, 𝑒𝛽0 . No assumption is necessary regarding the numerical values or scales of measurement of the prognostic factors.
Piantadosi
Date: July 27, 2017
496
Time: 4:45 pm
TREATMENT ALLOCATION
If we compare two cohorts denoted by 𝑋 and 𝑌 , the hazard ratio, Δ = 𝜆𝑋 ∕𝜆𝑌 is estimated by ̂ = 𝜃(𝑇 𝑋 − 𝑇 𝑌 ) + log(Δ)
𝑘 ∑ 𝑖=1
𝛽𝑖 (𝑋 𝑖 − 𝑌 𝑖 ),
where the 𝛽0 intercept term cancels out. We can take 𝑇𝑖 to be an indicator variable that has the value 1 if the treatment of interest was used and 0 if the control treatment was used, and assume that the 𝑋 group received treatment whereas 𝑌 was given control therapy. Then 𝑇 𝑋 = 1, 𝑇 𝑌 = 0. Taking expected values, denoted by 𝐸{⋅}, the expected log hazard ratio estimate is
̂ =𝜃+ 𝐸{log(Δ)}
𝑘 ∑ 𝑖=1
𝛽𝑖 𝐸{𝑋 𝑖 − 𝑌 𝑖 },
(17.1)
where each 𝛽𝑖 is a constant. We see from equation (17.1) that the general circumstance is a problem analytically because the hazard ratio we expect to observe is a combination of both the true treatment effect 𝜃 and the different prognostic factor composition of the groups being compared. This well known fact is restated statistically by equation (17.1), where the expected value of an arbitrary prognostic factor effect, 𝐸{𝑋 𝑖 − 𝑌 𝑖 }, is neither zero nor negligible. The presence of such confounding terms is elucidated by, but not a consequence of, the analytical model. We cannot know 𝑘, or every 𝛽𝑖 , any of which could be larger in magnitude than 𝜃. Some 𝛽𝑖 ’s will be negligible and can be ignored even if the corresponding group imbalance is large. Correlations between individual covariate values and treatment will surely be present if clinicians function as they have been taught. A large 𝛽𝑖 could render even a small difference between comparison groups influential. One approach to fixing these problems is to model the covariate effects on the outcome, essentially removing their influence from equation (17.1). Assume for a moment that we have the correct covariate model, and that we have actually measured some of the important factors, allowing such an approach. Even then a serious problem remains— what if we now take 𝑘 to be the number of unknown covariates. By the same reasoning as above, the covariate model will no longer reliably isolate the true effect of treatment. Remember the fallacy of omnimetrics [633]? From a practical perspective, using this sort of covariate adjustment merely stimulates debate as to the validity of the model and the completeness of measuring confounders. At least in theory, both are deficient. Luckily, there is a simple way to remove all covariate effects, known and unknown, from equation (17.1), and eliminate the need for the extra assumptions of covariate models. Randomized treatment assignment guarantees that all the expected values on the ̂ = 𝜃. In right side of equation (17.1) will be zero, 𝐸{𝑋 𝑖 − 𝑌 𝑖 } = 0, so that 𝐸{log(Δ)} removing all influences except chance on the allocation of individuals to treatment, correlations between prognostic factors and treatment choice are eliminated, and we expect the aggregate composition of comparison groups to be equal for every covariate. Correlations among covariates remain but do not influence treatment choice. Consequently, randomization justifies the expectation of an unbiased estimate of the true treatment effect. Taking control over the treatment allocation process is the hallmark of a true experiment, which then yields the ability to control covariate influences definitively, even
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZATION
497
if those prognostic factors are as yet undiscovered. This is the enormous, exclusive, and rarely disputed value of randomization. Sampling variation can yield covariate imbalances in an actual randomized experiment, just as a series of flips of a fair coin can yield more heads than tails. This does not invalidate the theory supporting the argument just sketched. Imbalances can be reduced by constraining the random process to induce close balance on key factors, while still preventing covariates from differentially influencing treatment assignment. As an example, treatments could be assigned according to a strictly alternating scheme, 𝐴𝐵𝐴𝐵𝐴𝐵𝐴𝐵𝐴 …. This is “effectively random” because it eliminates the influence of prognostic factors on treatment assignment, which is all that is required to zero the expectations in equation (17.1). Additionally, the sequence is always within one assignment of exact balance, so it could be used within strata to balance covariates. The reason we don’t literally do this is not because of a theoretical deficiency, but because of logistical concerns—such assignment schemes are too easily discovered and might allow manipulation of subject entry. Better methods for accomplishing the same benefits are discussed in Section 17.3.2. As it turns out, we cannot force balance everywhere someone could look. There is an unlimited supply of potential prognostic factors. The best we can do is balance a few key variables while guaranteeing an unbiased estimate of treatment effect using randomization. What are the consequences of the inevitable unbalanced covariates? The answer is that we have already planned for such effects in the error probabilities for the trial. We are entitled to be wrong by chance, and errors due to randomness, including random influential covariate imbalances, are controlled at an acceptable level by design as demonstrated elsewhere in this book. In other words, random error is random error whether it is inexplicable or mediated by chance imbalances. The discussion of randomization tests in Section 17.5.3 might be helpful in understanding this point. 17.2.2
Control the Influence of Unknown Factors
Randomization is an effective means for reducing bias in treatment selection because it guarantees that treatment assignment will not be based on subjects’ prognostic factors. This prevents the investigators from consciously or unconsciously assigning better prognosis subjects to a treatment that they hope will be superior. Because selection bias can influence outcomes as strongly as many treatment effects, preventing it is an important benefit of randomization. Selection bias is not the only one that plagues clinical studies. Treatment administration, outcomes assessment, and counting endpoints can all be biased, despite random treatment assignment. Example 17.1. Consider an unmasked trial with randomization between treatment and no treatment. The placebo effect will contribute to the apparent efficacy of the treatment. From a purely biological perspective we might say it adds a bias to the estimated treatment effect. If the same trial employed a placebo control, the estimated treatment difference would not be biased. Thus, randomization does not guarantee complete objectivity in a trial and must be combined with other design strategies to reduce bias.
Piantadosi
Date: July 27, 2017
498
Time: 4:45 pm
TREATMENT ALLOCATION
As just discussed, the far-reaching benefit of randomization is that it prevents confounding, even if the investigator is unaware that the effects exist and/or has not measured them. The extent to which confounding can be controlled during analysis only depends on two additional assumptions (i) the investigators are aware of, and have measured, all the important confounders in the experimental subjects, and (ii) the assumptions underlying the statistical models or other adjustment procedures are known to be correct. Randomization obviates these problems. Thus, randomization corrects many of the important limitations of studies with nonrandomized controls. It prevents effects from unknown prognostic factors and eliminates differential treatment selection effects, whether due to patient choice or physician judgment. Randomization facilitates defining start of treatment and controls time trends in the disease, subject population, diagnostic methods, and ancillary care. Interesting examples where investigators did not make the mistake of claiming definitive treatment differences in a nonrandomized comparison for reasons similar to those just outlined are given by Green and Byar [633]. One example is based on the report by Byar et al. [217]. There are methodologic problems and biases that randomization cannot eliminate. One concern surrounds the external validity of a trial. Eligibility restrictions designed to reduce variability in trials also limit the spectrum of subjects studied, and consequently can reduce the external validity of findings. This is a conventional view with some legitimacy. However, this perspective understates our ability to generalize treatment effects based on biological similarity, and the scarcity of large treatment–covariate interactions. A second potential problem with randomization is the limited treatment algorithm that most clinical trials study. The experimental structure simplifies and restricts interventions to control them. This may not fully reflect how the treatments are used in actual practice, or how they ultimately evolve. A third problem that randomization alone cannot eliminate is bias in the ascertainment of outcomes. Treatment masking and other design features are needed to eliminate this bias.
17.2.3
Haphazard Assignments Are Not Random
In many nonexperimental study designs the methods by which subjects came to receive their treatments are unknown. Presumably the treating physicians used information at hand in the usual way to select treatments that they felt were best. At worst, the physicians were ineffective at this, and the treatments were assigned in a haphazard fashion. Even in such a case treatment comparisons based on these data lack credibility and reliability compared with those from a randomized trial. Haphazard assignments are not random and cannot be relied upon to be free of bias. When reviewing the results of a comparative trial, we want to be convinced that accidents or selection and other biases are not responsible for differences observed. If one could be certain that subjects presented themselves for study entry in a purely random fashion and that investigators would use no prognostic information whatsoever to assign treatments, then virtually any allocation pattern would be effectively random. For example, a trial might employ alternating treatment assignments, under the assumption that the subjects available are a temporally random sample of those with the disease under study. While this might be credible, it relies on the additional assumption of random arrival in the clinic. Investigators would prefer to control randomization convincingly instead of leaving it to chance!
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZATION
17.2.4
499
Simple Randomization Can Yield Imbalances
Simple randomization makes each new treatment assignment without regard to those already made. In other words, a simply randomized sequence has no memory of its past history. While this has important advantages, it can also produce some unwanted effects in a clinical trial. The most common concern is an imbalance in the number of subjects assigned to one of the treatments. Our natural expectation that random assignments should yield numerical balance is incorrect. As unintuitive as it sounds, a random series of assignments spends the majority of its time favoring one group or the other. Returns to exact balance are relatively rare, although they are guaranteed to occur. Gamblers take note. For example, suppose we are studying two treatments, A and B, and we randomize a total of 2𝑁 subjects with probability 𝑝 = 1∕2 to each treatment group. When the trial is over, the size of the two treatment groups will not, in general, be equal. The chance of finishing 2𝑁 allocations with an excess in one group that equals exactly 2𝑟 assignments is ( ) 2𝑁 Pr{𝑒𝑥𝑐𝑒𝑠𝑠 = 2𝑟} = 2−2𝑁 , 𝑁 + 2𝑟 for 0 ≤ 𝑟 ≤ 𝑁. For example, when 2𝑁 = 100, the chance of the randomization yielding exactly 50 subjects per group (𝑟 = 0) is only about 8%. It is informative to consider the probability of imbalances greater than or equal to some specified size. If properties of the binomial distribution are used, the expected number of assignments to each of two groups is 𝐸{𝑁𝐴 } = 𝐸{𝑁𝐵 } = 𝑁𝑝 , and the variance of the number of assignments is var{𝑁𝐴 } = var{𝑁𝐵 } = 𝑁𝑝(1 − 𝑝). When 𝑁 = 100 and 𝑝 = 1∕2, the variance of 𝑁𝐴 is 25. An approximate√95% confidence bound on the number of assignments to treatment A is 𝑁𝐴 ± 1.96 × 25 ≈ 𝑁𝐴 ± 10. Thus, we can expect more than a 60∕40 imbalance in favor of A (or B) 5% of the time using simple randomization. This problem of imbalance becomes more noticeable when we must account for several prognostic factors. Even if the number of treatment assignments is balanced, the distribution of prognostic factors is unlikely to be. Suppose that there are 𝑘 independent dichotomous prognostic factors, each with probability 0.5 of being “positive.” Suppose we compare two treatment groups with respect to the proportion of subjects with a positive prognostic factor. Using a type I error of 5% for the comparison, the chance of balance on any one variable is 0.95. The chance of balance on all 𝑘 factors is 0.95𝑘 (assuming independence). Thus the chance of finding at least one factor unbalanced is 1 − 0.95𝑘 . When 𝑘 = 5, the chance of at least one statistically significant imbalance is 0.23, which explains the frequency with which this problem is noticed. From a practical point of view, large imbalances in the number of treatment assignments or the distribution of covariates seem to lessen the credibility of trial results. Even though imbalances that are the products of chance are theoretically ignorable, and statistical methods can account for the effects of such imbalances, we are usually left
Piantadosi
Date: July 27, 2017
500
Time: 4:45 pm
TREATMENT ALLOCATION
feeling uneasy by large or significant differences in baseline characteristics. For large, expensive, and unique trials to be as credible as possible, the best strategy is to prevent the imbalances from occurring in the first place. One way to control the magnitude of imbalances in these studies is to constrain randomization. There are two general methods of constrained randomization—blocked randomization and minimization. These will be discussed in the next sections.
17.3 17.3.1
CONSTRAINED RANDOMIZATION Blocking Improves Balance
Randomization in blocks is a simple constraint that improves balance in the number of treatment assignments in each group [1001]. A “block” contains a prespecified number and proportion of treatment assignments. The size of each block must be an exact integer multiple of the number of treatment groups. A sequence of blocks makes up the randomization list. Within each block the order of the treatments is randomly permuted, but they are exactly balanced at the end of each block. After all the assignments in each block are made, the treatment groups are exactly balanced as intended. To see that this scheme is constrained randomization, consider two treatment groups, 𝐴 and 𝐵, and blocks of size 𝑁𝐴 + 𝑁𝐵 . During the randomization, the current number of assignments in a block made to treatment 𝐴 is 𝑛𝐴 and likewise for 𝐵. At the start of each new block, we reset 𝑛𝐴 = 𝑛𝐵 = 0. Then, the blocking constraint can be produced by setting the probability of assignment to treatment 𝐴 to Pr[𝐴] =
𝑁 𝐴 − 𝑛𝐴 𝑁 𝐴 + 𝑁 𝐵 − 𝑛𝐴 − 𝑛𝐵
(17.2)
for each assignment. When 𝑁𝐴 assignments have been made in a block, 𝑛𝐴 = 𝑁𝐴 and the probability of getting 𝐴 is 0. When 𝑁𝐵 assignments have been made, the probability of getting 𝐴 is 1. This is a useful way to generate blocked assignments by computer. To make each new assignment, we compare a random number, 𝑢, uniformly distributed on the interval (0,1) to 𝑃 𝑟[𝐴]. If 𝑢 ≤ 𝑃 𝑟[𝐴], the assignment is made to 𝐴. Otherwise, it is made to 𝐵. Suppose that there are two treatments, 𝐴 and 𝐵, and blocks of size 4 are used. There are six possible permutations of two treatments in blocks of size 4 (Table 17.1). The assignment scheme consists of a string of blocks chosen randomly (with replacement) from the set of possible permutations. If we are unlucky enough to stop the trial halfway through a block in which the first two assignments are 𝐴’s, the imbalance will be only two extra in favor of treatment 𝐴. This illustrates that the maximum imbalance is one-half of the block size. The number of orderings of the treatments within small blocks is not very large. Usually, it makes sense to use relatively small block sizes to balance the assignments at frequent (2𝑏) intervals. Suppose there are 2 treatments and the block size is 2𝑏. Then, there are 𝑏 different permutations of the assignments within blocks. If the trial ends exactly halfway through a block, the excess number of assignments in one group will be no greater than 𝑏 and is likely to be less. Blocking in a similar fashion can be applied to unbalanced allocations or for more than two treatments. For more than two treatments,
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
CONSTRAINED RANDOMIZATION
501
TABLE 17.1 All Possible Permutations of Two Treatments in Blocks of Size 4 Within-Block Assignment Number
1
2
1 2 3 4
A A B B
A B A B
Permutation Number 3 4 A B B A
B B A A
5
6
B A B A
B A A B
the number of permutations can be calculated from multinomial coefficients and the assignments can be generated from formulae similar to equation (17.2). In an actual randomization, all blocks do not have to be the same size. Varying the length of each block (perhaps randomly) makes the sequence of assignments appear more random and can help prevent discovery. These sequences are not much more difficult to generate or implement than fixed block sizes. In practice, one could choose a small set of block sizes from which to select. For example, for two treatments, we could randomly choose from among blocks of length 2, 4, 6, or any multiple of 2. For three treatments, the block sizes would be multiples of 3. 17.3.2
Blocking and Stratifying Balances Prognostic Factors
To balance several clinically important prognostic factors, the randomization can be blocked and stratified. Every relevant prognostic factor combination can define an individual stratum, with blocking in each. Then the treatment assignments will be balanced in each stratum, yielding balanced prognostic factors in each treatment group. To be most advantageous, stratification should be used only for prognostic factors with relatively strong effects, such as risk ratios over about 2.0. The balance resulting from blocked stratification can improve the power of trials by reducing unwanted variation [1172]. Other advantages to the balance induced by stratification and blocking include reduced type I and II error rates, higher efficiency, and improved subgroup analyses. These have been emphasized by Kernan et al. [839] who also present useful information on the current practice of stratification and suggestions for reporting its use. An example using blocks of size 4 should help make the procedure clear (Table 17.2). Suppose that there are two prognostic factors, age (young versus old) and sex. To balance both, we would form four strata: young-female, young-male, old-female, and old-male. Blocked assignments are generated for each stratum. Each subject receives the next treatment assignment from the stratum into which he or she fits and the strata do not necessarily all fill at the same rate. In a blocked stratified trial such as this, we could stop the trial when all strata are half filled, all with an excess of assignments on 𝐴 (or 𝐵). For example, the sequence of assignments depicted in Table 17.2 could end halfway through all current blocks and with the last two assignments being 𝐴𝐴 (or 𝐵𝐵). Block 2 in the old-female stratum is of this type. This situation creates the largest imbalance that could occur. For a trial with 𝑏 assignments of each treatment in every block (block size = 2𝑏) and 𝑘 strata, the size of the maximum imbalance using blocked strata is 𝑏 × 𝑘. In other words, each stratum could yield an excess of 𝑏 assignments for the same treatment. The same
Piantadosi
Date: July 27, 2017
502
Time: 4:45 pm
TREATMENT ALLOCATION
TABLE 17.2 Example of Blocked Stratified Treatment Assignments with Equal Allocation in Two Groups Strata Block Number 1 1 1 1 2 2 2 2 ⋮
Males
Females
Young
Old
Young
Old
A B B A A A B B ⋮
B B A A B A B A ⋮
B A B A A B A B ⋮
A B A B B B A A ⋮
is true if 𝑏 is the largest number of assignments possible to one treatment in a block in each of 𝑘 strata when a variable block size is used. If 𝑏𝑗 represents the largest excess number of assignments possible in a block in the 𝑗 th stratum (not necessarily the same for all strata), the largest imbalance (total excess number of assignments on one group) possible is 𝑏1 + 𝑏2 + … + 𝑏𝑘 . Admittedly, we would have to be very unlucky for the actual imbalance to be at the maximum when the study ends. For example, suppose that there are 𝑏 assignments of each type in every block and the trial ends randomly and uniformly within a block. In other words, there is no increased likelihood of stopping on either a particular treatment assignment or position within a block. Assume that all blocks are the same size. The chance of stopping in every stratum exactly halfway through a block is 𝑢=
(
1 2𝑏
)𝑘
,
and the chance that the first half of all 𝑘 blocks has the same treatment assignment is ( )−𝑘 2𝑏 𝑣= . 𝑏 Then 𝑢𝑣 is the chance of ending the trial on the worst imbalance. For 2 treatments and 4 strata with all blocks of size 4 (𝑏 = 2), as in the example above, the maximum imbalance is 2 × 4 = 8. The chance of stopping with an imbalance of 8 in favor of one treatment or the other is only ( )4 ( )−4 4 1 𝑢𝑣 = 2 × × = 6. 03 × 10−6 . 2 4 The chance of a lesser imbalance is greater. Stratified randomization without blocking will not improve balance beyond that attainable by simple randomization. In other words, simple randomization in each of several strata is equivalent to simple randomization overall. Because a simple random process has no memory of previous assignments, stratification serves no purpose unless the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
CONSTRAINED RANDOMIZATION
503
assignments are constrained within strata. Another type of constraint that produces balance using adaptive randomization is discussed below.
17.3.3
Other Considerations Regarding Blocking
One can inadvertently counteract the balancing effects of blocking by having too many strata. For example, if each subject winds up in his or her own stratum uniquely specified by a set of prognostic factors, the result is equivalent to having used simple randomization. Plans for randomization must take this possible problem into account and use only a few strata, relative to the size of the trial. In other words, most blocks should be filled because unfilled blocks permit imbalances. If block sizes are too large, it can also counteract the balancing effects of blocking. Treatment assignments from a very large block are essentially equivalent to simple randomization, except toward the end of the block. For example, consider equation (17.2) for Pr[𝐴] when 𝑁𝐴 and 𝑁𝐵 are very large. Then Pr[𝐴] ≈ 12 for all the early assignments in a block. Because at least the last assignment in each block (and possibly up to half of some blocks) is determined from the previous assignments, the block size should not be revealed to clinical investigators. Otherwise, some assignments are discoverable. Random block sizes can be used, provided the sizes are small, to reduce or eliminate the chance that an investigator will acquire knowledge about future treatment assignments. For example, in a two-group trial, choosing randomly from among block sizes of 2, 4, and 6 yields assignments that would look almost purely random if revealed, except that they would be fairly closely balanced. Blocking can also be used with unequal treatment allocation. For example, suppose that we wish to have a 2:1 allocation in favor of treatment 𝐴. Block sizes that are multiples of three could be used with two-thirds of the assignments allocated to treatment 𝐴. At the end of each block the treatment allocation will be exactly 2:1 in favor of treatment 𝐴. One advantage of blocked assignments, with or without strata, is that they can be generated on paper in advance of the trial and placed in notebooks accessible to the staff responsible for administering the treatment allocation process. An excess number of assignments can be generated and used sequentially as accrual proceeds. The resulting stream of assignments will have the intended properties and will be easy to prepare and use. Excess assignments not used at the end of the study can be ignored. This method is simple but reliable. The use of stratified blocked randomization is an explicit acknowledgment that the variability due to the blocking factor(s) is sizable but extraneous to the treatment effect. When analyzing a study that has used this method of treatment allocation, it is important to account for the stratification. This point has been discussed by several authors [484, 1397]. If the strata are ignored in the analysis, it is equivalent to lumping that source of variability into the error or variance term of the test statistic. As a result, the variance will be larger than necessary and the efficiency of the test will be lowered. Blocking and stratifying the randomization has virtually no drawbacks, and properly accounting for stratification in the analysis increases precision. Consequently, these are very useful devices for controlling extraneous variation, and should be considered whenever feasible.
Piantadosi
Date: July 27, 2017
504
17.4
Time: 4:45 pm
TREATMENT ALLOCATION
ADAPTIVE ALLOCATION
Adaptive allocation, also discussed in Section 15.2.2, is a process in which the probability of assignment to the treatments in a clinical trial varies according to the current balance, composition, or outcomes of the groups. In a limited sense this same characteristic is true for blocked randomization. However, adaptive allocation is a more general idea and is often motivated by a desire to minimize the number of subjects entered on what will be shown to be the inferior treatment. General discussions of this method can be found in Refs [726], [406], and [1218]. Urn designs (biased coin randomization) and minimization are two types of adaptive treatment allocation. Theory behind responseadaptive randomization is discussed by Hu and Rosenberger [744]. 17.4.1
Urn Designs Also Improve Balance
An alternative to blocking to prevent large imbalances in the numbers of assignments made to treatment groups is a technique based on urn models [1534]. Urn models are ubiquitous devices in probability and statistics, which are helpful in illustrating important concepts. Imagine an urn containing one red and one white ball. To make a treatment assignment, a ball is selected at random. If the ball is red, the assignment is to treatment 𝐴. A white ball yields treatment 𝐵. After use, the ball is replaced in the urn. This procedure is equivalent to simple randomization with probability of 1∕2 of receiving either treatment. To discourage imbalances, however, a slightly different procedure might be used. Suppose that we begin with one ball of each color in the urn and the first ball picked is red. Then the original ball plus an additional white ball might be placed in the urn. With the next choice there is one red ball and two white balls in the urn, yielding a higher probability of balancing the first treatment assignment. With each draw the chosen ball and one of the opposite color are placed in the urn. Suppose that 𝑛𝐴 and 𝑛𝐵 represent the current number of red and white balls, respectively, in the urn. Then, for each assignment, Pr[𝐴] =
𝑛𝐵 . 𝑛𝐴 + 𝑛𝐵
(17.3)
When the number of draws is small, there tends to be tighter balance. As the number of draws increases, the addition of a single ball is less important and the process begins to behave like 1:1 simple randomization. At any point, when equal numbers of assignments have been made for each group, the probability of getting a red or white ball on the next draw is 1∕2. This process could also be used for stratified assignments. This urn procedure might be useful in trials where the final sample size is going to be small and tight balance is needed. These assignments can also be generated in advance and stratified, if necessary, to balance a prognostic factor. For example, Table 17.3 shows the first few treatment assignments from a two-group, two-stratum sequence. Any sequence is possible, but balanced ones are much more likely. 17.4.2
Minimization Yields Tight Balance
Other schemes to restrict randomization to yield tighter balance in treatment groups can be based on minimization, as suggested by Pocock and Simon [1218] and Begg and Iglewicz [136]. To implement minimization, a measure of imbalance based on the current number of treatment assignments (or the prognostic factor composition of the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ADAPTIVE ALLOCATION
505
TABLE 17.3 Example of Stratified Treatment Assignments Using an Urn Model 𝑛𝑎
𝑛𝑏
1 1 2 3 4 4 5 5 5 6 7 8 8 8 8 9 10 10 11 11 ⋮
0 1 1 1 1 2 2 3 4 4 4 4 5 6 7 7 7 8 8 9 ⋮
Stratum 1 Pr[𝐴] 0.500 0.333 0.500 0.400 0.333 0.286 0.375 0.333 0.400 0.455 0.417 0.385 0.357 0.400 0.438 0.471 0.444 0.421 0.450 0.429 ⋮
Treatment
𝑛𝑎
𝑛𝑏
Stratum 2 Pr[𝐴]
A B A A A B A B B A A A B B B A A B A B ⋮
1 1 1 1 1 2 3 4 4 5 5 6 6 7 7 7 7 8 8 9 ⋮
0 1 2 3 4 4 4 4 5 5 6 6 7 7 8 9 10 10 11 11 ⋮
0.500 0.333 0.500 0.600 0.667 0.714 0.625 0.556 0.500 0.545 0.500 0.538 0.500 0.533 0.500 0.529 0.556 0.579 0.550 0.571 ⋮
Treatment A B B B B A A A B A B A B A B B B A B A ⋮
𝑃 𝑟[𝐴] is the probability of assignment to treatment 𝐴 calculated prior to the assignment.
treatment groups) is used. Before the next treatment assignment is made, the imbalance that will result is calculated in two ways: (i) assuming the assignment is made to A, and (ii) assuming it is made to B. The treatment assignment that yields the smallest imbalance is chosen. If the imbalance will be the same either way, the choice is made randomly. An example of a minimization method is given in Section 15.2.2. Minimization can produce tighter balance than blocked strata. However, it has the drawback that one must keep track of the current measure of imbalance. This is not a problem for computer-based randomization schemes, which are frequently used today. Also, when minimization is employed, none of the assignments (except perhaps the first) is necessarily made purely randomly. This would seem to prevent the use of randomization theory in analysis. However, treating the data as though they arose from a completely randomized design is commonly done and is probably inferentially correct. At least in oncology, many multi-institutional “randomized” studies actually employ treatment allocation based on minimization. 17.4.3
Play the Winner
Other methods for making treatment allocations in clinical trials have been suggested to meet certain important constraints. For example, Zelen [1604] and Wei and Durham [1533] suggested making treatment assignments according to a “play the winner” rule, to minimize the number of subjects who receive an inferior treatment in a comparative trial. One way of implementing this scheme is as follows: Suppose that randomization
Piantadosi
Date: July 27, 2017
506
Time: 4:45 pm
TREATMENT ALLOCATION
to one of two groups is made by the logical equivalent of drawing balls labeled 𝐴 or 𝐵 from an urn, with replacement. For the first subject, the urn contains one ball of each type (or an equal number of each). The first subject is assigned on the basis of a random draw. If the assigned treatment “fails,” a ball of the other type is added to the urn. The next subject therefore has a higher probability of receiving the other treatment. If the first treatment “succeeds,” a ball of the same type is added to the urn. Thus, the next subject has a higher probability of receiving the same therapy. This method has the advantage of preferentially using what appears to be the best treatment for each assignment after the first. However, implementing it in the manner described requires being able to assess the final outcome in each study subject before placing the next subject on the trial. This may not always be feasible. A more flexible approach is to use a multistage design with adjustment of the allocation ratio, to favor the treatment arm that appears to be doing the best at each stage. This approach can also be used with outcomes that are not dichotomous. Some investigators believe that trials designed in this way may be viewed more favorably than conventional designs by subjects and Institutional Review Boards. A design of this type was used in a clinical trial of extracorporeal membrane oxygenation (ECMO) versus standard therapy for newborn infants with persistent pulmonary hypertension [120]. Based on developmental studies over 10 years, ECMO appeared to be a potentially promising treatment for this fatal condition when it was tested in a clinical trial in 1985. With the Wei and Durham biased coin randomization, the trial was to be stopped when the urn contained 10 balls favoring one treatment. This stopping rule was chosen to yield a 95% chance of selecting the best treatment when 𝑝1 ≥ 0.8 and 𝑝1 − 𝑝2 > 0.4, where 𝑝1 and 𝑝2 are the survival probabilities on the best and worst treatments, respectively. Some of the data arising from the trial (in the stratum of infants weighing more than 2 kilograms) are shown in Table 17.4. The first subject was assigned to ECMO and survived. The second subject received conventional treatment and died. All subsequent subjects were assigned to ECMO. The trial was terminated after 10 subjects received ECMO, but 2 additional subjects were nonrandomly assigned to ECMO. Because of the small sample size, one might consider testing the data for statistical significance using Fisher’s exact test, which conditions on the row and column totals of the 2 × 2 table. However, the analysis should not condition on the row and column totals because, in an adaptive design such as this, they contain information about the outcomes. Accounting for the biased coin allocation, the data demonstrate a marginally significant result in favor of ECMO [1532]. This trial was unusual from a number of perspectives: the investigators felt fairly strongly that ECMO was more effective than conventional therapy before the study was started, an adaptive design was used, consent was sought only for subjects assigned to experimental therapy (Section 17.7), and the stopping criterion was based on selection or ranking of results. Ware and Epstein [1528] discussed this trial, stating that from the familiar hypothesis testing perspective, the type I error rate for a trial with this design is 50%. The methodology has also been criticized by Begg [134]. The ECMO trial is also unusual because of the subsequent controversies that it sparked. Not all of the discussions stem from the design of the trial; much is a consequence of the clinical uncertainties. However, alternative designs may have helped resolve the clinical questions more clearly. For example, a design with an unequal treatment allocation (Section 17.6) might have reduced some of the uncertainty. For a clinical review of the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
OTHER ISSUES REGARDING RANDOMIZATION
507
TABLE 17.4 Results from an ECMO Clinical Trial using Play the Winner Randomization Subject Number 1 2 3 ⋮ 10
Treatment Assignment
Outcome
ECMO Control ECMO ⋮ ECMO
Success Failure Success ⋮ Success
ECMO story, see the paper by O’Rourke [1163]. Ethical issues were discussed by Royall [1296]. A confirmatory ECMO trial was performed (Section 18.1.4), with resulting additional controversy [1526].
17.5 17.5.1
OTHER ISSUES REGARDING RANDOMIZATION Administration of the Randomization
The beneficial effects of randomization for reducing bias can be undone if future treatment assignments are discoverable by investigators. Potentially this information could be available if all treatment assignments are listed in advance of the study in a notebook or folder and dispensed as each subject goes on study, as is commonly done. This problem could arise if clinical investigators responsible for placing subjects on study and selecting treatment are involved with, or have access to, the randomization process. To convincingly demonstrate that this has not occurred, the administration of the randomization process should be physically separated from the clinical investigators. In multicenter trials, this is commonly accomplished by designating one of the institutions, or an off-site location, to serve as a trial coordinating center. Treatment assignments are then made by telephone or by secure computer access. The concern about access to treatment assignments may be more relevant in single-institution studies, where the clinical investigators often administratively manage their own trials. It is sometimes necessary for the investigators to manage treatment assignment for reasons of accessibility or convenience. When this is required, extra care should be taken to keep the assignments secure. Example 17.2. Investigators conducted a multicenter randomized trial of low tidal volume versus standard ventilation in subjects with adult respiratory distress syndrome. Because the subjects were all ventilator-dependent and hospitalized in intensive care units (ICU), it was necessary to have the capability to randomize treatment assignments 24 hours each day. This could not be accomplished easily using an independent study coordinator. Instead, it was more efficient to designate one of the participating ICUs as the randomization center and have the nursing staff always available there perform an occasional randomization. This ICU environment is potentially “hostile” to performing such tasks rigorously. Nevertheless, a suitably written computer program helped the randomization proceed routinely. The program contained password protection for each randomization and safeguards to prevent tampering with either the program itself or the sequence of treatment assignments. Initially stratified randomized assignments from the
Piantadosi
Date: July 27, 2017
508
Time: 4:45 pm
TREATMENT ALLOCATION
computer program were made with a block size of 8. Interestingly, after 3 of the first 4 assignments in one stratum were made to the same treatment group, the ICU nurses became convinced that the program was not functioning “randomly.” This illustrates the scrutiny that the sequence of assignments is likely to get. Reassuring investigators that the program was operating correctly was a strong suggestion about the (fixed) block size. Consequently, the randomization program was modified to use a variable block size and the staff was reassured that all was well. This example illustrates that under special circumstances, treatment assignments can be protected from tampering, while still permitting flexibility and convenience. Of course, no safeguards are totally effective against a determined unscrupulous investigator. For example, one might be able to defeat seemingly secure randomization programs by running two copies. The second copy would permit experimentation that could help to determine the next treatment assignment. The most secure randomization for 24-hourper-day availability is provided by centralized, automated, dial-in computer applications kept at a location remote from the clinic. The use of sealed envelopes managed by clinical investigators is sometimes proposed as a method of randomization. Although expedient, it should be discouraged as a method of treatment allocation because it is potentially discoverable, error prone, and will not provide as convincing evidence of lack of bias as treatment allocation methods that are independent of the clinicians. Even so, sometimes sealed envelopes are the only practical allocation method. This may be the case when trials are conducted in settings with limited resources or research infrastructure, such as some developing countries or in some clinical settings where 24-hour-per-day treatment assignment is needed. Assignments using sealed envelopes should be periodically audited against a master list maintained centrally.
17.5.2
Computers Generate Pseudorandom Numbers
I am always fascinated that some investigators seem reluctant to employ truly random methods of generating numbers in preference to computer-based (pseudorandom) methods. Consider the familiar Plexiglas apparatus filled with numbered ping-pong balls, which is often used to select winning numbers for lottery drawings. This is a reliable and convincingly random method for selecting numbers and could well be used for treatment assignments in clinical trials, except that it is slow and inconvenient. Computers, in contrast, generate pseudorandom numbers quickly. The numbers are generated by a mathematical formula that yields a completely predictable sequence if one knows the constants in the generating equation and the starting seed. The resulting stream of numbers will pass some important tests for randomness, despite the determinism, provided that the seed and constants are chosen wisely. Excellent discussions of this topic are given by Knuth [860] and Press et al. [1233]. In the recent past, a commonly used algorithm for generating a sequence of pseudorandom numbers, {𝑁𝑖 }, was the linear congruential method. This method uses a recurrence formula to generate 𝑁𝑖+1 from 𝑁𝑖 , which is 𝑁𝑖+1 = 𝑎𝑁𝑖 + 𝑐 (mod 𝑘),
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
OTHER ISSUES REGARDING RANDOMIZATION
509
where 𝑎 is the multiplier, 𝑐 is the increment, and 𝑘 is the modulus. Modular arithmetic takes the remainder when 𝑁 is divided by 𝑘. This method can produce numbers that appear random and distributed uniformly over the interval (0,1). However, it has important limitations including that the stream of numbers can be serially correlated, and it will eventually repeat after no more than 𝑘 numbers. The sequence will have a short period if 𝑎, 𝑐, and 𝑘 are not carefully chosen, so one must be cautious about using untested random number generators built-in to computer systems. Also the formula and number stream are sensitive to the hardware of the computer system. These are important points when large numbers of random values are needed, and when they must pass sophisticated tests for randomness. This is the case in statistical simulation where the linear congruential method is deficient and obsolete. Given the importance of simulation to statistics broadly, and specifically to determine the operating characteristics of complex clinical trial designs, alternatives such as those discussed at length in Press et al. [1233] and its sources must be used. Many of those methods rely on computer bitwise operations. For treatment allocation in clinical trials, only the most basic properties of randomness are needed because the operative benefit is that the assignments are made impartially. Even so there is no drawback to superior methods for random number generation. The principal lessons are that (i) computers generate pseudorandom numbers, (ii) the method of generation is subject to mistakes that could lead to pseudorandom number streams with very undesirable properties, and (iii) extra care is needed to have good pseudorandom number generators or ones that yield identical results on all computer systems. It seems we have come full circle now in terms of treatment allocation. The randomization of computer-based methods is actually a nondiscoverable, seemingly random, deterministic allocation scheme. Used with care, it is perfectly appropriate for “randomized” assignments in clinical trials.
17.5.3
Randomized Treatment Assignment Justifies Type I Errors
In addition to eliminating bias in treatment assignments, a second benefit of randomization is that it guarantees the correctness of certain statistical procedures and control of type I errors in hypothesis tests. For example, the important class of statistical tests known as permutation (or randomization) tests of the null hypothesis are validated by the random assignment of subjects to treatments. To understand this, we must discuss some aspects of analysis now, even though it is jumping ahead in the scheme of our study of clinical trials. An in-depth discussion of permutation tests is given by Good [614], and randomization tests by Edgington [401–403]. The validity of randomization tests is motivated by the exchangeability of treatment assignments in a clinical trial. Under a population model, two samples from the same population are exchangeable. This might be the view of a randomized clinical trial where, under the null hypothesis, the sample observations from the two treatments are exchangeable. Alternatively, we could motivate such tests by a randomization model as suggested by R. A. Fisher. Under this model, the outcomes from the experiment are viewed as fixed numbers and the null hypothesis states that treatment has no effect on these numbers. Then the group assignments are exchangeable under this null hypothesis.
Piantadosi
Date: July 27, 2017
510
Time: 4:45 pm
TREATMENT ALLOCATION
Suppose that there are two treatments, 𝐴 and 𝐵, 𝑁 subject assignments in total, 𝑁∕2 subjects in each treatment group, and treatment assignments have been made by simple randomization. Let 𝑌𝑖 be the response for the 𝑖th subject, 𝑌 𝐴 and 𝑌 𝐵 be the average responses in the two treatment groups, and Δ = 𝑌 𝐴 − 𝑌 𝐵 . The null hypothesis states that there is no difference between the treatment groups, 𝐻0 ∶ Δ = 0. Stated another way, the treatment assignments and the responses are independent of one another and exchangeable under the null hypothesis. If so, any ordering of the treatment assignments is equally likely under randomization and all should yield a “similar” Δ. Following this idea, the null hypothesis can be tested by forming all permutations of treatment assignments independently of the 𝑌𝑖 ’s, calculating the distribution of the Δ’s that results, and comparing the observed Δ to the permutation distribution. If the observed Δ is atypical (i.e., if it lies far enough in the tail of the permutation distribution), we would conclude that the distribution arising from the null hypothesis is an unlikely origin for it. In other words, we would reject the null hypothesis. This procedure is “exact,” in the sense that the distribution generated is the only one that we need to consider and can be completely enumerated by the permutation method. Number of Permutations The number of permutations for even modest-sized studies is very large. For example, the number of permutations for two treatment groups, 𝐴 and 𝐵, using simple random allocation with 𝑁𝐴 and 𝑁𝐵 assignments in each group, is ( ) ( ) (𝑁𝐴 + 𝑁𝐵 )! 𝑁𝐴 + 𝑁𝐵 𝑁𝐴 + 𝑁𝐵 , (17.4) 𝑀= = = 𝑁𝐴 !𝑁𝐵 ! 𝑁𝐴 𝑁𝐵 which is the binomial coefficient for 𝑁 objects (treatment assignments) taken 𝑁𝐴 (or 𝑁𝐵 ) at a time. The formula is also valid when 𝑁𝐴 ≠ 𝑁𝐵 . This number grows very rapidly with 𝑁𝐴 + 𝑁𝐵 . For more than two treatments, the number of permutations is given by the multinomial coefficient ( ) 𝑁 𝑁! 𝑀= , = 𝑁𝐴 , 𝑁𝐵 , ⋯ 𝑁𝐴 !𝑁𝐵 ! ⋯ where 𝑁 = 𝑁𝐴 + 𝑁𝐵 + … is the total number of assignments. This number also grows very rapidly with 𝑁. When blocking is used, the number of permutations is smaller, but still becomes quite large with increasing sample size. For two treatments with 𝑏 assignments in each fixed block and all blocks filled, the number is ( )𝑘 2𝑏 𝑀= , 𝑏 where the number of blocks is 𝑘. A more general formula for blocks of variable size is ) 𝑟 ( ∏ 𝑁𝐴𝑖 + 𝑁𝐵𝑖 𝑘𝑖 , 𝑀= 𝑁𝐴𝑖 𝑖=1 where there are 𝑟 different sized blocks and 𝑘𝑖 of each one. Here also the formula is valid when 𝑁𝐴𝑖 ≠ 𝑁𝐵𝑖 . This last formula can also be adapted to cases where the last block
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
511
OTHER ISSUES REGARDING RANDOMIZATION
TABLE 17.5 Number of Possible Permutations of Two Treatment Assignments for Different Block and Sample Sizes Total Sample Size 4 8 16 24 56 104
Block Size 2
4
8
∞
4 16 256 4096 2.68×108 4.50×1015
6 36 1296 46656 7.84×1010 1.71×1020
6 70 4900 343000 8.24×1012 9.69×1023
6 70 12870 2.7×106 7.65×1015 1.58×1030
The calculation assumes all blocks are filled. An infinite block size corresponds to simple randomization.
is not filled. Examples of the number of permutations arising from two-treatment group studies of various sizes are shown in Table 17.5. Implementation Because the number of permutations from even modest size studies is so large, it is impractical to use exact permutation tests routinely in the analysis of clinical trials, despite the desirable characteristics of the procedure. A more economical approach, which relies on the same theory, is to sample randomly (with replacement) from the set of possible permutations. If the sample is large enough, although considerably smaller than the total number of possible permutations, the resulting distribution will accurately represent the exact permutation distribution and will be suitable for making inferences. For example, one might take 1000 or 10,000 samples from the permutation distribution. Such a procedure is called a “randomization test,” although this term could be applied to permutation tests as well. The terminology is not important, as long as one realizes that the exact permutation distribution can be approximated accurately by a sufficiently large sample. An example of a true permutation test is shown in Table 17.6. Here there were five assignments in each of two groups, yielding a total of 252 possible permutations, conditional on the number in each group being equal. For each possible permutation, let Δ = 𝑋 𝐵 − 𝑋 𝐴 . The five responses on 𝐴 and 𝐵 were 𝐗 = {1.1, 1.3, 1.2, 1.4, 1.2, 1.5, 2.0, 1.3, 1.7, 1.9}. From these, the Δ’s that would be obtained for each permutation of treatment assignments are shown in Table 17.6, ordered by the size of Δ. Only the upper half of the permutations are listed—the mirror-image half (i.e., 𝐴 and 𝐵 switching places) has Δ′ s that are the negative of those in the table. Figure 17.1 shows the permutation distribution. Suppose that the observed Δ was 2.2, the sequence of assignments actually being 𝐵𝐴𝐵𝐵𝐵𝐴𝐴𝐵𝐴𝐴, (from permutation number 2), which lies above the 0.012 percentile of the distribution. Thus, we would reject the null hypothesis of no treatment difference. The validity of this procedure relies on nothing, except the use of randomization in allocating treatment assignments: no assumptions about statistical distributions or models have been made. Similar procedures can be applied to event data from clinical trials [521, 1142]. In a large clinical trial, it will not be practical to enumerate all permutations of the treatment assignments. However sampling from the set of permutations will approximate the true distribution to any desired accuracy. Suppose a trial uses 20 assignments to treatment 𝐴 and 23 to 𝐵. The number of permutations for this modest trial calculated
Piantadosi
Date: July 27, 2017
512
Time: 4:45 pm
TREATMENT ALLOCATION
TABLE 17.6 Lower Half of the Permutation Distribution for Ten Balanced Treatment Assignments in Two Groups Permutation
Δ
Permutation
Δ
Permutation
Δ
AAABABBABB ABAAABBABB AAAAABBBBB ABABAABABB AAAABBBABB AABAABBABB AAABAABBBB ABAAAABBBB BAAAABBABB AAABBABABB AABBAABABB ABAABABABB ABABABBAAB ABBAAABABB BAABAABABB AAAABABBBB AABAAABBBB AAABABBBAB BAAAAABBBB ABAAABBBAB BBAAAABABB AAABBBBAAB AABABABABB AABBABBAAB ABABABBABA AAAABBBBAB AABAABBBAB ABAABBBAAB ABBAABBAAB BAAABABABB BAABABBAAB BABAAABABB ABABAABBAB AAABABBBBA ABAAABBBBA ABABBABAAB ABBBAABAAB ABABABAABB BBAAABBAAB AAABABABBB AABABBBAAB AABBABBABA
2.4 2.2 2.2 2.0 2.0 2.0 2.0 1.8 1.8 1.8 1.8 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.4 1.4 1.4 1.4 1.4 1.4 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
AAABBBBABA BAAAABBBAB AABBAABBAB AAABBABBAB BBABAABAAB ABABAABBBA AAAABBBBBA AABAABBBBA ABAAABABBB ABAABABBAB ABAABBBABA ABBAABBABA BAABABBABA AABBABAABB BAAABBBAAB BABAABBAAB AAABBBAABB AABBBABAAB ABBAAABBAB BAABAABBAB AABABBBABA BBAAABBABA ABAABBAABB ABABBABABA ABBBAABABA AAAABBABBB AAABBABBBA AABAABABBB ABBABABAAB BAAAABBBBA BAABBABAAB BABBAABAAB BBAAAABBAB AABBAABBBA ABABAAABBB ABBAABAABB BAABABAABB AABABABBAB BBABAABABA AABBBABABA ABAABABBBA ABABABBBAA
1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4
ABBAAABBBA BAABAABBBA BBAAABAABB BBBAAABAAB ABABBAAABB ABBBAAAABB BAAAABABBB BAAABBBABA BABAABBABA BBAABABAAB AABABBAABB AABBAAABBB AAABBAABBB BAAABABBAB BABAAABBAB ABABBBBAAA ABBBABBAAA ABAABAABBB ABBABABABA BAABBABABA BBAAAABBBA AABBBAAABB ABABABABAB BABABABAAB BBABAAAABB ABBAAAABBB BAAABBAABB BAABAAABBB BABAABAABB BABBAABABA AABABABBBA AAABBBBBAA AABBABBBAA AAABBBABAB AABABAABBB ABAABBBBAA ABBAABBBAA BAAABABBBA BAABABBBAA BABAAABBBA BBAAAAABBB BBAABABABA
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
OTHER ISSUES REGARDING RANDOMIZATION
513
FIGURE 17.1 Example permutation distribution from 10 treatment assignments.
from equation (17.4) is over 960 billion. To illustrate a randomization test, suppose the results on treatments 𝐴 and 𝐵 are samples from normal distributions with mean 0 and 1, respectively, and both have variance 1. This means the true treatment difference is 1 standard deviation. Data for this circumstance are shown in Table 17.7. The randomization distribution based on 100,000 sample permutations is shown in Figure 17.2, which is about one ten-thousandth of a percent of those possible. Nevertheless, the distribution is resolved quite clearly. Permutation and randomization tests are appealing because they can be applied to virtually any test, they do not depend on the sample size, and they require few assumptions beyond the use of randomization. They have not been as widely applied as many other tests, probably because of the amount of computing resources required. However, recent improvements in computing performance on all platforms make randomization tests very attractive. TABLE 17.7 Data Used for Randomization Test in Figure 17.2 Group
Data Values𝑎
𝐴
1.77, 0.43, –1.69, 0.95, 2.08, –0.61, –0.81, 0.48, –0.16, –0.66, –1.64, –1.39, –0.22, –0.58, 0.07, 0.84, 1.26, 0.71, 0.75, 0.71 0.4, 2.56, 1.45, 1.03, 0.6, 2.11, 0.35, 0.68, 2.74, 2.05, 0.54, 0.58, 1.17, 2.1, 2.3, 1.67, 1.48, 1.95, 1.31, 1.27, 2.03, 2.00, 0.45
𝐵 𝑎
𝐴 − 𝐵 = −1.38
Piantadosi
Date: July 27, 2017
514
Time: 4:45 pm
TREATMENT ALLOCATION
FIGURE 17.2 Histogram of results assuming the null hypothesis from randomization distribution based on 100,000 samples.
17.6
UNEQUAL TREATMENT ALLOCATION
Usually comparative trials employ an equal allocation of subjects to each of the treatments under study. To maximize the efficiency (power) of the primary comparison, this is often the best approach. However, the trial may be designed intentionally to employ unequal group sizes to meet important secondary objectives. Then unbalanced designs can be more efficient than equal allocation for meeting all study objectives. Examples where this can be useful include important subset analyses, large differences in the cost of treatments, and when the responses have unequal variances. For a review of unequal allocation, see Sposto and Krailo [1428]. Some investigators feel that unbalanced random assignment reflects a preference for the more frequently employed treatment. This view is not correct (at least with any sensible allocation ratio). Randomization is employed in the presence of equipoise to eliminate selection bias. This same rationale applies to unequal allocation. Furthemore, one cannot correct a lack of equipoise by appealing to a collective ethic that does less harm on average. It is either appropriate to randomize or not. The need to use an unbalanced allocation should be separated from the decision to randomize, and is more properly based on other study objectives as outlined below. 17.6.1
Subsets May Be of Interest
It may be important to acquire as much experience with a new treatment as possible, while also comparing it with standard therapy. In this case an unequal allocation of subjects, such as 2:1 in favor of the new treatment, may allow important additional experience
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
UNEQUAL TREATMENT ALLOCATION
515
with the new treatment without substantially decreasing the efficiency (power) of the comparison. Similarly, subsets of subjects who receive the new treatment (e.g., older subjects) may be of interest and unequal allocation may increase their number. The sensitivity of power to the allocation ratio was discussed in Chapter 16. 17.6.2
Treatments May Differ Greatly in Cost
When one treatment is much more expensive than the other in a two-group clinical trial, the total cost of the study will be minimized by an unequal allocation of subjects. Fewer subjects would be allocated to the more expensive treatment. This circumstance was discussed in Section 16.9.2. There it was shown that under fairly general circumstances the cost is minimized by an allocation ratio of √ 𝑟 = 𝐶, where 𝐶 is the relative cost of the more expensive therapy. If 𝐶 = 10, for example, the cost-minimizing allocation ratio will be approximately 3:1 in favor of the less expensive treatment. When cost differences are very large, we must be attentive to the potential power loss that this unbalanced allocation implies. 17.6.3
Variances May Be Different
Suppose the means, 𝜇1 and 𝜇2 , of two treatment groups are being compared using a test, for which the statistic is 𝜇 − 𝜇2 , 𝑡 = √1 𝜎12 𝑛1
+
𝜎22 𝑛2
where the variances, 𝜎12 and 𝜎22 , are not necessarily equal. Because of this, one may wish to change the allocation ratio (i.e., 𝑛1 and 𝑛2 ) to make the test as efficient as possible. For fixed 𝜇1 and 𝜇2 , the test statistic is maximized when the denominator is minimized, or when 𝜎12 ∕𝑛1 + 𝜎22 ∕𝑛2 is minimized. For 𝑁 = 𝑛1 + 𝑛2 , the minimum satisfies ( 2 ) 𝜎22 𝜎1 𝜕 0= + 𝜕𝑛1 𝑛1 𝑁 − 𝑛1 or 𝜎22 𝜎12
( =
𝑁 − 𝑛1 𝑛1
)2 ,
from which the optimal fraction in group 1 is found to be 𝑛1 𝜎1 = . 𝑁 𝜎1 + 𝜎2 Therefore, the optimal allocation ratio is 𝑟=
𝑛1 𝜎 = 1. 𝑛2 𝜎2
(17.5)
Piantadosi
Date: July 27, 2017
516
Time: 4:45 pm
TREATMENT ALLOCATION
In other words, more subjects should be placed in the group where the measured response is less precise. 17.6.4
Multiarm Trials May Require Asymmetric Allocation
Another circumstance that encourages unequal group allocations is a trial with one control group and more than one treatment group. Suppose there are 𝑘 treatments groups, each of which is to be compared to a single control group (e.g., mean difference), and all treatment comparisons are equally important. The optimal allocation of subjects is not necessarily equality in all 𝑘 + 1 groups. Suppose the control group has 𝑛 subjects and the treatment groups each have 𝑚 subjects consistent with the equal importance of all primary comparisons. The variance of any one of the 𝑘 mean comparisons is 𝑣=
𝜎2 𝜎2 + , 𝑛 𝑚
(17.6)
where 𝜎 2 is the subject to subject variance. A sensible constraint might be to maximize overall precision, or minimize the total variance of all comparisons, which is ( ) 1 1 𝑇 = 𝑘𝜎 2 + 𝑛 𝑚 ) ( 1 1 2 + , = 𝑘𝜎 𝑁 − 𝑘𝑚 𝑚 where the last equation has taken into account the total sample size 𝑁 = 𝑛 + 𝑘𝑚. We can find the minimum total variance as a function of 𝑚 using basic calculus by differentiating 𝑇 with respect to 𝑚 and equating to zero, ( ) −1 𝑘 0 = 𝑘𝜎 2 + , (𝑁 − 𝑘𝑚)2 𝑚2 from which we see 𝑚= √
𝑁 𝑘+𝑘
.
(17.7)
However, we can also express the total sample size as 𝑁 = 𝑛 + 𝑘𝑟𝑛 = 𝑛(1 + 𝑘𝑟) where 𝑟 = 𝑚∕𝑛 is the relative allocation of subjects to treatment versus control. Substituting, the last equation becomes 𝑛 + 𝑘𝑟𝑛 , 𝑟𝑛 = √ 𝑘+𝑘 or √
𝑘+𝑘=
𝑛 + 𝑘𝑟𝑛 1 = + 𝑘. 𝑟𝑛 𝑟
Thus, the optimal √ allocation ratio to minimize the total variability of comparisons is 𝑟 = 𝑚∕𝑛 = 1∕ 𝑘. A similar derivation could be used to minimize the variance of any single treatment– control difference. In equation (17.6), we can substitute 𝑛 = 𝑁∕(1 + 𝑘𝑟) and 𝑚 =
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
UNEQUAL TREATMENT ALLOCATION
517
√ 𝑁 − 𝑘(𝑁∕(1 + 𝑘𝑟)). Then minimizing with respect to 𝑘 yields 𝑟 = 1∕ 𝑘 as above. Because equation (17.6) represents an arbitrary one of the 𝑘 comparisons, the result is identical for all of them. In a two-armed trial, 𝑘 = 1 and 1:1 allocation is best, as instinct suggests. If there are three treatments and one√ control, 𝑘 = 3 and the optimal allocation is approximately 1 : 0.6 : 0.6 : 0.6, where 1∕ 3 ≈ 0.6. It may not be obvious that minimizing the total variance is the right basis for allocation, but the same result can be obtained from a multiple comparisons procedure [395]. Similar results are obtained for comparisons using log hazard ratios since the variance is proportional to 1∕𝑑0 + 1∕𝑑𝑗 , where 𝑑0 is the number of events in the control group and 𝑑𝑗 the number of events in the 𝑗 th treatment group. Although the number of events in a treatment group depends on its hazard rate, assuming they are all nearly equal yields the same result derived above. 17.6.5
Generalization
Minimizing total variance is not the only possible constraint on allocation. For example, we could imagine using some overall function of power to optimize the allocation ratio in a study with one control group and several treatments. Suppose we assume the objective function to be minimized has components ( 𝑣𝑖 = 𝐶
1 1 + 𝑛 𝑚𝑖
) 𝑢𝑖
(17.8)
where 𝐶 is a coefficient that does not depend on treatment–control differences, 𝑚𝑖 is the 𝑖th group size, and 𝑢𝑖 is a group-specific factor that does not depend on 𝑚𝑖 . The form of this objective function is motivated by variance considerations (as above) and also by power relationships (as suggested below). I will assume that 𝑢𝑖 represents a “utility,” weight, ∑ or importance assigned to the 𝑖th comparison by the investigator such that 𝑢𝑖 = 1. The total objective function is (
) 1 1 + 𝑢𝑖 𝑛 𝑚𝑖 𝑖=1 ( ) ∑ 𝑢𝑖 1 =𝐶 . ∑ + 𝑚𝑖 𝑁 − 𝑚𝑖
𝑇 =
𝑘 ∑
𝑣𝑖 =
∑
𝐶
Then 𝜕𝑇 = 𝜕𝑚𝑖
( (
𝑁−
1 ∑
𝑚𝑖
)2 −
𝑢𝑖 𝑚2𝑖
so that 𝑛=𝑁−
∑
𝑚 𝑚𝑖 = √ 𝑖 , 𝑢𝑖
) ,
Piantadosi
Date: July 27, 2017
518
Time: 4:45 pm
TREATMENT ALLOCATION
or the needed allocation ratios are the square roots of the utilities 𝑚𝑖 √ = 𝑢𝑖 . 𝑛
(17.9)
The result of the previous section can be easily seen because there the explicit “utility” of each comparison was 1∕𝑘. A further result is that squaring both sides and summing up yields ∑ ( 𝑚 𝑖 )2 ∑ = 𝑢𝑖 = 1, 𝑛 so that ∑
( ∑ )2 𝑚𝑖 . 𝑚2𝑖 = 𝑛2 = 𝑁 −
This is an interesting relationship from which the utilities have vanished. For arbitrary utilities that sum to 1, the square of the control group sample size should equal the sum of squares of the treatment group sample sizes in a kind of generalized Pythagorean relationship. Of course, many combinations of sample sizes will satisfy this rule approximately, and which group gets which sample size depends on the utilities. The utilities need not be subjective, but could be part of the problem structurally as suggested above for hazard ratios on which numbers of events depend. Another circumstance that fits into this framework is allocation that depends on power of the comparison. For example, equation (16.21) is of the same form as equation (17.8) above with 𝐶 = (𝑍𝛼 + 𝑍𝛽 )2 𝜎 2 and 𝑢𝑖 ∝ 1∕Δ2𝑖 . In a trial with one control group and two additional treatment groups with one treatment effect being twice as large as the other, allocation ratios of 1:0.9:0.45 would then maximize the “overall power.” Quantitative thinking about optimal allocation should also be a part of the discussion about selection designs involving more than two treatment groups. A standard assumption might be equal-sized treatment groups with early termination of underperforming arms, with expectations of gains in efficiency. However, the efficiency gains can be partly offset by optimal allocation because more subjects would then be assigned to the control arm. This consideration may be unlikely to disqualify such a design, but investigators should be aware of the consequences. 17.6.6
Failed Randomization?
The term “failed” randomization is used in two ways, both somewhat incorrect. One usage is when the balance resulting from randomization is not as intended. For example, in a trial with 1:1 randomization, if the resulting balance is not reasonably close to 1:1 someone might claim that the randomization failed. This is more likely to be observed in subsets, and inescapable if enough factors are studied for imbalance. The errors in viewing these circumstances as a failure are (i) the mistaken belief that “good” randomization must yield balance, and (ii) the failure to appreciate that the principal value of randomization is impartial treatment assignment rather than numerical balance. The breaking of the natural or physician-induced correlation between prognosis and treatment assignment is how selection bias is eliminated. A second way that randomization might be said to fail is when something goes amiss administratively in the assignment procedures. Depending on what actually
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZATION BEFORE CONSENT
519
happened, such errors can range from utterly inconsequential to invalidating the trial. An inconsequential mistake might be a programming error that produces assignments with the wrong balance (as opposed to such imbalances being produced by natural variability as above). Of course we want to find and correct such mistakes, but their impact on study results is zero. Even when randomization is constrained, we analyze the trial results as though the design was completely randomized. Minor programming errors still yield results consistent with the sampling frame of a completely randomized design. In contrast, a fatal mistake would be to allow treatment assignments to be discovered by investigators and possibly manipulated. This could reintroduce selection bias. One might then reasonably claim a management failure rather than a randomization failure. Any conclusion regarding the ability of randomized assignment to perform its function of bias removal must necessarily come from an administrative audit, and is not reliably visible in the numerical results of randomization.
17.7
RANDOMIZATION BEFORE CONSENT
Partly in response to the difficulties of getting subjects to accept randomization, Zelen [1605, 1606] suggested a “pre randomized” scheme, that works in the following way: Suppose that a new therapy is being compared with standard treatment. Eligible subjects are randomized before being approached to participate in the trial. If the treatment assignment is standard therapy, the subject is offered the standard as a matter of routine and randomization and consent need not be discussed. If the randomized assignment is the new therapy, the subject is approached for consent. If the subject refuses, he or she is offered standard therapy. The final comparison is between the groups based on randomized assignment. A double randomized consent has also been suggested, in which those initially assigned to standard therapy are also asked if they will accept the new treatment. A review of these designs is given by Parmar [1180]. The randomization before consent strategy can increase the number of trial participants. However, it creates some difficulties that limit its use. For example, in the analysis by treatment assigned, the subjects who refuse their assignment dilute the treatment difference. If the trial is analyzed by treatment received, there is potential for bias. Most important, the design has difficulty passing ethical review because subjects assigned to standard therapy are part of a research study without being properly informed (highlighting a double standard for consent on research studies). For example, the subjects on standard therapy are not given the chance to obtain the new treatment. Some may not wish to participate in the study for reasons not directly related to treatment. This design was used in the ECMO trial discussed above, although only a single subject was assigned to conventional therapy. A second trial of ECMO was conducted using this consent procedure [1163, 1164]. This pre-randomization was thought to be necessary because of the lack of equipoise on the part of the investigators. This trial, like the one before it and nonrandomized studies, supported the efficacy of ECMO compared with conventional therapy. This second ECMO study also raises a number of important issues about trial conduct as discussed by Meinert [1027] and Chalmers [247]. A few other trials have used randomization before consent, including some collaborative studies in the U.S. and Europe [1180].
Piantadosi
Date: July 27, 2017
520
Time: 4:45 pm
TREATMENT ALLOCATION
Because of its practical and ethical difficulties, pre-randomization should probably not be routinely used for treatment assignments in clinical trials. Although there may be special circumstances that warrant its use and it can increase accrual, traditional randomized designs seem to be better suited to the objectives of comparative trials and the limits of ethical standards.
17.8
SUMMARY
Treatment allocation in clinical trials and other true experiment designs is characterized by active control of the treatments and the process used to make assignments. The practical concerns in choosing an allocation scheme for comparative studies are reducing bias, quantifying random errors, and increasing the credibility of results. Simple or constrained randomization satisfies the practical concerns and offers important theoretical advantages over other methods of allocation. Randomization reduces or eliminates biases in treatment assignment, guarantees the expectation that unobserved confounders will be controlled, validates control over type I error, and motivates an important class of analyses. Despite its appeal the actual realization of a simple randomization scheme can result in chance imbalances in influential prognostic factors or the size of treatment groups. This can be prevented using constrained randomization. Constraints frequently used include blocking and stratifying, adaptive randomization, and minimization. These methods, while theoretically unnecessary, encourage covariate balance in the treatment groups, which tends to enhance the credibility of trial results. Unequal group sizes might be used when it minimizes the cost of a trial or facilitates secondary objectives. Administering treatment assignments may be as important as using an unbiased allocation method. The assignments should be supervised and conducted by individuals who have no vested interest in the results of the trial. The method employed should be convincingly tamper-proof and distinct from the clinic site. The beneficial effects of randomization in reducing bias can be undone if investigators can discover future treatment assignments.
17.9
QUESTIONS FOR DISCUSSION
1. Simple randomization is used in a two-group CTE trial with 100 participants. The probability of being assigned to treatment 𝐴 is 𝑝 = 0.5. What is the chance that treatment 𝐴 will end up having between 48 and 52 assignments? Using a 0.05 𝛼-level test, what is the chance that the groups will be so unbalanced that investigators would conclude that 𝑝 ≠ 0.5? 2. Simple randomization is used in a two-group CTE trial. The probability of being assigned to either treatment is 0.5. How many independent binary covariates would one need to test to be 50% certain of finding at least one “significantly” unbalanced in the treatment groups? How about being 95% certain? 3. A clinical trial with four groups uses treatment assignments in blocks of size 4 in each of 8 strata. What is the maximum imbalance in the number of treatment assignments that can occur? What is the chance of this happening? What is the chance that the trial will end exactly balanced?
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
521
4. Investigators have written a computer program to generate random integers between 0 and 32767. The output is tested repeatedly and found to satisfy a uniform distribution across the interval. Even integers result in assignment to group 𝐴 and odd ones to group 𝐵. When the actual assignments are made, group 𝐵 appears to have twice as many assignments as group 𝐴. How can this happen? Can this generator be used in any way? Why or why not? 5. A new random number generator is tested and found to satisfy a uniform distribution. Values below the midpoint are assigned to treatment 𝐴. When the actual assignments are made, there are very few “runs” of either treatment (i.e., 𝐴𝐴, 𝐴𝐴𝐴, …, 𝐵𝐵, 𝐵𝐵𝐵, …). How would you describe and quantify this problem? Can this generator be used after any modifications? If so, how? 6. A trial with 72 subjects in each of two groups uses randomized blocks of size 8. The data will be analyzed using a permutation test. You find a fast computer that can calculate 10 permutation results each second and start the program. How long before you have the results? Can you suggest an alternative?
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
18 TREATMENT EFFECTS MONITORING
18.1 INTRODUCTION Continuing a clinical trial should entail an active affirmation by the investigators that the scientific and ethical milieu requires and permits it, rather than a passive activity based only on the fact that the study was begun. As the trial proceeds, investigators must consider ongoing aspects of ethics (risks and benefits), utility of the information being generated, data quality, precision of results, the qualitative nature of treatment effects and side effects, resource availability, and information from outside the trial in deciding whether or not to continue the study. Circumstances may require ending a treatment arm or discontinuing accrual of a subset of subjects rather than stopping the entire trial. The appropriate source of information on which to base most aspects of these decisions is the accumulating data, although sometimes information from other studies may be relevant. This explains the idea of “data-dependent stopping,” also an appropriate terminology for this activity in which decisions about continuing the trial are based on the evidence currently available. However, monitoring a clinical trial touches more issues than only the decision to stop. Correctly gathering and interpreting information accumulating during a clinical trial has been described using a variety of other terms: interim analysis, data monitoring, data and safety monitoring, treatment effects monitoring, or early stopping. It is possible to construct nuances of meaning among all these terms, but not necessary for the purposes here, where I will use them interchangeably. Strictly speaking, an interim analysis need not address the question of study termination for reasons of efficacy. It might be done for administrative or quality control reasons. However, the most interesting and difficult questions are those surrounding early termination because of treatment effects or their lack, so that will be the focus of my discussion. Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
522
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
523
Statistical reasoning plays an essential, but not exclusive, role in such deliberations. The essential tension in any interim action is between data, information or evidence, and knowledge. Data properly employed yields information or evidence, but modifications of the trial require knowledge. Knowledge is a higher standard that implies evidence, has been integrated and reconciled with other salient facts, such as clinical risk and benefit. In any case, preplanned and structured data gathering and information generation during a trial is the most reliable way to proceed for all purposes. In times past, most clinical trials were monitored principally for data quality and safety rather than treatment effects. The need to minimize exposure of subjects to inferior treatments in comparative studies, safety, and the development of statistical methods for interim inferences have facilitated the wide application of monitoring for treatment effects. Now there is the expectation that all comparative clinical trials and many other designs be monitored formally. Sponsors such as the NIH have overall formal written policies for data and safety monitoring, as well as detailed policies for each institute [1099]. Thus, there are at least 16 different policies. The FDA also has a draft guidance on the subject [502]. In this chapter, I discuss ways of minimizing errors that can result from interim looks at accumulating trial data and ways of improving decision making. Both statistical perspective and organizational and procedural considerations are presented. Methods for data-dependent stopping fall into four general categories: likelihood, Bayesian, frequentist, and others. There is a large literature base related to this topic. The best place to start an in-depth study is the recent book by [418]. An issue of Statistics in Medicine contains papers from a workshop on early stopping rules in cancer clinical trials [1418]. There are many older useful reviews [146, 489, 491, 525, 528, 554, 634, 1214, 1311]. The focus of most methodologic work on monitoring is termination of the entire trial. However, practical issues sometimes force us to consider terminating a treatment arm or a subset of subjects who may be entering both treatments. This latter circumstance arose in a recent trial of lung volume reduction surgery (Section 4.6.6) and is discussed by Lee et al. [914]. The discussion that follows pertains to only two-group randomized trials and a single major endpoint. There is much less in the clinical trials literature on the monitoring and stopping of trials with more than two treatment groups. Exceptions are the papers by Hughes [746] and Proschan, Follmann, and Geller [1238]. In addition to statistical guidelines for interim analyses, I discuss some computer programs that can assist with the necessary computations.
18.1.1
Motives for Monitoring
The primary purpose for monitoring a clinical trial is to provide an ongoing evaluation of risk-benefit that addresses the uncertainty necessary to continue. This requires explicit views of both safety data (risk) and efficacy data (benefit). This might seem obvious, but there are occasional requests from study sponsors to monitor trials solely for safety. This is operationally possible, but it is seldom appropriate to isolate safety concerns from efficacy ones. Rarely, it might happen that accrual and treatment proceed very quickly relative to the accumulation of definitive efficacy outcomes, meaning that predominantly safety issues are relevant during the trial.
Piantadosi
Date: July 27, 2017
524
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
Aside from terminating a study early because of treatment effects, there are other reasons why investigators may want a view of trial data as they accumulate. Problems with subject accrual or recruitment may become evident and require changes in the study protocol or investigator interactions with prospective participants. In some cases incentives or recruitment programs may need to be initiated. Additionally, monitoring and judicious interim reporting can help maintain investigator interest in, and enthusiasm for, the study. It is natural for highly motivated investigators to wonder about study progress and be curious about results. Not all such curiosity can or should be satisfied by interim reports, but they can serve to maintain active interest. Finally, it may be necessary to make other adjustments in the trial design, such as sample size, on the basis of information from the early part of the study [710]. Although it is not usually phrased this way, monitoring makes a clinical trial formally adaptive to changing circumstance. Within the boundaries of human subjects protection and scientific goals, monitoring has enormous freedom. With that freedom, it may be as much of an art as a science. However, the “good practice” of monitoring is process oriented and very much a science.
18.1.2
Components of Responsible Monitoring
Making good decisions on an interim basis is not primarily a technical problem. It requires a mixture of advanced planning, skill, and experience that must consider the context of the trial, ethics, data quality, statistical reasoning, committee process and psychology, and scientific objectives. Deficiencies in any of these components can lead to unserviceable decisions. In all that follows, I assume that timely and accurate data are available on which to base decisions. The production, management, and quality control of complex dynamic data systems is an important subject in itself, and is made even more difficult in multicenter collaborations typical of clinical trials. Good data result from a smooth integration of planning, trained personnel, system design, experience, quality control, and diligence. Interim deliberations often require unexpected views of the data and unanticipated analyses that put additional stresses on the data system and staff. Doubts about data quality will undermine critical interim decisions. Planning, guidelines, and formal decision process cannot substitute for confidence in the data sources. Thus, reliable and timely data are a foundation for all of the decision aids discussed here. The components of monitoring are brought together in the form of a monitoring committee, often called a data and safety monitoring board (DSMB), committee (DSMC or DMC), or a treatment effects monitoring committee (TEMC). The last term is probably the most descriptive. The name used is inconsequential, but the structure, composition, competence, objectivity, and role assigned to the committee are vital. This aspect of data-dependent stopping is discussed below.
18.1.3
Trials Can Be Stopped for a Variety of Reasons
Some practical reasons for terminating a trial early are shown in Table 18.1. My use of the term “data-dependent stopping” to describe all of these aspects of trial monitoring is broader than usual because the term is applied often only to the statistical aspects of planning for study termination. Because one cannot easily separate the statistical
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
525
TABLE 18.1 Some Reasons Why a Clinical Trial Might Be Stopped Early ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙
Treatments are judged to be convincingly different by impartial knowledgeable experts. Treatments are judged to be convincingly not different, with sufficient precision, by impartial knowledgeable experts. Side effects or toxicity are too severe to continue treatment in light of potential benefits. The data are of poor quality. Accrual is too slow to complete the study in a timely fashion. Definitive information about the treatment becomes available (e.g., from a similar trial), making the study unnecessary or unethical. The scientific questions are no longer important because of other medical advances. Adherence to the treatment is unacceptably poor, preventing an answer to the basic question. Resources to perform the study are lost or no longer available. The study integrity has been undermined by fraud or misconduct.
aspects of trial monitoring from the administrative ones, I will use the term to refer to both. The simplest approach to formalizing the decision to stop a clinical trial is to use a fixed sample size. Fixed sample size designs are easy to plan, carry out, stop, and analyze, and the estimated treatment effects will be unbiased if the trial is performed and analyzed properly. Often, however, investigators have important reasons to examine and review the accumulating data while a study is ongoing. If convincing evidence becomes available during the trial about either treatment differences, safety, or the quality of the data, there may be an imperative to terminate accrual and/or stop the study before the final fixed sample size has been reached. Assessment of study progress from an administrative perspective, updating collaborators, and monitoring treatment compliance also require that the data be reviewed during the study. There are two decision landmarks when monitoring a study: when to terminate accrual and when to disseminate results. Some investigators view these as one and the same, but it is often useful to separate them in studies with prolonged follow-up (e.g., survival studies). As sketched in Chapter 16, one can advantageously trade accrual rate, accrual time, or follow-up time for events in these types of trials. It is common to view the stopping decision as symmetric with respect to the outcome of the trial. For example, whether 𝐴 is superior to 𝐵, and vice versa, the quantitative aspects of the stopping guidelines are often taken to be the same or similar. However, clinical circumstances are seldom symmetric. If 𝐴 is standard and 𝐵 is a new treatment, we would likely wish to stop a randomized trial (i) when 𝐵 cannot be superior to 𝐴 rather than when it is convincingly worse, or (ii) when 𝐵 is convincingly better than 𝐴. Thus, stopping guidelines for efficacy should be drafted considering the asymmetric consequences of the differences implied, and supplemented with expert opinion. The presence of additional important outcomes will further complicate the asymmetry. This asymmetry will be crucial in the discussion below regarding masking.
Piantadosi
Date: July 27, 2017
526
18.1.4
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
There Is Tension in the Decision to Stop
Making decisions about clinical trials based on interim data is error prone. Before final data are in, monitors will get an imprecise view of most aspects of the study. If interim information is disseminated, it may affect investigators’ or subjects’ objectivity about the treatments leading to unsupported conclusions or inappropriate changes in behavior. Repeated or multiple statistical hypothesis tests can increase the chance of a type I error. Despite these potential problems, investigators are vitally interested in the ongoing status of a trial and they or their agents need summaries of the data with appropriate format and detail to stay informed without corrupting the study. There are many pressures to terminate a trial at the earliest possible moment. Early stopping is encouraged by the need to minimize the size and duration of the trial and the number of subjects receiving an inferior treatment. Sponsors and investigators are concerned about the economy of the study, and timeliness in disseminating results. Pressures to shorten a trial are opposed by legitimate reasons to continue the study as long as possible. Aside from the benefits of adhering to the original design, larger trials increase precision and reduce errors of inference, yield higher power to account for the effects of prognostic factors, increase ability to examine or compare clinically important subgroups, increase precision for observing temporal trends, and add precision to secondary endpoints some of which may be observed only in the context of an ongoing trial. Thus, there is a natural tension between the needs to stop a study and those tending to continue it. Those responsible for monitoring must be cognizant of the investigators’ view of the trial and the anticipated use of the evidence generated. Will the study have the needed impact on clinical practice if it is shortened or made smaller? For example, a trial used to support a regulatory decision might require stronger evidence than one conducted for less contentious reasons. In any case, there is an ethical mandate to maintain the integrity of the trial until it provides a standard of evidence appropriate for the setting. In this sense, those who watch the data and evidence evolve must be concerned with both the collective good (what will be learned from the trial) as well as the individual good (appropriate riskbenefit for participants). The tension or balance between these perspectives is different early in the trial than later, and the trade-off between them cannot be avoided. Example 18.1. Tension can be illustrated by the second clinical trial of extracorporeal membrane oxygenation (ECMO) versus standard treatment for newborn infants with persistent pulmonary hypertension [1163, 1164]. (This study, and the first trial of ECMO, were briefly discussed in Chapter 17.) Thirty-nine newborn infants were enrolled in a trial comparing ECMO with conventional medical therapy. In this study randomization was terminated on the basis of 4 deaths in 10 infants treated with standard therapy compared with 0 of 9 on ECMO (𝑝 = 0.054). Regardless of how one evaluates the data, the evidence from such a small trial is not an ideal basis on which to make changes in clinical practice. If ECMO is a truly effective treatment, the failure to adopt it broadly because of weak evidence favoring it is also an ethical problem. Example 18.2. The Second International Study of Infarct Survival (ISIS-2) is an example of a trial that continued past the point at which many practitioners would have been convinced by the data. In this study the streptokinase versus placebo comparison was based on 17,187 subjects with myocardial infarction [1132]. After five weeks, the death
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ADMINISTRATIVE ISSUES IN TRIAL MONITORING
527
rate in the placebo group was 12.0% and in the streptokinase group was 9.2% (𝑝 = 10−8 ). With regard to this outcome, convincing evidence in favor of streptokinase was available earlier in the trial. However, the impact of this study on practice is largely related to its precision, which would have been diminished by early stopping. An earlier overview of 22 smaller trials had little effect on practice, even though the estimate it produced in favor of streptokinase was similar. See also Section 20.8.5.
18.2
ADMINISTRATIVE ISSUES IN TRIAL MONITORING
Investigators use interim analyses to assess several important aspects of an ongoing trial. These include accrual, data quality, safety, and other parameters listed in Table 18.1. Deficiencies in any area could be a reason to stop the trial. Thus, the statistical guidelines for efficacy discussed below are only part of a broader trial monitoring activity. A number of authors have discussed the broad issues and offer useful perspectives for the clinician and trialist alike [577, 578, 634, 1135]. An interesting discussion in the context of cardiovascular disease is given by the Task Force of the Working Group on Arrhythmias of the European Society of Cardiology [1463].
18.2.1 Monitoring of Single-Center Studies Relies on Periodic Investigator Reporting The practical issues in trial monitoring may be somewhat different for multicenter clinical trials than for single-institution studies. The funding and infrastructure for multicenter trials usually permits a separate monitoring mechanism for each study or group of related studies. For single academic institutions conducting a variety of trials, such as cancer centers, the same oversight mechanism may be required to deal with many varied clinical studies. In such settings the responsibility for study conduct and management may reside more with the individual investigator than it does in collaborative clinical trial groups. A typical mechanism to assist single institutions in monitoring clinical trials sponsored by their faculty is an annual report. It is the responsibility of the study principal investigator to prepare the report and submit it to the oversight committee, which may be the IRB or a similar body. This report should contain objective and convincing evidence that the study is safe and appropriate to continue. At a minimum, the renewal report should address the following concerns:
· · · ·
Compliance with governmental and institutional oversight: The investigator should document that all of the necessary regulations have been satisfied. Review of eligibility: Document that there is a low frequency of ineligible subjects being placed on study. Treatment review: Most or all subjects should have adhered to the intended treatment. Frequent nonadherence may be a sign of problems with the study design, eligibility, or conduct. Summary of response: For diseases like cancer where response to treatment is a well defined and widely used measure of efficacy, investigators should summarize responses and the criteria used to judge them.
Piantadosi
Date: July 27, 2017
528
· · · ·
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
Summary of survival: For most serious chronic diseases, the survival (or diseasefree survival) experience of the study cohort is important. In any case, events possibly related to adverse consequences of the treatment need to be summarized. Adverse events: Convincing evidence of safety depends on accurate, complete, and timely reporting of all adverse events related to the treatment. Investigators sometimes attribute adverse events to causes other than the investigational treatment, such as to the underlying disease. This aspect of the annual report merits careful review. Safety monitoring rules: Many study protocols have formal guidelines or statistical criteria for early termination. The data must be summarized appropriately and compared with these guides. Audit and other quality assurance reviews: Data audits examine some of the issues outlined above and other aspects of study performance. The results of such audits and other reviews of process should be available for consideration by the oversight committee.
18.2.2
Composition and Organization of the TEMC
There are multiple reasons why TEMCs have become widely used. First, they provide a workable mechanism for protecting the interests and safety of study participants, while preserving the scientific integrity of the trial. Second, the TEMC is intellectually and financially independent of the study investigators, thus providing objective assessments. Decisions will be made largely independent of academic and economic pressures. Most sponsors of trials such as NIH support or require such a mechanism [1096, 1102]. For example, the National Heart Lung and Blood Institute (NHLBI) has its own explicit guidelines for data quality assurance that emphasize the TEMC mechanism [1094]. Similar policies have been promulgated by the National Cancer Institute [1090]. Although the TEMC mechanism has strengths, it also has some weaknesses, discussed below. TEMC Charter The TEMC charter is a document that describes the role and function of the committee. Usually the charter is only a few pages. It will be prepared by the investigators or sponsor and vetted with the members of the monitoring body. This formality is needed to address composition, structure, function, relationships, and philosophy around the monitoring process. It also produces a useful record that might be helpful later if difficult decisions arise regarding study termination. A charter might document many of the specific points listed in this chapter. Typically, assembling members of the TEMC will review the draft charter to be sure they are comfortable working under it, agree with the mechanisms implied, and to recommend nay changes. Another audience for the TEMC charter is the IRB or Ethics Committee. A Charter can provide documentation that ethics concerns are being properly managed at the outset of a trial. TEMC minutes and topic areas covered by monitoring meetings will provide similar assurnaces as the trial progresses. Relationship to the Investigators One of the most important aspects of the TEMC is its structural relationship to the investigators. Two models are commonly used (Fig. 18.1). The first is having the TEMC be
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ADMINISTRATIVE ISSUES IN TRIAL MONITORING
FIGURE 18.1
529
Important relationships in treatment effects monitoring
advisory to the trial sponsor (Fig. 18.1 A); the second is making the TEMC advisory to the study investigators, perhaps through a Steering Committee (Fig. 18.1 B). This difference is not a nuance and can have consequences for the quality of the recommendation and ethics responsibilities. Neither model is more advantageous with respect to optimal selection of committee members, internal functioning, or the requirements for accurate and timely data. When the TEMC is advisory to the study sponsor, the opinions and judgments rendered are likely to be as independent as possible from investigator opinion and bias; that is, there is potentially a premium on objectivity. This model is typified by studies performed under contract, particularly those sponsored by the NHLBI and some pharmaceutical companies. Recommendations might be more efficiently implemented, although this is not always the case. The principal disadvantage of this model is that the study sponsor may be an imperfect filter for recommended actions and transmission of information that is important to the investigators. In other words, sponsors have conflicts of interest of their own. Thus, to be optimal, this design must be supplemented by a written policy requiring investigators to be notified of relevant TEMC recommendations. This issue of reporting relationships is particularly important for ethics. Consider what might occur if the TEMC makes an ethically significant recommendation to a sponsor, who in turn disagrees and does not deem it necessary to transmit the recommendation to the investigators. The investigators are the ones who incur the ethical responsibility for representing risks and benefits to the study participants, but they can be denied relevant information from their own data. But the investigators’ obligations to the patients cannot be alleviated or abdicated to a TEMC or sponsor. Thus, a structure that does not require TEMC reporting to investigators is potentially unethical. A better model is to make the TEMC advisory to the investigators, specifically through a steering or executive committee. This model is equally simple, objective, and effective as the alternative, and is employed by most cancer cooperative clinical trial (NCTN) groups and many academic institutions. It assures that recommendations that bear on the ethics of the study will find their way reliably and efficiently to the individuals responsible for patient safety. After due process, investigators may take a course of action different than the TEMC recommendation. This is fine if appropriately justified and documented. But it is not appropriate for the sponsor to follow a similar course with the concurrence of the investigators. My discussion has articulated a principle of obligation for the TEMC to the investigators/subjects, which I will now make explicit. The TEMC has an obligation to inform the investigators of their opinions and recommendations about actions that carry ethics implications. In the first model discussed above, the TEMC obligations do not end with
Piantadosi
Date: July 27, 2017
530
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
the sponsor. The investigators have their own widely recognized and discussed obligations to the subjects. The sponsor has its own obligations but is irrelevant with respect to the connection between the TEMC and the investigators. Suppose, for example, that the TEMC produces a nonunanimous vote to continue a clinical trial, meaning some members vote to terminate the trial. (The reason is irrelevant for this discussion.) We can see the potential for a double standard about TEMC function in this situation if this information is withheld from the investigators. On the one hand, we are interested in what objective knowledgeable experts have to say about the trial. On the other hand, we might refuse to represent a difference of opinion to the investigators from such experts because it might alter their thinking. Relationship to IRBs Another important but often overlooked aspect of TEMC structure is its relationship to the IRBs at participating institutions. The roles of these two entities in protecting subject safety are somewhat overlapping. However, their roles are often seen as separate and independent, and there is usually a reluctance of the TEMC to provide information to the IRB because of the emphasis on confidentiality. This can be more of a problem if the TEMC is advisory to the sponsor. I have seen this be an acute problem in a large NHLBI sponsored trial when the IRB of the institution supporting the coordinating center requested interim data on treatment risks. The sponsor resisted furnishing the data, and the IRB request was honored only when it signaled its intent to refuse renewal of the study. There is little doubt that the IRB has the right and responsibility to examine interim data relating to risks and benefits as presented to the patients in the consent documents. IRBs can easily review such data if they are properly summarized, while preserving confidentiality. Furthermore, IRBs are not intended or equipped for, nor do they desire, a more active role in the details of study monitoring. Relationship to Regulatory Authorities Investigators are personally responsible for the welfare of study participants and are supervised in this role by Insitutional Review Boards, Ethics Boards, and the like. In a real sense, the IRB abdicates its responsibility for detailed ongoing supervision of a clinical trial to the TEMC. The plans for doing so are part of the study protocol and are reviewed in advance by the IRB. IRBs are usually comfortable with the TEMC on the front line as they should be. The typically unspoken but key relationship between IRB and TEMC is sometimes interrupted by regulatory authorities. For example, I have observed both commercial sponsors and the FDA insist on intercepting a TEMC recommendation to stop a clinical trial so as to reflect on the regulatory implications of early termination before the investigators are notified and action is taken to protect human subjects. The sponsors I have queried about this seem to think it is reasonable, and I presume the FDA does also or they would not be recommending it. Investigators answer to an IRB, while sponsors and regulators do not. I do not know if Ethics Boards regularly contemplate that their agent (TEMC) may be compelled to have higher priority than the study participants and investigators obligated to protect them. Knowing that regulatory, financial, and business strategy considerations can be inserted ahead of human subject protection, one should be very aware of the formal
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ADMINISTRATIVE ISSUES IN TRIAL MONITORING
531
relationships sketched in Figure 18.1. The TEMC should report to the investigators and IRB rather than to the sponsor. Regulatory implications and authorities are not part of the monitoring process. When a sponsor insists otherwise, there should exist in the TEMC charter a requirement that significant recommendations or dissentions within the committee be transmitted immediately to the investigators and IRB.
TEMC Membership TEMCs are usually composed of a combination of clinical, statistical, epidemiological, laboratory, data management, and ethical experts. All members of the TEMC should be knowledgeable about the circumstances surrounding the trial and should not have any vested interest in the outcome. Of course, the committee members will have a scientific interest in the outcome, but only to the extent that they would like to see a valid and well-performed trial. Individuals with any financial or other conflicts of interest should not be members of the TEMC. Monitoring committees often consist of 3–10 members, depending on the size of the trial and complexity of the issues. It is good practice to include some experienced members on every TEMC. A new committee should not be constructed entirely from inexperienced members, even if they are all experts. Multidisciplinary or multinational membership may be helpful because some issues may require these varying perspectives. Individuals who would make good TEMC members are usually known to the trial investigators. There is no formal source for such expertise except for personal contacts. Experienced trial methodologists, clinical experts, ethicists, and investigators who have conducted similar studies are good sources for TEMC members. Investigators and trialists do not agree universally on whether or not the TEMC should include an investigator from the trial. To the extent that the role of the TEMC is to protect the interests of the subjects, a clinical investigator can be a member. To the extent that the TEMC should preserve the scientific integrity and objectivity of the trial, a clinical investigator may be unnecessary or inappropriate. This question need not be very difficult to resolve. For example, a clinical investigator could participate in safety and data quality related aspects of the TEMC deliberations but be excluded from discussions regarding efficacy and possible early termination of the trial because of convincing treatment differences. An investigator or other individual with particular expertise can be a nonvoting member of the committee. There are many circumstances where detailed knowledge of study conduct, protocol, and participants is essential to TEMC deliberations. In institutions that conduct many clinical trials simultaneously, such as large comprehensive cancer centers, a single TEMC may review multiple studies or may even monitor all ongoing trials. The role or emphasis of this committee may be broader than simply safety and efficacy monitoring. For example, such a committee may help investigators with protocol development and review, setting institutional priorities for potentially competing studies, monitoring adverse events, and renewing protocols and other administrative matters. These functions are distinct from those usually performed by the Institutional Review Board. There are other questions that an impartial view of the data can help answer beside evaluating treatment or side effects. For example, the need for ancillary studies, secondary
Piantadosi
Date: July 27, 2017
532
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
questions, or requests for the data might be evaluated by the TEMC. The committee will meet once or twice each year (more often if the study progress or issues require it) and review formal reports of study outcomes. Because of the potential for this interim information to affect the clinicians’ interactions with subjects, the TEMC’s meetings and the data presented to it are kept confidential. Reports typically address a number of issues vital to decision making about the trial and are not disseminated.
TEMCs Should Be Unmasked There have been two schools of thought regarding masking of the TEMC to treatment assignment when the data are reviewed. One philosophy places objectivity at a premium which motivates masking TEMC members. It is always the case that someone must be unmasked to prepare the data and materials for deliberation—this might be a statistician only. But the TEMC would remain masked unless there was observed some significant asymmetry in the data, at which point the treatment mask would be removed to assess the findings. If the trial is not stopped, in principle the subsequent meetings could be remasked, although TEMC members might well remember characteristics of the data that essentially unmask them permanently. We might refer to this approach as relying on structural objectivity because masking offers the best chance to remove personal biases. A second philosophy, which places expertise at a premium, requires unmasked data throughout the trial. This would allow to the greatest extent possible a fully informed perspective on all findings, large or small. When unmasked TEMC members might need to cope with their own preconceptions, biases, or expectations. But there would never be any ambiguity regarding what effects were attributable to which treatment. This approach relys on individual objectivity because it requires TEMC members to compensate for their biases. I raise this notion of personal objectivity because it is real, and I do not wish to convey the impression that masking is the only way to achieve objective deliberations. My experience has been that each TEMC must discuss the masking issue anew, and often has trouble deciding on the best course of action. Although it is often worthwhile to have the discussion, the resolution should almost always be in favor of unmasked review, for reasons I outline below. The circumstance where masked data monitoring might be appropriate is when two new similar drugs are being compared and the choice of superiority is truly symmetric. This is a rare circumstance. In general, the TEMC should be unmasked to treatment assignment. I might go so far as to say it is unethical to mask the experts monitoring the data. This might seem like an extreme viewpoint, but I think it is easily defended. A dominant preference for unmasking derives from (i) a typical and strong asymmetry in the decisions required during the trial, (ii) the need to incorporate clinical knowledge into the statistical assessments of data, and (iii) obligations to protect research subjects from an unfavorable balance of risk and benefit. As a general rule, I will assume that one of the treatments in our randomized trial is standard therapy, but the rationale extends to other cases as well. With regard to asymmetry in the decisions faced by a TEMC, it is never the case that we want strong evidence that a new treatment is inferior to standard. In contrast, we often require strong evidence that a new treatment is superior to standard. Thus, the decision to stop a trial in the presence of an efficacy signal is asymmetric, and cannot be made reliably under masking. Monitoring guidelines would reflect this asymmetry but cannot be implemented under masking.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ADMINISTRATIVE ISSUES IN TRIAL MONITORING
533
Clinical knowledge is required when interpreting both safety and efficacy signals. We are always prepared to offset small safety concerns if a treatment is sufficiently superior to a slightly safer alternative. This is part of the essential risk-benefit calculus. But there will be much more safety experience with the standard arm of the trial using expert monitors and clinical experience from outside the study. Thus, the risk-benefit considerations for each treatment are different, and the comparison of risk-benefits is asymmetric. Both require clinical knowledge to assess relaibly, removing the legitimacy of masking. Finally, we must consider the relationships between the research participants and the investigators, that between the investigators and the IRB or ethics board, and the obligations of the TEMC to all three constituencies. How does TEMC masking affect these relationships? It is the investigators who carry the responsibility for the protection of research participants. They have crafted the study procedures and the informed consent, obtained appropriate oversight from the IRB or ethics board, put in place the TEMC mechanism, and obtained permission from each individual research participant. But investigators are usually not “trusted” with oversight of their own data from a randomized trial. The IRB or ethics board is charged with oversight, but usually abdicates this repsonsibility to the TEMC and seldom if ever sees efficacy or safety data. The IRB and ethics board may also not have the topical expertise or time to monitor the trials under its purview. So the TEMC is the sole mechanism for assuring a favorable risk-benefit for the trial participants. They must be continually unmasked to fulfill their obligation reliably. Indemnification Unfortunately, the days of participating in TEMC activities without liability insurance coverage are gone. Fertile opportunities for lawsuits that might touch TEMC members arise from hungry lawyers, shareholders, and aggrieved third parties. Although uncommon, such lawsuits have taken place and can absorb enormous amounts of energy and destroy good intentions. I have seen essentially utterly routine or even exemplary TEMC actions called into legal question, so there is no protection. What TEMC members should expect is to have their activities indemnified by the sponsor. This will not prevent lawsuits, but will provide for the cost of defense which otherwise is unmanageable. The need for indemnification is independent of the stature of the sponsor—pharmaceutic, academic, and government sponsors are all very much aware of the need to indemnify their TEMC members. One should never participate on a TEMC without such protection. Aside from the obvious problems of lawsuits, one should reflect on several procedural aspects of TEMC function to minimize exposure. This points to the TEMC charter, meeting procedures, meeting minutes, emails, and other communications to members, sponsors, and investigators. Meeting Format and Reporting Responsibilities The TEMC will often meet at the same time that trial investigators review the progress of their studies, semiannually, for example. The frequency of TEMC meetings depends on how quickly information in the trial accumulates. The deliberations of the TEMC are confidential and there should be no discussion or dissemination of any results outside the committee meetings. A fairly standard but flexible format for meetings has evolved in many trial settings. It includes a graded degree of confidentiality, depending on the
Piantadosi
Date: July 27, 2017
534
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
aspect of the study being discussed. Topics from nonconfidential to most confidential are (i) relevant information from outside the trial, (ii) accrual and study progress, (iii) safety data, (iv) efficacy differences, and (v) TEMC speculations. For example, the general progress of the study (e.g., accrual) can be discussed in an open forum. Additionally, scientific information from outside the study can be brought to the attention of the committee and investigators. This might include results or progress on related trials. Attendance at this portion of the meeting can include the sponsor, study chair (PI) or other investigators, and statistical center personnel in addition to the TEMC members. A more restricted portion of the meeting typically follows this. Discussions and summary data concerning safety should probably be restricted to the sponsor, study chair, study statistician, and the TEMC. Discussions and data presented by treatment group should be closed to investigators, except for the trial statistician or designee who presents quantitative findings to the TEMC. During this portion of the meeting, outcome data on safety and efficacy by treatment group are presented and discussed. No formal decisions are made by the TEMC during this part of the meeting. Following the presentation of safety and efficacy data by treatment group, the TEMC meets alone in executive session. During these deliberations, decisions or recommendations regarding changes in the study are made. These decisions can then be transmitted to the appropriate group representative. In some clinical trials, the TEMC reports to the sponsor. This is the case, for example, in many NIH-sponsored trials. Other times, the committee reports to the principal investigator or study chair. In the AIDS Clinical Trial Group, the TEMC reports to the executive committee of the group. There are many workable models, provided the recommendations of the TEMC are considered seriously by those ultimately responsible for the conduct of the study. In rare cases the investigators may choose not to follow the course of action suggested by the TEMC. This can lead to a great deal of friction between the investigators and the TEMC, sometimes resulting in resignations from the monitoring committee. Many TEMC meetings are conducted as telephone conferences or the like. This is often appropriate, efficient, convenient, and especially economical. A preliminary TEMC meeting might be done this way, but the first formal meeting should probably be conducted in person. Also, any TEMC meeting at which a very consequential decision is to be taken, such as early stopping of a trial, should be held in person. The extra organizational time, which is often not excessive and can be used to advantage, and the improved dynamics of face-to-face deliberations make this a worthwhile rule.
TEMC Communications The formal reporting actions of the TEMC are different than the web of communications required to support its activities. Ideally, the TEMC membership communicates about the study almost exclusively among itself after input of data reports and documents from the trial coordinating center. Again ideally, all members of the TEMC would participate in all communications. These are reasonable policies but cannot be upheld too strictly. In practice, TEMC members may have direct contact with the sponsor, investigators, or experts outside the trial. The purpose is to allow important questions to be resolved directly and efficiently. Information flow from such contacts is into the TEMC and not the reverse. This will necessarily limit the number and nature of questions that can be discussed outside the circle of confidentiality. Some technical TEMC queries need not
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ADMINISTRATIVE ISSUES IN TRIAL MONITORING
535
be made as a group. For example, the TEMC statistician might ask questions of the sponsor or trial statistician and bring reassurance rather than technical details back to the full committee. Of course it is always possible to take more formal pathways for such queries. But reasonable rigor as well as efficiency in the monitoring process can sometimes be satisfied simultaneously.
18.2.3
Complete Objectivity Is Not Ethical
A basic dilemma when obtaining advice of any type is achieving a balance between objectivity and expertise. This problem arises in several places when monitoring clinical trials—the relationship between the monitoring committee and the investigators, the composition of the monitoring committee itself, the use of masking, and the crafting and implementation of statistical guidelines. To avoid certain criticisms, most clinical trial investigators tend to err on the side of more objectivity and less expertise, potentially tipping the balance in the wrong direction. When this balance is lost, ethics obligations to the subjects cannot be adequately met. As a simple example, consider having a masked TEMC in a study comparing standard treatment to a new therapy. From a purely theoretical view, it may seem appropriate to consider evaluating evidence regarding treatment differences as a symmetric, objective decision. This would endorse the complete masking of the study—subjects, physicians, and TEMC. From a subject safety perspective, however, the decision is not symmetric. We would not want to prove that a new treatment is significantly worse than standard therapy. Being convinced that it is not better than standard therapy is the appropriate stopping point. Thus, the TEMC needs to know the treatment group assignments to interpret the data appropriately and make the most expert risk-benefit assessment. A recent illustration of the importance of this was the trial of hormone replacement therapy discussed in Section 8.4.2. Most of the decisions faced by a TEMC are not symmetric, and it is my opinion that masking such committees is unethical, despite it appearing to be objective. Hypertrophied objectivity can enter the monitoring process in other structural and procedural ways. Consider the optimal composition of the TEMC. Complete objectivity can only be provided by members who have little or no knowledge of the investigators, the study, the competition, the sponsor, market factors, and the regulatory issues. Such persons could not offer expert advice. This illustrates that we do not seek complete objectivity. Similarly, it is unrealistic to expect an ideal balance of objectivity and expertise in every member of the TEMC. A more appropriate goal is that the TEMC should comprise experts who are reasonably objective in a collective sense and employ objective methods. Prior opinions can be balanced in the same way that expertise is. Also bear in mind that expertise is a characteristic of the membership of the TEMC, whereas objectivity is partially a property of the membership and substantially a characteristic of the process of review. Investigator expertise is a fundamental prerequisite for human experimentation, so we should preferentially admit expertise over objectivity at every point in monitoring. I am not arguing against objectivity but against the imbalance between expertise and objectivity that is prevalent in this field today. We permit attitudinal biases in TEMC members and the sponsor but are reluctant to tolerate the biases of study investigators. Such thinking is inconsistent, and we should be prepared to utilize the expertise of investigators while
Piantadosi
Date: July 27, 2017
536
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
counterbalancing any lack of objectivity with process. In part this means making the TEMC advisory to the investigators, and perhaps making a membership position on the TEMC open to a study investigator. The most serious deficiency of expertise on TEMCs as typically constituted today is lack of knowledge of the study protocol, trial conduct, and patient interaction. The best source for the missing expertise is one of the principal investigators. Current dogma teaches that it is inappropriate for the TEMC to include an investigator. For example, the FDA guidance states: Knowledge of unblinded interim comparisons from a clinical trial is generally not necessary for those conducting or sponsoring the trial; further, such knowledge can bias the outcome of the study by inappropriately influencing its continuing conduct or the plan of analyses. Unblinded interim data and the results of comparative interim analyses, therefore, should generally not be accessible by anyone other than DMC members… Even for trials not conducted in a double-blind fashion, where investigators and patients are aware of individual treatment assignment and outcome at their sites, the summary evaluations of comparative unblinded treatment results across all participating centers would usually not be available to anyone other than the DMC. [498].
Some have gone further to suggest that firewalls be constructed so that even those who interact with the investigators (e.g., coordinating center staff) not be privy to interim data. I have heard of a circumstance where investigators were discouraged by NIH sponsors from looking at data from their RCT for grant-writing purposes even though recruitment was over, the treatment period was finished for all subjects, and those looking were not treating or evaluating subjects on the study. A philosophy that forces this degree of “objectivity” is nonsense. Clinical trials seem to be the only science where the prevalent view is that investigators should not see their own data. Some sponsors have put forth guidelines for trial monitoring that objectify the process by excluding investigators. There are two problems with such guidelines. First, they tend to sacrifice expertise as I have discussed. Second, they are not necessarily produced by a reasoned process of debate and review. Guidelines put forward by NIH, for example, are often taken by inexperienced investigators without criticism. The FDA draft regarding monitoring is given for public comment, but the debate is not extensive and all such regulations have a strong ossifying effect because pharmaceutical companies have so much at stake. A landmark to help navigate these difficulties is the Institutional Review Board (IRB). The conduct of research is based on the relationships between the investigator, subject, and IRB. If these relationships are disrupted, the study may lose its ethical foundation. Although assuring an appropriate balance of risk and benefit is the responsibility of the investigator as overseen by the IRB, the practicalities are now usually abdicated to a TEMC. But the sponsor and the TEMC typically do not answer to any IRB. Thus, it is not appropriate for the TEMC/sponsor to withhold knowledge from the investigator (and from the IRB) but require that investigator to assure the IRB that the study remains ethically appropriate. Furthermore, the IRB cannot be completely comfortable with the continuation of a study with only passive confirmation by the TEMC. In many cases, IRBs have requested and obtained active confirmation based on data provided by the TEMC. This is a somewhat paradoxical situation in which the investigator role is usurped by the TEMC.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ORGANIZATIONAL ISSUES RELATED TO MONITORING
18.2.4
537
Independent Experts in Monitoring
Responsible monitoring may call for independent experts alongside the TEMC process. Such experts could include a statistician, independent medical monitor, or medical safety monitor. An Independent Medical Monitor (IMM) can be appointed by the principal investigators if risk to the participants is more than minimal, but the study is not complex enough to require a TEMC. Thus, for low risk trials, the IMM could be the primary mechanism for monitoring. This is common in dose-ranging, dose-finding, or other early developmental trials, for example. An IMM must be independent of the study and without conflicts of interest. The IMM should function like a TEMC and review the protocol and interim activities for data integrity, protocol adherence, and safety. Such a review examines adverse events and consequent study dropouts, and could yield recommendations for continuation, modification, or terminating a trial. A medical safety monitor (MSM) is commonly used as an adjunct to other monitoring methods. The MSM will be a physician who monitors reports of serious adverse events in real time to identify safety concerns. The MSM may suggest protocol changes to reduce the frequency or severity of particular adverse events. The MSM remains masked as much as feasible. The MSM may advise investigators about management of adverse events, but should not be involved in other aspects of the trial. Reports concerning adverse events will go from the MSM to the principal investigator, TEMC, sponsor, regulators, and so on. If risk to participants appears elevated based on MSM assessments, the TEMC may meet to plan appropriate actions. When the MSM role is essential, plans to prevent gaps in coverage are necessary. Another type of independent expert has been proposed for some monitoring circumstances—the independent statistician [350, 418, 470]. The role assigned to such an individual would be to prepare and present interim analyses to the TEMC. The independent statistician would not be a member of either the TEMC or the steering committee. The role of this individual is overlapping with the statistician in the data coordinating center or steering committee and the statistician on the TEMC. But the extra role and firewalling has been proposed to help remove any conflicts of interest. A particular point is the necessary unmasking of the data center or steering committee lead statistician. The independent statistician model is by no means universal presently.
18.3
ORGANIZATIONAL ISSUES RELATED TO MONITORING
The basic responsibility of the TEMC is to assure that the study is being conducted according to high scientific and ethical standards, and that participants have an appropriate risk-benefit circumstance. The TEMC is a surrogate for the IRB with respect to ongoing favorable risk-benefit details, and it is important for investigators to connect these two perspectives. To accomplish this, the specific tasks of the TEMC are as follows:
1. Assess study and clinic performance with respect to participant recruitment, retention, follow-up, and protocol adherence 2. Review the interim and final statistical analysis plan 3. Monitor interim data for safety and efficacy
Piantadosi
Date: July 27, 2017
538
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
4. Review protocol modifications and amendments for their impact on scientific integrity and safety 5. Monitor recruitment progress and losses to follow-up 6. Review data completeness, quality, and missingness 7. Recommend adaptations based on pre-specified decision guidelines 8. Review ancillary or companion studies proposed by investigators 9. Advise investigators and sponsors as to the continuation of an ongoing trial 10. Review and advise on publications
These points will be discussed in more detail below.
18.3.1
Initial TEMC Meeting
The first TEMC meeting usually occurs before a trial is open to accrual. Points of review include the informed consent document, protocol overall, outcomes, analyses, adverse event reporting procedures, monitoring and interim analysis plan and decision guidelines, and any proposed adaptive features. The proposed monitoring plan will be a particular focus of this pre-study review. It is key to establish an appropriate monitoring plan early to preserve the scientific integrity of the trial. The relationship between the TEMC and the study sponsor also needs to be made clear at this first meeting. Sometimes sponsor representatives will attend TEMC meeting as nonvoting observers. Such representatives can greatly help draft TEMC guidelines and formulate operating procedures to help with meeting frequency, meeting minutes and reports, and what if any interim data can be released. This initial meeting is the best time to dispense with pesky requests for TEMC masking if the discussion has not already taken place.
18.3.2
The TEMC Assesses Baseline Comparability
Before recommending termination of a clinical trial because of treatment or side effects, the TEMC will establish that any observed differences between the treatment groups are not due to imbalances in subject characteristics at baseline. When such differences are present and thought to be of consequence for the outcome, the treatment effect typically would be “adjusted” for the imbalance, perhaps using methods discussed in Chapter 21. Without the exploratory adjusted analyses, the trial might continue under the assumption that the differences are, in fact, a consequence only of baseline imbalances. This is likely to be a risky assumption. Familiar statistical methods will usually suffice to assess baseline balance and determine its impact on the observed outcome. Baseline imbalances by themselves are not likely to be a cause for much concern. For example, in a well-conducted randomized trial, they must be due to chance. However, because imbalances can undermine the appearance or credibility of a trial, some interventions might be proposed to correct them. These could include modifications of blocking or other methods employed to create balance.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ORGANIZATIONAL ISSUES RELATED TO MONITORING
18.3.3
539
The TEMC Reviews Accrual and Expected Time to Study Completion
Design Assumptions Another consideration for the TEMC that reflects on the decision to continue is the rate of accrual onto the trial and the projected duration of the study based on the current trends. Often the accrual rate at the beginning of a trial is slower than that projected when the study was designed or slower than that required to finish the trial in a timely fashion. In many studies the accrual rate later increases and stabilizes. In any case, after accrual has been open at all centers for several months, investigators can get a reliable estimate of how long it will take to complete the study under the original design assumptions and the observed accrual and event rates. At this early point in the trial it may be evident that design assumptions are inaccurate. For example, the dropout rate may be higher than expected, the event rate may be lower than planned (because trial participants are often at lower risk than the general population with the same disease), or the intervention may not be applied in a manner sufficient to produce its full effect (e.g., some dietary interventions). Each of these circumstances could significantly prolong the length of time necessary to complete the study. Unless remedial actions are taken, such as increasing accrual by adding study centers, the trial may be impossible to complete. The TEMC may recommend that these studies terminate prematurely. Resource Availability It may also happen that some of the limited resources available to conduct the clinical trial are being consumed more rapidly than planned. Money is likely to be at the top of this list, but one would have to consider shrinking human resources as well. Difficulties obtaining rare drugs are another example. Occasionally, irreplaceable expertise is lost, for example, because of the death of an investigator. Any of these factors may substantially impede the timely completion of a study. 18.3.4
Timeliness of Data and Reporting Lags
The TEMC cannot perform its duties without up-to-date data and the database cannot be updated without a timely and complete submission of data forms. Monitoring this activity is relatively routine for the coordinating center and is a reliable way of spotting errors or sloppy work by clinics or investigators. While there are numerous reasons why forms might be submitted behind schedule, there are not so many why they would be chronically delayed or extremely late. The percentage of forms submitted to the coordinating center within a few weeks or months of their due date should be quite high, such as over 90%. There can be ambiguities in exactly how timely the data are at the time of a TEMC review. Many interim analyses are done on a database with a given cutoff date, perhaps a month or so prior to the formal data review. This gives the impression that the data are up to date as of the cutoff date. However, because of the natural lag between the occurrence of an event and the time it is captured in the database, analyses on data as of a given cutoff date may not include events immediately prior to the cutoff date. For long-term ongoing reviews, this may not be much of an issue.
Piantadosi
Date: July 27, 2017
540
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
In other cases it may be essential to know that all events up to the specified cutoff date are included in the interim report. Extra data sweeps after the cutoff date are required to assure this. The TEMC should understand which circumstance pertains, especially for reviews early in a trial. Although it makes extra demands on the trial infrastructure to know that all events are captured as of the cutoff date, the TEMC may be reassured to operate under that model.
18.3.5
Data Quality Is a Major Focus of the TEMC
Even when the design assumptions and observed accrual rate of an ongoing clinical trial suggest that the study can be completed within the planned time, problems with data quality may cause the TEMC to recommend stopping. Evidence of problems with the data may become available after audits or as a result of more passive observations. An audit can be a useful device for the TEMC to assure them that data errors do not contribute to any treatment differences that are observed. The TEMC will routinely review subject eligibility. Minor deviations from the protocol eligibility criteria are common and are not likely to have any consequence for the internal or external validity of the trial. These minor errors are study dependent, but they might include things such as errors of small magnitude in the timing of registration or randomization or baseline laboratory values that are out of bounds. More serious eligibility violations undermine the internal validity of the trial and might include things such as randomizing subjects who have been misdiagnosed or those with disease status that prevents them from benefiting from the treatment. Usually in multicenter trials, eligibility violations occur in less than 10% of those accrued. If the rate is much higher than this, it may be a sign of internal quality control problems. In nearly all trials, there is a set of laboratory or other tests that needs to be completed either as a part of eligibility determination or to assess the baseline condition of the subject. Investigators should expect that the number of such tests would be kept to the minimum required to address the study question and that 100% of them are performed on time. One must not assume that missing data are “normal.” Failure of study centers or investigators to carry out and properly record the results of these tests may be a sign of serious shortcomings that demand an audit or closing accrual at the clinic. The TEMC will be interested in the completion rate of these required tests. Treatment compliance or adherence is another measure of considerable importance in the evaluation of data quality. Viewing trials from the intention to treat (ITT) perspective (see Chapter 19), the only way to ensure that the ITT analysis yields an estimate of treatment effect that is also the effect of actually receiving the treatment is to maintain a high rate of treatment adherence. When adherence breaks down it might be a sign of one or more of the following: (i) serious side effects, (ii) subjects too sick to tolerate the therapy or therapies, (iii) poor quality control by the investigators, or (iv) pressures from outside the study (e.g., other treatments). In any case, poor adherence threatens the validity of the study findings and may be cause for the TEMC to stop the trial.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ORGANIZATIONAL ISSUES RELATED TO MONITORING
18.3.6
541
The TEMC Reviews Safety and Toxicity Data
Safety and toxicity concerns are among those most carefully considered by the TEMC. There are three characteristics of side effects that are relevant: the frequency of side effects, their intensity or seriousness, and whether or not they are reversible. Frequent side effects of low intensity that are reversible by dose reduction or other treatment modifications are not likely to be of much consequence to the subjects or concern to the investigators. In contrast, even a rarely occurring toxicity of irreversible or fatal degree could be intolerable in studies where subjects are basically healthy or have a long life expectancy. In diseases like cancer or AIDS, serious but reversible toxicities are likely to be common side effects of treatment and are frequently accepted by subjects because of the life-threatening nature of their illness. In fact, in the cytotoxic drug treatment of many cancers, the therapeutic benefit of the treatment may depend on, or be coincident with, serious but reversible toxicity. This idea is especially important when considering the design of phase I trials in oncology. Here experienced investigators may realize that the implications of serious toxicity are not as far-reaching as they would be in many other diseases. 18.3.7
Efficacy Differences Are Assessed by the TEMC
Efficacy, or the lack of it, is the most familiar question to evaluate when considering the progress of a trial. The importance of learning about efficacy as early as possible has ethical and resource utilization motivations. Before attributing any observed differences in the groups to the effects of treatment, the TEMC will be certain that the data being reviewed are of high quality and that baseline differences in the treatment groups cannot explain the findings. Statistical stopping rules like the ones outlined later in this chapter may be of considerable help in structuring and interpreting efficacy comparisons. It is possible that efficacy and toxicity findings during a trial will be discordant, making the question of whether or not to stop the study more difficult. For example, suppose that a treatment is found to offer a small but convincing benefit in addition to a small but clinically important increase in the frequency and severity of side effects. On balance, these findings may not be persuasive for terminating the trial. In this situation, it may be important to continue the study to gather more precise information or data about secondary endpoints. 18.3.8
The TEMC Should Address Some Practical Questions Specifically
From a practical perspective, the TEMC needs to answer only a few questions to provide investigators with the information required to complete the experiment. Should the Trial Continue? The most fundamental question, and the one most statistical monitoring guidelines are designed to help answer is “should the study be stopped?” When the efficacy and toxicity information is convincing, the TEMC will likely recommend stopping. However, there is room for differences of opinion about what is, to a large degree, a subjective assessment. Clinical trials yield much more information than just a straightforward assessment of treatment and side effect differences. There are many secondary questions that are
Piantadosi
Date: July 27, 2017
542
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
often important motivations for conducting the trial. Also the database from the study is a valuable resource that can be studied later to address ancillary questions and generate new hypotheses. Much of this secondary gain from a completed clinical trial can be lost when the study is stopped early. If the window of opportunity for performing comparative studies closes, as it might if the TEMC terminates the trial, some important questions will remain unanswered. Therefore, the TEMC should weigh the decision to stop carefully and in the light of the consequences of losing ancillary information.
Should the Study Protocol Be Modified? If the trial is to continue, the TEMC may ask if the study protocols should be modified on the basis of the interim findings. A variety of modifications may be prudent, depending on the clinical circumstances. For example, side effects may be worrisome enough to adjust the frequency or timing of diagnostic tests, but not serious (or different) enough to stop the trial. If more than one treatment comparison is permitted by the study design (e.g., in a factorial trial), convincing stopping points may be reached for some treatment differences but not for others. Thus, the structure of part of the trial could be changed, while still allowing the remaining treatment comparisons to be made. Numerous other types of modifications may be needed after an interim look at the data. These include changes in the consent documents or process, improvements in quality control of data collection, enhancing accrual resources, changes in treatment to reduce dropouts or nonadherence, or changes in the eligibility criteria or their interpretation. In some cases one or more treatments or treatment schedules will have to be modified, hopefully in ways that preserve the integrity of the biological question being asked.
Does the TEMC Require Other Views of the Data? Another practical question for the TEMC is whether or not the data have been presented in sufficient detail and proper format to examine the monitoring questions. Remaining questions could be answered by additional views of the data, analysis of certain subsets, or presentation of supporting details. These could be important considerations for two reasons. First, preparation of additional analyses and supporting documentation may not be a trivial task for the statistical office. It could require extra quality control effort and verification of interim data. Second, continuation of the trial may strongly depend on subtle findings or those in a small percentage of the study participants. In such cases it is likely that some ways of viewing the results will not display the influential findings. An unfortunate encounter with fraudulent or fabricated data would illustrate this.
Should the TEMC Meeting Schedule Be Altered? Interesting trends in the data might prompt the TEMC to meet more often than originally planned. Such a decision could create problems for some statistical monitoring plans, which is a reason to use as flexible an approach as possible. In the same way, slower accrual or a lower event rate than originally projected could prolong the interval between TEMC meetings. In terms of “information time” for the trial, the meeting may occur at the recommended intervals. However, in terms of calendar time, the frequency could change.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ORGANIZATIONAL ISSUES RELATED TO MONITORING
543
Can Certain Data Be Released? Investigators frequently approach the TEMC to ask for access to study data that is needed for planning new trials, or has scientific value that would not damage the goals of the study. This is most common in large multicenter collaborations with dozens of trials ongoing at any one time. An example of a data release request is access to control arm data for a substudy with endpoints unrelated to the primary one. Another example would be for data pooled across treatment groups to assess event rates so a new trial can be planned, or even to modify the accrual and follow-up period of the current trial. Sometimes data might be requested by the TEMC from a different trial to enhance their oversight. The hardest questions are when a true public release is requested. In most cases it is clear whether or not the data release can damage the integrity of the trial, or if the study is over for all practical purposes. The decision can be made accordingly without a mindset of strict confidentiality to the end. As a general principle, it is relatively easy to expand the circle of confidentiality around the TEMC. Also, it is never a problem for the TEMC to request special analyses such as for accrual rates, event rates, or study projections for any reason. A well constituted and informed TEMC will likely make a good decision regarding the timing of a publication. Are There Other Recommendations from the TEMC? Finally, there are other recommendations that may result from the TEMC review. An example would be changes in the committee itself, perhaps adding expertise in specific areas. If recruitment goals are not being met, the TEMC might recommend initiating efforts to improve it, such as interactions with target populations through community leaders or organizations. Sometimes for multicenter trials a meeting of the investigators may increase enthusiasm and clear up minor problems with accrual. In any case, the TEMC should be viewed and should act as objective experts who have in mind the best interests of all parties associated with the study. The committee should thus feel free to recommend any course of action that enhances the safety and quality of the clinical trial. Documentation There are several targets for TEMC documents including investigators, IRBs, sponsors, critics, and the historical record. The TEMC reports must strike a balance between providing evidence of due diligence and not revealing too much informaton prematurely. The need of investigators and IRBs is to be certain that objective experts are still uncertain regarding the primary therapeutic questions of the study. Also they want to be reassured that key points of risk/benefit for participants have been examined and that the trial is appropriate to continue. Sponsors may also focus on the implications for additonal resources, alternate views of the data, and planning for the future. It is usually necessary for the TEMC to generate several types of meeting documentation. A simple brief report or letter to the sponsor, investigators, and IRBs usually suffices during the real calendar time of the trial provided it answers specific relevant questions such as those discussed below. Detailed minutes of the TEMC meeting should be taken but may be held in total confidence until after the study is finished. The detail of minutes should be sufficient to demonstrate in retrospect that key issues of data findings, study management, and risk/benefit were addressed. There may be other specific requests
Piantadosi
Date: July 27, 2017
544
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
of the TEMC (e.g., release of some administrative data) that might generate additional responses. TEMC records should explicitly list the documents that have been reviewed at each meeting. These will typically include the TEMC charter, study protocol and amendments, efficacy and safety report, model consent forms, correspondence, formal queries, relevant literature, and manuscripts in preparation. In some cases it may be necessary also to review the study’s own historical record.
18.3.9
The TEMC Mechanism Has Potential Weaknesses
Although TEMCs can satisfy the need for informed impartial decision making in clinical trials, the mechanism has potential shortcomings, some of which were discussed earlier in this chapter. Trial investigators are often not permitted to be part of the TEMC so that this vital perspective may be underrepresented during deliberations. Although clinical investigators are in a state of equipoise when a trial begins, experience, knowledge, and opinion gained during the study can eliminate it. The TEMC may preserve the state of equipoise, either because members do not have the clinical experiences of working with the treatments and patients, or because they employ somewhat abstract, statistically based termination criteria (discussed below). The stopping guidelines typically consider only a single outcome, whereas the decision to stop a trial may depend on several factors. The partially isolated character of the TEMC can sometimes be advantageous, but it is generally undesirable. The relationship between the TEMC and trial sponsor can sometimes be an issue as discussed earlier. In some trials sponsored by NIH, for example, the TEMC reports formally to the sponsoring institute only, and not to the study investigators or local IRBs. The sponsor has no written obligation to inform the investigators about TEMC recommendations. It is possible (though unlikely) that the sponsor would not agree with TEMC concerns and not convey recommendations to the study investigators. However, the investigators are the ones who carry ethical obligations to the subjects. Thus, a policy that does not require TEMC recommendations to be transmitted explicitly and formally to study investigators does not honor all ethical obligations. The relationship between TEMCs and IRBs or Ethics Boards is sometimes strained. IRBs may operationally abdicate to a TEMC some of their responsibilities to assure subject safety, creating a potential for dissonance. The multidisciplinary expertise needed to monitor a large trial may not be present on IRBs, requiring them to rely on a TEMC. However, TEMC confidentiality often prevents details of interim safety analyses from reaching the IRB. This circumstance can be quite contentious but can usually be managed by transmitting adequate information from the TEMC to the IRB to satisfy the latter’s obligations. Because of their composition and function, TEMCs can emphasize impartiality over expertise. Overzealous adherence to statistical stopping criteria could be a manifestation of this. Some TEMCs remain masked to treatment assignments to increase impartial assessments of differences that may become evident. As mentioned previously, it is hard to see how a TEMC can adequately assure subject safety while masked. Finally, there is some concern that studies employing TEMCs tend to stop too soon, leaving important secondary questions unanswered.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
18.4 18.4.1
545
STATISTICAL METHODS FOR MONITORING There Are Several Approaches to Evaluating Incomplete Evidence
To facilitate decision making about statistical evidence while a trial is still ongoing, several quantitative methods have been developed. These include likelihood-based methods, Bayesian methods, decision-theoretic approaches, and frequentist approaches (e.g., fixed sample size, fully sequential, and group sequential methods). Evaluating incomplete evidence from a comparative trial is one area of statistical theory and practice that tends to highlight differences between all of these, but particularly Bayesian and frequentist approaches. No method of study monitoring is completely satisfactory in all circumstances, although certain aspects of the problem are better managed by one method or another. Frequentist methods appear to be the most widely applied in practical situations. For a general discussion of monitoring alternatives, see Gail [555]. Some practical issues are discussed in DeMets [349]. Statistical guidelines for early stopping can be classified further in two ways: those based only on the current evidence at the time of an interim analysis and those that attempt to predict the outcome after additional observations have been taken (e.g., if the trial continued to its fixed sample size conclusion). Frequentist procedures that use only the current evidence include sequential and group sequential methods, alpha spending functions, and repeated confidence intervals. Stochastic curtailment is a frequentist method that uses predictions based on continuation of the trial. Corresponding Bayesian methods are based on the posterior distribution or a predictive distribution. Regardless of the method used to help quantify evidence, investigators must separate the statistical guidelines and criteria used for stopping boundaries from the qualitative and equally important aspects of monitoring a clinical trial. Although the statistical criteria are useful tools for addressing important aspects of the problem, they suffer from several shortcomings. They tend to oversimplify the information relevant to the decision and the process by which it is made. The philosophy behind, and the results of, the various statistical approaches are not universally agreed upon, and they are sometimes inconsistent with one another. These problems set the stage for formal ways of introducing expert opinion into the monitoring process, discussed below.
Similarities These different statistical methods to facilitate early termination of trials do have several things in common. First, all methods require the investigators to state the questions clearly and in advance of the study. This demands a priority for the various questions that a trial will address. Second, all methods for early stopping presuppose that the basic study is properly designed to answer the research questions. Third, all approaches to early stopping require the investigator to provide some structure to the problem beyond the data being observed. For example, fixed sample size methods essentially assume that the interim data will be inconclusive. The frequentist perspective appeals to the hypothetical repetition of a series of identical experiments and the importance of controlling type I errors. Bayesian approaches require specifying prior beliefs about treatment effects, and decision methods require constructing quantitative loss functions. Any one of these approaches may be useful in specific circumstances or for coping with specific problems.
Piantadosi
Date: July 27, 2017
546
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
Fourth, except for using a fixed sample size, the various approaches to early stopping tend to yield quantitative guidelines with similar practical performance on the same data. Perhaps this is not surprising because they are all ways of reformulating evidence from the same data. In any specific study, investigators tend to disagree more about the nonstatistical aspects of study termination, making differences between quantitative stopping guidelines relatively unimportant. Fifth, investigators pay a price for terminating trials early. This is true whether explicit designs are used to formalize decision making or ad hoc procedures are used. Unless investigators use designs that formalize the decision to stop, it is likely that the chance for error will be increased or the credibility of the study will suffer. Studies that permit or require formal data-dependent stopping methods as part of their design are more complicated to plan and carry out, and when such trials are stopped early, the estimates of treatment differences can be biased. Also, accurate data may not be available as quickly as needed. Good formal methods are not available to deal with multiple endpoints such as event times, safety, and quality of life or to deal effectively with new information from outside the trial itself. Finally, there is a difficult to quantify but very important question of how convincing the evidence will be if the study is terminated early. No method is a substitute for judgment. Investigators should keep in mind that the sequential boundaries are guidelines for stopping. The decision to stop is considerably more complex than capturing information in a single number for a single endpoint would suggest. Although objective review and consensus regarding the available evidence is needed when a stopping boundary is crossed, the decision cannot be made on the basis of the statistic alone. Mechanisms for making such decisions are discussed later in this chapter.
Differences and Criticisms There are substantial differences between the various approaches to study monitoring and stopping, at least from a philosophical perspective. Decision-theoretic approaches require specifying a loss function that attempts to capture the consequences of incorrect conclusions. This usually requires an oversimplification of losses and the consequences of incorrect conclusions. These approaches use the idea of a “patient horizon,” which represents those who are likely to be affected by the trial. The magnitude of the patient horizon has a quantitative effect on the stopping boundary. Bayesian approaches rely on choosing a prior distribution for the treatment difference, which summarizes prior evidence and/or belief. This can be a weakness of the method because a prior probability distribution may be difficult to specify, especially for clinicians. Apart from this, a statistical distribution may be a poor representation of one’s knowledge (or ignorance). Even if we ignore these problems, different investigators are likely to specify a different prior distribution. Its exact form is somewhat arbitrary. The frequentist approach measures the evidence against the null hypothesis using the 𝛼-level of the hypothesis test. However, hypotheses with the same 𝛼-level may have different amounts of evidence against them and Bayesian methods reject this approach [68, 300, 301]. For monitoring, frequentist methods have three additional problems. First, they lack the philosophical strength and consistency of Bayesian methods. Second, they are difficult to use if the monitoring plan is not followed. Finally, the estimated treatment effects at the end of the trial are biased if the study is terminated early according to one of these plans. This is discussed below.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
18.4.2
547
Monitoring Developmental Trials for Risk
Quantitative precision is not possible when monitoring developmental trials even when there is clinical exactness. Great flexibility is required for decision aids that will have excellent properties only when true risks are low or high. Between these extremes where most circumstances arise, we must cope with the high likelihood of error stemming from inevitable weak evidence. Some informal suggestions for quantitative monitoring aids are discussed in this section. The critical difference between risk and safety as discussed in Section 5.3.10 remains an issue. Stopping with a declaration of safety is not a reasonable expectation despite its desirability. More likely, a developmental trial would draw attention because risk events are occurring above a prespecified clinical threshold. This requires that the events of interest are well-defined, recognized in a timely fashion, and properly attributed to the treatment, which can be a challenge if the disease causes events of the same or similar nature. This also implies that investigators have carefully defined a threshold frequency for unacceptable clinical events. With varying degrees of statistical rigor, guidelines for stopping a trial in this setting could be based on intuitive clinical tolerances, confidence intervals, credible regions, repeated looks at accumulating data, and so on. Whatever the method, it is key to determine the operating characteristics (OC) of the suggested boundary through calculation or simulation to be certain it behaves in the manner intended. The OC is simply the average behavior of the rule when implemented in an assumed but realistic circumstance. Example 18.3. Suppose we have zero tolerance for treatment-related side effects in an early developmental trial, meaning that the study will end if such an event is ever observed. Assume that otherwise the trial will continue until 20 participants are enrolled. This is a strict stopping criterion, but might be appropriate for subjects whose conditions are low risk and are treated with low risk therapy. The chance of observing exactly 𝑛 participants before the first event is 𝑃𝑛 = (1 − 𝜖)𝑛 , where 𝜖 is the probability of a treatment related side effect (Fig. 18.2). The curves in Figure 18.2 are the operating characteristics of a design with the specified true event rate. If the true treatment related side effect frequency is 0.07, the bottom curve in Figure 18.2 indicates that there is only about a 50% chance of continuing beyond the 10th subject before the stopping rule is triggered. It is a matter of clinical judgment as to whether or not these or any other operating characteristics are reasonable.
Longitudinal Boundaries In a single cohort trial, investigators frequently have tolerability thresholds for unacceptable events that depend on the current number of study participants. The trial might be paused briefly at planned landmarks to compare observed events with the boundary. The probability of passing through any landmark and continuing the trial (or the chance of stopping) must account for having successfully passed earlier landmarks. Suppose the decision boundary is characterized as a set of integer pairs {𝑟𝑖 , 𝑛𝑖 } where 𝑟𝑖 represents the largest number of unacceptable events we can tolerate out of 𝑛𝑖 subjects in the cohort at that time. The exact values of 𝑟𝑖 and 𝑛𝑖 are chosen to yield a boundary with quantitative properties appropriate to the clinical setting.
Piantadosi
Date: July 27, 2017
548
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
FIGURE 18.2 (bottom).
Probability of no events in 𝑛 subjects for event rates from 0.01 (top) to 0.07
Assume that the underlying probability of an unacceptable event, 𝑝, is constant for all study participants. The chance of continuing the trial past the first boundary point is the probability of having 𝑟1 or fewer events in 𝑛1 subjects, or 𝑄1 = Pr[𝑋 ≤ 𝑟1 |𝑛1 ],
(18.1)
where 𝑋 is a random variable that represents the number of events observed. To simplify the notation, I have omitted 𝑝 from the probability expression. The chance of stopping at the first boundary point is 1 − 𝑄1 . At the second interim, the total number of events must be less than or equal to 𝑟2 to continue, and we must account for having passed the first landmark. The first decision point could have been passed with 0, 1, … 𝑟1 events. Doing the accounting for each of those possibilities yields the chance of passing the first and second point, 𝑄2 =
𝑟1 ∑
Pr[𝑋 = 𝑖|𝑛1 ] Pr[𝑌 ≤ 𝑟2 − 𝑖|𝑛2 − 𝑛1 ].
(18.2)
𝑖=0
The first term in the sum represents events at the first decision point, and the second term represents passing the second landmark after the incremental number of study subjects. The chance of stopping at or before the second landmark is 1 − 𝑄2 , and the chance of stopping exactly at the second landmark is 𝑄1 − 𝑄2 . The probability of passing the first three decision points accounts for all possible ways of passing the first and second. Hence, 𝑄3 =
𝑟1 𝑟2 −𝑖 ∑ ∑
Pr[𝑋 = 𝑖|𝑛1 ] Pr[𝑌 = 𝑗|𝑛2 − 𝑛1 ] Pr[𝑍 ≤ 𝑟3 − 𝑖 − 𝑗|𝑛3 − 𝑛2 ],
(18.3)
𝑖=0 𝑗=0
and so on with later boundary points. The chance of stopping exactly at the third landmark is 𝑄2 − 𝑄3 .
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
549
TABLE 18.2 Probabilities of Passing Stages in a Trial With Stopping Boundary {𝒓𝒊 , 𝒏𝒊 } = {(𝟏, 𝟓), (𝟐, 𝟏𝟎), (𝟑, 𝟏𝟓), (𝟒, 𝟐𝟎), (𝟓, 𝟐𝟓)} 𝑝
𝑄1
𝑄2
𝑄3
𝑄4
𝑄5
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
0.98 0.92 0.84 0.74 0.63 0.53 0.43 0.34 0.26 0.19
0.97 0.89 0.76 0.61 0.46 0.33 0.22 0.14 0.08 0.04
0.97 0.87 0.71 0.52 0.36 0.22 0.13 0.06 0.03 0.01
0.97 0.87 0.70 0.49 0.31 0.17 0.08 0.03 0.01 0.00
0.97 0.87 0.69 0.47 0.27 0.13 0.05 0.02 0.00 0.00
The most appropriate probability distribution to implement these calculations is the binomial with an assumed value for 𝑝. As suggested above, we could empirically list the set {𝑟𝑖 , 𝑛𝑖 } and determine the stopping probabilities. Or specify boundary probabilities at 𝑛𝑖 and calculate a set of 𝑟𝑖 that satisfies them. An example of the first strategy is shown in Table 18.2. It may also be useful to specify by design the probability of continuing (or stopping) for a set of landmarks assuming some true frequency of unacceptable events. The boundary values {𝑟𝑖 , 𝑛𝑖 } that are consistent with those requirements can be determined. For a small cohort, sensible numerators would be 𝑟 = {1, 2, 3, …}, because we would want to consider stopping after each unacceptable event. That leaves only the 𝑛𝑖 to be determined numerically from equations (18.1–18.3). Example 18.4. We plan to consider stopping a trial after each of the first three unacceptable events that occur. Assume that events occur with a true frequency of 0.05, and we want a probability of 95% of continuing past the first landmark, 90% of continuing past the second, and 87.5% for continuing past the third. As above, the event counts are to be 𝑟 = {1, 2, 3, …}. We then calculate the number of subjects to include in each interim analysis to be consistent with these design requirements. To do so, we can simultaneously solve the three equations {𝑄1 = 0.95, 𝑄2 = 0.90, 𝑎𝑛𝑑 𝑄3 = 0.875} for the denominators 𝑛1 , 𝑛2 , and 𝑛3 , respectively. Doing so, and rounding noninteger values down to be conservative, yields 𝑛1 = 7, 𝑛2 = 20, and 𝑛3 = 31. More than 1, 2, or 3 events, respectively, at any of these cohort sizes would then stop the trial with the implication that the true event frequency exceeds 0.05. Bayesian Viewpoint Suppose we view the probability of an unacceptable event, 𝑝, as a random variable. If the accumulating evidence from our clinical trial suggests a high likelihood that 𝑝 exceeds a critical threshold, we should stop the trial. For example, in a high risk cohort we might not wish to tolerate more than a 25% chance that the probability of some serious or life threatening side effect is greater than 0.2. In a low-risk population, we might tolerate only a 5% chance that serious side effect occur with frequency 0.1, for example. A sensible probability model for 𝑝 is the beta distribution [796] because among other reasons all of its values lie in the range (0,1). The beta distribution has two parameters,
Piantadosi
Date: July 27, 2017
550
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
𝑎 and 𝑏. It is usually written mathematically in a way such that 𝑎 − 1 represents the number of successes, and 𝑏 − 1 the number of failures out of a series of 𝑎 + 𝑏 − 2 trials. Hence, when 𝑎 = 𝑏 = 1, the beta is simply a uniform distribution, implying that when evidence is lacking, any value for 𝑝 is equally probable. Event and nonevent counts (data from our trial) are parameters of a beta distribution, which then becomes a probability model for the true frequency of events. If the data suggest that the true frequency is high because a large fraction of the distribution exceeds a tolerability threshold, we would suspend the trial. The data and decision can be updated at any time. Furthermore, this method can explicitly incorporate prior data if available. If we are willing to put prior belief regarding the true value of 𝑝 in the form of data, that can also be accomodated. The remainder of this discussion will assume that no prior evidence or opinion exists. In monitoring a developmental trial, we might ask what combinations of 𝑎 and 𝑏 always maintain a small upper tail area above our tolerability threshold. In other words, how many events, 𝑎 − 1, and nonevents 𝑏 − 1, are consistent with a strong chance that the event probability is below our prespecified clinical tolerance? This is a question about the upper tail area of the most likely distribution for 𝑝, given the data. For example, when there are zero events and 15 nonevents in our cohort, the upper tail area is 18.5% above 0.1, 7.4% above 0.15, and 2.8% above 0.2. We might characterize this as strong evidence that the event rate is less than 0.2, moderate evidence that it is less than 0.15, and weak evidence that it is less than 0.1. More generally, we would specify a tolerability threshold and an acceptable error (chance of exceeding it), and calculate 𝑎 and 𝑏 consistent with them. If we represent the beta probability density function at 𝑥 by 𝐵(𝑥|𝑎, 𝑏), the clinical tolerability threshold by 𝑡, and the error (upper tail area) by 𝛼, the requirement is 𝑡
1−
∫0
1
𝐵(𝑥|𝑎, 𝑏)𝑑𝑥 =
∫𝑡
𝐵(𝑥|𝑎, 𝑏)𝑑𝑥 ≤ 𝛼.
(18.4)
This equation has to be solved numerically. When doing so for relatively small values of 𝑎 and 𝑏, it makes sense to index 𝑎 at integers 0, 1, 2, … , because it represents a count of events. The value of 𝑏 that then satisfies the equation may not be an integer, but because it also represents a count, it can be conservatively rounded down. Some values of 𝑎 and 𝑏 that solve equation (18.4) in the manner indicated for specified 𝑡 and 𝛼 are shown in Table 18.3. As we accrue the cohort on a developmental trial, we may want periodic assurance that the most likely distribution for 𝑝 lies mostly below a prespecified threshold. A series of these calculations yields a boundary that provides this evidence. The OC of a given boundary can then be determined by calculation or simulation over a range of event true frequencies. Table 18.3 covers cases with a low tolerance for events, but the same approach is relevant for high thresholds. The situation is also symmetric in 𝑎 and 𝑏 depending on how we define events and the tail area. Example 18.5. A developmental trial is planned with up to 30 participants. Investigators consider it unacceptable if there is ever more than a 25% chance that the true frequency of adverse events exceeds 0.20. Solving equation (18.4) for 𝑎 and 𝑏 produces the following pairs consistent with the specified tolerance: {0, 5}, {1, 10}, {2, 15}, {3, 20}, {4, 25}, {5, 29}. If any of the first five subjects on the trial experience an adverse event, there is greater than a 25% chance that the event
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
551
TABLE 18.3 Values of 𝒂 and 𝒃 That Solve Equation (18.4) for Specified 𝒕 and 𝜶 𝑡 0.05
𝛼
𝑎−1
0.05
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0.10 0.15 0.20 0.25 0.05 0.10 0.15 0.20 0.25
0.15
𝑏−1 57 90 27 43 17 27 12 19 9 15 35 64 17 30 10 19 7 13 5 10
𝑡
𝛼
𝑎−1
0.05
0.10
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0.10 0.15 0.20 0.25 0.05 0.10 0.15 0.20 0.25
0.20
𝑏−1 43 74 20 35 13 22 9 15 7 12 30 56 14 26 8 16 6 11 4 8
frequency exceeds 0.2 and the study would stop. The next decision point is at 1 + 10 = 11 subjects, where the trial would be stopped if more than one person experiences an adverse event. In 17 subjects, only 2 adverse events can be tolerated, and so on. The operating characteristics for the boundary given in Example 18.5 can be calculated using equations (18.1–18.3). One extra detail must be taken into account. The given boundary is pairs of event and nonevent counts (𝑎𝑖 and 𝑏𝑖 ). The operating characteristics equations (18.1–18.3) are written in terms of events and total subjects (𝑟𝑖 and 𝑛𝑖 ). To reconcile, 𝑟𝑖 = 𝑎𝑖 and 𝑛𝑖 = 𝑎𝑖 + 𝑏𝑖 . After this correction, the OC is shown in Table 18.4. We can see there when the true probability of an event is low, the trial is unlikely to hit a stopping boundary. As the true probability of an event approaches the threshold of 0.2, there is a substantial chance the study will be terminated early. For higher probabilities of an event, the trial will almost certainly stop very early. This boundary has generally the correct properties, but does not perform sharply for intermediate event probabilities. As indicated above, this is unavoidable in small studies. 18.4.3
Likelihood-Based Methods
Likelihood methods for monitoring trials are based principally or entirely on the likelihood function constructed from an assumed probability model. The likelihood function is frequently used for testing hypotheses, but it also provides a means for summarizing evidence from the data without a formal hypothesis test. Here we consider a likelihood method originally developed for testing, but use it as a pure likelihood-based assessment of the data. Likelihood methods represent nice learning examples, although they are not used in practice as often as frequentist methods discussed below.
Piantadosi
Date: July 27, 2017
552
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
TABLE 18.4 Operating Characteristics as a Function of the True Probability of an Event, 𝜽, for the Boundary Given in Example 18.5 𝜃 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20
1 − 𝑄1
1 − 𝑄2
1 − 𝑄3
1 − 𝑄4
1 − 𝑄5
0.05 0.10 0.14 0.18 0.23 0.27 0.30 0.34 0.38 0.41 0.44 0.47 0.50 0.53 0.56 0.58 0.61 0.63 0.65 0.67
0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.39 0.44 0.48 0.52 0.55 0.59 0.62 0.66 0.69 0.71 0.74 0.76 0.79
0.05 0.10 0.15 0.21 0.26 0.31 0.36 0.41 0.46 0.51 0.55 0.59 0.63 0.67 0.70 0.73 0.76 0.79 0.82 0.84
0.05 0.10 0.15 0.21 0.26 0.32 0.37 0.42 0.47 0.52 0.57 0.61 0.66 0.70 0.73 0.77 0.80 0.82 0.85 0.87
0.05 0.10 0.15 0.21 0.26 0.32 0.37 0.42 0.48 0.53 0.58 0.62 0.67 0.71 0.75 0.78 0.81 0.84 0.87 0.89
Fully Sequential Designs Test after Each Observation One could assess the treatment difference after each experimental subject is accrued, treated, and evaluated. Although this approach would allow investigators to learn about treatment differences as early as possible, it can be impractical for studies that require a long period of observation after treatment before evaluating the outcome. (An alternative perspective on this view is offered below.) This approach is definitely useful in situations where responses are evident soon after the beginning of treatment. It is broadly useful to see how quantitative guidelines for this approach can be constructed. The first designs that used this approach were developed by Wald [1518, 1519] to improve reliability testing and are called sequential probability (or likelihood) ratio tests (SPRT). As such, they were developed for testing specific hypotheses using the likelihood ratio test, and could be classified as frequentist procedures. More generally as suggested above, the likelihood function is a summary of all the evidence in the data and the likelihood ratio does not have to be given a hypothesis testing interpretation. Therefore, I have classified the SPRT as a likelihood method, in spite of it being a “test.” Frequentist methods are discussed below. SPRT Boundaries for Binomial Outcomes Consider a binary response from each study subject that becomes known soon after receiving treatment. We are interested in knowing if the estimated probability of response favors 𝑝1 or 𝑝2 , values of clinical importance chosen to facilitate decisions about early stopping. We will test the relative evidence in favor of 𝑝1 or 𝑝2 , using the likelihood ratio test [835], a commonly used statistical procedure.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
553
Because the response is a Bernoulli random variable for each subject, the likelihood function is binomial, 𝑒(𝑝,𝐗) =
𝑚 ∏
𝑝𝑟𝑖 (1 − 𝑝)1−𝑟𝑖 = 𝑝𝑠 (1 − 𝑝)𝑚−𝑠 ,
𝑖=1
where 𝐗 = {𝑟1 , 𝑟2 , 𝑟3 , … 𝑟𝑚 } is the data vector, 𝑟𝑖 is 1 or 0 depending on whether the 𝑖th subject responded or not, 𝑠 is the total number of responses, and 𝑚 is the number of subjects tested. The likelihood is a function of both the data, 𝐗, and the unknown probability of success, 𝑝. The relative evidence in favor of 𝑝1 or 𝑝2 is the likelihood ratio, which was also given in Section 16.4.7 as ( )𝑠 ( ) 𝑝1 1 − 𝑝1 𝑚−𝑠 (𝑝1 , 𝐗) = . Λ= (𝑝2 , 𝐗) 𝑝2 1 − 𝑝2 If Λ >> 1, the evidence favors 𝑝1 and if Λ 2.0) = 1 − Φ 0.228 where Φ(⋅) is the cumulative normal distribution function and Δ is the true hazard ratio. In other words, there is no convincing evidence to terminate the trial on the basis of this skeptical prior distribution.
Enthusiastic Prior
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
559
For comparison, consider what happens when investigators are subjectively enthusiastic about the new treatment. They quantify their opinion in the form of a different prior distribution, centered at the alternative hypothesis, but with the same variance as the skeptical prior. If true, random samples from this distribution would yield hazard ratios in excess of 2.0, 50% of the time. Following the calculations now, the posterior mean, Figure 18.5, is ′
𝜇𝑝 =
(0.693 × 32 + 0.811 × 45) = 0.762, (32 + 45)
with the same standard deviation as that obtained above. Investigators would then estimate the chance that the true hazard ratio exceeds 2.0 as ( ) log(2.0) − 0.762 Pr{Δ > 2.0) = 1 − Φ = 1 − Φ(−0.302) = 0.682. 0.228 Furthermore, the chance that the true hazard ratio exceeds 1.75 is 81%, which might lead some investigators to want the trial terminated because of efficacy. This example illustrates both the Bayesian method and its dependence on the choice of a prior distribution for the treatment effect. In many actual cases, one would not choose either of these distributions as a prior for monitoring purposes, and instead let the evidence from the trial stand more on its own by employing a “less informative” density function. An example of this might be a normal distribution that is made to be very flat by virtue of having a large variance. This would correspond to having few prior observations. Such a distribution might be chosen by investigators who prefer not to influence monitoring decisions greatly with information from outside the trial. 18.4.5
Decision-Theoretic Methods
One can approach the problem of stopping or continuing a clinical trial at the time of an interim analysis as a decision-theory question. Given a prior distribution for the treatment difference and a utility function, it is possible to construct optimal group sequential tests for terminating the trial. Similarly, any particular group sequential test is optimal under some assumed prior and utility, which can be determined using this approach. Like the likelihood and Bayesian methods, decision-theoretic approaches do not control the type I error properties of the stopping procedure. Fixing the probability of a type I error in advance does not lead to stopping guidelines with optimal decision-theoretic properties. Many of the designs that have been developed using this approach have been impractical and have not been applied to actual clinical trials. A principal difficulty is that the exact designs are sensitive to the “patient horizon,” which is the number of patients who stand to benefit from the selected therapy. It is usually difficult or impossible to specify this number. Despite this shortcoming, it is useful to examine the general properties of this approach. With it, subjective judgments are isolated and formalized as utility functions and prior distributions. The utility function quantifies the benefit from various outcomes and the prior distribution is a convenient way to summarize knowledge and uncertainty about the treatment difference before the experiment is conducted. Using these tools, the behavior of decision rules can be quantified and ranked according to their performance.
Piantadosi
Date: July 27, 2017
560
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
Standard group sequential stopping rules (discussed below) have been studied from a decision-theoretic view and found to be lacking [695]. When the utility functions and prior distributions are symmetric, the group sequential methods commonly employed are optimal. However, when a trial is judged by how well it improves the treatment of future patients, symmetric utilities may not be applicable, and standard group sequential stopping rules perform poorly. Lewis and Berry [933] discussed trial designs based on Bayesian decision theory but evaluated them as classical group sequential procedures. They show that the clinical trial designs based on decision theory have smaller average costs than the classical designs. Moreover, under reasonable conditions, the mean sample sizes of these designs are smaller than those expected from the classical designs. 18.4.6
Frequentist Methods
Sequential and group sequential clinical trial designs have been developed from the perspective of hypothesis testing, which permits trial termination when a test of the null hypothesis of no difference between the treatments rejects. As interim assessments of the evidence (e.g., tests of the null hypothesis) are carried out, the test statistic is compared with a pre-specified set of values called the “stopping boundary.” If the test statistic exceeds or crosses the boundary, the statistical evidence favors stopping the trial. By repeatedly testing accumulating data in this fashion, the type I error level can be increased. Constructing boundaries that preserve the type I error level of the trial, but still permit early termination after multiple tests, is part of the mechanics of frequentist sequential trial design. A simplified example will illustrate the effect. Example 18.6. Suppose that we take the data in 𝑛 nonoverlapping batches (independent of one another) and test treatment differences in each one using 𝛼 = 0.05. If the null hypothesis is true, the chance of not rejecting each batch is 1 − 0.05 = 0.95. Because of independence the chance of not rejecting all batches is 0.95𝑛 . Thus, the chance of rejecting one or more batches (i.e., the overall type I error) is 1 − 0.95𝑛 . The overall type I error, 𝛼 ∗ , would be 𝛼 ∗ = 1 − (1 − 𝛼)𝑛 , where 𝛼 is the level of each test. This example is only slightly deficient. In an actual trial the test is performed on overlapping groups as the data accumulate so that the tests are not independent of one another. Therefore, the type I error may not increase in the same way as it would for independent tests, but it does increase nonetheless. One way to correct this problem is for each test to be performed using a smaller 𝛼 to keep 𝛼 ∗ at the desired level. Triangular Designs Are “Closed” A complete discussion of the theory of sequential trial designs is beyond the scope of this book. This theory is extensively discussed by Whitehead [1549] and encompasses fixed sample size designs, SPRT designs, and triangular designs. A more concise summary is also given by Whitehead [1548]. Computer methods for analysis are discussed by Duan-Zheng [1589]. Here, I sketch sequential designs for censored survival data while
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
561
trying to avoid some of the complexities of the theory. For a clinical application of the triangular design, see Moss, Hall, Cannom, et al. [1065]. As in the case of fixed sample size(equations, ) assume that the measure of difference between treatment groups is Δ = log 𝜆1 ∕𝜆2 , the logarithm of the hazard ratio. At the time of the 𝑖th failure, we have observed 𝑑𝑖1 and 𝑑𝑖2 failures in the treatment groups from 𝑛𝑖1 and 𝑛𝑖2 subjects who remain at risk. Consider the logrank statistic, which measures the excess number of events on the control treatment over that expected: 𝑍=
𝑁 ∑ 𝑛𝑖1 𝑑𝑖2 − 𝑛𝑖2 𝑑𝑖1
𝑛𝑖
𝑖=1
,
where 𝑛𝑖 = 𝑛𝑖1 + 𝑛𝑖2 . The variance of this quantity is 𝑉 =
𝑁 ∑ 𝑑𝑖 (𝑛𝑖 − 𝑑𝑖 )𝑛𝑖1 𝑛𝑖2
(𝑛𝑖 − 1)𝑛2𝑖
𝑖=1
,
where 𝑑𝑖 = 𝑑𝑖1 + 𝑑𝑖2 . 𝑉 is a measure of the amount of information in the data. For equal sized treatment groups and a small proportion of events, 𝑉 ≈ 𝐷∕4 where 𝐷 is the total number of events. To monitor the trial, 𝑍𝑗 and 𝑉𝑗 will be calculated at the 𝑗 th monitoring point. It is necessary to assume that 𝑉𝑗 does not depend on 𝑍1 , … 𝑍𝑗−1 so that the stopping rules cannot be manipulated. This can be accomplished by using a fixed schedule of interim analysis times. At the 𝑗 th interim analysis, if 𝑍𝑗 is more extreme than a specified boundary point, the trial will be stopped. The boundaries depend on 𝑉1 , … 𝑉𝑗 . It is customary to draw them on a plot of 𝑍 versus 𝑉 . Figure 18.6 shows an example for the triangular test. For discrete rather than continuous monitoring, the boundaries must be modified slightly, yielding a “Christmas tree” shape. The mathematical form of the boundaries can be derived from two considerations. The first is the usual power requirement which states that the probability of rejecting the null hypothesis when the hazard ratio is Δ𝑎 should be 1 − 𝛽. The second consideration relates the upper boundary, 𝑍𝑢 and the lower boundary 𝑍𝓁 , to 𝑉 as 𝑍𝑢 = 𝑎 + 𝑐𝑉 𝑍𝓁 = −𝑎 + 3𝑐𝑉 . In the very special case when 𝛽 = 𝛼∕2, we have 𝑎=
−2 log(𝛼) Δ𝑎
and 𝑐=
Δ𝑎 . 4
However, the restriction on 𝛼 and 𝛽 that produces this simplification is not generally applicable, and one must resort to numerical solutions for other cases.
Piantadosi
Date: July 27, 2017
562
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
FIGURE 18.6
Triangular test stopping boundaries.
Group Sequential Designs May Be Easier to Construct and Apply In many circumstances investigators do not need to assess treatment differences after every subject is accrued. When monitoring large (multicenter) trials, it is more common for data about efficacy to be available only at discrete times, perhaps once or twice each year. When assessments are made at discrete intervals but frequently enough, the trial can terminate nearly as early as if it were monitored continually. This is the idea of group sequential methods. Typically, only a handful of interim looks need to be performed. The statistical approach to group sequential boundaries is to define a critical value for significance at each interim analysis so that the overall type I error criterion will be satisfied. Suppose that there are a maximum of 𝑅 interim analyses planned. If the values of the test statistics from the interim analyses are denoted by 𝑍1 , 𝑍2 , … , 𝑍𝑅 , the boundary values for early stopping can be denoted by the points 𝐵1 , 𝐵2 , … , 𝐵𝑅 . At the 𝑗th analysis, the trial stops with rejection of the null hypothesis if 𝑍𝑗 ≥ 𝐵 𝑗
for 1 ≤ 𝑗 ≤ 𝑅.
Many commonly used test statistics have “independent increments” so that the interim test statistic can be written ∑𝑗 𝑍𝑖∗ , 𝑍𝑗 = 𝑖=1 √ 𝑗 where 𝑍𝑖∗ is the test statistic based on the 𝑖th group’s data. This greatly simplifies the calculation of the necessary interim significance levels. Although the boundaries are constructed in a way that preserves the overall type I error of the trial, one may construct boundaries of many different shapes, which satisfy the overall error criterion. However,
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
FIGURE 18.7
563
Group sequential stopping boundaries from Table 18.5.
there are a few boundary shapes that are commonly used. These are shown in Figure 18.7 and the numerical values are given in Table 18.5. Specific Group Sequential Boundaries The Pocock boundary uses the same test criterion for all tests performed, 𝑍𝑖 = 𝑍𝑐 for 1 ≤ 𝑖 ≤ 𝑅, where 𝑍𝑐 is calculated to yield the desired overall type I error rate. With this boundary, it is relatively easy to terminate a trial early. However, the procedure suffers from the undesirable feature that the final test of significance is made with a larger critical value (smaller p-value) than that conventionally used for a fixed sample size trial. In theory, the final significance level after three analyses could be between, say, 0.05 and 0.0221 (Table 18.5). If one had not used a group sequential procedure, the trial would show a significant difference. However, because the sequential procedure was used, the result is not “significant.” This is an uncomfortable position for investigators. The Haybittle–Peto boundary corrects this problem because the final significance level is very close to that conventionally employed, such as it is near 0.05. It is harder to terminate the trial early, but the final analysis resembles the hypothesis test that would have been used in a fixed sample size design. It is a very convenient and workable design. Similarly, the O’Brien–Fleming design uses boundaries that yield nearly conventional levels of significance for the √ final analysis but makes it hard to terminate the trial early. For these designs, 𝑍𝑗 = 𝑍𝑐 ∕ 𝑅∕𝑗, where again, 𝑍𝑐 is calculated to control the overall type I error rate [1129]. One needs very strong evidence to stop early when using this boundary. Interim analyses of accumulating data are usually spaced evenly in calendar time or information time. This means that some analyses are planned when the trial data are
Piantadosi
Date: July 27, 2017
564
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
TABLE 18.5 Some Frequently Used Group Sequential Stopping Boundaries with 𝒁 Scores and Significance Levels for Different Numbers of Interim Analyses Interim Analysis Number
O’Brien – Fleming 𝑍 𝑝
Haybittle – Peto 𝑍 𝑝
𝑍
𝑝
Pocock
1 2
2.782 1.967
0.0054 0.0492
𝑅=2 2.576 0.0100 1.960 0.0500
2.178 2.178
0.0294 0.0294
1 2 3
3.438 2.431 1.985
0.0006 0.0151 0.0471
𝑅=3 2.576 0.0100 2.576 0.0100 1.960 0.0500
2.289 2.289 2.289
0.0221 0.0221 0.0221
1 2 3 4
4.084 2.888 2.358 2.042
5 × 10−5 0.0039 0.0184 0.0412
𝑅=4 3.291 0.0010 3.291 0.0010 3.291 0.0010 1.960 0.0500
2.361 2.361 2.361 2.361
0.0158 0.0158 0.0158 0.0158
1 2 3 4 5
4.555 3.221 2.630 2.277 2.037
5 × 10−6 0.0013 0.0085 0.0228 0.0417
𝑅=5 3.291 0.0010 3.291 0.0010 3.291 0.0010 3.291 0.0010 1.960 0.0500
2.413 2.413 2.413 2.413 2.413
0.0158 0.0158 0.0158 0.0158 0.0158
relatively immature. It is quite difficult to meet stopping guidelines with less than about 50% of the information available. It may not be worthwhile to look prior to that unless we firmly believe that the truth lies outside the range of the null and alternative hypotheses. For example, suppose that survival is the primary outcome and the trial is powered for an alternative hazard ratio of 1.75. Very early analyses are not likely to be useful (result in stopping) unless the true hazard ratio is less than 1.0 or greater than 1.75. Alpha Spending Designs Permit more Flexibility One of the principal difficulties with the strict application of group sequential stopping boundaries is the need to specify the number and location of points for interim analysis in advance of the study. As the evidence accumulates during a trial, this lack of flexibility may become a problem. It may be more useful to be able to adjust the timing of interim analyses and, for the frequentist, to use varying fractions of the overall type I error for the trial. Such an approach was developed and called the “alpha spending function” [351, 888]. See also Hwang, Shih, and DeCani [757]. Suppose that there are 𝑅 interim analyses planned for a clinical trial. The amount of information available during the trial is called the information fraction and will be denoted by 𝜏. For grouped interim analyses, let 𝑖𝑗 be the information available at the 𝑗 th analysis, 𝑗 = 1, 2, … , 𝑅, and let 𝜏𝑗 = 𝑖𝑗 ∕𝐼 be the information fraction, where 𝐼 is the total information. For studies comparing mean values based on a final sample size of 𝑁, 𝜏 = 𝑛∕𝑁, where 𝑛 is the interim accrual. For studies with time-to-event endpoints,
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
565
𝜏 ≈ 𝑑∕𝐷, where 𝑑 is the interim number of events and 𝐷 is the planned final number. The alpha spending function, 𝛼(𝑡), is a smooth function on the information fraction such that 𝛼(0) = 0 and 𝛼(1) = 𝛼, the final type I error rate desired. The alpha spending function must be monotonically increasing. If the values of the test statistics from the interim analyses are denoted by 𝑍1 , 𝑍2 , …, 𝑍𝑗 , the boundary values for early stopping are the points 𝐵1 , 𝐵2 , …, 𝐵𝑗 , such that | | Pr{||𝑍1 || ≥ 𝐵1 , 𝑜𝑟 ||𝑍2 || ≥ 𝐵2 , ⋯ , 𝑜𝑟 |𝑍𝑗 | ≥ 𝐵𝑗 } = 𝛼(𝜏𝑗 ). | | By this definition, the O’Brien–Fleming boundary discussed above is )] [ ( 𝑍𝛼 , 𝛼𝑂𝐹 (𝜏) = 2 1 − Φ √ 𝜏 where Φ(𝑥) is the cumulative standard normal distribution function. The Pocock boundary discussed above is 𝛼𝑃 (𝜏) = 𝛼 log(1 + 𝜏(𝑒 − 1)). Many other spending functions have been constructed [757]. Early Stopping May Yield Biased Estimates of Treatment Effect One drawback of sequential and group sequential methods is that, when a trial terminates early, the estimates of treatment effect will be biased. This was illustrated semi-quantitatively in Section 6.3.1, but also can be understood intuitively. If the same study could be repeated a large number of times, chance fluctuations in the estimated treatment effect in the direction of the boundary would be more likely to result in early termination than fluctuations away from the boundary. Therefore, when the boundary is touched, these variations would not average out equally, biasing the estimated treatment effect. The sooner a boundary is hit, the larger is the bias in the estimated treatment effect. This effect can create problems with the analysis and interpretation of a trial that is terminated early [423]. For example, consider a randomized comparative group sequential trial with four analyses using an O’Brien–Fleming stopping boundary. If the true treatment effect is 0.0, there is a small chance that the stopping boundary will be crossed at the third analysis. This would be a type I error. This probability is 0.008 for the third analysis, upper boundary. Among trials crossing the upper boundary there, the average 𝑍 score for the treatment effect is approximately 2.68. When the true treatment effect is 1.0, 50% of the trials will cross the upper boundary. Among those crossing at the third interim analysis, the average 𝑍 score is 2.82. When the true treatment effect is 2.0, 98% of trials will terminate on the upper boundary. Among those crossing at the third interim analysis, the average 𝑍 score is 3.19. These results, each determined by statistical simulation of 10,000 trials, illustrate the bias in estimated treatment effects when a boundary is crossed. One could consider correcting for this bias using a Bayesian procedure that modifies the observed treatment effect by averaging over all possible “prior” values of the true treatment effect. However, such a procedure will be strongly affected by the properties of the assumed prior distribution and therefore may not be a good solution to the problem of bias. In other words, the true prior distribution of treatment effects is unknown, making it impossible to reliably correct for the bias.
Piantadosi
Date: July 27, 2017
566
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
Because of this bias the trialist must view these sequential designs as “selection designs” that terminate when a treatment satisfies the stopping criteria. In contrast, fixed sample size trials might be termed “estimation designs” because they will generally provide unbiased estimates of the treatment effect or difference. One should probably not employ a selection design when it is critical to obtain unbiased estimates of the treatment effect. A fixed sample size design would be more appropriate in such a situation. Similarly, when early stopping is required for efficient use of study resources or ethical concerns, a selection design should be used in place of the fixed sample size alternative. 18.4.7
Other Monitoring Tools
Futility Assessments and Conditional Power The monitoring methods discussed above assume that the trial should be terminated only when the treatments are significantly different. However, we should also consider stopping a clinical trial when the interim result is unlikely to change after accruing more subjects (futility). There really are two circumstances related to futility, each of which needs a slightly differrent perspective. The first circumstance is when the interim evidence suggests that the final evidence will not support rejection of the null hypothesis. In particular, we may eventually fail to reject the null hypothesis when the observed interim treatment effect is substantially smaller than that planned in the original design. Assuming, we designed wisely for a minimal clinically important treatment effect, we should be comfortable with smaller effects yielding statistically nonsignificant results. A second slightly different circumstance (discussed below) arises when the interim result is essentially null. We can calculate the power of the study to reject the null hypothesis at its end, given the interim results. This is called conditional power [885, 886] because the power calculation will depend on how much information is left to collect and a projected trend to the end of the trial. Some practical aspects of this method are discussed by Andersen [38]. Example 18.7. Suppose that we are testing a questionable coin to see if it yields heads 50% of the time when flipped. If 𝜋 is the true probability of heads, we have 𝐻0 ∶ 𝜋 = Pr(ℎ𝑒𝑎𝑑𝑠) = 0.5 or 𝐻𝑎 ∶ 𝜋 = Pr(ℎ𝑒𝑎𝑑𝑠) > 0.5. A fixed sample size plan might be to flip the coin 500 times and reject 𝐻0 if 𝑛 − 250 ≥ 1.96, 𝑍=√ ℎ 500 × .5 × .5 where 𝑛ℎ is the number of heads obtained and 𝛼 = 0.025. Stated another way, we reject 𝐻0 if 𝑛ℎ ≥ 272. Suppose that we flip the coin 400 times and observe 272 heads. It would be futile to conduct the remaining 100 trials because we are already certain to reject the null hypothesis. Alternatively, suppose that we observe 200 heads after 400 trials. We can reject the null only if 72 or more of the remaining 100 trials yield heads. From a fair coin, which this appears to be, this event is unlikely, being more than two standard deviations away from the null. Thus, with a fairly high certainty, the overall null will not be rejected and continuing the experiment is futile.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
567
Partial Sums To illustrate how conditional power works in a one-sample case, suppose that we have a sample of size 𝑁 from a 𝑁(𝜇, 1) distribution and we wish to test 𝐻0 ∶ 𝜇 = 0 versus 𝐻𝑎 ∶ 𝜇 > 0. Furthermore, assume that we have a test statistic, 𝑍(𝑁) , which is based on the mean of a sample from a Gaussian distribution, ∑𝑁
𝑖=1 𝑥𝑖 𝑍(𝑁) = √ , 𝑁
where we reject 𝐻0 if 𝑍(𝑁) > 1.96. Using a standard fixed sample size approach (equation √ 16.22), we solve 𝜃 = 𝑁𝜇 = (𝑍𝛼 + 𝑍𝛽 ) for 𝑁 to obtain the sample size. Part way through the trial when 𝑛 subjects have been accrued, the test statistic is ∑𝑛 𝑆 𝑖=1 𝑥𝑖 𝑍(𝑛) = √ = √𝑛 , 𝑛 𝑛 where 1 ≤ 𝑛 ≤ 𝑁, and 𝑆𝑛 is the partial sum. It is easy to show that 𝐸{𝑆𝑛 } = 𝑛𝜇 and 𝑣𝑎𝑟{𝑆𝑛 } = 𝑛. More important, using 𝑆𝑛 , the increments are independent, meaning 𝐸{𝑆𝑁 ∣ 𝑆𝑛 } = 𝑆𝑛 + 𝐸{𝑆𝑁−𝑛 } and var{𝑆𝑁 ∣ 𝑆𝑛 } = 𝑁 − 𝑛. The expected value of 𝑆𝑛 increases linearly throughout the trial. Therefore, we can easily calculate the probability of various outcomes conditional on observing certain interim results. This is represented in Figure 18.8 by the plot of 𝑆𝑛 versus 𝑛 (and the information fraction, 𝑓 ) and the following example:
FIGURE 18.8
Projected partial sums during a trial.
Piantadosi
Date: July 27, 2017
568
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
TABLE 18.6 Projected Partial Sums During a Trial under Three Assumptions about the Trend from Example 18.8 𝜇
𝐸{𝑆144 ∣ 𝐒100 }
𝐯𝐚𝐫{𝐒𝟏𝟒𝟒 𝐒𝟏𝟎𝟎 }
0.25 0.35 0.00
35 + (144 − 100) × 0.25 = 46.0 35 + (144 − 100) × 0.35 = 50.4 35 + (144 − 100) × 0.00 = 35.0
144 − 100 = 44 44 44
Case C1 C2 C3
√ Example 18.8. Suppose 𝜇 = 0.25 and 𝑁𝜇 = (1.96√ + 1.04) = 3 so that 𝑁 = 144. After 𝑛 = 100 subjects, suppose 𝑆100 = 35 (𝑍100 = 35∕ 100 = 3.5). There are at least three scenarios under which we would like to project the end of the trial given the interim results: (C1) the original alternative hypothesis continues to the end of the trial, (C2) the current trend continues to the end, and (C3) the original null hypothesis continues to the end of the trial. These are shown in Figure 18.8. Other scenarios could also be important to consider. The method of calculation is shown in Table 18.6. For C1, Pr{𝑍(144) ≥ 1.96 ∣ 𝑍(100) = 3.5, 𝜇 = 0.25} = Pr{𝑆144 ≥ 23.52 ∣ 𝑆100 = 35, 𝜇 = 0.25} { } 𝑆144 − 46.0 23.52 − 46.0 ≥ ∣ 𝑆100 = 35, 𝜇 = 0.25 = Pr √ √ 44 44 = Pr{Φ(𝑥) ≥ −3.39} = 0.99965. Similarly, for C2, Pr{𝑍(144) ≥ 1.96 ∣ 𝑍(100) = 3.5, 𝜇 = 0.35} { } 𝑆144 − 50.4 23.52 − 50.4 ≥ ∣ 𝑆100 = 35, 𝜇 = 0.35 = Pr √ √ 44 44 = Pr{Φ(𝑥) ≥ −4.05} = 0.99997, and for C3, Pr{𝑍(144) ≥ 1.96 ∣ 𝑍(100) = 3.5, 𝜇 = 0.0} { } 𝑆144 − 35.0 23.52 − 35.0 ≥ ∣ 𝑆100 = 35, 𝜇 = 0.0 = Pr √ √ 44 44 = Pr{Φ(𝑥) ≥ −1.73} = 0.95818. Because of the high chance of rejecting the null hypothesis regardless of assumptions about the trend, investigators would probably terminate this study early.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STATISTICAL METHODS FOR MONITORING
569
B-values Conditional power can be made independent of the sample size [887]. To do so, we define the information fraction of the trial as 𝑓 = 𝑛∕𝑁. Then for a comparative study we define 𝑍𝑛 =
𝑋1 − 𝑋2 , √ 𝑠𝑛 1∕𝑛1 + 1∕𝑛2
where 𝑛1 ≈ 𝑛2 , and √ 𝐵(𝑓 ) = 𝑍𝑛 𝑓 . The expected value of 𝐵(𝑓 ) is 𝐸[𝐵(𝑓 )] = 𝐸(𝑍𝑁 )𝑓 = 𝜃𝑓 , a linear function of 𝑓 that is useful for extrapolations. Also 𝐵(1) = 𝐵(𝑓 ) + [𝐵(1) − 𝐵(𝑓 )], from which we can see: 𝐵(𝑓 ) and (𝐵(1) − 𝐵(𝑓 )) are independent and normal; 𝐸[𝐵(1) − 𝐵(𝑓 )] = 𝜃(1 − 𝑓 ); var[𝐵(𝑓 )] = 𝑓 , and var[𝐵(1) − 𝐵(𝑓 )] = 1 − 𝑓 . From these facts the conditional power given 𝜃 can be expressed as 𝑍CP =
𝑍𝛼 − 𝐵(𝑓 ) − 𝜃(1 − 𝑓 ) √ 1−𝑓
(18.13)
and 𝑃 (𝜃) = 1 − Φ(𝑍CP ). If the null trend pertains to the remainder of the trial, ) ( 𝑍𝛼 − 𝐵(𝑓 ) , 𝑃 (0) = 1 − Φ √ 1−𝑓 whereas if the observed trend continues, ( 𝑃 (𝜃) = 1 − Φ
𝑍𝛼 − 𝐵(𝑓 )∕𝑓 √ 1−𝑓
) .
Example 18.8 above can be reworked in terms of B-values as follows. Example 18.9.√ After 𝑛 = 100 subjects, 𝑓 = 100∕144 = 0.694, 𝑍100 = 3.5, and 𝐵(0.694) = 3.5 0.694 = 2.92. The three scenarios under which we would like to project the end of the trial are (C1) 𝜃 = 3, (C2) 𝜃 = 2.92∕0.694 = 4.20, and (C3) 𝜃 = 0. These can be calculated from equation (18.13). For C1, 𝑍𝛼 − 𝐵(𝑓 ) − 𝜃(1 − 𝑓 ) 1.96 − 𝐵(0.694) − 3 × 0.306 = √ √ 1−𝑓 0.306 1.96 − 2.92 − .918 = −3. 396. = 0.553 𝑃 (3) = 1 − Φ(−3.396) = 0.99965. 𝑍CP =
Piantadosi
Date: July 27, 2017
570
Time: 4:45 pm
TREATMENT EFFECTS MONITORING
Similarly, for C2, 1.96 − 2.92 − 4.2 × 0.306 0.553 = −4.05. 𝑃 (4.2) = 1 − Φ(−4.05) = 0.99997, 𝑍CP =
and for C3, 1.96 − 2.92 − 0 × 0.306 0.553 = −1.73. 𝑃 (0) = 1 − Φ(−1.73) = 0.95818. 𝑍CP =
The conclusions are as above. When the interim trial results are literally null, the conditional power approach discussed above is still relevant but some additional thought is helpful. For an interim null and projected null trend, the “power” will be the type I error rate, e.g., 5%. Seeing this we might be inclined to stop the trial, but there can be compelling reasons to continue. If the risk-benefit in both groups is the same as they may well be under the null, the trial should probably continue to its planned end to yield as strong evidence as possible that the treatments do not differ. Strong evidence or narrow confidence intervals around a risk ratio of 1 will be useful for scientific, policy, and economic reasons. This emphasizes that we should not rely either on single tools or perspectives nor strictly on a hypothesis testing framework when making important decisions about trial termination. 18.4.8
Some Software
A recurring problem in implementing new or difficult methods of design and analysis is the availability of reliable, flexible, and reasonably priced computer software to facilitate the computations. For the Bayesian and likelihood-based methods discussed above, the calculations are simple enough to be performed by hand. To my knowledge, there is no commercially available software to accomplish those calculations. For calculating group sequential boundaries, the program East [328] is available. Although the user interface is somewhat awkward, the program is accurate and reasonably priced. It also permits including stopping boundaries for both the null hypothesis and the alternative hypothesis.
18.5
SUMMARY
It is appropriate, if not ethically imperative, for investigators to examine the accumulating results from a clinical trial in progress. Information from the trial may convince investigators to close the study early for reasons of subject safety. Reasons why a trial might be stopped earlier than initially planned include: the treatments are convincingly
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SUMMARY
571
different (or equivalent); side effects are too severe or adherence is too low; the data are of poor quality; or needed resources are lost or are insufficient. The tendency to stop a study for these or other reasons is balanced by an imperative to gather complete and convincing evidence about all the objectives. The process used to make these decisions must balance expertise and objectivity. It is relatively easy to compose an advisory committee with the required expertise, and somewhat harder to assure objectivity of the monitoring process. By their nature, experts tend to be less than completely objective, but monitoring policies and procedures can compensate for it. Strong separation of the monitoring process from the expertise of the investigators (e.g., firewalling) is seldom truly necessary or wise. Making inferences about treatments based on incomplete data can be error prone. Investigators must place additional structure on the interim analyses beyond the usual end-of-study plans to minimize errors. There are several approaches to accomplishing this, including frequentist methods that control the type I error, likelihood-based methods, Bayesian approaches, and decision theory. The product of each of these approaches to early stopping is a set of quantitative guidelines that help investigators evaluate the strength of the available evidence and decide if the trial should be terminated. All methods require timely and accurate interim reporting of data. Frequentist methods for constructing early stopping guidelines (boundaries) have gained widespread acceptance. They control the overall type I error of the trial and are relatively simple and flexible to implement and interpret. However, they have the drawback that the same data can yield different inferences depending on the monitoring plans of the investigators. Likelihood methods base the decision to stop a trial early on achieving the same strength of evidence (measured by likelihood ratios) as one would obtain for the final analysis. Bayesian methods have similar characteristics but allow subjective notions of strength of evidence to play a part in the decision to stop. In the commonly used group sequential boundaries, the decision to stop or not is taken at a small number of discrete points during the trial, typically once or twice each year. Additional flexibility in the timing and type I error control can be gained by using an alpha spending function approach. Fully sequential frequentist methods, likelihood methods such as the SPRT, and most Bayesian monitoring methods permit assessment of interim data continuously. Besides stopping when the “null hypothesis” appears to be false, trials may be terminated early if, after some point, there is little chance of finding a difference. Conditional power calculations are one means to facilitate this. Statistical guidelines for the primary endpoint are not the only consideration when assessing the interim evidence from a clinical trial. Side effects, unanticipated events, data quality, secondary objectives, and evidence from outside the trial may all reflect on the decision to continue. Also, the sponsor or investigators may have other obligations or interests that can conflict with objective evaluation of trial data. Because monitoring can be complicated, many clinical trials use a formal Treatment Effects Monitoring Committee (TEMC) to assist with the task and assure subject safety. After reviewing interim data, the TEMC could recommend a variety of actions including stopping the trial, modifying the study protocol, examining additional data, or making a new assessment of the evidence sooner than originally planned.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
572
TREATMENT EFFECTS MONITORING
18.6
QUESTIONS FOR DISCUSSION
1. Investigators plan a safety and activity clinical trial to study the efficacy and side effects of a new genetically engineered treatment vaccine against prostate cancer. The toxicity is likely to be low and the potential treatment effect, based on animal studies, could be high. The clinicians are anxious to finish the trial as quickly as possible because of cost and scientific priorities. Sketch and defend a trial design that you believe would be appropriate under these circumstances. 2. The probability of success on standard therapy for a disease is 50%. A new treatment is being tested and investigators would like to end the trial if evidence supports a success rate of 75%. Construct, graph, and explain what you believe to be appropriate statistical monitoring guidelines for this trial. 3. In a clinical trial comparing treatments to improve survival following AIDS, investigators plan to adopt a new treatment if the evidence suggests a hazard ratio of 1.75 (new treatment is superior). Investigators prefer a Bayesian approach to monitoring and analysis. Prior to the trial, some data are available indicating a hazard ratio of 1.6 based on 50 events. During the trial, a hazard ratio of 2.0 is observed with 60 and 30 events in the treatment groups. What is your quantitative assessment of the evidence and would you favor continuing the trial? 4. Two statisticians are helping to conduct, monitor, and analyze a trial with a single planned interim analysis and a final analysis. At the interim, the statisticians perform some calculations independently and agree that the study should be continued. (Because they agree to recommend continuing the trial, the statisticians do not discuss details of their interim analyses.) At the final analysis, the p-value turns out to be 0.035. To their consternation, the statisticians discover that they have used different boundaries for the analyses and one can declare the final result “significant” while the other cannot. Discuss how you can help them out of this problem. 5. A trial is planned to accrue √ a maximum sample size of 80 subjects. After 60 have been accrued, 𝑍 = 25∕ 60 = 3.23. Should the trial be continued? Justify your recommendation. 6. For the previous problem, calculate the information fraction, 𝑓 , and the B-value, 𝐵𝑓 .
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
19 COUNTING SUBJECTS AND EVENTS
19.1 INTRODUCTION Counting events properly in a clinical trial can be more difficult than it sounds. Common problems contributing to ambiguity include missing and incomplete observations, treatment dropouts, treatment crossovers, eligibility errors, uncounted or misclassified events, and lack of adherence to planned schedules [556]. These and other data imperfections occur universally in clinical trials for several reasons. First, human participants do not precisely follow experiment protocols because they are independent and autonomous, and because of evolving clinical circumstances. Every participant is entitled to an advantageous position, the nature of which can change with their medical condition. Second, imperfections arise when investigators themselves do not adhere to the research plan. Investigators may deviate to follow their conscience or the spirit of the study, or make other well intentioned modifications after the fact. Trouble counting subjects or events often results. Investigators may fail to follow experimental protocols also because of their independence and autonomy, trial complexity or ambiguity, and unforeseen clinical circumstances. The commonest flaw attributable to investigators may be the entry of technically ineligible participants into a trial. Third, many clinical trials are extremely large and complex undertakings. It is not uncommon for comparative trials for regulatory objectives to involve thousands of participants, dozens or hundreds of clinical centers, and many countries. Such endeavors can take years to complete. Human error and legitimate differences of interpretation in such circumstances is unavoidable. Accounting for those differences and errors can yield imperfect data. While we always strive for perfect data, there is no general rule as to the consequences of various defects. Depending on design, the validity of a clinical trial can be sensitive Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
573
Piantadosi
Date: July 27, 2017
574
Time: 4:45 pm
COUNTING SUBJECTS AND EVENTS
to certain imperfections, how frequently they occur, or how they are resolved, while remaining unaffected by others. Trial designs and outcomes are chosen to minimize the impact of expected flaws. Fortunately, good design can make the inevitable human error in large complex trials essentially inconsequential. Censoring of longitudinal observations is probably the most common type of incomplete data. But censoring is usually anticipated and compensated in the design and analysis. The result is some loss of efficiency, but preserved accuracy and validity of treatment effect estimates. Not everyone takes their oral medication, giving rise to the classic adherence problem. Although measuring medication adherence is itself an echinate problem, let’s assume we can do it and find it to be well below 100%. Should this imperfection be ignored? Should only the adherent subjects be analyzed? Should the treatment effect be adjusted for adherence, and how? Dealing with these two ubiquitous data imperfections take the investigator down very different paths. One of the scariest imperfections arises when participants use a therapy other than the one prescribed for them by the study protocol. The best approach to this circumstance has been hotly debated, especially for comparative trials. Properly counting the denominator in a treatment group is nearly as basic as counting the number of participants on the trial. We can obtain satisfactory answers only by understanding what is the true and achievable purpose of a trial, and how we can preserve its validity. In developmental trials there is great temptation to purify the cohort, even after the fact and contrary to technical eligibility. The rationale for excluding eligible accrued participants is usually biologically based and intuitively justifiable. But counting is fairly empirical and too much biology can interfere with it.
19.2
IMPERFECTION AND VALIDITY
The challenge is how best to reconcile imperfections with the experimental structure upon which valid inference depends. This problem has been approached from two perspectives. One view has been called “explanatory” and the other “pragmatic,” terms coined 45 years ago [1348–1350]. Other terms used to describe the same distinction are “efficacy” versus “effectiveness” (Chapter 9) and “explanatory” versus “management” [1322]. The explanatory perspective emphasizes acquiring information, while the pragmatic perspective focuses on making decisions. These viewpoints have practical implications for how the investigator deals with both trial design issues as well as data imperfections. This chapter emphasizes the pragmatic perspective. Suppose that investigators are studying the effect of pre-operative chemotherapy in subjects with early lung cancer. They plan a randomized trial with the treatment groups consisting of surgery (𝑆) compared to chemotherapy (𝐶) plus surgery. A biologist might view the question like a laboratory experiment, and could isolate the effect of chemotherapy by scheduling surgery at the same time following randomization in both treatment groups. Thus, subjects in the 𝑆 group would wait a period of time before having surgery. The waiting time in the 𝑆 group corresponds to the chemotherapy period in the 𝐶 + 𝑆 group. This design attempts to estimate the effect of chemotherapy. In contrast, a clinician might view that design as unrealistic, noting that physicians and patients will be unwilling to wait for surgery. In practice, patients receiving surgery alone would have their operations immediately after diagnosis. This leads to a different,
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
TREATMENT NONADHERENCE
575
practical design, where surgery is scheduled at once in the 𝑆 group but after chemotherapy in the 𝐶 + 𝑆 group. Although this design does not, strictly speaking, isolate the effect of chemotherapy, it is pragmatic and attempts to select the superior treatment as used in actual practice. The two trial designs are superficially the same, but demonstrate an important conceptual difference. The explanatory trial attempts to estimate what has been called “methodeffectiveness” [1025, 1373] while the pragmatic one addresses “use-effectiveness.” One cannot say which is the superior or correct approach, only that both questions are relevant and important and that both types of queries cannot always be answered by the same trial. Similarly, when resolving data imperfections arising from nonadherence with the study protocol, explanatory and pragmatic views may each suggest methods based on relevant and important questions. However, like the design problem just outlined suggests, there is no guarantee that the questions can be answered from the existing data. The final resolution depends on the specific circumstances. Distinguishing between explanatory and pragmatic approaches is a useful device to explain different philosophies for coping with protocol nonadherence and other data imperfections, as well as some design issues. However, it is not possible to label all approaches to such problems in this way. Moreover, it is not possible to say exactly what methods the explanatory or pragmatic views will emphasize in every circumstance. In any case, I will continue to use the labels in this chapter as a descriptive device.
19.3
TREATMENT NONADHERENCE
There has been considerable investigation into the problems surrounding treatment nonadherence. Most work has focused on ways of measuring or preventing nonadherence and proposed ways of improving statistical estimates of treatment effects when adherence is a problem. There is a chronic debate about the advantages and problems of analyses based on treatment assigned compared with those based on treatment received in comparative trials. Several reviews are available [595, 932, 1116]. This section will highlight some of the issues surrounding this “intention to treat” debate. 19.3.1
Intention to Treat Is a Policy of Inclusion
Intention to treat (ITT) is the idea, often stated as a principle, that subjects on a randomized clinical trial should be analyzed as part of the treatment group to which they were assigned, even if they did not actually receive the intended treatment. The term “intention to treat” appears to have been originated by Hill [716]. It can be defined generally as the analysis that includes all randomised subjects in the groups to which they were randomly assigned, regardless of their adherence with the entry criteria, regardless of the treatment they actually received, and regardless of subsequent withdrawal from treatment or deviation from the protocol [468].
Thus, ITT is an approach to several types of protocol nonadherence. “Treatment received” (TR) is the idea that subjects should be analyzed according to the treatment actually given, even if the randomization called for something else. I can think of no other issue
Piantadosi
Date: July 27, 2017
576
Time: 4:45 pm
COUNTING SUBJECTS AND EVENTS
which perpetually generates so much disagreement between clinicians, some of whom prefer treatment received analyses, and statisticians, who usually prefer intention to treat approaches. Actually, there is a third type of analysis that often competes in this situation, termed “adherers only.” This approach discards all those subjects who did not comply with their treatment assignment. Most investigators would instinctively avoid analyses that discard information. I will not discuss “adherers only” analyses for this reason and also because doing so will not help illuminate the basic issues. Suppose that a trial calls for randomization between two treatments, 𝐴 and 𝐵. During the study, some subjects randomized to 𝐴 actually receive 𝐵. I will denote this group of subjects by 𝐵𝐴 . Similarly, subjects randomized to 𝐵 who actually receive 𝐴 will be designated 𝐴𝐵 . ITT calls for the analysis groups to be 𝐴 + 𝐵𝐴 compared with 𝐵 + 𝐴𝐵 , consistent with the initial randomization. TR calls for the analysis groups to be 𝐴 + 𝐴𝐵 compared with 𝐵 + 𝐵𝐴 , consistent with the treatment actually received. Differences in philosophy underlying the approaches and the results of actually performing both analyses in real clinical trials fuel debate as to the correct approach in general. With some careful thought and review of empirical evidence, this debate can be decided in favor of the ITT principle, at least as a foundational approach to the analysis. One of the most useful perspectives of an RCT is as a test of the null hypothesis of no treatment difference, a circumstance in which the ITT analysis yields the best properties. However, it is a mistake to adhere rigidly to either view. It is easy to imagine circumstances where following one principle or the other too strictly is unhelpful or inappropriate. Any utility of TR analyses would best be established on a study-by-study basis as part of the design and analysis plan. Then investigators will not claim that either one of the TR or ITT results is correct solely because it yields the desired outcome.
19.3.2 Coronary Drug Project Results Illustrate the Pitfalls of Exclusions Based on Nonadherence The potential differences between ITT and TR analyses have received attention in the clinical trials literature since the early work by the [304]. The CDP trial was a large randomized, double-blind, placebo-controlled, multicenter clinical trial, testing the efficacy of the cholesterol lowering drug clofibrate on mortality. The overall results of the CDP showed no convincing benefit of clofibrate (effect size = 0.60, 𝑝 = 0.55) (Table 19.1, bottom row). There was speculation that subjects who adhered to the clofibrate regimen received a demonstrable benefit, while those who did not comply would have a death rate similar to the placebo group. The relevant analysis is also shown in Table 19.1, where adherence is defined as those subjects taking 80% or more of their medication during the first 5 years or until death. Additional mortality rates in Table 19.1 have been adjusted for 40 baseline prognostic factors. Within the clofibrate group, the results appear to support the biological hypothesis that treatment adherence yields a benefit from the drug (effect size = 3.86, 𝑝 = 0.0001). However, an identical analysis of subjects receiving placebo demonstrates why results based on adherence can be misleading. Subjects who adhered to placebo show an even larger benefit than those adhering to clofibrate (effect size = 8.12, 𝑝 0.99), whereas the TR approach showed a survival advantage for the surgical therapy group (𝑝 < 0.001). Based on these findings and simulations, the authors conclude that ITT analyses should be the standard for randomized clinical trials. Similar conclusions were obtained in a second study [1192]. Lagakos, Lim, and Robins [881] discuss the similar problem of early treatment termination in clinical trials. They conclude that ITT analyses are best for making inferences about the unconditional distribution of time to failure. The size of ITT analysis tests is not distorted by early treatment termination. However, a loss of power can occur. They propose modifications to ordinary logrank tests that would restore some of the lost power without affecting the size of the test. 19.3.4
Trials Are Tests of Treatment Policy
It is unfortunate that investigators conducting clinical trials cannot guarantee that participants will definitely complete (or even receive) the treatment assigned. This is, in part, a consequence of the ethical principle of respect for individual autonomy. Many factors contribute to participants’ failure to complete the intended therapy, including severe side effects, disease progression, subject or physician strong preference for a different treatment, and a change of mind. In nearly all circumstances failure to complete the assigned therapy is partially an outcome of the study and therefore can produce a bias if used to define subsets. From this perspective, a clinical trial is a test of treatment policy, not a test of treatment received. ITT analyses avoid bias by testing policy or programmatic effectiveness.
Piantadosi
Date: July 27, 2017
578
Time: 4:45 pm
COUNTING SUBJECTS AND EVENTS
From a clinical perspective, post-entry exclusion of eligible participants is analogous to using information from the future. For example, when selecting a therapy for a new patient, the clinician is primarily interested in the probability that the treatment will benefit the new patient. Because the clinician has no knowledge of whether or not the patient will complete the treatment intended, he or she has little use for inferences that condition on events in the patient’s future. By the time treatment adherence is known, the clinical outcomes may also be known. Consequently, at the outset the investigator will be most interested in clinical trial results that do not depend on adherence or other events in the future. If the physician wishes to revise the prognosis when new information becomes available, then an analysis that depends on some intermediate success might be relevant. 19.3.5
ITT Analyses Cannot Always Be Applied
The limitations of the ITT approach have been discussed by Feinstein [452] and Sheiner and Rubin [1373]. A breakdown in the experimental paradigm can render an analysis plan based on ITT irrelevant for answering the biological question. This does not mean that TR analyses are the best solution because they are subject to errors of their own. There may be no entirely satisfactory analysis when the usual or ideal procedures are inapplicable because of unanticipated complications in study design or conduct. On the other hand, when the conduct of the trial follows the experimental paradigm fairly closely, as is often the case, analyses such as those based on ITT are frequently the most appropriate. Example 19.1. Busulfan is a preparative regimen for the treatment of hematological malignancies with bone marrow transplantation. In a small proportion of subjects, the drug is associated with veno-occlusive disease (VOD) of the liver, a fatal complication. Clinical observations suggested that the incidence of VOD might be eliminated with appropriate dose reduction to decrease the area under the time–concentration curve (AUC) of the drug, observable after subjects are given a test dose. Furthermore, the efficacy of the drug might be improved by increasing the dose in subjects whose AUC is too low. These considerations led to a randomized trial design in which subjects were assigned to two treatments consisting of the same drug but used in different ways. Group A received a standard fixed dose, while group B received a dose adjustment, up or down, to achieve a target AUC, based on the findings of a test dose. Partway through the trial the data were examined to see if dose adjustment was reducing the incidence of VOD. Interestingly, none of the subjects assigned to B actually required a dose adjustment because, by chance, their AUCs were all within the targeted range. On treatment A, some subjects had high AUCs and a few experienced VOD. The intention to treat analysis would compare all those randomized to B, none of whom were dose adjusted, with those on A. Thus, the ITT analysis could not carry much information about the efficacy of dose adjustment. On the other hand, when the trial data were examined in conjunction with preexisting data, the clinical investigators felt ethically compelled to use dose adjustment, and the trial was stopped. In special circumstances TR analyses can yield estimated treatment effects that are closer to the true value than those obtained from the ITT analyses. However, this improved performance of the TR approach depends upon also adjusting for the covariates
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
TREATMENT NONADHERENCE
579
responsible for crossover. Investigators would have to know factors responsible for subjects failing to get their assigned treatment, and incorporate those factors in correct statistical models describing the treatment effect. This is usually not feasible because investigators do not know the reasons why, or the covariates associated with, subjects failing to complete their assigned treatment. Even if the factors that influence nonadherence are known, their effect is likely to be more complex than simple statistical models can capture. Thus, the improved performance of TR methods in this circumstance is largely illusory. Sommer and Zeger [1415] present an alternative to TR analyses that permits estimating efficacy in the presence of nonadherence. This method employs an estimator of biological efficacy that avoids the selection bias that confounds the comparison of compliant subgroups. This method can be applied to randomized trials with a dichotomous outcome measure, regardless of whether a placebo is given to the control group. The method compares the compliers in the treatment group to an inferred subgroup of controls, chosen to avoid selection bias. Efron and Feldman [407, 408] discuss a statistical model that uses adherence as an explanatory factor and apply their method to data from a randomized placebo-controlled trial of cholestyramine for cholesterol reduction. Their method provides a way to reconstruct dose–response curves from adherence data in the trial. This and similar approaches based on models are likely to be useful supplements to the usual conservative ITT analyses and can recover valid estimates of method effectiveness when nonadherence is present [1373]. Treatment effects in the presence of imperfect adherence can be bounded using models of causal inference that do not rely on parametric assumptions. Examples of this is the work by Robins [1989], Manski [983], and Balke and Pearl [105]. These methods permit estimating the extent to which treatment effect estimates based on ITT can differ from the true treatment effect. These methods show promise but have not been widely applied in clinical trials. Another promising method for coping with treatment nonadherence is based on the idea of principal stratification [511, 512]. This method permits a recovery of the effect of treatment, as opposed to the effect of randomization, under fairly general assumptions. For an example of its application, see the study of exercise in cancer subjects by Mock [1048].
19.3.6
Trial Inferences Depend on the Experiment Design
The best way to reconcile the legitimate clinical need for a good biological estimate of treatment efficacy and the statistical need for unbiased estimation and correct error levels is to be certain that subjects entered on the trial are very likely to complete the assigned therapy. In other words, the eligibility criteria should exclude subjects with characteristics that might prevent them from completing the therapy. This is different from excluding subjects solely to improve homogeneity. For example, if the therapy is lengthy, perhaps only subjects with good performance status should be eligible. If the treatment is toxic or associated with potentially intolerable side effects, only subjects with normal function in major organ systems would be likely to complete the therapy. It is a fact that the potential inferences from a clinical trial and the potential correctness of those inferences are a consequence of both the experiment design and the methods of
Piantadosi
Date: July 27, 2017
580
Time: 4:45 pm
COUNTING SUBJECTS AND EVENTS
analysis. One should not employ a design with particular strengths and then undo that design during the data analysis. For example, if we are certain that factors associated with treatment decisions are known, there might be very little reason to randomize. One could potentially obtain correct inferences from a simple database. However, if there are influential prognostic factors (known), then a randomized comparative trial offers considerable advantages. These advantages should not be weakened by ignoring the randomization in the analysis. Finally, there may be legitimate biological questions that one cannot answer effectively using rigorous designs. In these circumstances it is not wise to insist on an ITT analysis. Instead an approximate answer to a well-posed biological question will be more useful than the exact answer to the wrong question.
19.4 19.4.1
PROTOCOL NONADHERENCE Eligibility
Ineligible subjects are a form of missing data with respect to the external validity of the trial. We do not usually think about eligibility in these terms but it is precisely the concern raised regarding inclusiveness or representation in the study cohort. Potentially, this problem can be ameliorated by using a large heterogeneous study cohort. However, this may not always be possible because of resource limitations. This problem is interesting but peripheral to the focus of this chapter. Inclusion and exclusion criteria are imperfect filters. Most trials are vulnerable to technically ineligible participants being placed on study—a common type of protocol nonadherence. Eligibility criteria are often sharply demarcated, as in cutoffs for lab values or other quantitative assessments. Investigators will naturally feel compelled to enter a potential participant who falls just outside the acceptable range, perhaps in a single laboratory value. A retrospective review of eligibility will find and flag this error even if the spirit of the study was to include such individuals. The lesson is to create boundaries that you can live with. Eligibility criteria serve two purposes. One is to define the study cohort, accomplished by specifying diagnoses and attributes of the disease or condition under study. These could be very important, especially in the genomic setting where a seemingly small characteristic can be definitional. A second purpose is to mitigate risk to the participant by specifying characteristics of general well being or adequate organ system and functional reserve so a stressful therapy can be tolerated. It may be a problem that both purposes are entwined in eligibility criteria. A better arrangement might be to separate definitional parameters from safety guidelines. Safety guidelines could be written in a flexible and interpretive way. I am not aware of any clinical trial that has actually done this. From a wide perspective, imperfect adherence to safety criteria is probably inconsequential to most therapeutic assessments. The criteria themselves might reasonably have been written differently to permit all the actual accruals. More importantly, the treatment effects are likely estimated correctly unless there is a strong interaction between the treatment and the eligibility factor and value in question. This would also have to happen frequently to perturb the effect estimate. This seems extremely unlikely. So we have to expect the impact of nonadherence to safety parameters to be primarily an administrative and management issue rather than a scientific one. As such, it can be important if participant risk is elevated.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PROTOCOL NONADHERENCE
581
Usually, eligibility criteria are defined objectively and are based only on information available before study entry or randomization. Objective criteria include quantitative measurements such as age, the results of laboratory tests, and some categorical factors such as sex. Partly subjective criteria include some measures like extent of disease, histological type, and functional capacity. These criteria are clinically well defined but require expertise to specify them. Subjective criteria are those based on self-report or solely on physician judgment. These may not be appropriate eligibility criteria, although they may be the gold standard for some assessments such as quality of life and pain. Ineligible participants can be analyzed in either of two ways: included with the eligible subjects from the trial cohort (pragmatic approach), or removed from the analysis (explanatory approach). In a randomized comparative trial, if the eligibility criteria are objective and determined from pre-study criteria, neither of these approaches will create a bias in the treatment effect estimate. However, excluding subjects from any study can diminish the external validity of the results. If there is any potential for the eligibility determination to be applied retroactively because of subjective interpretation of the criteria or methodologic error in the protocol design, the pragmatic and explanatory approaches have quite different properties. In particular, excluding subjects in this situation can affect the treatment groups in a randomized comparison differently and produce a bias. The principal difficulty is that exclusions may be confounded with outcome. For example, suppose that a randomized trial compares medical versus surgical therapy for the same condition. Subjects randomized to surgery may need evaluations and criteria beyond those required for subjects to receive medical therapy. Simple examples are normal electrocardiograms and good pulmonary function. If retroactive exclusions are based on such tests, they can differentially affect the treatment groups, yielding subjects with a better prognosis in the surgical arm. This problem could be avoided by requiring all trial participants to pass the same (pre-surgical) criteria. However, such a policy would exclude many subjects eligible for medical therapy, reducing the generalizations that can be made from the trial.
19.4.2
Treatment
Participants can fail to adhere to nearly any aspect of a treatment algorithm. Common problems are reduced or missed doses and improper scheduling. All treatment modalities are subject to such errors including drugs, biological, radiotherapy, surgery, lifestyle changes, diet, and psychosocial interventions. Failure to comply with, or complete, the intended treatment can be a rationale to remove subjects from the analysis. Evaluability criteria, mentioned above, are one way to formalize this. Removing eligible but nonadherent participants from the analysis can create serious biases. From the earlier perspective that a clinical trial is a test of treatment policy, failure to adhere to nuances of the treatment plan are irrelevant. If these deviations are pervasive, it may reflect important shortcomings in study design or conduct. In the extreme we may know that the policy for treatment A failed, but we could remain unenlightened as to whether this is the result of biology or administration.
Piantadosi
Date: July 27, 2017
582
19.4.3
Time: 4:45 pm
COUNTING SUBJECTS AND EVENTS
Defects in Retrospect
Key aspects of a clinical trial are sometimes revisited as part of analyses, regulation, misconduct, or controversy. Problems can always be found because of inevitable human error, the superior lens of hindsight, or improved methods. Assessing the implications of uncovered defects is not difficult in principle, but the setting in which such circumstances arise may sustain controversy. Again the simplest example is eligibility, where a retrospective look may reveal that a few percent of participants were ineligible. This is virtually routine for randomized trials. The implications of this are determined by comparing the two analyses, and it is rare for them to disagree. A trial might be re-analyzed because of misconduct as in the National Surgical Adjuvant Breast Project [1067] also discussed in Section 26.4.2. In that circumstance, the scientific implications were found to be small according to re-analyses, but the overall controversy and public perceptions were bitter. These strongly divergent views seem always to catch scientists by surprise. Rarely, it may be necessary to revisit assessments or classifications of events on a trial, such as attribution of side effects to study medications or assigned causes of death in a mortality study. We always expect to find some discordancies between original assessments and retrospective ones whenever the assessment method involves subjective interpretation. More complex assessment algorithms would be expected to yield a higher frequency of discordant findings. We know for example that radiologists have only a moderate concordance with themselves when re-reading images ref. We have to expect similar results from any complex subjective task. An example might help clarify this issue: Example 19.2. Suppose we have a large randomized clinical trial with a mortality outcome under regulatory scrutiny. Imagine that a comprehensive rigorous review of major events on the trial by new experts using the original algorithm for attribution finds additional deaths that should be classified as related to treatment. Also imagine that this increases the total number of deaths by 1% and that all new events are distributed randomly in the treatment groups. When the trial is re-analyzed, the results do not change qualitatively or quantitatively. Most of us would likely be willing to accept either the original or revised analysis as accurate and recognize the inevitable and inconsequential nature of these “errors.” What would be our reaction if a regulatory agency viewed the discrepancies as unacceptable and wanted to discard the entire clinical trial? Stepping back from the details, this could create a very odd circumstance indeed. Should discordancies of interpretation that we expect in any complex clinical trial, but almost never look for, invalidate a study if we actually have occasion to observe them? Is there a threshold for random misclassification error that invalidates treatment effects? It is important to recognize that there are elements of good study design that support the robustness of a trial to misclassifications, whether those errors are made in the original execution or as part of a putative correction. It is in my view an open question as to which classification is the “correct” one. In any case, randomization, masking or masked review, definitive outcomes, and well defined assessments strongly diminish the chance that error will differentially influence the treatment groups. Random errors can create noise and reduce precision but are not expected to bias estimates of treatment effect.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DATA IMPERFECTIONS
19.5
583
DATA IMPERFECTIONS
Not all imperfections involve literally missing data. Imperfections can result from inappropriately “correcting” data that are properly missing. For example, subjects who are too ill to comply with a demanding course of treatment should not be included in the study cohort because they are not likely to adhere to the treatment. Data from such participants may be present but incompatible with the intended design. Eligible participants who become too ill to complete treatment or follow-up cannot be discarded as if they were never present. The resulting data would appear complete but would again be imperfect with respect to the intended design. Unavoidable data imperfections result from human error, differences of expert opinion, participants lost to follow-up, and lost data when participants miss test or clinic visits due to illness or exacerbations. Data imperfections that are in principle avoidable can be a consequence of poor study methodology, chance, or lack of protocol adherence. These topics are discussed in the following sections. 19.5.1
Evaluability Criteria Are a Methodologic Error
Protocols sometimes contain improper plans that can create or exacerbate imperfections in the data. A common example is evaluability criteria, by which some investigators attempt to define what it means to receive treatment. Already there is a potential problem because the definition of “receiving treatment” would not be important unless exclusions based on it were planned. Investigators might define “evaluable” subjects as those who receive most or all planned courses of therapy. A biological rationale supports removing inevaluable subjects from the analysis because the treatment did not have an opportunity to work. This circumstance may be common in early and middle development. Suppose that among subjects accrued to a safety and activity trial 𝑁𝐸 are evaluable and 𝑅𝐸 of those have a favorable outcome. 𝑁𝐼 are inevaluable, of whom 𝑅𝐼 have a favorable outcome, as in Table 19.2 (usually 𝑅𝐼 = 0). The estimate of benefit among all subjects is 𝑃 = (𝑅𝐸 + 𝑅𝐼 )∕(𝑁𝐸 + 𝑁𝐼 ) = 𝑅∕𝑇 , and that for evaluable subjects is 𝑃𝐸 = 𝑅𝐸 ∕𝑁𝐸 . It appears that 𝑃𝐸 has a firm biological basis because the exclusions are described in advance and are predicated on the persuasive fact that treatment cannot work unless a subject receives it. Attempts to isolate and estimate biological effects of treatment are explanatory, as compared with pragmatic analyses for which evaluability is immaterial. Evaluability criteria can create missing data, although the explanatory perspective sees them as irrelevant. There are fundamental problems with this approach. Evaluability criteria define inclusion retroactively, that is, on the basis of treatment adherence, which is an outcome. TABLE 19.2 Evaluability and Outcomes in a Single Cohort Trial Outcome
Evaluable Yes
No
All
Positive Negative
𝑅𝐸 —
𝑅𝐼 —
𝑅
All
𝑁𝐸
𝑁𝐼
𝑇
Piantadosi
Date: July 27, 2017
584
Time: 4:45 pm
COUNTING SUBJECTS AND EVENTS
Although treatment adherence is also a predictor of subsequent events, exclusions based on it incorrectly assume that it is a baseline factor. These exclusions create bias, as would other retroactive definitions or selections based on other outcomes. For example, suppose that comparison groups are defined using outcome or future events such as tumor response in cancer studies. Subjects who respond must have lived long enough to do so, whereas subjects who did not respond may have survived a shorter time. Therefore, survival or other event time comparisons based on such categorizations can be biased. This problem is discussed by Anderson et al. [39]. The pragmatic perspective is a better one for coping with questions of evaluability. The pragmatist recognizes that the trial does not guarantee an assessment of biological effect. Even if it did, 𝑃𝐸 does not necessarily estimate anything because it is confounded with adherence. Furthermore, the concept of biological effect degenerates if subjects cannot adhere to the treatment. The trial does assure an unbiased estimate of treatment benefit if all eligible subjects are included in the analysis. The way to assure that the pragmatic estimate of benefit is close to the explanatory estimate of biological effect is to select treatments and design and conduct the trial so that there is a high adherence. Another perspective on this issue can be gained by considering the difference between conditional and unconditional estimates of treatment effect. Unconditional estimates of effect do not rely on adherence (nor any other factor) and are always applicable to a new patient at baseline. Effect estimates that condition on other factors, even those measurable at baseline, create potential problems. If the estimate conditions on events from the patient’s future (adherence), it will not represent the true biological effect and is likely to be uninterpretable. Evaluability criteria may be defined only implicitly by the analytic plan. This mistake can be subtle and hard to correct. Exactly this type of error plagues many uncontrolled comparisons of a stressful therapy compared with a nonstressful one (e.g., surgery versus medical management), which can create a survivors’ bias. However, the same bias can occur in a randomized comparison unless an outcome is defined for every subject regardless of vital status. A good example of the problem and its fix can be seen in the discussion of lung volume reduction surgery in Section 4.6.6. In that randomized trial, comparison of mean functional measures (FEV1 , exercise capacity, quality of life) by treatment group were potentially biased because the surgical therapy produced a higher short-term mortality capable of spuriously raising the group average. To correct this effect, an outcome was defined based on achieving a prespecified degree of improvement. Missing or deceased subjects were classified as unimproved, removing the survivors’ bias. It is surprising how resistant some clinicians were to this remedy.
19.5.2
Statistical Methods Can Cope with Some Types of Missing Data
Missing data happen 100% of the time in clinical trials. This occurrence should not be viewed as a failure unless (i) it is the product of sloppiness, bad design, or poor choice of outcomes, or (ii) it is handled unwisely. Usually, missing data result from accident, autonomy of human subjects, or as a partial consequence of a participant’s prognosis. In this latter case, a trial participant may become too ill to provide intended follow-up data. This overall topic is large and complex and can only be covered superficially here. More detail can be found in contemporary references [944, 1307].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DATA IMPERFECTIONS
585
There are three generic ways to cope with missing values: disregard the observations that contain a missing value; disregard the variable if it has a high frequency of missing values; replace the missing data by some appropriate values. This last approach, called imputation of missing values, is discussed below. The consequences of missing data can range from trivial to catastrophic. There will inevitably be some loss of precision, but also the potential for strong bias in estimated treatment effects. Suppose that the sample size for a trial is robust, missing data occur randomly, and represent only a small fraction of observations. Then we can safely say that the consequences will be trivial. Two of these conditions can be observed in the data, but the criterion of “randomly missing” is not so obvious. It means that there is statistical independence between the outcome and missing data. In other words, random missingness implies that the missing data carry no information about outcome. Human error during data entry might be an example of this. Random missingness is a strong assumption. When we know or suspect that the reason for missing data is informative with respect to the outcome, more thoughtful methods must be applied. For example, if subjects are unable or unwilling to provide follow-up data as an illness advances, the study cohort might yield a longitudinally biased view of treatment. In a randomized comparison, if one treatment has a higher short-term mortality (major surgery versus medical therapy, for example), functional comparisons will be biased if sicker subjects are less likely to survive the stressful therapy. This is a so-called survivor’s bias. These circumstances are much more concerning than any loss of precision from missing data because they can yield incorrect estimates of treatment effect. The choices for coping with missing data are limited. When the fraction of missing data is small it might make sense to ignore it. But this is also the circumstance in which replacing or imputing values might be most feasible and appropriate. If a high fraction of data are missing, it can be difficult to justify either the basis or results of imputation. To compensate for missing data we employ (i) reasonable but potentially unverifiable assumptions, (ii) nonmissing data, and (iii) computations or procedures to replace or impute missing values. There is no cure from within the observed data for observations that are entirely missing. However, when a record is partially missing, correlations among remaining variables motivate using observed data with reasonable models or procedures for imputation of the missing pieces.
Old Ways A somewhat old fashioned method of imputation is last observation carried forward (LOCF), used when recent longitudinal data are missing and we employ earlier values from the same subject as replacements. LOCF implicitly assumes that early values are appropriate representatives for later missing ones. This is not plausible for progressive serious illnesses, where missingness may be correlated with prognosis and the measured outcome reflects declining function. It is not possible to say generally whether LOCF is conservative or anti-conservative. LOCF continues to be used as a simple tool but is inferior compared to newer methods. Another simple method for imputation was to replace each missing value with the mean of the nonmissing ones. This seems to be attractive because it preserves the overall
Piantadosi
Date: July 27, 2017
586
Time: 4:45 pm
COUNTING SUBJECTS AND EVENTS
mean of each variable. But it implicitly assumes that the missing data are very similar to the nonmissing data, in fact even more typical, and it does not account for random variation. Another approach uses other variables to predict the missing values according to some reasonable model, like a linear multiple regression. This can be accomplished while preserving the overall correlation structure of the data. The prediction model itself might depend on imputed values, leading to an iterative procedure. Validity of such methods depends on the model and the presence of appropriate predictors. It also does not fully account for variation. In practice, more than one approach might be tried, and the sensitivity of conclusions to each can be assessed. If all approaches yield similar answers, we would have more confidence in the conclusions. Otherwise, we might not consider any approach to be reliable. Imputation is not a good solution for data with a high frequency of missing values, and is certainly not a substitute for thoughtful design, active ascertainment of outcomes, and an adequate infrastructure to support the study.
New Ways A problem with all simple methods, aside from model assumptions, is that the imputed values are then analyzed essentially as though they were never missing. However, we should attach more uncertainty to imputed data than to observed data because we are less certain that imputed values are correct. Otherwise, estimates of variance, confidence intervals, and 𝑝-values may be more favorable than they should be. One way to account appropriately for the uncertainty inherent in resolving missingness is via multiple imputation. It proceeds via a simple but reasonable assumption: an observation with missing elements might be similar to other observations without the missing elements that share values, or at least are close to it in multidimensional space. Such observations can be said to be similar. Suppose we then sample randomly (with replacement) from the similar existing data and replace missing values. The overall analysis will differ slightly with each such replacement, leading ultimately to a distribution of effect estimates, rather than a single estimate. The distribution of effect estimates indicates the increased uncertainty that results from imputation. Some extra care will then be required in the reporting and interpretation of results, but the representation of slightly increased uncertainty will be appropriate. The notion of what it means for observations to be similar to one another can be statistically well-defined according to the problem. A simple example would be to take the closest observation using a distance measure in the multidimensional space containing the observations. In any case, the general approach of multiple imputation is probably the best way to cope with partially missing data. Some data analysis software has built-in procedures to facilitate it. The following general circumstances might support imputation as a worthwhile analytic strategy: (i) the frequency of missingness is relatively small, for example, less than 10–15% of the data, (ii) the variable containing missing values is especially important clinically or biologically to the research question, (iii) reasonable (conservative) base assumptions and technical strategy for imputation exist, and (iv) sensitivity of the conclusions to different imputation strategies can be determined. Then the effect of imputation of missing values will be to shift the debate away from concern about the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DATA IMPERFECTIONS
587
possible influence of missing observations to concern about the effect of assumptions and methods. Because the latter can be studied, this shift of focus can be extremely helpful.
Censoring The explanatory way of coping with unobserved events because they are seemingly unrelated to the outcome of interest (e.g., noncardiac deaths in a cohort with time to myocardial infarction as the primary study outcome) might be to censor such observations. If events seemingly unrelated to the outcome of interest are in fact independent of the outcome, this might be appropriate. In other words, not counting or censoring the seemingly unrelated events would address a useful question about the cause-specific event rate and would not be subject to bias. In clinical trials with longitudinal components such as time-to-event studies, some subjects are likely to be lost to follow-up. In other words, the follow-up period may not be long enough for investigators to observe events in all subjects. This can happen when a study participant is no longer accessible to the investigators because the subject does not return for clinic visits or has moved away. It can also happen when the follow-up period is shortened due to limited resources or error. Follow-up information is also lost when the earliest event time is the subject’s death, which prevents observing all later event times such as disease progression or recurrence. In this case, the competing risks of death and disease progression result in lost data. This can be a problem in studies of chronic disease, especially in older populations. If losses to follow-up occur for reasons not associated with outcome, they have little consequence for affecting the study result, except to reduce precision. If investigators know that losses to follow-up are independent of outcome, the explanatory and pragmatic views of the problem are equivalent. In a trial with survival time as the main outcome, subjects might stop coming to clinic for scheduled follow-up visits because they are too ill or have, in fact, died. In this case, being lost to follow-up is not a random event but carries information about the outcome. Such problems could affect the treatment groups of a randomized trial differently, producing a bias. In any case, studies designed with active follow-up and active ascertainment of endpoints in the participants will be less subject to this problem than those that rely on passive methods of assessing individual outcomes. When losses occur frequently, even if they are not associated with the outcome, the external validity of the trial might be open to question. In studies using survival or disease recurrence or progression time as the major outcome, these losses to follow-up should occur in less than 5% of the trial participants when the study is conducted by experienced researchers. Often these types of missing data are not correctable but are preventable by designs and infrastructure that use active follow-up and ascertainment of events. Investigators cannot reliably assume that losses are random events and conduct analyses that ignore them, particularly if they occur often. Every effort should be made to recover lost information rather than assuming that the inferences will be correct “as though all subjects were followed completely.” Survival status can
Piantadosi
Date: July 27, 2017
588
Time: 4:45 pm
COUNTING SUBJECTS AND EVENTS
sometimes be updated through the National Death Index or local sources. Active efforts to obtain missing information from friends or family members are also frequently successful. There are many clinical rationalizations why investigators permit uncounted events in the study cohort. Suppose that a drug trial is being conducted in subjects at high risk of death due to myocardial infarction. Because these subjects are likely to be older than low-risk individuals, some study participants might die from other diseases before the end of the trial. Some investigators might prefer not to count these events because they do not seem to carry information about the effect of treatment on the target cause of death. Failure to count all such events can bias estimates of treatment effect. [1324] have some sensible advice on this point. It is also worth reading the discussion of the National Emphysema Treatment Trial in Section 4.6.6 for an illustration of the potential consequences of ignoring missing data that arise from differential stresses on the treatment groups in a randomized trial. Different causes of failure compete with one another and failure of one kind usually carries some information about other types of failure. Being lost to follow-up may be associated with a higher chance of disease progression, recurrence, or death, for example. Death from suicide, infection, or cardiovascular causes may be associated with recurrence or progression of chronic disease such as cancer or AIDS. In cases such as these one cannot expect to estimate a cause-specific event rate without bias by censoring the seemingly unrelated events. Instead, the pragmatist would rely on a composite and well-defined endpoint such as time to disease recurrence or death from any cause (disease-free survival). Alternatively, time to death from any cause (overall survival) might be used. Censoring seemingly unrelated event times is usually a correctable error.
19.6
SUMMARY
Clinical trials are characterized by imperfect data as a consequence of protocol nonadherence, methodologic error, and incomplete observations. Inferences from a trial can depend on how the investigators resolve data imperfections. Two approaches to such questions have been called “explanatory” and “pragmatic.” The pragmatic approach tends to follow the statistical design of an experiment more closely. The explanatory approach may make some assumptions to try to answer biological questions. Data imperfections that have the most potential for influencing the results of a trial arise from subjects who do not adhere with assigned treatment or other criteria. Nonadherence encourages some investigators to remove subjects from the analysis. If the reasons for exclusion are associated with prognosis or can affect the treatment groups differently, as is often the case, the trial results based on exclusions may not be valid. Intention to treat is the principle that includes all subjects for analysis in the groups to which they were assigned, regardless of protocol adherence. This approach is sometimes at odds with explanatory views of the trial, but usually provides a valid test of the null hypothesis. Approaches based on analyzing subjects according to the treatment they actually received may be useful for exploring some clinical questions but should not be the primary analysis of a randomized clinical trial. Investigators should avoid any method that removes eligible subjects from the analysis.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
19.7
589
QUESTIONS FOR DISCUSSION
1. In some retrospective cohort studies subjects who have not been followed for a while in clinic can be assumed to be alive and well. Discuss how such an assumption can create bias. 2. Generalizing the idea of intention to treat to developmental trials, one might require that all subjects who meet the eligibility criteria should be analyzed as part of the treatment group. Discuss the pros and cons of this approach. 3. Discuss specific circumstances in a comparative trial that might make it difficult to apply the intention to treat approach. 4. Read the paper by Stansfield et al. [1436] and discuss the inclusion/exclusion properties of the analysis. 5. Apply the method of Sommer and Zeger [1415] to the data from the Coronary Drug Project, assuming that the placebo compliance data were not available. How does the method compare to the actual estimated relative risk? Discuss.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
20 ESTIMATING CLINICAL EFFECTS
20.1 INTRODUCTION Understanding the connection between estimation and study design is crucial for clinical trials. Because this book is not primarily about analyzing trials, the discussion here will only briefly cover some clinically relevant estimates from data. Thorough knowledge of this topic requires good design, experience, and expert statistical collaborators. Some useful advice in the clinical context is given by Pocock [1222]. For the reader with sufficient background and ambition, additional technical details for analysis can be found in a number of books [82, 221, 436]. How best to report findings is discussed in Chapter 25. A more complete introduction to the analysis of survival data can be found in Kleinbaum [853]. The issue of simultaneous analysis of treatment effects and covariates, or adjusted analyses, is discussed in Section 21.3. Most clinical trials use several of the types of outcomes discussed in Chapter 5, and therefore require qualitatively different summaries of the data to address all objectives. Informative analyses of data depend on the clinical context in which questions are embedded, relevant biological knowledge, study design, proper counting of subjects and events, numeracy, and familiarity with probability, statistical models, and inference. Applying these diverse concepts to clinical trials requires a broad range of skill and experience. Many of the same methods of data analysis or estimation are used for both experiments and nonexperimental studies. Although the spectrum of such statistical tools extends from simple description to analytical models, good experiment design usually permits valid inference from descriptive methods alone. Here analytical refers not to data analysis, but to the methods of quantifying relationships among observables. The true value of experiment design is to free the investigator from statistical modeling, conditions, and Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
590
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
591
other strong assumptions necessary for the analysis of happenstance data. The validity of conclusions from experiments relies on design more than on analysis. Data analysis is fascinating because it is dynamic, technological, and proximate to final results. Sometimes it even seems a bit mysterious. So analysis can appear to be more important than the study design which was long since completed. However it appears, analysis of experiments is always subordinate to design. A virtuous data analysis cannot reliably correct flaws in the design of a trial. This is true for both experiments and nonexperimental studies. Furthermore, without a foundation of design, even clever analytic models reduce to purely descriptive devices.
20.1.1
Invisibility Works Against Validity
With modern computerized workflows it is possible, if not likely, that investigators will never see the actual data values that support conclusions. Despite automated validity and bounds checking with electronic data capture, human error can enter and propagate through data systems. Sometimes, only human scrutiny can correct those errors. Computerized transformation of data values can yield errors that may then be greatly amplified in number or scope. Poorly designed data capture, which sometimes seems to be flexible, may be the most vulnerable. An example is codifying free text, or natural language processing, which is often a challenge even for literate humans. This type of application is going to be common in the future. Looking at raw data listings from a study may seem primitive and pointless, but is actually quite valuable. Defects that are immediately visible by inspection include missing values, range errors, inconsistencies across records, and miscoded values. These types of mistakes can be fatal to final conclusions, but could be missed by data coordinators who may not have as deep an understanding as the investigator of the measurements. For small trials, the entire workfile can easily be examined prior to analysis. Some subtle errors may not be visible on inspection, but no errors are visible unless we look. To minimize drudgery and the seeming inefficiency of reviewing source data, one might imagine looking at tabulations and data summaries. Unfortunately, there are no summaries of data that can assure the integrity of individual data elements.
20.1.2
Structure Aids Internal and External Validity
Design of a clinical experiment imposes structure on the data that will ultimately validate analysis and interpretation. Other structural elements, such as a biological or population model, can represent knowledge from outside the trial, and also contribute to valid and generalizable conclusions. For example in pharmacological studies, blood samples are used to construct time–concentration curves that describe drug distribution and metabolism. These rely on the sampling design. But empirical data are usually augmented with a pharmacokinetic model that represents knowledge external to the trial. Data and model together allow estimating clinically relevant parameters such as half-life, elimination rate, and area under the curve. The model facilitates estimation of physiologic effects, smoothing and interpolation of data, quantitative assessments of side effect risks, and reliable inferences about future patients. Extrapolation to individuals yet to be treated is based on biological knowledge embodied in the model, which originates outside the actual experiment.
Piantadosi
Date: July 27, 2017
592
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
In middle development trials, investigators are usually interested in clinical responses and side effects. The experiment permits estimating the unconditional probability of response or risk in those who met the eligibility criteria. These measures of benefit and risk not only summarize the effects of treatment but also often generalize outside the trial to a substantial degree based on biological similarity. By adding structural models, subject characteristics can be connected quantitatively to outcomes. These risk or prognostic models are explicitly intended to generalize because they estimate effects usually assumed to be population relevant and not modulated by subject characteristics. Dose finding and safety and activity trials would not have a key role in development if they did not permit a reasonable degree of generalizability. This in no way contradicts restrictions derived from deterministic effects such as genetic variants or expressions. Comparative trials yield formal comparisons regarding various relevant clinical benefits and risks. Design components that isolate the interesting effects and remove extraneous factors serve to increase validity. The causal relationship between treatment and outcome supported by the design of these trials is one such fundamental generalization. New questions or hypotheses can be suggested by other structural components such as risk models, subset definitions, or aggregated data from several studies. One very important question is whether supplemental questions are specified in advance or driven by findings in the data. The key is how investigators describe these findings. Questions specified in advance of collecting data, though perhaps secondary, deserve to be described as a priori hypotheses. Often they are answered with lower precision than a primary question. Data exploration may suggest comparisons that we wish we had been smart enough to pre-specify. Data-driven comparisons are less reliable because they may well be supported more by chance than by design. Findings unsupported by design structure internal to the trial should never be represented as a primary result. Example 20.1. Suppose that a series of participants meeting pre-defined eligibility criteria are given a six-week course of a new antihypertensive drug. The drug produces side effects that require dose reduction or discontinuation of the treatment in some individuals. The design of this study suggests a primary interest in estimating the unconditional probability of benefit from the new drug and the proportion of individuals who cannot tolerate the therapy. Suppose that investigators also observe that participants who remain on the new drug have “significantly” lower blood pressure after six weeks than those who resumed their previous treatment. Although there may be biological reasons why the new treatment represents an advance, the study design does not permit a reliable comparison with standard therapy. The participants who experience side effects on the new treatment may be different in clinically important ways from those who tolerate the drug.
20.1.3
Estimates of Risk Are Natural and Useful
Risk is always important to the patient. Both safety and efficacy can be summarized in terms of the risk of experiencing relevant events, as discussed in Chapter 5. Often risk is the best or only way to quantify these effects. The clinician scientist is invariably interested in estimating the magnitude of risk differences. Depending on the setting, risk differences can be expressed in absolute magnitude or on a relative scale, as discussed below. A major flaw in populist interpretation revolves around perceived risk as compared to actual or quantified risk. The confounding of perceived with actual risk applies both
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
593
to understanding specific therapies, and to risks and benefits inherent in clinical trials. A second flaw arises from distorted metrics applied to assess risk. A potential trial participant should generally be interested in absolute probabilities or risks, but is seldom given sufficient information to understand them appropriately. In contrast, the clinical investigator is typically focused on relative risk when comparing therapies. However, when assessing safety, the clinical investigator must also have a comprehension of the relevant absolute probabilities. Odds and odds ratios are also frequently used for assessing relative risks. The ordinary betting odds is a widely cited measure of risk and is often used as a wagering summary or guide. At the track, it does not formally relate to the chances of a particular horse winning, but merely reflects the current allocation of wagers. In contrast, the use of odds in study design and interpretation is always a reflection of risk probabilities. In football and some other sports contests, expected performance differences are often quantified as a point spread, which is generally understood to equalize the probability of winning a bet on either side. Although widely understood, such measures of risk are unhelpful in medical studies. Frequently, investigators are interested in comparing the probabilities of successful outcomes or failures, measured longitudinally. Relative risks are helpful in this circumstance. One often assumes that the risk ratio remains constant over time, which might be the case even when absolute risks change. When comparing event times such as survival or disease recurrence or progression, groups can be compared by the difference in the absolute probabilities of failure. Alternatively, differences in median event times could be used. Hazard or risk ratios are more natural and efficient summaries of such differences. It is essential to understand these risk summaries to be conversant with clinical trials. Additional details are provided later in this chapter. To illustrate the simplicity that design affords us when comparing risk or other events in a trial, consider a simple approach proposed by Pocock [1215]. A balanced randomized design would allow us to compare directly the event counts in the two treatment groups, which I will denote by 𝑛𝑎 and 𝑛𝑏 . Assuming events are relatively uncommon and denominators do not differ greatly, the treatment difference is Δ𝑎𝑏 = 𝑛𝑎 − 𝑛𝑏 . A statistical test of equality can be based on the ratio 𝑛 − 𝑛𝑏 , 𝑍 = √𝑎 𝑛𝑎 + 𝑛𝑏 which is approximately a standard normal deviate under the null. Both formulas are composed only from the event counts in the treatment arms. This statistical test has the same form as McNemar’s test for matched pairs [13]. It may seem mysterious as to why the variance of 𝑛𝑎 − 𝑛𝑏 is 𝑛𝑎 + 𝑛𝑏 , but consideration of the Poisson distribution will clarify. This is a nice illustration of how design affords us the simplest but valid assessment of treatment differences. Some discipline is needed to generate the most useful summaries of risk. Generally, unconditional estimates are best. For example, in safety assessments or when estimating the chance of benefit in a single cohort study, unconditional estimates are most appropriate. Eligible study subjects should not be removed from the denominator of risk calculations, even if doing so appears to address some useful post hoc clinical question.
Piantadosi
Date: July 27, 2017
594
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
Risk calculations that condition on selection, other interim outcomes, or treatment adherence, do not estimate a pure biological effect because they are confounded by the unknown conditioning effects.
20.2
DOSE-FINDING AND PHARMACOKINETIC TRIALS
Obtaining clinically useful information from dose-finding trials such as phase I studies depends on (i) evaluation of preclinical data, (ii) knowledge of the physical and chemical properties of the drug and related compounds, (iii) modeling drug absorption, distribution, metabolism, and elimination, and (iv) judgment based on experience. Instead of a comprehensive mathematical presentation, I will discuss a few interesting points related to modeling and obtaining estimates of important pharmacokinetic parameters. More complete discussions of modeling and inferences can be found in Carson, Cobelli, and Finkelstein [235] or Rubinow [1308, 1309] and many other places [400, 604, 752, 1061, 1285]. Physiological models are discussed in Keener [829]. Similar models might be used to extract parameters from quantitative imaging studies. 20.2.1
Pharmacokinetic Models Are Essential for Analyzing DF Trials
One of the principal objectives of phase I studies is to assess the distribution and elimination of drug in the body. Some specific parameters of interest are listed in Table 20.1. Pharmacokinetic (PK) or compartmental models are a useful tool for summarizing and interpreting data from DF trials such as phase I studies. These models can also be helpful in study design, for example, to suggest the best times for blood or other samples to be taken. During analysis the model is essential as an underlying structure for the data and permitting quantitative estimates of drug elimination. PK models do have limitations because they are idealizations. No model is better than the data on which it is based. However, when properly used, the model may yield insights that are difficult to gain from raw data. Some other objectives of DF studies may not require modeling assumptions. Examples include secondary objectives, such as exploring the association between plasma drug levels and severity of side effects, or looking for evidence of drug efficacy. In any case, PK models are at the heart of using quantitative information from DF studies. Drug absorption, distribution, metabolism, and excretion is not generally a simple process. However, relatively simple PK models can yield extremely useful information, TABLE 20.1 Outcomes of Clinical Interest in Dose-Finding Studies ∙ ∙ ∙ ∙ ∙ ∙ ∙ ∙
Optimal biological dose (e.g., MTD) Absorption rate Elimination rate Area under the time–concentration curve Peak concentration Half life Correlation between plasma levels and side effects Proportion of subjects who demonstrate evidence of efficacy
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-FINDING AND PHARMACOKINETIC TRIALS
595
even though they do not capture all of the complexities of the biology. In the next section, I consider a simple but realistic PK model that facilitates quantitative inferences about drug distribution and elimination. An important consideration in assessing toxicity, especially in oncology studies, is the area under the time–concentration curve in the blood (AUC). The model discussed below illustrates one method of studying the AUC. 20.2.2
A Two-Compartment Model Is Simple but Realistic
Consider a drug administered at a constant rate by continuous intravenous infusion. We assume that the drug is transferred from blood to a tissue compartment (and vice-versa) with first order kinetics, and eliminated directly from the blood, also by first order kinetics (Fig. 20.1). This situation can be described by a two-compartment linear system in the following way. Suppose that 𝑋(𝑡) and 𝑌 (𝑡) are the drug concentrations within blood and tissue, respectively, at time 𝑡 and the drug is infused into the blood compartment with a rate of 𝑔(𝑡). The drug is transported from compartment 𝑋 to 𝑌 at a rate 𝜆, from 𝑌 back to 𝑋 at a rate 𝜇, and eliminated from 𝑋 at a rate 𝛾. All coefficients are positive. The rate equations for the system are 𝑑𝑋(𝑡) = −(𝜆 + 𝛾)𝑋(𝑡) + 𝜇𝑌 (𝑡) + 𝑔(𝑡), 𝑑𝑡 𝑑𝑌 (𝑡) = 𝜆𝑋(𝑡) − 𝜇𝑌 (𝑡), 𝑑𝑡
(20.1)
𝑑 denotes the derivative with respect to time. This is a nonhomogeneous linear where 𝑑𝑡 system of differential equations and can be solved using standard methods [1395]. The general solution of this system is
𝑋(𝑡) = 𝑐1 (𝑡)𝑒𝜉1 𝑡 + 𝑐2 (𝑡)𝑒𝜉2 𝑡 , 𝜉 +𝜆+𝛾 𝜉 𝑡 𝜉 +𝜆+𝛾 𝜉 𝑡 𝑌 (𝑡) = 𝑐1 (𝑡) 1 𝑒 1 + 𝑐2 (𝑡) 2 𝑒2, 𝜇 𝜇
FIGURE 20.1
A simple two-compartment model for drug distribution.
(20.2)
Piantadosi
Date: July 27, 2017
596
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
where 𝜉1,2 =
−(𝜆 + 𝜇 + 𝛾) ±
√
(𝜆 + 𝜇 + 𝛾)2 − 4𝜇𝛾 2
(20.3)
and 𝑐1 (𝑡) and 𝑐2 (𝑡) satisfy 𝜉 +𝜆+𝛾 𝑑𝑐1 (𝑡) = 2 𝑔(𝑡)𝑒−𝜉1 𝑡 , 𝑑𝑡 𝜉2 − 𝜉1 𝜉 +𝜆+𝛾 𝑑𝑐2 (𝑡) =− 1 𝑔(𝑡)𝑒−𝜉2 𝑡 . 𝑑𝑡 𝜉2 − 𝜉1
(20.4)
To get an explicit solution for the system, we must specify 𝑔(𝑡), the infusion rate as a function of time, and initial conditions 𝑋(0) and 𝑌 (0). An interesting special case is constant infusion for a fixed time period, { 𝑔0 , 𝑡 ≤ 𝑡0 , (20.5) 𝑔(𝑡) = 0, 𝑡 > 𝑡0 . Then, substituting equations (20.3–20.5) into equation (20.2), ⎧ 𝑔0 𝜉2 +𝜆+𝛾 𝑔0 𝜉 𝑡 ⎪ 𝑟 + 𝜉2 −𝜉1 ( 𝜉1 + 𝑋(0))𝑒 1 ⎪ 𝜉 +𝜆+𝛾 𝑔 − 𝜉1 −𝜉1 ( 𝜉0 + 𝑋(0))𝑒𝜉2 𝑡 , ⎪ 2 2 ⎪ 𝑋(𝑡) = ⎨ 𝑔0 −𝜉 𝑡 ⎪ 𝜉2 +𝜆+𝛾 𝑔0 1 0 )𝑒𝜉1 𝑡 ⎪ 𝜉2 −𝜉1 ( 𝜉1 + 𝑋(0) − 𝜉1 𝑒 ⎪ 𝑔 𝜉 +𝜆+𝛾 𝑔 − 1𝜉 −𝜉 ( 𝜉0 + 𝑋(0) − 𝜉0 𝑒−𝜉2 𝑡0 )𝑒𝜉2 𝑡 , ⎪ 2 1 2 2 ⎩
𝑡 ≤ 𝑡0 , (20.6) 𝑡 > 𝑡0 ,
where the initial condition 𝑌 (0) = 0 has been incorporated. Here we have used the facts that 𝜉1 + 𝜉2 = −(𝜆 + 𝜇 + 𝛾) and 𝜉1 𝜉2 = 𝜇𝛾. Hence, the area under the curve (AUC) for the first compartment is 𝐴𝑈 𝐶𝑥 =
𝑡0
∫0
∞
𝑋(𝑡)𝑑𝑡 +
∫𝑡0
𝑋(𝑡)𝑑𝑡 =
𝑔0 𝑡0 + 𝑋(0) . 𝛾
(20.7)
Sometimes it is helpful to express these models in terms of amount of drug and volume of distribution. Equation (20.7) can be rewritten 𝐴𝑈 𝐶𝑥 =
𝐷𝑡0 + 𝑊 (0) , 𝛾𝑉
where 𝐷 is the dose of drug, 𝑉 is the volume of distribution (assumed to be constant), 𝑊 (𝑡) = 𝑉 × 𝑋(𝑡) is the amount of drug, and 𝛾𝑉 is the “clearance.” However, expressing 𝐴𝑈 𝐶𝑥 this way represents only a change of scale, which is not necessary for this discussion. Because drug “dose” is commonly expressed as weight of drug, weight of drug per kilogram of body weight, or weight of drug per square meter of body surface area, we can refer to 𝑔0 𝑡0 + 𝑋(0) as a dose, even though formally it is a concentration.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
DOSE-FINDING AND PHARMACOKINETIC TRIALS
597
When the drug is infused constantly from time 0 to 𝑡0 and 𝑋(0) = 0, 𝐴𝑈 𝐶𝑥 = 𝑔0 𝑡0 ∕𝛾. This is the ratio of total dose to the excretion rate. Another interesting case is when the drug is given as a single bolus, meaning 𝑡0 = 0, in which case 𝐴𝑈 𝐶𝑥 = 𝑋(0)∕𝛾, which is also a ratio of total dose over excretion rate. The transport parameters 𝜆 and 𝜇 do not affect the AUC in the first compartment. With similar calculations we can find the solution for the second compartment, ⎧ 𝑔0 𝜆 𝑔0 𝜆 𝜉 𝑡 ⎪ 𝜇𝛾 − 𝜉2 −𝜉1 ( 𝜉1 + 𝑋(0))𝑒 1 ⎪ 𝑔 𝜆 + 𝜉 −𝜉1 ( 𝜉0 + 𝑋(0))𝑒𝜉2 𝑡 , ⎪ 2 2 ⎪ 𝑌 (𝑡) = ⎨ ⎪ − 𝜆 ( 𝑔0 + 𝑋(0) − 𝑔0 𝑒−𝜉1 𝑡0 )𝑒𝜉1 𝑡 ⎪ 𝜉2 −𝜉1 𝜉1 𝜉1 ⎪ 𝑔0 𝑔 𝜆 + 𝜉 −𝜉1 ( 𝜉 + 𝑋(0) − 𝜉0 𝑒−𝜉2 𝑡0 )𝑒𝜉2 𝑡 , ⎪ 2 2 2 ⎩
𝑡 ≤ 𝑡0 , (20.8) 𝑡 > 𝑡0 ,
and 𝐴𝑈 𝐶𝑌 =
𝜆𝑔0 𝑡0 + 𝜆𝑋(0) 𝜆 = 𝐴𝑈 𝐶𝑋 . 𝜇𝛾 𝜇
(20.9)
It is directly related to AUC in the first compartment and the transport rates 𝜆 and 𝜇. The behavior of this model is shown in Figure 20.2. Values for both compartments determined by numerical integration are shown with the tissue compartment peak lagging behind the vascular compartment peak as one would expect. When the infusion stops, the concentration in the blood begins to decline abruptly, whereas the tissue curve shows
FIGURE 20.2 Time concentration curves and data from a two-compartment model. 𝑇0 = 10, 𝜇 = 0.1, 𝛾 = 0.1, 𝜆 = 0.1, and 𝑔0 = 1.0.
Piantadosi
Date: July 27, 2017
598
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
an inflection point. Data points (with simulated error) that might be obtained from such a system are also shown in Figure 20.2. 20.2.3
PK Models Are Used By “Model Fitting”
It is common in actual DF studies to have data that consist of samples from only the vascular compartment taken at various time points during the uptake or infusion of drug and during its elimination. The model in the form of equations (20.6 and 20.8) can be fitted to such data to obtain estimates of the rate constants which are the clinical effects of interest. An example of this for a continuous intravenous infusion of the antitumor drug cyclophosphamide in four subjects is shown in Figure 20.3. For each study subject, the infusion was to last 60 minutes, although the exact time varied. The value of 𝑔0 was fixed at 10.0. Because of measurement error in the infusion time and dosage, it is sometimes helpful to estimate both 𝑡0 and 𝑔0 to improve model fit. The fitted curves show a good agreement with the measured values. The estimated parameter values are shown in Table 20.2, along with the estimated AUCs. Of course, the quality of inference about such a drug would be improved by examining the results from a larger number of study subjects.
FIGURE 20.3 Sample data and model fits to time–concentration values from four subjects in a DF trial. The triangles are observed serum levels and the solid lines are model fits.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
599
MIDDLE DEVELOPMENT STUDIES
TABLE 20.2 Estimated Rate Constants from Phase I Clinical Trial Data #
𝑡𝟎
𝜆̂
̂ 𝛾
𝜇̂
̂ 𝐀𝐔𝐂
1
85 60
3
50
4
50
0.015 (0.0014) 0.007 (0.0004) 0.012 (0.0068) 0.007 (0.0044)
0.042 (0.0040) 0.057 (0.0065) 0.196 (0.1027) 0.061 (0.0275)
53,475
2
0.284 (0.0436) 0.250 (0.0190) 0.782 (0.9200) 0.239 (0.1684)
78,876 40,251 63,509
Estimated standard errors are shown in parentheses.
20.3
MIDDLE DEVELOPMENT STUDIES
A common primary objective of safety and activity studies is to estimate the frequency of side effects and the probability of success in treating subjects with a new drug or combination. In oncology, the investigator is also usually interested in grading the toxicity seen and estimating overall length of survival. Often the outcome assessments for response and toxicity are dichotomous, involving yes–no variables. Meeting these types of objectives requires estimating absolute probabilities. 20.3.1
Mesothelioma Clinical Trial Example
To illustrate some clinically useful summaries of information from safety and activity trials, consider data of the kind shown in Table 20.3. The subjects on this (phase II) trial were all diagnosed with malignant mesothelioma, an uncommon lung tumor strongly related to asbestos exposure. Depending on the extent of disease at diagnosis, subjects underwent one of three types of surgery: biopsy, limited resection, or extrapleural TABLE 20.3 Data from Mesothelioma Middle Development Clinical Trial Age 60 59 51 73 74 39 46 71 69 49 69 72 ⋮
Sex
PS
Hist
Wtchg
Surg
PFS
Prog
Surv
Event
1 1 0 1 1 0 1 1 1 1 1 1 ⋮
1 0 0 1 0 0 1 0 0 0 0 0 ⋮
136 136 130 136 136 136 131 136 136 131 131 131 ⋮
1 2 1 1 2 1 1 1 1 1 1 1 ⋮
3 3 1 3 1 1 3 1 1 1 2 1 ⋮
394 1338 184 320 168 36 552 133 175 327 0 676 ⋮
1 0 1 0 0 1 1 1 1 0 0 1 ⋮
823 1338 270 320 168 247 694 316 725 327 0 963 ⋮
1 0 1 1 1 1 0 1 0 1 1 0 ⋮
PFS is progression free time; Surv is survival time. Prog and Event are censoring indicator variables for progression and death, respectively. Performance status (PS) is dichotomized as high versus low.
Piantadosi
Date: July 27, 2017
600
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
pneumonectomy (EPP), a more extensive operative procedure [1316]. The goals of the trial were to determine the feasibility of performing EPP and to document the natural history of the disease. The complete data on 83 subjects are presented in Appendix A. Possibly important prognostic factors in patients with mesothelioma include sex, histologic subtype (hist), weight change at diagnosis (wtchg), performance status (ps), age, and type of surgery (surg). Disease progression is both an outcome of treatment and a potential predictor of survival. The progression and survival times are censored, meaning some subjects remained progression free or alive at the end of the study or cutoff date for analysis. 20.3.2
Summarize Risk for Dichotomous Factors
Estimating the overall proportion of subjects who progress is straightforward and will not be detailed here. Instead, we consider the relationships between disease progression as an intermediate outcome and the prognostic factors sex and performance status. In reality, progression is an event time, but will be treated as a dichotomous factor, temporarily, for simplicity. Both factors can be summarized in 2 × 2 tables (Table 20.4). The probabilities and odds of progression are shown in Tables 20.5 and 20.6. Here the 95% confidence limits on the proportions are based on the binomial distribution. For dichotomous factors like those in Table 20.4, we are sometimes interested in absolute probabilities (or proportions). However, we are often interested in estimates of risk ratios such as the relative risk or odds ratio. If 𝑝1 and 𝑝2 are the probabilities of events in two groups, the odds ratio, 𝜃, is estimated by 𝜃̂ =
𝑝2 𝑝1 𝑎𝑑 , ÷ = 1 − 𝑝1 1 − 𝑝2 𝑏𝑐
𝑎 𝑏 , 𝑝2 = 𝑏+𝑑 , and so on from the entries in a 2 × 2 table. For example, the where 𝑝1 = 𝑎+𝑐 odds ratio for progression in males versus females (Table 20.4) is (49 × 5)∕(15 × 14) = 1.17. This can also be seen by dividing the relevant odds in Table 20.5. For odds ratios,
TABLE 20.4 Progression by Sex and Performance Status (PS) for the Mesothelioma Trial Sex Progression No Yes
PS
Overall
Male
Female
0
1
20 63
15 49
5 14
10 42
10 21
TABLE 20.5 Probabilities and Odds of Progression by Sex for the Mesothelioma Trial Sex
Pr[progression] 95% CL Odds of progression
Overall
Male
Female
0.76 (0.653–0.846)
0.77 (0.643–0.862)
0.74 (0.488–0.909)
3.15
3.27
2.80
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MIDDLE DEVELOPMENT STUDIES
601
TABLE 20.6 Probabilities and Odds of Progression by Performance Status for the Mesothelioma Trial PS 0
1
Pr[progression] 95% CL
0.81 (0.675–0.904)
0.68 (0.486–0.833)
Odds of progression
4.20
2.10
calculating confidence intervals on a log scale is relatively simple. An approximate confidence interval for the log odds ratio is √ ̂ ± 𝑍𝛼 × 1 + 1 + 1 + 1 , log{𝜃} 𝑎 𝑏 𝑐 𝑑 where 𝑍𝛼 is the point on normal distribution exceeded with probability 𝛼∕2 (e.g., for 𝛼 = 0.05, 𝑍𝛼 = 1.96). For the example above, this yields a confidence interval of [−1.02, 1.33] for the log odds ratio or [0.36, 3.85] for the odds ratio. Because the 95% confidence interval for the odds ratio includes 1.0, the male–female difference in risk of progression is not “statistically significant” using the conventional criterion. The odds ratio is a convenient and concise summary of risk data for dichotomous outcomes and arises naturally in relative risk regressions such as the logistic model. However, the odds ratio has limitations and should be applied and interpreted with thought, especially in circumstances where it is important to know the absolute risk. For example, if two risks are related by 𝑝′ = 𝑝 + 𝛿, where 𝛿 is the difference in absolute risks, the odds ratio satisfies (𝑝 + 𝛿) 𝑝 =𝜃 1−𝑝 1 − (𝑝 + 𝛿) or 𝛿=
𝑝(𝜃 − 1)(1 − 𝑝) . 1 + 𝑝(𝜃 − 1)
This means that many values of 𝑝 and 𝜃 are consistent with the same difference in risk. For example, all of the following (𝑝, 𝜃) pairs are consistent with an absolute difference in risk of 𝛿 = −0.2: (0.30, 3.86), (0.45, 2.45), (0.70, 2.33), and (0.25, 6.33). As useful as the odds ratio is, it obscures differences in absolute risks that may be biologically important for the question at hand. Similar comments relate to hazard ratios (discussed below). 20.3.3
Nonparametric Estimates of Survival Are Robust
Survival data are unique because of censoring. Censoring means that some individuals have not had the event of interest at the time the data are analyzed or the study is over. For these people we know only that they were at risk for a measured period of time and that the event of interest has yet to happen. Thus their event time is censored. Using the incomplete information from censored observations requires some special methods that give “survival analysis” its statistical niche. The basic consequence of censoring is that
Piantadosi
Date: July 27, 2017
602
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
we summarize the data using cumulative probability distributions rather than probability density functions, which are so frequently used in other circumstances. In many clinical trials involving subjects with serious illnesses like cancer, a primary clinical focus is the overall survival experience of the cohort. There are several ways to summarize the survival experience of a cohort. However, one or two frequently used methods make few, if any, assumptions about the probability distribution of the failure times. These “nonparametric” methods are widely used because they are robust and simple to employ. See Kleinbaum [853] for a review of methods. For censored failure time data, the commonest analytic technique is the lifetable. Events (failures or deaths) and times at which they occur are grouped into convenient intervals (e.g., months or years) and the probability of failure during each interval is calculated. Here I review the product limit method [821] for estimating survival probabilities with individual failure times. It is essentially the same as lifetable methods for grouped data. To avoid grouping of events into (arbitrary) intervals or to handle small cohorts, individual failure times are used. The method is illustrated in Table 20.7 using the middle development mesothelioma trial data introduced above. First, the observed event or failure times are ranked from shortest to longest. At each event time, indexed by 𝑖, the number of subjects failing is denoted by 𝑑𝑖 and the number of subjects at risk just prior to the event is denoted by 𝑛𝑖 . By convention, censoring times that are tied with failure times
TABLE 20.7 Product Limit Estimates of Survival for Data from a Mesothelioma Clinical Trial (All Patients) Event Time 𝑡𝑖
Number of Events 𝑛𝑖
Number Alive 𝑑𝑖
Survival Probability ̂ 𝑆(𝑡 𝑖)
Failure Probability 𝑝𝑖
Survival Std. Err.
0.0 0.0 4.0 6.0 17.0 20.0 22.0 28.0 ⋮ 764.0 823.0 948.0 963.0 1029.0 1074.0 1093.0 1102.0 1123.0 1170.0 1229.0 1265.0 1338.0
1 1 1 1 1 1 1 1 ⋮ 1 1 1 0 0 0 0 0 0 0 1 1 0
83 82 81 80 79 78 77 76 ⋮ 12 11 10 9 8 7 6 5 4 3 2 1 0
1.0000 0.9880 0.9759 0.9639 0.9518 0.9398 0.9277 0.9157 ⋮ 0.2081 0.1908 0.1734 . . . . . . . 0.1156 0.0578 .
0 0.0120 0.0241 0.0361 0.0482 0.0602 0.0723 0.0843 ⋮ 0.7919 0.8092 0.8266 . . . . . . . 0.8844 0.9422 .
0 0.0120 0.0168 0.0205 0.0235 0.0261 0.0284 0.0305 ⋮ 0.0473 0.0464 0.0453 . . . . . . . 0.0560 0.0496 .
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MIDDLE DEVELOPMENT STUDIES
603
are assumed to rank lower in the list. Unless there are tied failure times, the number of events represented by each time will be 0 (for censorings) or 1 (for failures). For an arbitrary event time, the probability of failure in the interval from the last failure time is 𝑝𝑖 =
𝑑𝑖 . 𝑛𝑖
The probability of surviving the interval is 1 − 𝑝𝑖 = 1 − probability of surviving all earlier intervals up to the ̂ 𝑆(𝑡 𝑘) =
𝑘−1 ∏ 𝑖=0
( 1−
𝑑𝑖 𝑛𝑖
) =
∏ 𝑡𝑖
𝑘th (
𝑑𝑖 . 𝑛𝑖
Therefore, the cumulative
failure time is
𝑛𝑖 − 𝑑 𝑖 𝑛𝑖
) ,
(20.10)
where the product is taken over all distinct event times. This calculation is carried through a few observations in Table 20.7. For censored events, the previous product will be multiplied by 1 and, in the absence of an event, the survival estimate remains constant, giving the curve its characteristic stepfunction appearance. Although more complicated to derive, the variance of the product limit estimator is [643] ̂ ̂ var{𝑆(𝑡 𝑘 )} = 𝑆(𝑡𝑘 )
2
𝑘−1 ∑ 𝑖=0
𝑑𝑖 . 𝑛𝑖 (𝑛𝑖 − 𝑑𝑖 )
(20.11)
The square roots of these numbers are shown in the last column of Table 20.7. Usually one ̂ versus time to obtain familiar “survival curves,” which can be done separately plots 𝑆(𝑡) for two or more groups to facilitate comparisons. For the mesothelioma data, such curves are shown in Figure 20.4. Investigators are frequently interested in the estimated probability of survival at a fixed time, which can be determined from the calculations sketched above. For example, the probability of surviving (or remaining event free) at 1 year is approximately 0.50 (Fig. 20.4). When this estimate is based on a lifetable or product limit calculation, it is often called “actuarial survival.” Sometimes clinicians discuss “actual survival,” a vague and inconsistently used term. It usually means the raw proportion surviving, for example, 25 = 0.5 or 50% at 1 year in the data above. 50 20.3.4
Parametric (Exponential) Summaries of Survival Are Efficient
To discuss nonparametric estimates of survival quantitatively, it is necessary to agree on a reference point in time or have the entire survival curve available. This inconvenience can often be avoided by using a parametric summary of the data, for example, by calculating the overall failure rate or hazard. If we assume the failure rate is constant over time (i.e., the failure times arise from an exponential distribution), the hazard can be estimated by 𝑑 𝜆̂ = ∑𝑁
𝑖=1 𝑡𝑖
,
(20.12)
Piantadosi
Date: July 27, 2017
604
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
FIGURE 20.4 Nonparametric estimates of survival for a middle development clinical trial in subjects with mesothelioma.
where 𝑑 is the total number of failures and the denominator is the total follow-up or exposure time in the cohort. This estimate of the hazard was introduced in Chapter 16. It is the event rate per person-time (e.g., person-year or person-month) of exposure and summarizes the entire survival experience. More complicated procedures are necessary if the hazard is not constant over time or such an assumption is not helpful. Because 2𝑑𝜆∕𝜆̂ has a chi-square distribution with 2𝑑 degrees of freedom [669] a 100(1 − 𝛼)% confidence interval for 𝜆 is ̂ 2 𝜆𝜒 2𝑑,1−𝛼∕2 2𝑑
0), reducing the variance of 𝛿 𝐴 . The same applies to 𝛿 𝐵 . Example Using the gain-score approach on the data from the FAP clinical trial, the average difference in the number of polyps between baseline and 12 months were 26.3 and −18.7 in the placebo and treatment groups, respectively. That is, on placebo, the number of polyps tended to increase, whereas treatment with sulindac decreased the number of polyps. This difference was statistically significant (𝑝 = 0.03). For polyp size, the average change on placebo was −0.19 and on treatment was −1.52. This difference was also statistically significant (𝑝 = 0.05). Analysis of Covariance An alternative approach using baseline information, which is probably better than a gainscore analysis [882] is an analysis of covariance (ANCOVA). This method employs the baseline measurements as covariates, which are used to adjust the end of treatment values using a linear regression model. In a simple case the model takes the form 𝑌𝑖2 = 𝛽0 + 𝛽1 𝑌𝑖1 + 𝛽2 𝑇𝑖 + 𝜖𝑖 ,
(20.15)
where 𝑇𝑖 is the treatment indicator variable for the 𝑖th person. 𝛽2 is the treatment effect and the focus of interest for the analysis. One can show that the relative efficiency of the gain-score versus the linear model, as measured by the ratio of variances, is 2∕(1 + 𝜌) in favor of the ANCOVA [163]. In any case, proper use of additional measurements on the study subjects can improve the precision or efficiency of the trial. Example An analysis of covariance for the FAP clinical trial is shown in Table 20.10. The linear model used is as given above with the variables coded, as in Table 20.9. The interpretation of the parameter estimates depends on the coding and on which terms are included in the model. Here 𝛽0 is the sulindac effect, 𝛽1 is the effect of the baseline measurement, and 𝛽2 is the difference between the treatments. For polyp number the first model is equivalent to the t-test above: the average number of polyps at 12 months is 13.0 on sulindac and the average difference between the treatments is 64.9 polyps (𝑝 = 0.15). The second model shows that the number of polyps at 12 months depends strongly on the number at baseline (𝑝 = 0.0001) and that the difference between the treatments after accounting for the baseline number of polyps is 43.2, which is significant (𝑝 = 0.04). The results for polyp size are different, showing that the average polyp size at 12 months in the sulindac group is 1.83 and the difference in size of 1.28 attributable to the treatment is significant (𝑝 = 0.02). Also the effect of treatment on polyp size does not depend on the baseline size. 20.4.4
Comparing Counts
Summarizing comparative trials can get us very quickly into the weeds. Powerful analytic techniques are impressive and exciting, and contribute to knowledge, discussion, and sometimes even employment. Many times, however, we can rely on the good design
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZED COMPARATIVE TRIALS
611
TABLE 20.10 Analyses of Covariance for the Familial Adenomatous Polyposis Clinical Trial Dependent Variable
Model Terms
Parameter Estimate
Standard Error
Polyp number
𝛽0 𝛽2
13.0 64.9
30.8 42.5
– 0.15
Polyp number
𝛽0 𝛽1 𝛽2
−21.5 1.1 43.2
14.2 0.1 18.9
– 0.0001 0.04
Polyp size
𝛽0 𝛽2
1.83 1.28
0.37 0.51
– 0.02
Polyp size
𝛽0 𝛽1 𝛽2
1.13 0.21 1.29
0.90 0.24 0.51
– 0.40 0.02
P-Value
All dependent variables are measured at 12 months.
of a trial to permit very simple data summaries or analyses to capture virtually 100% of the information regarding a single outcome. It is important especially for clinical investigators to understand why, how, and when this is appropriate if for no other reason than to yield an intuitive check on results from more complex calculations. One circumstance where this occurs is in 1:1 randomized designs with events or frequency counts as outcomes. In Section 20.1.3, I explained how risk estimates are natural and useful in summarizing therapeutic effects. Risk is invariably estimated by the ratio of a numerator event or frequency count divided by a denominator summary of exposure time, population size, or other aggregate. The trial design may equalize (or nearly so) the denominators in the treatment groups, leaving differences to manifest themselves only in the numerators. For example, the failure rate, risk of failure, or hazard rate in a cohort is given by equation (20.12). The ratio of such risks between cohorts, known as the hazard ratio, is the preferred comparator of treatment effects for event-time studies. If the cohorts being compared have equal exposure times (denominators) as might be expected in a 1:1 randomized trial, the hazard ratio will be simply the ratio of event counts. This permits a quick estimate of the critical ratio from simple summaries of the trial. Unfortunately, a quick estimate of the variability of the hazard ratio is not as easy, so it may not be possible to do the corresponding statistical test in one’s head. However, Pocock [1215] has pointed out that in this same circumstance a quick formal test can be performed using only the numerators. The test relies on two additional assumptions. One is that the events arise according to a Poisson distribution, and the second is that difference between counts roughly follows a normal distribution. Including the idea that the denominators are approximately equal, we have now three assumptions into a quick but formal test. Luckily these are all fairly reasonable. Rather than a ratio, we can also compare treatments 𝐴 and 𝐵 by the difference of counts, 𝑑𝐴 − 𝑑𝐵 . If these are Poisson counts, the respective variances are estimated also by the counts, and the variance of the difference is the sum of the variances, 𝑑𝐴 + 𝑑𝐵 . A
Piantadosi
Date: July 27, 2017
612
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
test statistic with an approximate normal distribution is therefore 𝑑 − 𝑑𝐵 𝑍 = √𝐴 . 𝑑𝐴 + 𝑑𝐵 Values of 𝑍 more extreme than about ±2 suggest 𝑝-values less than 0.05. Values exceeding 3 or 4 indicate strong 𝑝-values. This rapid simple path to a 𝑝-value for some studies should not be taken as too strong an endorsement. This is the formula for McNemar’s test [1004], classically used for assessment of matched pairs. Ideal pairing occurs when comparing two diagnostic tests performed in the same individuals, for example, in which case 𝑑𝐴 and 𝑑𝐵 would represent the number of +∕− and −∕+ discordant tests in the cohort. As an example, in the lung cancer chemotherapy trial from above, there were 54 and 70 recurrence events in the RT+CAP versus RT groups, respectively. The corresponding simple test statistic is 70 − 54 𝑍=√ = 1.44. 70 + 54 This suggest a statistical significance worse than 5%, which does not agree with the more refined analysis below.
20.4.5
Nonparametric Survival Comparisons
Above, I discussed the utility of nonparametric estimates of event time distributions. These are useful summaries because they are relatively simple to calculate, can be generalized to more than one group, can be applied to interval grouped data, and require no assumptions about the distribution giving rise to the data. Methods that share some of these properties are widely used to compare event time distributions in the presence of censored data. One of the most common methods is the logrank statistic [985]. To understand the workings of this statistic, consider a simple two-group (𝐴 versus 𝐵) comparison of the type that might arise in a randomized clinical trial. As in the product limit method discussed above, the data are sorted by the event time from smallest to largest. At each failure time, a 2 × 2 table can be formed: Status: Event No Event Group:
A
𝑑𝑖𝐴
𝑛𝑖𝐴 − 𝑑𝑖𝐴
B
𝑑𝑖𝐵
𝑛𝑖𝐵 − 𝑑𝑖𝐵
where 𝑖 indexes the failure times, the 𝑑’s represent the numbers of events in the groups, the 𝑛’s represent the number of subjects at risk, and either 𝑑𝑖𝐴 or 𝑑𝑖𝐵 is 1 and the other is 0 if all of the event times are unique. These tables can be combined over all failure times in the same way that 2 × 2 tables are combined across strata in some epidemiologic studies. In particular, we can calculate an overall “observed minus expected” statistic for
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZED COMPARATIVE TRIALS
613
group 𝐴 (or 𝐵) as a test of the null hypothesis of equal event rates in the groups. This yields 𝑂𝐴 − 𝐸𝐴 =
𝑁 ∑ 𝑛𝑖𝐴 𝑑𝑖𝐵 − 𝑛𝑖𝐵 𝑑𝑖𝐴 𝑖=1
𝑛𝑖
,
(20.16)
where 𝑛𝑖 = 𝑛𝑖𝐴 + 𝑛𝑖𝐵 and 𝑑𝑖 = 𝑑𝑖𝐴 + 𝑑𝑖𝐵 . The variance can be shown to be 𝑉𝐴 =
𝑁 ∑ 𝑑𝑖 (𝑛𝑖 − 𝑑𝑖 )𝑛𝑖𝐴 𝑛𝑖𝐵
(𝑛𝑖 − 1)𝑛2𝑖
𝑖=1
.
(20.17)
Then the test statistic can be calculated as 𝑍=
𝑂𝐴 − 𝐸𝐴 . √ 𝑉𝐴
Z will have a standard normal distribution under the null hypothesis. Example For the lung cancer clinical trial introduced above, investigators were interested in testing the difference in disease recurrence rates and survival on the two treatment arms (Fig. 20.6 and 20.7). When eligible subjects are analyzed, the logrank statistic for survival calculated from equations (20.16 and 20.17) is 1.29 (𝑝 = 0.26) and for recurrence is 9.18 (𝑝 = 0.002).
FIGURE 20.6
Survival by treatment group during a lung cancer clinical trial.
Piantadosi
Date: July 27, 2017
614
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
FIGURE 20.7
Disease free survival by treatment group during a lung cancer clinical trial.
20.4.6 Risk (Hazard) Ratios and Confidence Intervals Are Clinically Useful Data Summaries Although nonparametric methods of describing and comparing event time distributions yield robust ways of assessing statistical significance, they may not provide a concise clinically interpretable summary of treatment effects. For example, product-limit estimates of event times require us to view the entire recurrence curves in Figure 20.7 to have a sense of the magnitude of benefit from CAP. The problem of how to express clinical differences in event times concisely can be lessened by using hazard ratios (and confidence intervals) as summaries. These were introduced in the power and sample size equations in Chapter 16. The ratio of hazards between the treatment groups can be considered a partially parametric summary because it shares characteristics of both parametric and nonparametric statistics. The hazard ratio is usually assumed to be constant over the course of follow-up, which is a parametric assumption. However, the ratio does not depend on the actual magnitude of the event times, only on their ranking. This is typical of nonparametric methods. The hazard ratio is a useful descriptor because it summarizes the magnitude of the treatment difference in a single number. Hazard ratios that deviate from 1.0 indicate increasing or decreasing risk depending on the numerical coding of the variable. It is also relatively easy to specify the precision of the hazard ratio using confidence intervals. Assuming a constant hazard ratio is likely to be at least approximately correct for the period of observation, even in many situations where the hazards are changing with time. In other words, the ratio may remain constant even though the baseline risk fluctuates. A fixed ratio is often useful, even when it is not constant over time. Furthermore, the ratio has an interpretation in terms of relative risk and is connected to the odds ratio
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
RANDOMIZED COMPARATIVE TRIALS
615
in fundamental ways. Finally, the effects of both categorical and continuous prognostic factors can usually be expressed in the form of a hazard ratio, making it widely applicable. Confidence intervals are probability statements about an estimate and not about the true treatment effect. For example, suppose that the true hazard ratio has exactly the value we have estimated in our study. Then a 95% confidence interval indicates the region in which 95% of hazard ratio estimates would fall if we repeated the experiment. Informally, a confidence interval is a region in which we are confident that a true treatment effect or difference lies. Although incorrect, this notion is not too misleading. The value of confidence intervals is that they convey a sense of the precision with which an effect is estimated. As indicated in Chapter 16, the estimated hazard from exponentially distributed event times has a chi-square distribution with 2𝑑 degrees of freedom. A ratio of chi-square random variables, (the hazard ratio) has an 𝐹 distribution with 2𝑑1 and 2𝑑2 degrees of freedom [314]. Therefore, a 100(1 − 𝛼)% confidence interval for Δ = 𝜆1 ∕𝜆2 is ̂ 2𝑑 ,2𝑑 ,𝛼∕2 . ̂ 2𝑑 ,2𝑑 ,1−𝛼∕2 < Δ < Δ𝐹 Δ𝐹 1 2 1 2 Here again, the calculations can be made more simple by using the approximate normality of log(Δ). Example In the CAP lung cancer clinical trial we can summarize the survival difference by saying that the hazard ratio is 1.22 with a 95% confidence interval of 0.87–1.70. The fact that the hazard ratio is near 1.0 tells us that the treatment offers little overall benefit on survival. The fact that this confidence interval includes 1.0 tells us that, even accounting for the minimal improvement in survival, the difference is not statistically significant at conventional levels. The hazard ratio for disease free survival is 1.72 with 95% confidence interval 1.21–2.47. Thus, the benefit of CAP for recurrence is clinically sizeable and statistically significant at conventional levels. Because a hazard ratio is a single number that summarizes a longitudinal experience, it makes sense only when it is relatively constant over time. It is not difficult to assess the hazard ratio using logarithmic plots. When it is found to be inconstant, alternative simple summaries of group differences might be appropriate. These include differences or ratios of survival probabilities at fixed times (vertical differences between event curves), or differences or ratios of median or other quantile event times (horizontal differences between event curves). Restricted mean survival times, or restricted mean time lost between treatment groups can also be used. These alternatives are discussed in Uno [1498]. 20.4.7
Statistical Models Are Necessary Tools
Statistical models are extremely helpful devices for making estimates of treatment effects, testing hypotheses about those effects, and studying the simultaneous influence of covariates on outcome (Section 21.3). All models make assumptions about the data. Commonly used survival or relative risk models can be parametric, in which case they assume that some specific distribution is the source of the event times, or partially parametric, in which case some assumptions are relaxed. Models cannot be totally nonparametric because this would be something of an oxymoron.
Piantadosi
Date: July 27, 2017
616
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
Here we view a small set of models principally as devices to facilitate estimating and comparing hazard ratios. The proportional hazards (PH) model [316] is probably the most well-known and most useful device. The essential feature of the model is that time, 𝑡, and the covariate vector (predictor variables), 𝐗, enter the hazard function, 𝜆(𝑡; 𝐗), in a way that conveniently factors, 𝜆(𝑡; 𝐗) = 𝜆0 (𝑡)𝑒𝛽𝐗 , where 𝜆0 (𝑡) is the baseline hazard and 𝛽 is a vector of regression coefficients to be estimated from the data. In other words, for an individual characterized by 𝐗, the ratio of their hazard to the baseline hazard is 𝑒𝛽𝐗 . We could write this relationship as { } 𝜆(𝑡) 𝑙𝑜𝑔 = 𝛽𝐗 = 𝛽1 𝑋1 + 𝛽2 𝑋2 + … 𝜆0 (𝑡) to reveal its similarity to other covariate models. Thus, 𝛽 is a vector of log hazard ratios. The model assumes that the hazard ratio is constant over time, and the covariates are also assumed to be constant over time. The effect of covariates is to multiply the baseline hazard, even for 𝑡 = 0. Estimating 𝛽 is technically complex and not the subject of primary interest here. Computer programs for parameter estimation are widely available, enabling us to focus directly on the results. One advantage of the PH model is that the estimation can be stratified on factors that we have no need to model. For example, we could account for the effects of a risk factor by defining strata based on its levels, estimating the treatment effect separately within each level of the factor, and pooling the estimated hazard ratios over all strata. In this way we can account for the effects of the factor without having to assume that its hazard is proportional because it is not entered into the model as a covariate. When interpreting 𝛽, we must remember that it represents the hazard ratio per unit change in the predictor variable. If the predictor is a dichotomous factor such as an indicator variable for treatment group, then a unit change in the variable simply compares the groups. However, if the variable is measured on a continuous scale such as age, then the estimated hazard ratio is per year of age (or other unit of measurement). For variables that are measured on a continuous scale, the hazard ratio associated with an 𝑛-unit change is Δ𝑛 , where Δ is the hazard ratio. For example, if age yields a hazard ratio of 1.02 per year increase, then a 10-year increase will have a hazard ratio of 1.0210 = 1.22. For the CAP lung cancer trial, estimated hazard ratios and 95% confidence limits for predictor variables calculated from the PH model are shown in Table 20.11. Because of differences in the exact method of calculation, these might be slightly different from estimates that could be obtained by the methods outlined earlier in this chapter.
20.5
PROBLEMS WITH P-VALUES
A discussion of estimation and analysis methods for clinical trials would not be complete without an appropriate critique of p-values. A p-value is a probability statement assuming (i) the experiment is repeated an arbitrarily large number of times, and (ii) the null hypothesis is true. Under these conditions, the p-value gives the probability of observing a result the same size as, or larger than, the one actually found. Small p-values indicate the
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PROBLEMS WITH P-VALUES
617
TABLE 20.11 Estimated Hazard Ratios from the Proportional Hazards Model for the CAP Lung Cancer Trial Variable
Hazard Ratio
95% Confidence Bounds
P-Value
Survival Results Treat=“2” Cell type=“2” Karn=“2” T=“2” T=“3” N=“1” N=“2” Age Sex=“1” Wtloss=“1” Race=“1”
1.22 1.28 0.84 0.94 0.94 1.09 1.26 1.00 1.09 1.09 1.21
0.867–1.70 0.907–1.81 0.505–1.40 0.558–1.57 0.542–1.63 0.542–2.20 0.691–2.30 0.984–1.02 0.745–1.58 0.602–1.98 0.734–1.98
0.26 0.16 0.51 0.80 0.82 0.81 0.45 0.73 0.67 0.78 0.46
Recurrence Results Treat=“2” Cell type=“2” Karn=“2” T=“2” T=“3” N=“1” N=“2” Age Sex=“1” Wtloss=“1” Race=“1”
1.73 1.68 0.72 0.99 0.89 1.06 1.30 1.00 0.93 1.48 0.89
1.21–2.47 1.16–2.43 0.433–1.21 0.576–1.71 0.495–1.60 0.499–2.26 0.674–2.49 0.980–1.02 0.629–1.37 0.814–2.67 0.542–1.48
0.003 0.006 0.22 0.98 0.70 0.88 0.44 0.96 0.71 0.20 0.67
result is unlikely in the infinite population of such experiments, causing us to disbelieve the truth of the null hypothesis. The p-value is not a probability statement regarding the observed treatment effect. P-values are used in two related circumstances. One is for hypothesis tests specified prior to the experiment where a p-value below a preset threshold such as 5% triggers rejection of the null hypothesis. A second use of p-values is in significance tests where observed differences or effects are tested during data analysis for consistency with the null. Findings inconsistent with the null are dubbed “statistically significant” and are often labeled with the p-value. Many such tests are not specified in advance of the analysis. I will not discuss important statistical issues or controversies surrounding differences in these two uses of p-values. Many investigators do not know the origins of the 5% significance level that is both a tool and tyranny today. [471] book on experimental design was influential in setting this reference point. In the early twentieth century, psychical research (today it might be termed extrasensory perception) was a topic of serious structured investigation. The question of significance level was addressed by Fisher in that context: An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is a common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently
Piantadosi
Date: July 27, 2017
618
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once in every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demonstrable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation [472].
Here, the implication is that the test informs us what can be ignored, but acceptance requires more investigation. Unfortunately, modern behavior is not in accord with that perspective. This somewhat informal perspective has contributed to the problem we face with p-values today. 20.5.1
P-Values Do Not Represent Treatment Effects
A common error in the use and interpretation of p-values is when investigators use them to replace estimated treatment effects. This is common in casual discussions of trial results. Heuristically, a significance level is determined from the ratio 𝑍=√
Treatment effect Var(Treatment effect)
,
after which the tail area from an appropriate probability distribution for 𝑍 is taken as the p-value. We can see that in the 𝑍-score that the biologically important treatment effect is inextricably combined with its estimated variance, which is a consequence only of the size of the experiment. The 𝑍 value is then further transformed into a distributional tail area. A small p-value does not tell us whether the treatment effect is large or the experiment is large, each with very different implications. This makes it clear that p-values obscure treatment effects rather than represent them. Presenting the treatment effect jointly with its precision and a p-value is optimal—the latter actually then being redundant. Such matters are discussed in Chapter 25. 20.5.2
P-Values Do Not Imply Reproducibility
The notion that a valid finding should be independently reproducible is at the heart of science. We might anticipate that a small p-value in one experiment would imply the likelihood of a similar finding in a second such trial. Unfortunately, this is not the case. Suppose a clinical trial has produced a p-value of 0.05. Investigators are content because the significance level suggests that the treatment effect is not null, and they will attract some attention. If the clinical trial were repeated exactly, how likely is the new p-value to be smaller than 0.05? To answer this question, assume that the estimated ̂ is exactly correct. Then in any new trial the treatment effect from the first trial, 𝜃, estimated treatment effect will be a sample from an “alternative” distribution centered ̂ We can reasonably assume this distribution is normal. exactly at 𝜃. Now the question has virtually answered itself. A new trial will yield a treatment effect ̂ Half the samples will be larger and estimate sampled from a distribution centered at 𝜃. ̂ half smaller than 𝜃. Thus, half the resulting p-values will be smaller than 0.05 and half
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
PROBLEMS WITH P-VALUES
619
larger—the probability of replication is only 50%. Using the same framework, the initial p-value would have to be about 0.001 to yield a 90% chance for subsequent ones to be less than 0.05. Whatever the utility of the p-value, especially the dichotomy of significance at some arbitrary level, it cannot be considered a major purveyor of scientific information because of its poor replicative properties. 20.5.3
P-Values Do Not Measure Evidence
A framework for relative evidence based on the likelihood function has been discussed by Royall [1295, 1298]. There is no absolute measure of evidence, and p-values do not convey relative evidence because that role is restricted to the likelihood function. One of the weaknesses of p-values as even informal summaries of strength of evidence can be illustrated in a simple way using an example originated by Walter [1521]. Consider the following 2 × 2 tables summarizing binomial proportions: 𝐴 𝐴 𝐵 1 7 𝐵 13 7 and 𝐴 𝐴 𝐵 1 6. 𝐵 13 6 The column proportions are the same in both tables, although the first has more data and should provide stronger evidence than the second. For both tables we can compare the 1 proportions 14 versus 12 using Fisher’s exact test [14, 15]. Doing so yields two-sided pvalues of 0.33 and 0.26, respectively. In other words, the second table is more “statistically significant” than the first, even though it provides less evidence. This outcome is a result of discreteness and asymmetry. One should probably not make too much of it except to recognize that p-values do not measure strength of evidence. To illustrate the advantage of estimation and confidence intervals over p-values, consider the discussion over the prognostic effect of peri-operative blood transfusion in lung cancer [1201]. Several studies (not clinical trials) of this phenomenon have been performed because of firm evidence in other malignancies and diseases that blood transfusion has a clinically important immunosuppressive effect. Disagreement about results of various studies has stemmed, in part, from too strong an emphasis on hypothesis tests instead of focusing on the estimated risk ratios and confidence limits. Some study results are shown in Table 20.12. Although the authors of the various reports came to different qualitative conclusions about the risk of blood transfusion because of differing p-values, the estimated risk ratios, adjusted for extent of disease, appear to be consistent across studies. Based on these results, one might be justified in concluding that peri-operative blood transfusion has a modest adverse effect on lung cancer patients. Interestingly, a randomized trial of autologous versus allogeneic blood transfusion in colorectal cancer was reported [694]. It showed a 3.5-fold increased risk attributable to the use of allo-
Piantadosi
Date: July 27, 2017
620
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
TABLE 20.12 Summary of Studies Examining the Peri-Operative Effect of Blood Transfusion in Lung Cancer Study
Endpoint
Hazard Ratio
95% Confidence Limits
Tartter et al., 1984
Survival
1.99
1.09–3.64
Hyman et al., 1985
Survival
1.25
1.04–1.49
Pena et al., 1992
Survival
1.30
0.80–2.20
Keller et al., 1988
Recurrence Stage I Stage II
1.24 1.92
0.67–1.81 0.28–3.57
Survival Recurrence
1.57 1.40
1.14–2.16 1.01–1.94
Moores et al., 1989
All hazard ratios are transfused versus untransfused subjects and are adjusted for extent of disease.
geneic blood transfusion with a p-value of 0.10. Additional evidence for the risk of blood transfusion can be seen in liver cancer [199, 252, 946].
20.6 20.6.1
STRENGTH OF EVIDENCE THROUGH SUPPORT INTERVALS Support Intervals Are Based on the Likelihood Function
Strength of evidence can be measured using the likelihood function [404, 405, 1298]. We can obtain a measure of the relative strength of evidence in favor of our best estimate versus a hypothetical value using the ratio of the likelihoods evaluated with each parameter value. Support intervals are quantitative regions based on likelihood ratios that quantify the strength of evidence in favor of particular values of a parameter of interest. They summarize parameter values consistent with the evidence without using hypothesis tests and confidence intervals. Because they are based on a likelihood function, support intervals are conditional on, or assume, a particular model of the data. Like confidence intervals, they characterize a range of values that are consistent with the observed data. However, support intervals are based on values of the likelihood ratio rather than on control of the type I error. The likelihood function, 𝐿(𝜃|𝐱), depends on the observed data, 𝐱, and the parameter of the model, 𝜃 (which could be a vector of parameters). We can view the likelihood as a function of the unknown parameter, conditional on the observed data. Suppose that ̂ Usually, this will be the “maximum our best estimate of the unknown parameter is 𝜃. likelihood estimate.” We are interested in the set of all values of 𝜃 that are consistent ̂ according to criteria based on the likelihood ratio. We could, for example, say with 𝜃, that the data (evidence) support any 𝜃 for which the likelihood ratio relative to 𝜃̂ is less than 𝑅. When the likelihood ratio for some 𝜃 exceeds 𝑅, it is not supported by the data. ̂ An This defines a support interval or a range of values for 𝜃 which are consistent with 𝜃. example is given below.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
STRENGTH OF EVIDENCE THROUGH SUPPORT INTERVALS
20.6.2
621
Support Intervals Can Be Used with Any Outcome
Support intervals can be constructed from any likelihood using any outcome. If the purpose of our clinical trial is to estimate the hazard ratio on two treatments, we could employ a simple exponential failure time model. Let 𝑖 denote an arbitrary study subject. We define a binary covariate, 𝑋𝑖 , which equals 1 for treatment group 𝐴 and 0 for treatment group 𝐵. The survival function is 𝑆(𝑡𝑖 ) = 𝑒−𝜆𝑖 𝑡𝑖 and the hazard function, 𝜆𝑖 , is assumed to be constant. Usually, we model the hazard as a multiplicative function of covariates, 𝜆𝑖 = 𝑒𝛽0 +𝛽1 𝑋𝑖 (or 𝜆𝑖 = 𝑒−𝛽0 −𝛽1 𝑋𝑖 ), where 𝛽0 and 𝛽1 are parameters to be estimated from the data, that is 𝜃 ′ = {𝛽0 , 𝛽1 }. The hazard for a person on treatment 𝐴 is 𝜆𝐴 = 𝑒𝛽0 +𝛽1 and for a person on treatment 𝐵 is 𝜆𝐵 = 𝑒𝛽0 . The hazard ratio is Δ𝐴𝐵 = 𝑒𝛽0 +𝛽1 ∕𝑒𝛽0 = 𝑒𝛽1 . Thus, 𝛽1 is the log hazard ratio for the treatment effect of interest. (𝛽0 is a baseline log hazard, which is unimportant for the present purposes.) To account for censoring, we define an indicator variable 𝑍𝑖 that equals 1 if the 𝑖th person is observed to have an event and 0 if the 𝑖th person is censored. The exponential likelihood is 𝐿(𝛽0 , 𝛽1 ∣ 𝐱) =
𝑁 ∏ 𝑖=1
𝑍
𝑒−𝜆𝑖 𝑡𝑖 𝜆𝑖 𝑖 =
𝑁 ∏
𝛽 +𝛽1 𝑋𝑖
𝑒−𝑡𝑖 𝑒 0
(𝑒𝛽0 +𝛽1 𝑋𝑖 )𝑍𝑖 .
𝑖=1
If 𝛽̂0 and 𝛽̂1 are the MLEs for 𝛽0 and 𝛽1 , a support interval for 𝛽̂1 is defined by the values of 𝜁 that satisfy ∏𝑁 −𝑡( 𝑒𝛽̂0 +𝛽̂1 𝑋𝑖 ) 𝛽̂ +𝛽̂ 𝑋 𝑍 𝑒 𝑖 (𝑒 0 1 𝑖 ) 𝑖 𝑅 > ∏𝑖=1 (20.18) ̂ 𝑁 −𝑡𝑖 (𝑒𝛽0 +𝜁 𝑋𝑖 ) (𝑒𝛽̂0 +𝜁𝑋𝑖 )𝑍𝑖 𝑖=1 𝑒 ( )𝑍𝑖 ∑𝑁 𝑁 𝛽̂ +𝛽̂ 𝑋 ̂ 𝑒− 𝑖=1 𝑡(𝑖 𝑒 0 1 𝑖 ) ∏ 𝑒𝛽1 𝑋𝑖 = ∑𝑁 𝜁𝑋 𝛽̂ +𝜁 𝑋 𝑒− 𝑖=1 𝑡(𝑖 𝑒 0 𝑖 ) 𝑖=1 𝑒 𝑖 = 𝑒−
∑𝑁
𝛽̂0 +𝛽̂1 𝑋𝑖 −𝑒𝛽̂0 +𝜁 𝑋𝑖 ) 𝑖=1 𝑡𝑖 (𝑒
𝑁 ∏ ̂ (𝑒(𝛽 1 −𝜁)𝑋𝑖 )𝑍𝑖 . 𝑖=1
Example Reconsider the comparative lung cancer trial for which the estimated hazard ratio for disease-free survival favored treatment with CAP chemotherapy. The log hazard ratio estimated using the exponential model is (𝛽̂1 =) 0.589 in favor of the CAP group. With some software for fitting this model, it may be necessary to re code the treatment group indicator variable as 0 versus 1, rather than 1 versus 2. Note that 𝑒0.589 = 1.80, which is similar to the hazard ratio estimated above from the PH model. The estimated baseline hazard depends on the time scale and is unimportant for the present purposes. However, from the exponential model, 𝛽̂0 = −6.92. Using these parameter estimates and applying equation (20.18) to the data, we can calculate support for different values of 𝛽1 (Fig. 20.8). The vertical axis is the likelihood ratio relative to the MLE and the horizontal axis is the value of 𝛽1 . For these data, values of 𝛽1 between 0.322 and 0.835 fall within an interval defined by 𝑅 = 10 and are thus strongly supported by the data. This interval corresponds to hazard ratios of 1.38–2.30.
Piantadosi
Date: July 27, 2017
622
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
FIGURE 20.8 cancer trial.
Support interval for the estimated hazard ratio for disease free survival in a lung
The 95% confidence interval for 𝛽1 is 0.234–0.944, which corresponds approximately to a support interval with 𝑅 = 12. 20.7
SPECIAL METHODS OF ANALYSIS
Some biological questions, whether on the underlying disease, method of treatment, or the structure of the data, require an analytic plan that is more complicated than the examples given above. This tends to happen more in nonexperimental studies because investigators do not always have control over how the data are collected. In true experiments it is often possible to measure outcomes in such a way that simple analyses are sufficient. Even so, special methods of analysis may be needed to address specific clinical questions or goals of the trial. Some examples of situations that may require special or more sophisticated analytic methods are correlated responses, such as those from repeated measurements on the same individual over time or clustering of study subjects, pairing of responses such as event times or binary responses, covariates that change their values over time, measurement errors in independent variables, and accounting for restricted randomization schemes. Each of these can require generalizations of commonly used analytic methods to fully utilize the data. Sometimes the hardest part of being in these situations is recognizing that ordinary or simple approaches to an analysis are deficient in one important way or another. Even after recognizing the problem and a solution, computer software to carry out the analyses may not be readily available. All of these issues indicate the usefulness of consulting a statistical methodologist during study design and again early in the analysis.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SPECIAL METHODS OF ANALYSIS
20.7.1
623
The Bootstrap Is Based on Resampling
Often, one of the most difficult tasks for the biostatistician is to determine how precise an estimate is. Stated more formally, determining the variance is usually more difficult than determining the estimate itself. For statistical models with estimation methods like maximum likelihood, there are fairly simple and reliable ways of calculating the approximate variance of an estimate. Sometimes, however, either the assumptions underlying the approximation are invalid or standard methods are not available. One example is placing confidence limits on the estimate of a median from a distribution. In situations like this, one simple and reliable way to approximate the variance is to use a resampling method called the bootstrap [409, 410]. In the bootstrap the observed data are resampled and the point estimate or other statistic of interest is calculated from each sample. The process is repeated a large number of times so that a distribution of possible point estimates is built up. This distribution, characterized by ordinary means, serves as a measure of the precision of the estimate. The sample at each step is taken “with replacement,” meaning any particular datum can be chosen more than once in the bootstrap sample. Like randomization distributions, there are a large number of possible bootstrap samples. For example, suppose that there are 𝑁 observed values in the data and each bootstrap sample consists of 𝑀 values. Then there are 𝑁 𝑀 possible bootstrap samples. Some of these may not be distinguishable because of duplicate observations or sample points. Nevertheless, there are a large number of samples in general, so complete enumeration of the sample space is not feasible. For this reason, the bootstrap distribution is usually approximated from a random sample of the 𝑁 𝑀 possibilities. A simple example will illustrate the procedure. Suppose that we have a sample of 100 event times and we are interested in placing confidence intervals on our estimate of the median failure time. The observed data plotted as a survival curve are shown in Figure 20.9. There are no censored observations in this example. Standard lifetable methods yield a median of 15.6 with a 95% confidence interval of (11.9–19.2). Samples of size 100 were taken from the event times with replacement, and the bootstrap median estimate was repeated 10,000 times. A frequency histogram of the estimated medians from the samples is shown in Figure 20.10. The mean value is 15.2, and 95% of the bootstrap median estimates fall between 11.9 and 19.2, agreeing with standard methods. Calculations such as these are greatly facilitated by computer programs dedicated to the purpose. Bootstrap methods can be used to validate modeling procedures. An example of its use in pharmacodynamic modeling is given by [1038]. The bootstrap is a very general method for approximating the variability in any derived value (statistic). A second illustration is the hypothetical trial from Table 20.13 with censored survival data. The observed hazard ratio is 1.7. Taking 2000 bootstrap samples yields the distribution of hazard ratio estimates in Figure 20.11.
20.7.2
Some Clinical Questions Require Other Special Methods of Analysis
A book such as this can do little more than mention a few of the important statistical methods that are occasionally required to help analyze data from clinical trials. Here I briefly discuss some special methods that are likely to be important to the student of clinical trials. Most of these represent areas of active biostatistical research. For many of
Piantadosi
Date: July 27, 2017
624
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
FIGURE 20.9 Survival Curve for Bootstrap Example
the situations discussed below, there are no widely applicable guidelines about the best statistical summary of the data to answer clinical questions. This results from the lower frequency with which some of these situations arise, and their greater complexity. Longitudinal Measurements Because most clinical trials involve or require observation of subjects over a period of time following treatment, investigators often record longitudinal assessments of outcomes, endpoints, and predictor variables. Using the additional information contained in longitudinal measurements can be difficult. For example, survival and disease progression are familiar longitudinal outcomes that require special methods of analysis. To the data analyst, the relationship between measurements taken in a longitudinal study are not all equivalent. For example, there is a difference between measurements taken within the same individual, which are correlated, and those taken from different individuals, which are usually uncorrelated or independent. Because most simple statistical approaches to analysis rely on the assumption of independent observations, coping with correlated measurements is an important methodologic issue. When the primary data analysis tool is the linear model, such as for analyses of variance or linear regressions, correlated measurements are analyzed according to “repeated measures” or “longitudinal data” models. These types of data and models used to analyze them are increasingly used in studies of HIV and other diseases with repeated outcomes such as migraine, asthma, and seizure disorders. An in-depth discussion of statistical methods is given by Diggle, Liang, and Zeger [378, 379]. See also [1602], [937], and [1603] pertaining to the analysis of other correlated outcomes. A recent treatment of the subject is Fitzmaurice et al. [480].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SPECIAL METHODS OF ANALYSIS
FIGURE 20.10 Figure 20.9.
625
Frequency of median estimates from bootstrap samples for the survival data in
Correlated Outcomes Longitudinal measure is not the only circumstance that gives rise to correlated observations. When the experimental unit is a cluster or group, outcomes may be correlated from individual to individual. This situation can arise if families are the experimental unit, as in some disease prevention trials. For example, consider a clinical trial to assess the efficacy of treatments to eliminate helicobacter pylori in subjects living in endemic areas. Helicobacter pylori has a causal association with peptic ulcer disease. After treatment with antibiotics and/or bismuth, subjects may become reinfected because of environmental exposure or family contact. Therefore, it may be necessary to treat families as the experimental unit, in which case the outcomes between individuals in the same family will be correlated. In situations such as this, the analysis may need to employ methods to account for this dependency. Other circumstances can lead to correlations, such as when individuals have more than one outcome. For example, suppose that recurrent pre-malignant or malignant lesions (e.g., skin cancers, bladder polyps, and colonic polyps) are the clinical outcome of interest. We might be interested in the interval between such events, recognizing that prolongation of the between-lesion time could be the sign of an effective secondary preventive agent. Thus, each individual on study could give rise to two or more event times correlated with each other because they arise from the same person. Correlated dichotomous outcomes can arise in a similar way. Individual study subjects can yield more than one outcome as a consequence of the experiment design. This is the case for crossover designs (discussed in Chapter 23) where each study participant is intentionally given more than one treatment. Estimating
Piantadosi
Date: July 27, 2017
626
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
TABLE 20.13 Event Times From a Randomized Trial Time 1.79 9.02 3.06 6.17 3.6 9.09 4.64 0.17 4.33 4.69 22.6 2.9 5.02 0.69 2.02 3.28 12.56 3.31 0.39 17.8 12.31 11.24 7.44 10.73 0.66
Group A z Time 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1
9.14 15.92 7.07 22.95 1.68 5.11 5.12 0.2 18.31 11.34 10.31 3.21 29.68 6.98 0.14 0.57 1.85 10.39 2.6 0.06 10.38 3.78
z
Time
z
Group B Time z
0 1 1 1 1 1 0 1 1 0 0 0 1 1 1 0 1 0 0 1 0 1
4.03 3.34 7.17 11.75 1.61 1.66 0.47 4.53 7.22 0.17 4.16 11.53 6.51 12.72 25 12.13 7.87 1.05 3.52 0.28 4.5 0.04 10.5 1.5 1.07
0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
5.67 8.37 0.15 0.48 0.78 8.48 11.33 4.7 2.75 2.52 8.9 13.02 10.09 2.96 1.37 1.92 3.91 0.94 3.35 1.66 3.55 4.19 0.24 2.22 9.37
1 1 1 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0
Time
z
10.85 9.66 2.64 8.62 0.39
1 1 1 1 1
z=0 denotes a censored observation.
the difference in treatment effects within individuals while accounting for dependency is the major issue in the use of crossover designs. Time Dependent Covariates Most of the prognostic factors (covariates) measured in clinical trials are fixed at baseline (start of treatment) and do not change during follow-up. Examples are severity of disease at diagnosis, age at diagnosis, sex, race, treatment group assignment, and pathologic type or class of disease. The methods of accounting for the influence of such prognostic factors in statistical regression models typically assume that their effects are constant over time. Statistical models usually assume that the effect of treatment as a prognostic factor is immediate following administration and constant over the follow-up period. These assumptions are inadequate to describe the effects of all prognostic factors. First, disease intensity or factors associated with it can fluctuate following any clinical landmark such as beginning of treatment. Second, long-term prognosis may be a direct consequence of disease intensity or other time-varying prognostic factors measured at an earlier time. Third, some time-varying prognostic factors may be associated with an outcome without being causal. In any case, we require statistical models that are flexible enough to account for the effects of predictor variables whose value changes
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SPECIAL METHODS OF ANALYSIS
FIGURE 20.11
627
Histogram of bootstrap hazard ratio estimates for data in Table 20.13.
over time. These type of variables are called “time-dependent covariates” (TDC). TDCs are discussed in depth by Kalbfleisch and Prentice [813, 814], and others [469, 993]. It is helpful to distinguish between two basic types of TDCs. The first is external, implying that the factor can affect the individual’s prognosis but does not carry information about the event time. External TDCs can be defined by a particular mechanism. An example is age, which is usually regarded as a fixed covariate (age at time 0) but may be considered as defined and time varying when prognoses change with the age of the study subject. TDCs can be ancillary, which indicates that they arise from a process not related to the individual. A second type of TDC is internal, a term that implies that a process within the individual, perhaps even the disease itself, gives rise to the prognostic factor. Extent of disease and response to treatment are examples of this. Cumulative dose of drug may also be an internal TDC if it increases because the subject survives longer. Investigators should interpret the results of analyses employing TDCs with care. For internal TDCs in a treatment trial, the therapy may determine the value of the covariate. Adjusting on the TDC in such a case can “adjust away” the effect of treatment, making the prognostic factor appear significant and the treatment appear ineffective. This is almost never the correct description of the trial outcome. Measurement Error Another usual assumption of most statistical modeling methods is that the predictor variables are measured without error. Although such models explicitly incorporate the effects of random error, it is usually assumed to be associated with the response measurement rather than with the predictor variables. This paradigm is clearly not applicable in all circumstances. For example, suppose that we attempt to predict the occurrence of cancer from dietary factors, such as fat, calorie, and mineral content. It is likely that
Piantadosi
Date: July 27, 2017
628
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
the predictor variables will be subject to measurement error as a consequence of recall error and inaccuracies converting from food substances to the components of interest. Accounting for errors in predictor variables complicates statistical models and is not routinely needed when analyzing clinical trials. Nonlinear models with predictor variables subject to measurement error are discussed by Carroll [233, 234]. Random versus Fixed Effects Most widely used statistical models assume that the effects of predictor variables are fixed (nonrandom). Sometimes it makes more sense to model the influence of a predictor variable as a random effect. For example, in a multicenter study, the effect of interest may vary from institution to institution. Comparing two institutions may not be of interest. Testing the average treatment effect in different institutions also may not be of interest because of systematic differences. In such a situation investigators may be more interested in the relative size of the within and between institution variability. A random effects model regards the study centers as a random sample of all possible institutions and accounts for variation both within and among centers. When using linear models or analyses of variance, such models are also called variance components. It will almost never be the case that centers participating in a trial can realistically be regarded as randomly chosen from a population of centers. However, this perspective may be useful for assessing the relative sizes of the sources of variation. An overview of random and mixed effects linear models is given by McLean, Sanders, and Stroup [1016]. Such models have been applied also to longitudinal data [883]. See Taylor, Cumberland, and Sy [1464] for an interesting application to AIDS.
20.8 20.8.1
EXPLORATORY ANALYSES Clinical Trial Data Lend Themselves to Exploratory Analyses
In a true experiment, the biological or clinical question drives the methods for acquiring data. Important beneficial consequences result from the linkage and ordering of question and data in a good design, including reliability, validity, and quantification of errors. The importance of good methods in preserving these benefits cannot be overstated. But in addition, having data on a well defined cohort with good documentation and active follow-up creates an opportunity to explore ancillary or secondary questions. The data from clinical trials can and should be used to address questions in addition to those directly related to the primary objectives of the study. These exploratory analyses, as they are often called, do not generate the same quality or strength of evidence as primary questions do. Such questions may be specified in advance of the trial, or may suggest themselves after the data are seen. Despite there being a common pool of data, the experiment is usually not perfectly suited to answering secondary questions, so the quality of evidence that results is almost always inferior to that for the primary objectives. This can seem counterintuitive, but we cannot expect data to be reliable when put to purposes for which they were not designed. This represents the perpetual discomfort with analysis of happenstance data no matter how easily accomplished it may be. Problems with secondary analyses can result from the following:
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
EXPLORATORY ANALYSES
629
1. Imprecision because the question is addressed in a subset of the study cohort or the interesting events occur infrequently. 2. An indeterminate or increased influence of random error because of multiplicity (defined below). 3. Confounding or bias despite a “randomized” design because subsets are conditionally defined or filtered by post-randomization events. 4. Lack of reliability inherent in data-driven questions as compared to question-driven data. 5. Analyses that impose strong, unlikely, or unverifiable assumptions. 6. Missing data that aggravates any condition above. When any of these problems are likely, (i) we would not represent the finding as a primary result of the trial, and (ii) we would recognize the need for independent validation. Usually, investigators engage in this activity only after the primary questions have been answered, often in secondary publications or reports. Mistakes arise not because of exploratory analyses, but when the findings are over interpreted. The usual calculation of type I error may be incorrect, and is definitely wrong when the data themselves suggest the hypothesis test. As a general rule, the same data should not be used both to generate a new hypothesis and to test it. Apart from the statistical pitfalls, investigators must guard against forcing the data to fit the hypotheses (data torturing) [1045].
20.8.2
Multiple Tests Multiply Type I Errors
Data, in sufficient quantity and detail, can be made to yield nearly any effect desired by the adventuresome analyst performing hypothesis tests. They will almost certainly yield some effect if studied diligently. There once was a biased clinician, Who rejected the wise statistician. By flogging his data With 𝛼 and 𝛽, He satisfied all his ambition. A small thought experiment will illustrate effects of concern. Suppose that we generate 𝑁 observations sampled from a Gaussian distribution with mean 0 and variance 1. With each observation, we also randomly generate 100 binary indicator variables, 𝑥1 , 𝑥2 , … , 𝑥100 , which can be used to assign the observation to either of two groups. We then perform 100 “group” comparisons defined by the 𝑥’s, each at a specified 𝛼-level (the nominal level), such as 𝛼 = 0.05. Thus, the null hypothesis is true for all comparisons. Using this procedure, we expect 5% of the tests to reject simply by chance. Of course, the type I error rate for the entire testing procedure greatly exceeds 5%. It equals 𝛼 ∗ = 1 − (1 − 𝛼)100 ≈ 0.99. In other words, we are virtually certain to find at least one “significant” difference based on partitioning by the 𝑥’s. What if we put aside the problem of multiplicity of tests and restrict our attention to only those differences that are large in magnitude? This corresponds to performing significance tests suggested by findings in the data. If we test only the 10 largest group differences, presumably all five of the expected “significant” differences will be in this
Piantadosi
Date: July 27, 2017
630
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
group. Thus, the expected type I error rate for each test will increase from the nominal 5% to 5∕10 = 50%. Investigators must be aware of these types of problems when performing exploratory analyses. Findings observed in this setting, even if supported by biological rationale, have a high chance of being incorrect. Independent verification is essential, perhaps through another clinical trial.
20.8.3
Kinds of Multiplicity
There are four well-described sources of multiple tests that can contribute to an increased frequency of type I errors. They are multiple endpoints in a single trial, multiple interim analyses of accumulating data, tests performed on a large number of subsets, and analyses of prognostic factors (covariates) and their interactions. We can refer to these circumstances collectively as multiplicity. An extreme of any can increase the type I error arbitrarily high, as indicated in the numerical example above. It is sometimes surprising how rapidly the error rate can increase above the nominal level with seemingly few such tests. Reducing errors due to a multiplicity of tests requires mainly discipline, which is to say that it can be quite difficult. The best conceptual approach to control such errors is illustrated in Chapter 18, where the frequentist framework for interim analyses is discussed. This framework is designed to control type I errors at the nominal level in interim analyses by (i) planning for multiplicity in advance, (ii) restricting the number of tests performed, (iii) performing each test in a structured fashion at a more stringent critical value, and (iv) tempering the technical recommendations with good biological judgment. These same points can be used to reduce errors when examining multiple endpoints or subset analyses. Some sensible advice on this subject in the context of cardiovascular trials is given by Parker and Naylor [1177]. Applying these ideas to subset analyses (defined below) implies that we should specify such tests in advance of obtaining the data and tightly constrain the number of comparisons performed. The nominal type I error level for such tests should not be interpreted uncritically or represented as a primary finding of the trial. If it is important to control the overall error rate, each test can be restricted using some appropriate correction factor. This point deserves careful consideration because there are disparate technical opinions about it. Last, seemingly significant findings from subset analyses must be interpreted cautiously or dismissed entirely if they are the product of abusive analyses. It is too much to ask the same data to generate a new hypothesis and simultaneously provide convincing evidence supporting it.
20.8.4
Inevitible Risks from Subgroups
Subgroups, subsets, or subpopulations are portions of the trial cohort delineated by measured factors in individuals. Subgroup members share common values of defining
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
EXPLORATORY ANALYSES
631
discrete variables. If the defining factor is a continuous measurement, subgroup members will fall in the same category or range of values. Such factors may be measured at baseline, or in early post-treatment follow-up. Ignoring a factor when estimating a treatment effect assumes a homogeneous effect across factor levels. If the treatment effect partly depends on the factor, ignoring it will yield a larger variance than accounting for the variable. To account for a factor, we estimate differential treatment effects across its levels, which reduces overall variability. Factors subject to errors in measurement, like biological assays, can yield substantial mis-classification that introduces more variability in subset analyses. Investigators might be especially interested in subgoup analyses in three circumstance. First is when evidence for overall benefit is strong and we are looking for consistency of effect in a heterogeneous trial cohort. It can be reassuring to see similar effects across disparate subsets. Second, we sometimes see a statistically convincing but clinically marginal overall effect. Subset analyses might identify a group with particularly favorable risk-benefit. In either of these first two situations, subgroup analyses may be necessary to make the best clinical inferences. Third, when the overall therapeutic effect is unconvincing, investigators sometimes seek a subset in which the treatment appears to be working. Those subset analyses are typically ill conditioned because they are not protected by the design of the trial. Control over bias and random variation is lost, making results unreliable. Even when planned, analysis of subgroups inflates errors beyond nominal levels because of multiplicity. Issues at the foundation of all subgroup analyses include the following:
· · · · · ·
Heterogeneity: There are two perspectives. One is when seeking reassurance that heterogeneity in treatment effect is minimal and apparently random. This might pertain to the variation of effects across clinical centers in a trial. A second perspective is that we may be aware of differences attributable to specific measured factors. Either view motivates interest in subgoup analysis. Biological plausibility: Investigators must bring to bear evidence from outside the experiment that the subgroups in question can reflect biological principles. The general importance of biological plausibility in clinical trials is often understated compared to their empirical strengths. Its importance here connects to the discussions in Sections 2.2.1, 4.5.1, and 11.3. External evidence that a subgroup is well defined and important: This is of primary importance when subgroup characterization depends on a biological assay. Strong clinical and statistical associations: The magnitude of effect relative to disease landmarks, and relative magnitude of effect compared to natural variation must be taken into account. If either is small, subgroup exploration will be of little use. Replication: Replication is the only tool to reduce variability, and therefore control random error. Independent replication of a subgroup finding, especially one weakly designed or motivated, is crucial. Prior structure: Subgroup investigation demands advance specification such as stratification and preplanning of analyses. This reduces errors of chance typical of post hoc anlayses.
Piantadosi
Date: July 27, 2017
632
· · ·
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
Awareness of extraneous effects: Influences such as stage of disease, concomitant medications, and others can be incorrectly attributed to subgroup classification. Multiplicity: Too many chances for a type I error will almost certainly produce one. Searching among a large number of subgroups typifies and aggravates this problem. The only solutions are advance specification, documentation, and discipline in the number of analyses, and overall control over statistical significance levels. Statistical evidence: A subgroup effect is a type of treatment by covariate interaction—the treatment effect depends on the level of another variable. Formal tests of interaction using models are one way to assess and summarize the statistical evidence. This does not correct for multiplicity (and may aggravate it) but does provide a framework for a rigorous statistical approach.
Failing to organize subgroup exploration around these principles increases the risk that findings are spurious. Even a disciplined approach to these questions is more error prone than the designed analyses of a trial for reasons discussed above. It is possible to design a trial explicitly for a test of interaction relevant to a subgroup, but a large sample size is required for the same reasons given for treatment–treatment interactions discussed in Chapter 22. Moreover, unlike designed experiments for interaction, subgroup questions are frequently unbalanced, further reducing efficiency. 20.8.5
Tale of a Subset Analysis Gone Wrong
One of the easiest ways for the analysis of a clinical trial to follow an inappropriate direction occurs when investigators emphasize the findings from a particular subset of subjects, especially when the results are different from the overall findings (i.e., an analysis of all randomized subjects). These interesting results may be found in a particular subset of subjects after an extensive search that is not based on any a priori biological hypothesis. Other times an accidental observation in a subset may suggest a difference, which is then tested and found to be “statistically significant.” If the investigators have prejudices or reasons from outside the trial to believe the findings, these circumstances could lead to a fairly firmly held belief in the validity of the results. Besides type I errors, subset analyses are not protected by the randomization. As a result, a small bias can become amplified. A classic cautionary example regarding subset analyses was derived from the second International Study of Infarct Survival (ISIS-2) [777]. In that trial (see also Section 18.1.4), 12 subset analyses we performed by zodiacal sign. Although, overall, aspirin was strongly superior to placebo (𝑝 < 0.00001), subjects born under the Gemini and Libra signs appeared to do better with placebo (9% ± 13%). This example has become so familiar to trialists that it has lost some of its ability to surprise and amuse. In any case, the lesson cannot be forgotten. Another example of how subgroup analyses help create and perpetuate errors is illustrated by the enthusiasm some clinicians had in the 1980s and 1990s (and beyond) for the treatment of cancer patients with hydrazine sulfate. Hydrazine, H2 NNH2 (diamide or diamine), was discovered in 1887 and synthesized in 1907. However, it was not until World War II that interest developed in the compound as a rocket fuel. It is a powerful reducing agent and readily reacts with acids to form salts. Hydrazine is a known carcinogen in rodents and probably in humans [1154]. It is metab-
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
EXPLORATORY ANALYSES
633
olized by N-acetylation [286] with varying rapidity according to acetylator phenotypes [1530]. A condensed but well referenced review of hydrazine sulfate in cancer treatment can be found in the National Cancer Institute’s final IND report to the FDA [1078, 1080]. In the 1970s, there were suggestions that hydrazine sulfate could improve the survival of patients with advanced cancer [605–607]. Investigators suggested that the mechanism of action was by normalizing glucose metabolism in patients with cachexia. Cachexia is common in cancer patients and is a sign of poor prognosis. It may be due, in part, to tumor glycolysis. Although blocking glycolysis systemically is undesirable, inhibiting gluconeogenesis might be beneficial. Hydrazine sulfate is a noncompetitive inhibitor of phosphoethanol pyruvate carboxykinase, the enzyme that catalyzes the conversion of oxaloacetate to phosphoenolpyruvate. This mechanism may explain observations that hydrazine sulfate inhibits tumor growth in animals [606]. Based on the possible biological actions suggested above and anecdotal case reports of human benefit, early uncontrolled trials of hydrazine sulfate were undertaken in the Soviet Union. The results were mixed. In 1990 a group of investigators reported results from a randomized clinical trial testing the effects of hydrazine sulfate in subjects with advanced lung cancer [256]. Although no statistically significant overall effect was found, a subset analysis revealed a group of subjects that appeared to benefit from the drug. The subset of subjects who seemed to have improved survival after treatment with hydrazine included those with the most advanced disease. It was not made clear how many subset analyses had to be conducted to discover the truth. The analysis and report of this study emphasized the subset findings as primary results of the trial. Criticism of the trial report, suggesting that it might represent a type I error [1200], annoyed proponents of hydrazine sulfate. Commentary on the question in the well-known scientific forum afforded by Penthouse magazine suggested that patients were being denied a virtual cure for cancer and that unscrupulous researchers were profiting from continued denial of hydrazine’s salutary effects: If you … come down with cancer or AIDS … you will probably be denied the one drug that may offer the best possibility of an effective treatment with the least side effects. It works for roughly half of all the patients who have received it and it’s being deliberately suppressed … [A] million Americans alone are being denied lifesaving benefits each year. … [T]he apparent sabotaging of federally funded clinical trials of the drug deny the public access to it [817].
The author of this article was said to be working on a book and a documentary film regarding hydrazine. Scientists with no financial interests in the outcome were performing large randomized clinical trials testing the effects of hydrazine in subjects with lung and colon cancer. One clinical trial was conducted by the Cancer and Leukemia Group B (CALGB) and two others were done by the North Central Cancer Treatment Group (NCCTG) based at the Mayo Clinic. These studies were all published in the same issue of the Journal of Clinical Oncology in 1994 [865, 866, 952, 953]. All of these studies showed hydrazine sulfate to be no better than placebo. Measurements of quality of life using standard methods of assessment also showed trends favoring placebo. The findings at a planned interim analysis from one randomized trial suggested that hydrazine sulfate was nearly significantly worse than placebo [953]. This finding could
Piantadosi
Date: July 27, 2017
634
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
have been due to chance imbalances in the treatment groups. In any case, the trial was terminated before its planned accrual target of 300 subjects because there was little chance that given the interim results, hydrazine sulfate would be found to be beneficial. In addition to the rigorous design and analyses of these trials, the final reports emphasized only findings that were protected by the randomization. Even with these methodologically rigorous findings in evidence, the proponents of hydrazine would not be silenced. Returning to Penthouse, they said: The contempt of the Romanovs for their people has almost been rivaled in the United States by key figures in the American cancer establishment. These senior physicians have demonstrated their own brand of imperviousness to the suffering of millions by systematically destroying the reputation of an anticancer drug that has already benefited thousands and made life longer and easier for many more … [818].
In 1993, because of the failure of hydrazine sulfate to demonstrate any benefit in large, randomized, double-masked, placebo-controlled clinical trials, the National Cancer Institute inactivated the IND for the drug [1080]. However, an editorial accompanying the publication of the results from the three NIH trials prophesied that the treatment might rise again [704]. This was true, for in 1994, the General Accounting Office (GAO) of the U.S. government began an investigation into the manner in which the three trials were conducted. The investigation was fueled by documents from the developer of hydrazine alleging that NCI had compromised the studies. In particular, questions were raised about the number of subjects who might have received barbiturates or related drugs, said to interfere with the beneficial effects of hydrazine. A collection of new data (retrospectively) and re-analysis of the lung cancer trial supported the original conclusions [866]. The final GAO report on the NCI hydrazine clinical trials appeared in September 1995 [579]. After a detailed investigation of the allegations and research methods, the GAO supported conclusions by the study investigators and sponsors that hydrazine sulfate is ineffective. The GAO did criticize the failure to keep records of tranquilizer, barbiturate, and alcohol use, and the late analyses of such questions. These criticisms were dismissed by the Public Health Service in Appendix II of the GAO report. In any case, mainstream scientific and governmental opinion has converged on the view that hydrazine sulfate in not effective therapy for patients with cancer or HIV infection. The findings of lack of efficacy continue to be explained away handily by proponents of hydrazine. Statements in 2009 by a leading advocate for the compound [608, 609] contained extensive biased rationalizations for all the study findings, positive and negative. There are also vague internet claims from 2013 of additional uncontrolled observations in humans and animals supporting hydrazine efficacy [610]. However, there seems to be little or no enthusiasm among mainstream cancer clinical researchers for further testing of this compound. Aside from the greater weight of evidence assigned by most scientists to the negative studies, the tone of conspiracy and paranoia by its advocates likely continue to discourage quality investigations of hydrazine. Lessons to Be Learned from Hydrazine Sulfate The student of clinical trials can learn some very important general lessons by reading the scientific papers and commentaries cited above regarding hydrazine sulfate. With
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
EXPLORATORY ANALYSES
635
regard to subset analyses, power is reduced because of the smaller sample sizes on which comparisons are based. This tends to produce a high false negative rate (see the discussion in Section 6.1.8). If investigators perform many such comparisons, the overall false positive rate will be increased. Thus, almost all findings are incorrect. Explorations of the data are worthwhile and important for generating new questions. However, one cannot test a hypothesis reliably using the same data that generated the question. The hydrazine example also sketches implicitly some of the circumstances in which we need to maintain a cautious interpretation of clinical trial results. Even when there is a priori biological justification for the hypothesis from animal studies, interpretations and extrapolations of methodologically flawed human trials should be cautious before confirmatory studies are done. Small trials or those with poor designs or analyses should be treated skeptically. The results of subset analyses should not be emphasized as the primary findings of a trial nor should hypothesis tests suggested by the data. When exploratory analyses are conducted, statistical adjustment procedures can help interpret the validity of findings. The reader of trial reports and other medical studies should be aware of the ways in which observer bias can enter trial designs or analyses. Aside from inappropriate emphasis, those analyzing or reporting a study should not have any financial interests in the outcome. Apparent treatment–covariate interactions should be explored rigorously and fully. Finally, a look back at the discussion regarding late failures in Section 10.4 might be useful.
20.8.6
Perspective on Subgroup Analyses
Understanding the correct approach for subgroup analyses is important for several reasons. One is that we are mandated to perform such analyses for reasons loosely linked to social justice (see Chapter 9) for gender and minority representation. Second, regulators often look within subsets for consistency of treatment effects and contemplate what they see there. Third, contemporary views of subsets are increasingly derived from genetic characteristics that may be virtually deterministic with regard to treatment effects from targeted therapies. The source of all concerns about subgroups is the unavoidable heterogeneity in clinical trial cohorts. An undisciplined approach to exploring heterogeneity will create problems even when the differences between subjects are unimportant. One classic example of this is multiplicity or inflation of the type I error that results from performing many hypothesis tests. A second similar problem is loss of precision (or power) resulting from the smaller size of subgroups. These problems will occur from merely looking. Inconsequential heterogeneity is essentially random variation that should/can be controlled by replication (sample size). Aggregation or ignoring subgroup distinctions is the appropriate way to increase sample size, and all studies are designed to ignore irrelevant subsets. Variability is a concern only when it seems to be much larger than we might expect. Consequential heterogeneity is almost always derived from biological characteristics that modulate treatment effects (i.e., yield treatment–covariate interactions). Ignoring consequential heterogeneity could invalidate overall findings from a clinical trial. We
Piantadosi
Date: July 27, 2017
636
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
generally specify methods to cope with such concerns in the design of the study. A clinical trial may discover consequential heterogeneity, but it may be impossible to validate the finding is the same study. Genetically derived features of a disease can represent consequential heterogeneity, particularly if the treatments are targeted to the same features. On the other hand, even genetic differences may be inconsequential if the treatment mechanism of action does not depend on them. The safest vantage point for exploring subgroup effects is through treatment-covariate interactions. It is best to anticipate these in the design of a trial because reliable investigation places large demands on sample size, in the range of fourfold. The cost is higher if subgroups are numerically unbalanced. Conversely we typically have little power to convince ourselves that observed post hoc interactions are real. Unlike “subgroup-only” analyses, tests of interaction use all the available information in the data. Although any large interaction may be worth investigating, we tend to be most interested in qualitative ones (Chapter 9). In summary, the discipline to deal with subgroups includes the following ideas:
1. It is permissible or wise to ignore inconsequential heterogeneity. 2. Use biological knowledge and the likelihood of treatment–covariate interaction to decide if heterogeneity is consequential a priori. 3. If a subgroup question is relevant or important, design the trial accordingly. 4. Post hoc subgroup analysis may show comforting consistency of treatment effects but is not reliable beyond that. 5. Important subgroup differences will yield significant tests of treatment-covariate interaction if the design is strong. 6. Tests of interaction require roughly fourfold larger sample sizes than main effects for the same power. 7. Always control the type I error.
20.8.7
Effects the Trial Was Not Designed to Detect
Trials can only be actively powered for a single outcome, almost always the primary one. All other effect estimates produced by trial data will have their precision or power passively determined by the sample size, and to lesser extent, by the nature of the outcome measure and the analytic method. Sometimes we might power a study for an important secondary outcome, knowing that the primary outcome would then have supranominal power. In such a setting it might make sense to let the designed outcome be primary. Uncommon events, most often related to risks, will often be accompanied by wide confidence intervals. All too often, imprecise, clinically interesting, and unexpected go hand in hand. We might call this the “principle of regrettable precision”—absent design, precision is inversely proportional to interest for unanticipated effects. Similarly, we often don’t pay for things we would actually like to know. Important effects that are usually too expensive for our constrained designs include safety signals, subset effects, and treatment–covariate interactions. These provide endless topics for discussion, criticism, “more research,” and faulty conclusions.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
EXPLORATORY ANALYSES
637
Precision is not the only design protection lost in such circumstances. Most benefits of design such as control of extraneous factors, bias control, and control over random errors are lost. Unanticipated events essentially turn the study into a nonexperimental design with regard to the particular findings. We should recall the likelihood principle from Sections 7.5 and 10.4.12 that states that weak evidence is more likely to mislead than strong evidence is, and weak evidence is almost always the result when we didn’t reasonably anticipate a finding.
20.8.8
Safety Signals
There is no analysis strategy that will convert weak evidence from observed effects outside the design into strong evidence to form a basis of safety or therapeutic decisions. In Section 16.4.3 it was shown that an upper one-sided confidence bound for the true event rate when 0 events are observed in 𝑁 subjects is well approximated by 3∕𝑁. If we actually observe 𝑟 events, the upper confidence bound on the true event rate is clearly higher than 3∕𝑁. Confidence bounds can then be calculated by conventional means as in Sections 16.4.3 or 16.8.1. Then the essential question is whether the event rate evidence is consistent with a clinically tolerable threshold or an excess. The implications in a regulatory context for these safety signals can be substantial. In recent years for example, there were worrisome safety signals for cardiovascular endpoints in trials testing the benefits of cyclo-oxygenase II inhibitors on musculoskeletal conditions [73, 809, 1485]. The result was withdrawal of rofecoxib by its sponsor, and FDA issued a “black box” warning for similar compounds. It would not be unusual to completely miss a rare treatment effect in developmental trials, but uncover weak evidence (a few events) in large comparative efficacy trials. Such trials most likely would not be designed to detect signals that were not previously seen. Event frequencies below 3% can only be ruled out with sample sizes in excess of 100. Frequencies below 1% can only be ruled out with samples exceeding 300. In such cases, there is the further assumption that the actual observed event frequencies are literally zero. If the data show 0 events out of 300 subjects, the upper one-sided exact 95% binomial confidence bound is 1%. If events occur with a low background frequency, say 0.3%, then convincing ourselves that the true rate is below 1% is even harder. With a true frequency of 0.3%, the upper two-sided 95% exact binomial confidence bound would be less than 1% only with sample sizes in excess of 750 subjects. We could then tolerate two events out of 750 subjects and be reasonably certain that the true frequency is less than 1%. A few serious events in a typical size comparative trial will often be seen only as ambiguous or weak evidence for a worrisome safety signal, unless the events are categorically unacceptable or the trial is very large.
20.8.9
Subsets
Subset analyses of clinical trials are endlessly seductive and amusing yet seldom explicitly designed. They almost represent a kind of conspiracy where trialists know that they will attract some attention in analysis and interpretation, but are not often placed on firm design footing. The highest profile subset comparisons nowadays are those derived from sex and ethnicity as discussed in Section 9.4.4. But many other subsets can be called out in the analysis of trials including those based on clinics or centers, risk factors, adherence,
Piantadosi
Date: July 27, 2017
638
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
age, genotypes, phenotypes, behavior, education, medical conditions, medication use or history, socioeconomic status, and so on. In any (heterogeneous) trial cohort, it seems reasonable to ask if the treatment effect is consistent across subsets, apart form the ill-conditioning guaranteed by small sample sizes and lack of planning. Homogeneity of relative effects across prognostic subsets is an implicit assumption of many comparative trial designs. Even so, it is worth pointing out that homogeneity of effect across our favorite subsets is not required for validity. It is perfectly reasonable to estimate the population average treatment effect in the presence of heterogeneity and design a trial for it. The real question is if observed differences in effect are due to chance or can be reliably attributed to sensible biologically based characteristics, which would then yield greater insights. The most efficient design, and therefore the typical assumption then, is for homogeneity of effect. If the study cohort is known to be a mixture of heterogenous types with respect to treatment effect, some additional quantitative reasoning is needed to select an optimal design (e.g., Section 13.8). But the perspective here is about visiting this issue after the fact. Then, the test of treatment effect by subset interaction (which is what is required as explained in Section 9.4.4) has low power and we generally obtain weak evidence. The U.S. Food and Drug Administration can get away with subset analyses that would embarrass most investigators, and on occasion question the validity of treatment effects if there seems to be inhomogeneity of treatment effects. The subsets in question might be institutionally defined, geographic, or otherwise not expected to give rise to noteworthy differences. The reasoning might be that outliers or large variation in subset estimates where we don’t expect it reflects negatively on the validity of the overall treatment effect. This does not seem correct. In contrast, it would be suspicious if an overall positive effect is entirely due to a small minority of centers or units in a large complex trial.
20.8.10
Interactions
Treatment–covariate interaction is a unifying theme for understanding many of the atypical effects we would like to know about, and also why they are often unknowable. Subset questions are essentially interaction questions: does a certain characteristic modulate the treatment effect? Other interaction questions may relate to the structure of the trial, as in the case of geographic or clinic variation discussed above. Reliable tests of interactions require much larger sample sizes than typically employed. When interactions are critical questions, the trial should be appropriately powered by increasing sample size accordingly. Otherwise the hint of an interaction will be based on weak evidence. Anticipating the discussion in Chapter 22, treatment–treatment interactions typically require sample sizes that are a multiple of 4 or larger compared to main effects comparisons at the same type II error rate. A similar scale of increase can be expected for treatment–covariate interactions. For example, suppose that our study cohort is a 50–50 mixture of subject “types” and we must estimate the modifying effect of type on treatment. If 𝑋 is a binary indicator for type, and 𝑇 is a binary treatment indicator, we wish to estimate the 𝑇 × 𝑋 interaction. Then, there are four roughly equal size subsets of the study cohort that contribute to the interaction estimate (imagine a 2 × 2 table with rows 𝑇 and columns 𝑋). If 𝑌̂𝑖𝑗 denotes the estimated mean effect in the 𝑗 th type and 𝑖th treatment group, the estimate of
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
SUMMARY
639
interaction effect is 𝐼 = (𝑌̂10 − 𝑌̂00 ) − (𝑌̂11 − 𝑌̂01 ). This quantity will differ significantly from zero when a treatment–covariate interaction is present and the sample size is sufficient. If there are 𝑛 subjects in each treatment group, and the person to person variance is 𝜎 2 , the overall treatment effect (absent interaction) would have variance 2𝜎 2 ∕𝑛. However, the variance of the interaction effect is 4𝜎 2 ∕(𝑛∕2) = 8𝜎 2 ∕𝑛. The saving feature of treatment–covariate interactions is that most of them probably have been therapeutically irrelevant. But in the era of genomic-based medicine, this may no longer be true. Many treatments will be targeted to specific subsets of individuals with the disease subtype, and will not work in individuals without the requisite genotype. While the treatment effect in the appropriate subset of individuals could be relatively large, a trial may require screening a big cohort to find a sufficient number of eligible participants.
20.9
SUMMARY
The clinical effects of interest in phase I trials are usually drug distribution and elimination parameters and the association, if any, between dose and side effects. Pharmacokinetic models are essential for quantifying kinetic parameters such as elimination rate and halflife. These models are often based on simple mass action distribution of drug in a few idealized compartments. The compartments can often be identified physiologically as intravascular, extravascular, body fat, or similar tissues. Phase I trials usually provide useful dosage recommendations and usually do not provide evidence of efficacy. Middle development clinical trials often focus on treatment feasibility and simple estimates of clinical efficacy or toxicity such as success or failure rates. More formally, these studies estimate the probability of success or failure (according to pre-defined criteria) when new patients are treated. Phase II studies in patients with life-threatening diseases can also provide estimates of survival and other event rates. Many other clinical or laboratory effects of treatment can be estimated from middle development trials. Comparative trials estimate relative treatment effects. Depending on the outcome being used, the relative effect of two treatments might be a difference of means, a ratio, or a qualitative difference. Estimated risk ratios are important and commonly used relative treatment effects in phase III trials. Although not always required, statistical models can be useful when analyzing comparative trials to help estimate relative treatment effects and account for the possible influence of prognostic variables. Besides statistical models, clinical trials often employ other special methods of analysis. These may be necessary for using the information in repeated measurements, when outcomes are correlated with one another, or when predictor variables change over time. The statistical methods for dealing with such situations are complex and require an experienced methodologist. It is often important to conduct exploratory analyses of clinical trials, that is, analyses that do not adhere strictly to the experiment design. In comparative trials these explorations of the data may not be protected by the design of the study (e.g., randomization). Consequently, they should be performed, interpreted, and reported conservatively. This
Piantadosi
Date: July 27, 2017
640
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
is true of results based on subsets of the data, especially when many such analyses are performed.
20.10
QUESTIONS FOR DISCUSSION
1. A new drug is thought to be distributed in a single, mostly vascular, compartment and excreted by the kidneys. Drug administration will be by a single IV bolus injection. What rate (differential) equation describes this situation? What should the time– concentration curve look like? Four hours after injection, 65% of the drug has been eliminated. What is the half-life? If the drug is actually distributed in two or more compartments, what will be the shape of the time–concentration curve?
TABLE 20.14 Survival of Presidents, Popes, and Monarchs from 1690 Washington J. Adams Jefferson Madison Monroe J.Q. Adams Jackson Van Buren Harrison Polk Taylor Fillmore Pierce Buchanan Lincoln A. Johnson Grant Hayes Garfield Arthur Cleveland Harrison McKinley T. Roosevelt Taft Wilson Harding Coolidge Hoover
10 29 26 28 15 23 17 0 20 4 1 24 16 12 4 10 17 16 0 7 24 12 4 18 21 11 2 9 36
F. Roosevelt Truman Kennedy Eisenhower L. Johnson Nixon Ford Carter Reagan GHW Bush Clinton GW Bush Obama
Alex VIII Innoc XII Clem XI Innoc XIII Ben XIII Clem XII Ben XIV Clem XIII Clem XIV Pius VI Pius VII Leo XII Pius VIII
12 28 3 16 9 27 30 40 24 28 25 16 8
2 9 21 3 6 10 18 11 6 25 23 5 2
+ + + + +
Greg XVI Pius IX Leo XIII Pius X Ben XV Pius XI Pius XII John XXIII Paul VI John Paul John Paul II Benedict XVI Francis
15 32 25 11 8 17 19 5 15 0 26 12 4
+ +
James II Mary II William III Anne George I George II George III George IV William IV Victoria Edward VII George V Edward VIII George VI Elizabeth II
17 6 13 12 13 33 59 10 7 63 9 25 36 15 66
+
Survival in years from first taking office (see Lunn and McNeil [960]. Updated by the author in 2016.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
641
2. For the solution of the two compartment pharmacokinetic model in Section 20.2.2, it was found that √ −(𝜆 + 𝜇 + 𝛾) ± (𝜆 + 𝜇 + 𝛾)2 − 4𝜇𝛾 𝜉1,2 = . 2 Prove that the expression under the radical is positive. 3. The following event times are observed on a safety and activity trial in subjects with cardiomyopathy (censored times are indicated by a +): 55, 122, 135, 141+,144, 150, 153+, 154, 159, 162, 170, 171, 171+, 174, 178, 180, 180+, 200+, 200+, 200+, 200+, 200+. Construct a lifetable and estimate the survival at 180 days. Estimate the overall failure rate and an approximate 95% confidence interval. 4. In the mesothelioma trial, estimate the overall rate of progression and death rate. Are these rates associated with age, sex, or performance status? Why or why not? 5. In the CAP lung cancer trial, there are as many as 2 × 2 × 3 × 3 × 2 × 2 × 2 = 288 subsets based on categorical factors (cell type, performance status, T, N, sex, weight loss, and race). The number of possible ways of forming subsets is even larger. Can you suggest a simple algorithm for forming subsets based on arbitrary classifications? Use your method to explore treatment differences in subsets in the CAP trial. Draw conclusions based on your analyses.
TABLE 20.15 Results of a Double-Masked Randomized Trial of Mannitol versus Placebo in Patients with Ciguatera Poisoning ID 1 2 3 4 26 28 33 37 42 43 50 51 59 94 95 98 116 117 118 501 503
Signs and Symptoms Baseline 2.5 Hours 10 10 12 12 5 4 7 10 5 6 11 7 7 12 14 14 8 8 16 3 15
3 7 8 6 0 4 7 10 5 6 11 7 7 6 14 7 7 5 11 0 14
Source: Data from Palafox [1170].
Treatment Group
Baseline Severity
Sex
Age
Time
M M M M P P P P P M M M M M M M M M M P M
18 12 21 24 9 9 12 23 6 6 16 12 12 19 27 34 12 20 27 3 23
F M F M M M M M M M F F M M M F F F F F M
24 12 18 11 37 41 44 46 54 44 46 27 46 44 45 34 25 18 16 27 57
330 270 820 825 910 2935 1055 1455 780 9430 390 900 870 80 490 415 605 650 645 2670 5850
Piantadosi
Date: July 27, 2017
642
Time: 4:45 pm
ESTIMATING CLINICAL EFFECTS
6. In the FAP clinical trial, test the treatment effect by performing gain score analyses on polyp number and polyp size. How do the results compare with analyses of covariance. Discuss. Can these analyses test or account for the effects of sex and age? 7. At one medical center, a new cancer treatment is classified as a success in 11 out of 55 subjects. At another center using the same treatment and criteria, the success rate is 14 out of 40 subjects. Using resampling methods, place a confidence interval on the difference in the success rates. What are your conclusions? 8. The data in Table 20.14 show the survival times in years measured from inauguration, election, or coronation for U.S. presidents, Roman Catholic popes, and British monarchs from 1690 to the present [960]. (I have added censored observations to the list for individuals remaining alive.) Are there differences among the groups? Discuss your methods and conclusions. 9. Do this exercise without consulting the reference. The data in Tables 20.15 and 20.16 show the results of a double-masked randomized clinical trial of mannitol infusion versus placebo for the treatment of ciguatera poisoning [1170]. The number of neurological signs and symptoms of poisoning at the start of treatment and 2.5 hours after treatment are given. Estimate the clinical effect and statistical significance of mannitol treatment. Discuss your estimate in view of the small size of the trial.
TABLE 20.16 Results of a Double-Masked Randomized Trial of Mannitol Versus Placebo in Patients With Ciguatera Poisoning ID 504 505 512 513 514 515 516 517 613 615 701 705 706 707 708 709 737 738 766 768 1000
Signs and Symptoms Baseline 2.5 Hours 13 13 13 5 17 25 21 8 12 11 11 12 3 8 11 9 4 11 11 11 6
13 13 1 0 8 24 20 8 1 8 0 12 3 7 3 9 3 8 0 7 6
Source: Data from Palafox [1170].
Treatment Group
Baseline Severity
Sex
Age
Time
P P P M P M M M P M P M M P P M M P P M P
22 21 31 29 37 54 27 11 34 45 11 14 4 8 10 9 4 22 29 27 11
M F F F M F F M M M M F M M F M M M M M M
42 16 14 30 40 19 29 48 9 9 40 38 39 54 27 34 31 48 22 18 50
440 1460 545 670 615 645 1280 1500 740 415 540 1005 370 250 795 . 865 630 630 690 795
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
QUESTIONS FOR DISCUSSION
643
TABLE 20.17 Results of a Randomized Double-Masked Trial of BCNU Polymer versus Placebo for Patients with Newly Diagnosed Malignant Gliomas ID
Treat
Score
Sex
Age
Karn
GBM
Time
Status
103 104 105 107 203 204 205 301 302 306 307 402 404 407 408 412 101 102 106 108 109 201 202 303 304 305 308 309 401 403 405 406
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
25 25 19 17 21 26 27 23 26 24 24 27 27 10 23 27 26 14 21 21 22 17 20 26 27 25 24 25 27 27 18 26
M M F M F F F M F F F M M M M F F F M F F F F M M F M M F M F F
65 43 49 60 58 45 37 60 65 44 57 68 55 60 42 48 59 65 58 51 63 45 52 62 53 44 57 36 47 52 55 63
1 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 1 0 1 1 1 0 0
0 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
9.6 24.0 24.0 9.2 21.0 24.0 24.0 13.3 17.9 9.9 13.4 9.7 12.3 1.1 9.2 24.0 9.9 1.9 5.4 9.2 8.6 9.1 6.9 11.5 8.7 17.1 9.1 15.1 10.3 24.0 9.3 4.8
1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
Source: Data from Valtonen et al. [1504].
10. Do this exercise without consulting the reference. The data in Table 20.17 shows the results of a double-masked randomized clinical trial of implantable biodegradable polymer wafers impregnated with (BCNU) versus placebo for the treatment of patients with newly diagnosed malignant gliomas [1504]. Survival time (in weeks) following surgical resection is given for each subject. Estimate the clinical effect and statistical significance of BCNU wafer treatment. Discuss your estimate in view of the small size of the trial. You may wish to compare your findings with those of Westphal et al. [1543].
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
21 PROGNOSTIC FACTOR ANALYSES
21.1 INTRODUCTION Prognostic factor analyses (PFAs) are cohort studies that assess the relative importance of predictor variables (prognostic factors) on outcomes. These assessments are often made in a clinical trial cohort if it is large enough. PFAs can also be performed in cohorts assembled retrospectively, but the analysis itself always looks forward in time. The need to prognosticate is basic to clinical reasoning. Gradations of risk as a consequence of individual characteristics can be substantial and are therefore important to both patients and physicians. We require statistical methods to account for the collective influence of multiple interrelated factors. Using appropriate statistical models, PFAs can easily account for several predictor variables simultaneously along with their interactions [81]. The terms predictor variables, independent variables, prognostic factors, and covariates are often used interchangeably. In this era of targeted therapies however, a distinction between prognostic and predictive factors has evolved. Prognostic factors carry information about outcomes independent of treatment, whereas predictive factors are informative about outcome based on a specific therapy [978–981]. For example, in breast cancer anatomic extent is a prognostic factor indicating poorer outcomes for extensive disease. Estrogen receptor (ER) status is a predictive factor, with ER positivity indicating a 50–60% response to hormonal therapy and ER negativity less than 10% response. Predictive factors are indicators of the likelihood of benefit from a particular therapy. This chapter will discuss only the assessment of prognostic factors, although similar methods might be used to measure the effects of predictive factors. Prognostic factor analyses are closely related, or identical, to some methods for analyzing a clinical trial, especially those for adjusting estimated treatment differences for covariate imbalances. In fact, the treatment in a comparative experiment is simply a Clinical Trials: A Methodologic Perspective, Third Edition. Steven Piantadosi. © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. Companion website: www.wiley.com/go/Piantadosi/ClinicalTrials3e
644
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
INTRODUCTION
645
prognostic factor controlled by the investigator. The effect of treatment could therefore be quantified and tested for significance in a prognostic factor model, although randomized design permits a valid assessment of its effect with minimal assumptions, as indicated in earlier chapters. Like a trial, advanced planning is needed to conduct reliable PFAs. There are a number of useful references discussing the techniques most frequently used [207, 208, 284, 580, 679, 968, 993, 1181]. PFAs are often based on data where investigators do not actively control confounders as in experiment designs. Hence, the validity of prognostic factor analyses depend on the absence of selection bias, on minimal missing data, on the correctness of the statistical models employed, and on having captured the important variables. These are stronger assumptions than those needed to estimate valid effects in true experiment designs. Not only is PFA data often the product of selection effects, but also the reasons why treatments were used may relate strongly to prognosis, which is confounding by indication. A prognostic factor analysis cannot control unobserved confounders that can bias estimates of treatment effects. These are the standard problems that plague database analyses.
21.1.1
Studying Prognostic Factors is Broadly Useful
PFAs suggest information about the future of individuals. We can learn the relative importance of multiple characteristics that affect, or are associated with, disease outcome. This is especially important for diseases that are treated imperfectly as many serious chronic conditions are. Because a large fraction of individuals with those diseases will continue to have problems or even die, prognostication is vital for treatment planning and understanding the condition. There are many clinically useful prognostic factors, such as extent of disease measures, functional measures, and biomarkers. The utility and method of PFA is well illustrated in a 2010 study of advanced melanoma [103]. Sometimes PFAs are used to construct nomograms to assist quantitative classification of individuals and make predictions [619]. A second reason for studying prognostic factors is to improve the design of clinical trials. For example, suppose that we learn that a composite score describing the severity of recurrent respiratory papillomatosis is strongly prognostic. We might use this score to stratify on disease severity to improve balance and comparability of treatment groups in a randomized study of this condition. Other trials may focus on only high- or low-risk subjects as determined by such a score. Knowledge of prognostic factors can sharpen analyses of randomized trials and other cohort studies. Unsightly imbalances in prognostic factors often are a source of worry in randomized trials. Analyses that adjust for imbalances can alleviate this problem, as discussed later in this chapter. We know from the discussion regarding randomization in Chapter 17 that errors associated with chance imbalances are actually controlled in the same way as type I errors. Interactions between treatment and covariates or between prognostic factors themselves can be quantified using the methods of PFAs [210]. Large treatment–covariate interactions are likely to be important, as are those that indicate that treatment is helpful in one subset of patients but harmful in another, so-called qualitative interactions. Detecting interactions depend both on the nature of the covariate effects on outcome and on the scale of analysis. For example, if covariates influence outcome in a multiplicative fashion with no interactions, as is typical for nonlinear models for survival and other event time
Piantadosi
Date: July 27, 2017
646
Time: 4:45 pm
PROGNOSTIC FACTOR ANALYSES
TABLE 21.1 Hypothetical Effect Estimates for Factors A and B Illustrating Interaction On Different Scales of Measurement Factor 𝐵 Present
No
Factor 𝐴 Present Yes
No Yes
10 30
20 60(40)
See text for full explanation.
outcomes, an additive scale of analysis is likely to indicate interaction. Conversely, if the effect of covariates on outcome is additive as in many linear models, a multiplicative scale of analysis will likely indicate interactions. These effects are illustrated in Table 21.1, which shows hypothetical outcomes in four groups indexed by two dichotomous prognostic factors. In the absence of both factors, the baseline response is 10. If the effect of factor 𝐴 is to multiply the rate by 2 and the effect of 𝐵 is to multiply the rate by 3, then a response of 60 in the yes–yes cell demonstrates no interaction. In contrast, if the effect of factor 𝐴 is to add 10 to the baseline response and the effect of 𝐵 is to add 20, then a response of 40 in the yes–yes cell demonstrates no interaction. Therefore, in either case there will be interaction on some scale of measurement, illustrating that interaction is model dependent. Prognostic factors are also useful in assessing clinical landmarks during the course of an illness and deciding if changes in treatment strategy are warranted. This could be useful, for example, when monitoring the time course of viral load in HIV positive patients. When the level exceeds some threshold, a change in treatment may be indicated. A threshold such as this could be determined by an appropriate PFA. 21.1.2
Prognostic Factors Can Be Constant or Time-Varying
Numerical properties of prognostic factor measurements are the same as those for the study endpoints discussed in Chapter 5. Prognostic factors can be continuous measures, ordinal, binary, or categorical. They are usually recorded at study entry, or time 0, relative to follow-up time. These are termed baseline factors. The value of a baseline factor such as sex, treatment assigned, or functional index is usually taken to be fixed over follow-up time. The prognostic factor models most frequently employed assume that their effects apply immediately and remain constant over follow-up time. Other prognostic factors change their value over time, as discussed in Chapter 20. They can be assessed at baseline and as longitudinal measurements during follow-up. For example, prostate specific antigen (PSA) is a reliable indicator of disease recurrence in prostate cancer. The PSA level may decrease to near zero in patients with completely treated early prostate cancer, but disease recurrence or risk of death may relate to recent PSA levels. Time varying prognostic factors are termed time-dependent covariates (TDC), and special methods are needed to account for their ever-changing effects on outcomes. TDCs can be further classified as intrinsic or internal versus extrinsic or external. Intrinsic covariates are those measured in the study subject, such as the PSA example above. Extrinsic TDCs are those that exist independently of the study subject. For example, we might be interested in the risk of developing cancer as a function of environmental
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MODEL-BASED METHODS
647
levels of toxins. These levels may affect the individual’s risk but exist independently. In clinical trials, we are most interested in intrinsic factors. Both types of TDCs can be incorporated in suitably modified prognostic factor models using appropriate modifications of the model equations and estimation methods.
21.2
MODEL-BASED METHODS
A statistical model is one of the most powerful and flexible tools for assessing the effects of more than one prognostic factor simultaneously. These models describe a plausible mathematical relationship between predictors and outcome in terms of one or more parameters that have handy clinical interpretations. To use such models effectively, the investigator must be knowledgeable about both the clinical and statistical subject matter and interpretation and (i) collect and verify complete data, (ii) consult an experienced statistical expert to guide technical aspects of the analysis, and (iii) plan for dealing with decision points during the analysis. Decisions during the analysis may dictate key features such as candidate predictors to include or exclude, model selection, and model building via variable selection. Survival or time-to-event data constitute an important subset of prognostic factor information associated with clinical trials. Because of the frequency with which such data are encountered, statistical methods for analyzing them are highly evolved. Sources for the statistical theory dealing with these types of data are Lee [912], Cox and Oakes [318], and Kalbfleisch and Prentice [814]. There are also journals devoted to statistical methods for time-to-event data. The theory underlying these methods connects them to a broad class of relative risk models that find extensive use in clinical trials, PFAs, and epidemiology. These models are generally robust and most have extensive diverse software support, but all have critical nuances that require expert statistical help.
21.2.1
Models Combine Theory and Data
A model is any construct that combines theoretical knowledge (hypothesis) with empirical knowledge (observation). In mathematical models the theoretical component is represented by one or more equations that relate the measured or observed quantities. Empirical knowledge is represented by data, that is, the measured quantities from a sample of individuals or experimental units. The behavior of a model is governed by its structure or functional form and unknown quantities or constants of nature called parameters. One goal of the modeling process might be to estimate, or gain quantitative insight, into the parameters. Another goal might be to see if the data are consistent with the theoretical form of the model. However, another goal might be to summarize or reduce large amounts of data efficiently, in which case the model need not be a precise representation. Statistical models generally have additional characteristics. First, the equations represent convenient biological constructs but are usually not fashioned to be literally true as in some biological or physical models. Second, statistical models often explicitly incorporate an error structure or a method to cope with the random variability that is always present in the data. Third, the primary purpose of statistical models is often to facilitate estimating the parameters so that relatively simple mathematical forms are most
Piantadosi
Date: July 27, 2017
648
Time: 4:45 pm
PROGNOSTIC FACTOR ANALYSES
appropriate. Even so, the models employed do more than simply smooth data [78]. A broad survey of statistical models is given by Dobson [386]. If it has been constructed so that the parameters correspond to clinically interpretable effects, the model can be a way of estimating the influence of several factors simultaneously on the outcome. In practice, statistical models usually provide a concise way for estimating parameters, obtaining confidence intervals on parameter estimates, testing hypotheses, choosing from among competing models, or revising the model itself. Models with these characteristics include linear, generalized linear, logistic, and proportional hazards models. These are widely used in clinical trials and prognostic factor analyses and are discussed in more detail below. 21.2.2
Scale and Coding May Be Important
An early step in a PFA is to code the measurements or variable values in a numerically appropriate way. All statistical models can be influenced by the numerical coding of variables. Even qualitative factors can be represented by the proper choice of numerical coding. There are no truly qualitative statistical models or qualitative variable effects. For this reason, the coding and scale of measurement that seems most natural clinically may not be the best one to use in a model. If an ordinal variable with three levels is coded 1, 2, 3 as compared with 10, 11, 12, the results of model fitting might be different. If a coding of 1, 10, 100 is used, almost certainly the results and conclusions will be different. The effects of variable coding can often be used purposefully to transform factors in ways that are clinically sensible and statistically advantageous. Simple summary statistics for prognostic factor variables should always be inspected. This may reveal that some factors have highly skewed or irregular distributions, for which a transformation could be useful. Predictor variables do not need to have normal, or even symmetric, distributions. Categorical factors may have some levels with very few observations. Whenever possible, these should be combined so that subsets are not too small. Ordinal and qualitative variables often need to be recoded as binary “indicator” or dummy variables that facilitate group comparisons. A variable with 𝑁 levels requires 𝑁 − 1 binary dummy variables to compare levels in a regression model. Each dummy variable implies a comparison between a specific level and the reference level, which is omitted from the model. For example, a factor with three levels A, B, and C, would require two dummy variables. One possible coding is for the first variable to have the value 1 for level A and 0 otherwise. The second could have 1 for level B and 0 otherwise. Including both factors in a model compares A and B versus the reference group C. In contrast, if a single variable is coded 1 for level A, 2 for level B, and 3 for level C, for example, then including it in a regression will model a linear trend across the levels. This would imply that the A–B difference was the same as the B–C difference, which might or might not be appropriate. A set of three dummy variables for a factor with four levels is shown in Table 21.2.
21.2.3
Use Flexible Covariate Models
The appropriate statistical models to employ in PFAs are dictated by the specific type of data and biological questions. Regardless of the endpoint and structure of the model, the mathematical form will always contain one or more submodels that describe the effects
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MODEL-BASED METHODS
649
TABLE 21.2 Dummy Variables for a Four Level Factor with the Last Level Taken as a Reference Factor Level
𝑥1
Dummy Variables 𝑥2
𝑥3
A B C D
1 0 0 0
0 1 0 0
0 0 1 0
of multiple covariates on either the outcome or a simple function of it. For example, most relative risk regression models used in failure time data and epidemiological applications use a multiplicative form for the effects of covariates on a measure of relative risk. In the case of the proportional hazards regression model, the logarithm of the hazard ratio is assumed to be a constant related to a linear combination of the predictor variables. The hazard rate, 𝜆𝑖 (𝑡), in those with covariate vector 𝐱𝑖 is assumed to satisfy { log
𝜆𝑖 (𝑡) 𝜆0 (𝑡)
}
𝑘 ∑
=
𝑗=1
𝛽𝑗 𝐱𝑖𝑗 .
Because the covariates are additive on a logarithmic scale, the effects multiply the baseline hazard. Similarly, the widely used logistic regression model is { log
𝑝𝑖 1 − 𝑝𝑖
} = 𝛽0 +
𝑘 ∑ 𝑗=1
𝛽𝑗 𝐱𝑖𝑗 ,
where 𝑝𝑖 is the probability of “success,” 𝐱𝑖𝑗 is the value of the 𝑗 th covariate in the 𝑖th subject or group, and 𝛽𝑗 is the log-odds ratio for the 𝑗 th covariate. This model contains an intercept term, 𝛽0 , unlike the proportional hazards model where the baseline hazard function, 𝜆0 (𝑡), is arbitrary. In both cases a linear combination of covariates multiplies the baseline risk. Generalized linear models (GLMs) [1008, 1112] are powerful tools that can be used to describe the effects of covariates on a variety of outcome data from the general exponential family of distributions. GLMs have the form 𝜂𝑖 = 𝛽0 +
𝑘 ∑ 𝑗=1
𝛽𝑗 𝐱𝑖𝑗
𝐸{𝑦𝑖 } = 𝑔(𝜂𝑖 ), where 𝑔(⋅) is a simple “link function” relating the outcome, 𝑦, to the linear combination of predictors. Powerful and general statistical theory facilitates parameter estimation in this class of models, which includes one-way analyses of variance, multiple linear regression models, log-linear models, logit and probit models, and others. For a classic comprehensive review, see McCullagh and Nelder [1008]. For both common relative risk models and GLMs, the covariate submodel is a linear one. This form is primarily a mathematical convenience and yields parameter estimates
Piantadosi
Date: July 27, 2017
650
Time: 4:45 pm
PROGNOSTIC FACTOR ANALYSES
that are simple to interpret. However, more complex covariate models can be constructed in special cases, such as additive models and those with random effects. Models with Random Effects All statistical models contain an explicit error term that represents a random quantity, perhaps due to the effects of measurement error. Almost always, the random error is assumed to have mean zero and a variability that can be estimated from the data. Most other effects or parameters in statistical models are assumed to be fixed, which means that they estimate quantities that are constant for all subjects. Sometimes we need to model effects that, like the error term, are random but are attributable to sources other than random error. Random effects terms are a way to accomplish this, and they are used frequently in linear models. For discussions of this topic in the context of longitudinal data, see [883] and [1320]. An example where a random effect might be necessary is a clinical trial with a large number of treatment centers and a population at each center that is somewhat different from one another. In some sense the study centers can be thought of as a random sample of all possible centers. The treatment effect may be partly a function of study center, meaning there may be clinically important treatment by center interactions. The effect of study center on the trial outcome variable might best be modeled as a random effect. A fixed-effect model would require a separate parameter to describe each study center.
21.2.4
Building Parsimonious Models Is the Next Step
Quantitative prognostic factor assessment can be thought of as the process of constructing parsimonious statistical models. These models are most useful when (i) they contain a few clinically relevant and interpretable predictors, (ii) the parameters or coefficients are estimated with a reasonably high degree of precision, (iii) the predictive factors each carry independent information about prognosis, and (iv) the model is consistent with other clinical and biological data. Constructing models that meet these criteria is usually not a simple or automatic process. We can use information in the data themselves (data-based variable selection), clinical knowledge (clinically based variable selection), or both. A useful tutorial on this subject is given by Harrell, Lee, and Mark [677]. Don’t Rely on Automated Procedures With modern computing technology and algorithms, it is possible to automate portions of the model-building process, which can be dangerous. First, the criteria on which automated algorithms select or eliminate variables for inclusion in the model may not be appropriate. This is often done on the basis of significance levels (p-values) only. Second, the statistical properties of performing a large number of such tests and refitting models are poor when the objective is to arrive at a valid set of predictors. The process will be sensitive to noise or chance associations in the data. Third, an automated procedure does not provide a way to incorporate information from outside the model building mechanics. This information may be absolutely critical for finding a statistically correct and clinically interpretable final model. Fourth, we usually want more sophisticated control over missing data values, an inevitable complication, than that afforded by automated procedures.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MODEL-BASED METHODS
651
There are ways of correcting these deficiencies. All of them require thoughtful input into the model-building process from clinical investigators and interaction with a biostatistician who is familiar with the methods and pitfalls. One way to prevent being misled by random associations in the data is to pre-screen prognostic factors using clinical criteria. Many times prognostic factors are simply convenient, as opposed to having been collected because of a priori interest or biological plausibility. Such factors should probably not be taken seriously. An example of the potential difficulty with automated variable selection procedures is provided by the data in Table 21.3. The complete data and computer program to analyze them are given in Appendix A. The data consist of 80 observed event times and censoring indicators. For each study subject, 25 dichotomous predictor variables have been measured. These data are similar to those available from many exploratory studies. The first approach to building a multiple regression model employed a step-up selection procedure. Variables were entered into the regression model if the significance level for association with the failure time was less than 0.05. This process found 𝑋1 and 𝑋12 to be significantly associated with the failure time. A second model-building approach used step-down selection, retaining variables in the regression only if the p-value was less than 0.05. At its conclusion, this method found 𝑋3 , 𝑋10 , 𝑋11 , 𝑋14 , 𝑋15 , 𝑋16 , 𝑋18 , 𝑋20 , 𝑋21 , and 𝑋23 to be significantly associated with the failure time (Table 21.4). It may be somewhat alarming to those unfamiliar with these procedures that the results of the two model building techniques do not overlap. The results of the step-down procedure suggest that the retention criterion could be strengthened. Using step-down variable selection and a p-value of 0.025 to stay in the model, no variables were found to be significant. This seems even more unlikely than the fact that nine variables appeared important after the step-down procedure. These results are typical, and should serve as a warning as to volatility when automated procedures based on significance levels are used without the benefit of biological TABLE 21.3 Simulated Outcome and Predictor Variable Data from a Prognostic Factor Analysis # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ⋮
Dead
Time
𝑋1 −𝑋25
1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 ⋮
7.38 27.56 1.67 1.82 10.49 14.96 0.63 10.84 15.65 4.73 14.97 3.47 4.29 0.11 13.35 ⋮
1111111001110100010100101 1001100111001011001001110 1000110000111000011010010 1101001111011101100101101 0001110011110011111100101 1011111011000100101001011 0111001011101000010001111 1001010111111011110100101 1110010000001001110001100 0100100000011100101100000 0010111110010111110000010 0010111111001100000000000 1100110000101100010000001 1111110110111101000011000 0010101010110011010010011 ⋮
Piantadosi
Date: July 27, 2017
652
Time: 4:45 pm
PROGNOSTIC FACTOR ANALYSES
TABLE 21.4 Multiple Regressions Using Automated Variable Selection Methods Applied to the Data From Table 21.3 Parameter Estimate
Standard Error
𝑋1 𝑋12
0.479 −0.551
0.238 0.238
𝑋3 𝑋10 𝑋11 𝑋14 𝑋15 𝑋16 𝑋18 𝑋20 𝑋21 𝑋23
0.573 −0.743 0.777 −0.697 −0.611 −0.670 −0.767 −0.610 0.699 0.650
0.257 0.271 0.274 0.276 0.272 0.261 0.290 0.262 0.278 0.261
Model
Variables
1
2
Pr > 𝜒 2
Risk Ratio
4.07 5.35
0.04 0.02
1.615 0.576
4.96 7.53 8.06 6.38 5.06 6.57 7.01 5.42 6.30 6.20
0.03 0.006 0.005 0.01 0.02 0.01 0.008 0.02 0.01 0.01
1.773 0.476 2.175 0.498 0.543 0.512 0.465 0.544 2.012 1.916
Wald 𝜒 2
knowledge. A second lesson from this example derives from the fact that the data were simulated such that all predictor variables were both independent from one another and independent from the failure time. Thus, all statistically significant associations in Table 21.4 are purely due to chance. Furthermore, the apparent joint or multivariable association, especially the second regression, disappears when a single variable is removed from the regression. Resolve Missing Data There are three alternatives for coping with missing covariate values. One is to disregard prognostic factors with missing values. This may be the best strategy if the proportion of missing data within the factor(s) is very high. Alternatively, one can remove the records or observations with missing covariates from the analysis. This may be appropriate for a small proportion of individuals who have missing values for a large fraction of the important covariates. A third strategy is to replace missing observations with an appropriate value determined by a statistical procedure with desirable properties. Some statistical packages have no way to cope with records that contain missing values for the variable being analyzed, other than to discard them. We rarely have the luxury of analyzing perfectly complete data. Often, many records have some missing measurements so that the number of records with complete observations is only a fraction of the total. If data are missing at random, meaning that the loss of information is not associated with outcomes or other covariates, it may be appropriate to disregard individuals with the missing variable. Loss of precision is the main consequence of this approach, but we can still assess the influence of the covariate of interest. If missing data are more likely in individuals with certain outcomes (or certain values of other covariates), there may be no approach for coping with the loss of information that avoids bias. Removing records with missing data from the analysis in such a circumstance systematically discards influences on the outcome. For example, if individuals with the most severe disease are more likely to be missing outcome data, the aggregate effect
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MODEL-BASED METHODS
653
of severity on outcome will be lessened. When we have no choice but to lose some information, decisions about how to cope with missing data need to be guided by clinical considerations rather than automated statistical procedures. There is a third useful method of coping with missing values. It is called imputation, or replacing missing data with values calculated in a way that allows other analyses to proceed essentially unaffected. This can be a good alternative when relatively few data are missing, precisely the circumstance in which other alternatives are also workable. Imputation is discussed in Chapter 19. For example, missing values could be predicted from all the remaining data and replaced by estimates that preserve the overall covariance structure of the data. The effect of such procedures on the inferences that result must be studied on a case-by-case basis. The way that missing data are resolved should not be driven by the effect it produces on the outcome. For example, discarding certain incomplete observations or variables may influence the outcome differently than imputation. The best strategy to use should be decided on principle rather than results. I have seen circumstances where investigators (or regulatory officials) create missing data by disregarding certain observations, regardless of completeness. While there are occasional circumstances in which this might be the correct thing to do, it can produce very strong effects on the outcome and should be discussed carefully ahead of time. Screen Factors for Importance in Univariable Regressions The next practical step in assessing prognostic factors is to screen all the retained variables in univariable regressions (or other analyses). It is important to examine the estimates of clinical effect (e.g., relative hazards), confidence intervals, and significance levels. Together with biological knowledge, these can be used to select a subset of factors to study in multivariable models. The most difficult conceptual part of this process is deciding which factors to discard as unimportant. The basis of this decision is threefold: prior clinical information (usually qualitative), the size of the estimated effect, and the statistical significance level. Strong factors, or those known to be important biologically, should be retained for further consideration regardless of significance levels. The findings at this stage can help investigators check the overall validity of the data. For example, certain factors are known to be strongly prognostic in similar groups of subjects. Investigators should examine the direction and magnitude of prognostic factor effects. If the estimates deviate from those previously observed or known, the data may contain important errors. It is appropriate to use a relaxed definition of “significance” at this screening stage. For example, one could use 𝛼 = 0.15 to minimize the chance of discarding an important prognostic factor. This screening step usually reduces the number of factors to about 1∕4 of those started. The potential error in this process is the tendency to keep variables that are associated with the outcome purely by chance and/or that the modeling procedure may overestimate the effect of some factors on the outcome. The only way to prevent or minimize these mistakes is to use clinical knowledge to augment the screening process. Build Multiple Regressions The next practical step in a PFA is to build a series of multivariable regression models and study their relative performance. To decrease the chance of missing important associations in the data, one should try more models rather than fewer. Consequently,
Piantadosi
Date: July 27, 2017
654
Time: 4:45 pm
PROGNOSTIC FACTOR ANALYSES
step-down model-building procedures may be better than step-up methods. In step-down, all prognostic factors that pass the screening are included in a single multiple regression. The model is likely to be overparameterized at the outset, in which case one or more factors will have to be removed to allow the estimation process to converge. Following model-fitting, we use the same evaluation methods outlined above to include or discard prognostic factors in the regression. After each change or removal of variables, the parameters and significance levels are re-estimated. The process stops when the model makes the most biological sense. It is satisfying when this point also corresponds to a local or global statistical optimum, such as when the parameter estimates are fairly precise and the significance levels are high. Usually, step-down variable selection methods will cause us to try more models than other approaches. In fact there may be partially (or non-) overlapping sets of predictors that perform nearly as well as each other using model-fitting evaluations. This is most likely a consequence of correlations between the predictor variables. Coping with severe manifestations of this problem is discussed below. Some automated approaches can fit all possible subset regressions. For example, they produce all 1-variable regressions or all 2-variable ones. The number of such regressions is large. For 𝑟 predictor variables, the total number of models possible without interaction terms is 𝑟 ( ) ∑ 𝑟 𝑁= . 𝑘 𝑘=1 For 10 predictor variables, 𝑁 = 1023. Even if all of these models can be fitted and summarized, the details may be important for distinguishing between them. Thus, there is usually no substitute for guidance by clinical knowledge. Correlated Predictors May Be a Problem Correlations among predictor variables can present difficulties during the model-building and interpretation process. Sometimes the correlations among predictor variables are strong enough to interfere with the statistical estimation process, in the same way that occurs with familiar linear regression models. Although models such as logistic and proportional hazards regressions are nonlinear, the parameter estimates are most often obtained by a process of iterated solution through linearization of estimating equations. Furthermore, covariates often form a linear combination in statistical models, as discussed above. Therefore, collinearity of predictors can be a problem in nonlinear models also. Even when the estimation process goes well, correlated variables can create difficulties in model building and interpretation. Among a set of correlated predictors, any one will appear to improve the model prediction, but if more than one is included, all of them may appear unimportant. To diagnose and correct this situation, a number of models will have to be fitted and one of several seemingly good models selected as best. Clinicians are sometimes disturbed by the fact that statistical procedures are not guaranteed to produce one regression model that is clearly superior to all others. In fact, even defining this model on statistical grounds can be difficult because of the large number of regressions that one has to examine in typical circumstances. In any case, we should not be surprised that several models fit the data and explain it well, especially when different variables carry partially redundant information. Models cannot always
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MODEL-BASED METHODS
655
be distinguished on the basis of statistical evidence because of our inability to compare nonnested models formally and the inadequacies of summary statistics like significance levels for variable selection. Nested models are those that are special cases of more complex ones. Nested models can often be compared with parent models using statistical tests such as the likelihood ratio (see Chapter 18). The only reasonable way for the methodologist to solve these difficulties is to work with clinicians who have expert knowledge of the predictor variables, based on either other studies or preclinical data. Their opinion, when guided by statistical evidence, is necessary for building a good model. Even when a particular set of predictors appears to offer slight statistical improvement in fit over another, one should generally prefer the set with the most clear clinical interpretation. Of course, if the statistical evidence concerning a particular predictor or set of predictors is very strong, then these models should be preferred or studied very carefully to understand the mechanisms, which may lead to new biological findings.
21.2.5
Incompletely Specified Models May Yield Biased Estimates
When working with linear models, it is well known that omission of important predictor variables will not bias the estimated coefficients of the variables included in the model. For example, if 𝑌 is a response linearly related to a set of predictors, 𝐗, so that 𝐸{𝑌 } = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑛 𝑋𝑛 is the true model and we fit an incompletely specified model, for example, 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝜖, the estimates of 𝛽1 , 𝛽2 , and 𝛽3 will be unbiased. The effect of incomplete specification is to increase the variances but not to bias the estimates [390]. In contrast, when using certain nonlinear models, the proportional hazards model among them, omission of an important covariate will bias the estimated coefficients, even if the omitted covariate is perfectly balanced across levels of those remaining in the model [558]. The same can be said for model-derived estimates of the variances of the coefficients [559]. The magnitude of the bias that results from these incompletely specified nonlinear models is proportional to the strength of the omitted covariate. For example, suppose that we conduct a clinical trial to estimate the hazard ratio for survival between two treatments for coronary artery disease and that age is an influential predictor of the risk of death. Even if young and old subjects are perfectly balanced in the treatment groups, failure to include age in a proportional hazards regression when using it to estimate the hazard ratio will yield a biased estimate of the treatment effect. Although the existence of this bias is important for theoretical reasons, for situations commonly encountered in analyzing RCTs, there is not a serious consequence when important covariates are omitted, as they invariably are [251]. The lesson to learn from this situation is that models are important and powerful conveniences for summarizing data, but they are subject to assumptions and limitations that prevent us from blindly accepting the parameters or significance tests they yield. Even so, they offer many advantages that usually outweigh their limitations.
Piantadosi
Date: July 27, 2017
656
21.2.6
Time: 4:45 pm
PROGNOSTIC FACTOR ANALYSES
Study Second-Order Effects (Interactions)
One advantage to using models to perform PFAs is the ability to assess interactions. For example, if response depends on both age and sex, it is possible that the sex effect is different in young, compared with old, individuals. In a linear model an interaction can be described by the model 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛾𝑋1 𝑋2 + 𝜖, where 𝛾 represents the strength of the interaction between 𝑋1 and 𝑋2 . Interactions of biological importance are relatively uncommon. Some interactions that seem to be important can be eliminated by variable transformations or different scales of analysis. However, large interactions may be important and regression models provide a convenient way to assess them. One difficulty with using models to survey interactions is the large number of comparisons that have to be performed(to) evaluate all possible effects. If the model supports 6 prognostic factors, there are 6 + 62 = 21 pairwise interactions possible (each variable can interact with itself). Even if all estimated interaction effects are due only to chance, we might expect one of these to be significant using p-values as test criteria. The number of higher order interactions possible is much larger. Usually, we would not screen for interactions unless there is an a priori reason to do so or if the model does not fit the data well. Even then we should employ more strict criteria for declaring an interaction statistically significant than for main effects to reduce type I errors. It is important to include the main effects, or low-order terms, in the model when estimating interaction effects. Otherwise, we may misestimate the coefficients and wrongly conclude that the high-order effect is significant when it is not. This can be illustrated by a very simple example. Suppose that we have the data shown in Figure 21.1 and the two models 𝐸{𝑌 } = 𝛽0 + 𝛽1 𝑋1 and 𝐸{𝑌 } = 𝛽1∗ 𝑋1 . The first model has both a low-order effect (intercept) and a high-order effect (𝛽1 ). The second model has only the high-order effect 𝛽1∗ , meaning the intercept is assumed to be zero. When fit to the data, the models yield 𝛽1 ≈ 0 and 𝛽1∗ ≠ 0. In the first case, we obtain a correct estimate of the slope (Fig. 21.1, dotted line). In the second case, we obtain an incorrect estimate of the slope because we have wrongly assumed that the intercept is zero (Fig. 21.1, solid line). An analogous problem can occur when interaction effects are estimated assuming the main effects are zero (i.e., omitting them from the model). A clinical application of interaction analyses can be seen in Section 4.6.6. 21.2.7
PFAs Can Help Describe Risk Groups
Information from a PFA can help clinicians anticipate the future course of a patient’s illness as a function of several possibly correlated predictors. This type of prognostication is often done informally on purely clinical grounds, most often using a few categorical factors such as functional classifications or anatomical extent of disease. Stage of disease
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MODEL-BASED METHODS
FIGURE 21.1
657
Hypothetical data for regression analyses with and without low-order effects.
is a good example of a measure of extent from the field of cancer. Prognostication based only on simple clinical parameters can be sharpened considerably by using quantitative model-based methods. Suppose that we have measured two factors, 𝑋 and 𝑌 that are either present or absent in each subject and relate strongly to outcome. Subjects could be classified into one of the following four cells: 𝑋 alone, 𝑌 alone, both, or neither. It is possible that the prognoses of individuals in each of the four cells are quite different and that knowing into which risk group a subject is categorized would convey useful information about his or her future. On the other hand, it might be that cells 𝑋 and 𝑌 are similar, such that each represents the presence of a single risk factor and there are effectively only three risk levels. The three levels of risk are characterized by one risk factor, two risk factors, or none. When there are several prognostic factors, some of them possibly measured on a continuous scale rather than being dichotomous or categorical, these ideas can be extended using the types of models discussed above. Rather than simply counting the number of factors, the actual variable values can be combined in a weighted sum to calculate risk. The resulting values can then be ranked and categorized. The best way to illustrate the general procedure is with an actual example, discussed in detail in the next section. Example In patients with HIV infection it is useful to have simple methods by which individual risk of clinical AIDS or death can be inferred. Many clinical parameters are known to carry prognostic information about time to AIDS and survival, including platelet count, hemoglobin, and symptoms [627]. Here I illustrate a risk set classification that is a combination of clinical and laboratory values for HIV positive patients from the Multicenter AIDS Cohort Study (MACS) [1208]. The MACS is a prospective study of the natural history of HIV infection among homosexual and bisexual men in the United States. Details of the MACS study design and methods are described elsewhere [824].
Piantadosi
Date: July 27, 2017
658
Time: 4:45 pm
PROGNOSTIC FACTOR ANALYSES
From April 1984 to March 1985, 4954 men were enrolled in four metropolitan areas: Baltimore/Washington, DC, Chicago, Pittsburgh, and Los Angeles. There were 1809 HIV seropositive and 418 seroconverters in the MACS cohort. This risk set analysis included all HIV-1 seroprevalent men and seroconverters who used zidovudine prior to developing AIDS, and who had CD4+ lymphocyte data at study visits immediately before and after the first reported use of zidovudine. The total number of individuals meeting these criteria was 747. Prognostic factors measured at baseline included age, CD4+ lymphocyte count, CD8+ lymphocyte count, hemoglobin, platelets, clinical symptoms and signs, and white blood cell count. The clinical symptoms and signs of interests were fever (greater than 37.9 ◦ C) for more than two weeks, oral candidiasis, diarrhea for more than four weeks, weight loss of 4.5 kg, oral hairy leukoplakia, and herpes zoster. There were 216 AIDS cases among individuals treated with zidovudine prior to the cutoff date for this analysis and 165 deaths. The initial step was to build to a multiple regression model for prediction of time to AIDS or time to death using the proportional hazards regression model. After model building, the log relative risk for each individual patient was calculated according to the equation 𝑟𝑖 =
𝑝 ∑ 𝑗=1
𝛽̂𝑗 𝑋𝑖𝑗 ,
(21.1)
where 𝑟𝑖 is the aggregate relative risk for the 𝑖th patient, 𝛽̂𝑗 is the estimated coefficient for the 𝑗 th covariate from the regression model, and 𝑋𝑖𝑗 is the value of the 𝑗 th covariate in the 𝑖th patient. Data values were then ordered from smallest to largest, based on the value of 𝑟𝑖 . Following this ordering based on aggregate estimated relative risk, individuals were grouped or categorized into discrete risk sets. The cut points used for categorization and the number of groups were chosen empirically. That is, they were chosen so that the groups formed would display the full range of prognoses in the cohort. Usually, this can be accomplished with three–five risk groups. Following the formation of risk groups, survival curves were drawn for individuals in each group. Finally, predicted survival curves were generated from the proportional hazards model and drawn superimposed on the observed curves as an informal test of goodness of fit. The predicted survival curves in the risk groups cannot be obtained directly from the original covariate regression. Instead, a second regression must be fit using a single covariate representing risk group. Because the risk group classification is based only on covariates from the first regression, the second regression a valid illustration of model performance based on the original predictors. The units for expressing covariate values are shown in Table 21.5. The scale of measurement is important because it directly affects the interpretation of risk ratio estimates. For example, symptoms were coded as present or absent. Consequently, the risk ratio for symptoms is interpreted as the increase in risk associated with any symptoms. In contrast, change in CD4+ lymphocyte count was measured in units of 100 cells, so that its relative risk is per 100 cell increase. For the time to AIDS outcome, the final regression model shows significant effects for CD4 count, platelets, hemoglobin, and symptoms (Table 21.6). When risk set assignments are based on this model, the resulting time-to-event curves are shown in Figure 21.2.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
MODEL-BASED METHODS
659
TABLE 21.5 Coding of Covariate Values for AIDS Prognostic Factor Analysis Variable Name
Measurement Units
CD4 number CD8 number Neopterin Microglobulin Platelets Hemoglobin Symptoms
100 cells 100 cells mg/dl mg/dl 25,000 cells gm/dl 1 = yes, 0 = no
For time to AIDS, the first risk set was formed from individuals with the lowest risk of AIDS. This group constitutes 35% of the population. For the last risk set, the 15% with the highest risk was chosen. These individuals developed AIDS very rapidly, virtually all within the first 1.5 years after beginning zidovudine. The remaining risk sets were each formed from 25% fractions of the population. They indicate levels of risk intermediate between the most favorable and least favorable subsets. For the baseline variables model, equation (21.1) becomes 𝑟𝑖 = −0.373 × 𝐶𝐷4 + 0.020 × 𝐶𝐷8 − 0.071 × 𝑃 𝑙𝑎𝑡𝑒𝑙𝑒𝑡𝑠 −0.168 × 𝐻𝑒𝑚𝑜𝑔𝑙𝑜𝑏𝑖𝑛 + 0.295 × 𝑆𝑦𝑚𝑝𝑡𝑜𝑚𝑠.
(21.2)
The cut points used for these risk sets were: low risk, 𝑟𝑖 < −4.25; low-intermediate risk, −4.25 ≤ 𝑟𝑖 < −3.70; high-intermediate risk, −3.70 ≤ 𝑟𝑖 < −3.13; and high risk, 𝑟𝑖 ≥ −3.13. Regression models for survival time are shown in Table 21.7. For this outcome, the proportions of individuals assigned to each risk set are slightly different than for time to AIDS (Fig. 21.2). This was done to separate the highest and lowest risk subsets as much as possible. The first risk set for the baseline variables model was formed from the 30% of individuals with the lowest risk of death. For the last risk set, the 15% with the highest risk was chosen. These individuals had the poorest survival, living generally less than 1.5 years after beginning AZT. The remaining risk sets were each formed from 30 to 25% fractions of the population. They indicate levels of risk intermediate between the most favorable and least favorable subsets. For the baseline variables model, equation (21.1) becomes 𝑟𝑖 = −0.440 × 𝐶𝐷4 + 0.034 × 𝐶𝐷8 − 0.070 × 𝑃 𝑙𝑎𝑡𝑒𝑙𝑒𝑡𝑠 −0.182 × 𝐻𝑒𝑚𝑜𝑔𝑙𝑜𝑏𝑖𝑛 + 0.030 × 𝑆𝑦𝑚𝑝𝑡𝑜𝑚𝑠.
(21.3)
TABLE 21.6 Proportional Hazards Multiple Regressions for Time to AIDS Variable CD4 CD8 Platelets Hemoglobin Symptoms
Relative Risk
95% CI
P-Value
0.69 1.02 0.93 0.85 1.34
0.63–0.75 1.00–1.04 0.89–0.97 0.78–0.92 1.08–1.67
0.0001 0.04 0.0009 0.0001 0.008
Piantadosi
Date: July 27, 2017
660
Time: 4:45 pm
PROGNOSTIC FACTOR ANALYSES
FIGURE 21.2 Observed and predicted survival of risk sets in the MACS cohort.
The cut points used for the risk sets were: low risk, 𝑟𝑖 < −4.73; low-intermediate risk, −4.73 ≤ 𝑟𝑖 ≤ −3.95; high-intermediate risk, −3.95 < 𝑟𝑖 ≤ −3.29; and high risk, 𝑟𝑖 > −3.29. The utility of these results lies in our ability to classify individuals into one of the risk groups on the basis of covariate values. This is fairly easily accomplished using the calculation defined by the linear combination in equation (21.2 or 21.3). A particular risk set is heterogeneous with respect to individual covariate values, but homogeneous with respect to risk. Therefore, the character of a risk set does not have a simple interpretation in terms of covariates, but it does have a simple clinical interpretation, that is, increased or decreased risk. 21.2.8
Power and Sample Size for PFAs
Sometimes we can conduct PFAs on a surplus of data and would like to know how few observations can be used to meet our objectives. Other times we wish to know in TABLE 21.7 Proportional Hazards Multiple Regressions for Time to Death Variable CD4 CD8 Platelets Hemoglobin Symptoms
Relative Risk
95% CI
P-Value
0.64 1.03 0.93 0.83 1.35
0.59–0.71 1.01–1.06 0.89–0.98 0.76–0.91 1.08–1.69
0.0001 0.0009 0.003 0.0001 0.01
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ADJUSTED ANALYSES OF COMPARATIVE TRIALS
661
advance how large a covariate effect must be to yield statistical significance. Both of these questions can be addressed by methods for determining power and sample size in PFAs. Calculating power for PFAs is difficult in general, but a few guidelines can be given. Suppose that we are studying a single binary predictor, denoted by 𝑋, in a time-toevent analysis analogous to the effect of treatment in a planned trial. The distribution of 𝑋 = 0 and 𝑋 = 1 in the study population will not be equal as it might have been if it actually represented an assignment to a randomized treatment group. Instead, 𝑋 is probably going to be unbalanced. Furthermore, we are likely to be interested in the effect of 𝑋 adjusted for other variables in the regression. Under these assumptions, a sample size formula such as equation (7.21) might be used to determine the approximate number of observations needed in a proportional hazards regression model. To detect an adjusted hazard ratio of Δ = 1.5 attributable to a binary covariate with 20% of the population having 𝑋 = 1, with 90% power and using a type I error of 5%, we calculate 𝐷=
2 (𝑟 + 1)2 (𝑍𝛼 + 𝑍𝛽 ) (4 + 1)2 (1.96 + 1.282)2 = = 240. 𝑟 4 [log(Δ)]2 [log(1.5)]2
Therefore, we have a reasonable chance of detecting an adjusted risk ratio of 1.5 in a PFA using 240 subjects where 20% of them have 𝑋 = 1 and 80% have 𝑋 = 0. More precise methods for performing these types of calculations are available in statistical packages. However, the general method is complex and not likely to be used frequently by clinical trial investigators. Consequently, it is not discussed in detail here. In any case, prognostic factor analyses most often utilize all of the data that bear on a particular question or analysis. Missing variables, incomplete records, interest in subsets, and highly asymmetric distributions of variable values tend to limit precision even when there appears to be a large database on which to perform analyses. For survival and timeto-event outcomes, the number of events is frequently the most restrictive characteristic of the data.
21.3
ADJUSTED ANALYSES OF COMPARATIVE TRIALS
Covariates are also important purveyors of ancillary information in designed clinical trials. They facilitate validating the randomization, allow improved prognostication, can generate or test new hypotheses because of their associations with each other and with outcome, and may be used to improve, or reduce the variance of, estimates of treatment effect. The possibility that estimates of treatment effect can be influenced or invalidated by covariate imbalance is one of the main reasons for studying the results of adjusted analyses. However, analysis of covariates and treatment effects simultaneously in a clinical trial is typically not a well-conditioned exercise practically or inferentially and should be considered with care [506]. In essence, covariate adjustment is a type of subset analysis, and the tests generated contribute to problems of multiplicity (Section 20.8). This problem can be made even worse when it becomes necessary to explore interactions. The number of possible combinations of factors, and therefore the potential number of significance tests, increases dramatically with the number of covariates.
Piantadosi
Date: July 27, 2017
662
Time: 4:45 pm
PROGNOSTIC FACTOR ANALYSES
Not all clinical trial statisticians agree on the need for adjusted analyses in comparative clinical trials. From a theoretical perspective, randomization and proper counting and analysis guarantee unbiasedness and the correctness of type I error levels, even in the presence of chance imbalances in prognostic factors. In a randomized experiment, variability may still influence the results. The distinction is between the expectation of a random process and its realization. Adjustment can increase the precision of estimated treatment effects or control for the effects of unbalanced prognostic factors. Also the difference in estimated treatment effects before and after covariate adjustment often conveys useful biological information. Thus, although not necessary for valid tests of the treatment effect, adjusted analyses may facilitate other goals of randomized clinical trials. Suppose that the analyst thought that the data from an RCT arose from an observational study rather than from an experiment design. The analysis would likely proceed much as it was sketched above for PFAs. Other investigators, knowing the true origins of the data, might not perform covariate adjustment. The conclusions of the two analyses could be different. Although we would prefer the analysis that most closely follows the paradigm of the study, it is not guaranteed to yield the most efficient or informative estimate of the treatment effect. It seems we have little choice but to explore the consequences of covariate adjustment and emphasize the results that are most consistent with other knowledge.
21.3.1
What Should We Adjust For?
There are two sources of information to help in deciding which covariates should be used for adjusted analyses: the data and biological knowledge from outside the trial. An excellent discussion of using the observed data to decide which covariates should be studied in adjusted analyses is given by Beach and Meier [126]. They conclude on the basis of some real-world examples and statistical simulations that only covariates distributed differently in the treatment groups (disparity) and having a distribution of outcomes across the covariate levels (influence) are likely candidates for adjusted analyses. Specifically, the product of 𝑍 statistics for influence and disparity appears to govern the need for covariate adjustment, at least in simple cases. Investigators might use these ideas in adjusting estimated treatment effects for prognostic factors that meet one of the following criteria: 1. Factors that (by chance) are unbalanced between the treatment groups. 2. Factors that are strongly associated with the outcome, whether unbalanced or not. 3. To demonstrate that a prognostic factor does not artificially create the treatment effect. 4. To illustrate, quantify, or discount the modifying effects of factors known to be clinically important. The philosophy underlying adjusting in these circumstances is to be certain that the observed treatment effect is independent of the factors. The quantitative measures of clinical interest after adjustment are changes in relative risk parameters rather than changes in p-values. Therefore, some adjusted analyses will include statistically nonsignificant variables but will be informative in a broad context.
Piantadosi
Date: July 27, 2017
Time: 4:45 pm
ADJUSTED ANALYSES OF COMPARATIVE TRIALS
663
One should not adjust estimated treatment effects for all of the covariates that are typically measured in comparative trials. Not only are large numbers of uninteresting covariates often recorded, but also model building in that circumstance can produce spurious results due to multiplicity and collinearities. However, clinical knowledge, preclinical data, and findings in the trial data can contribute to covariate adjustment that can improve inferences from comparative trials and generate new hypotheses. 21.3.2
What Can Happen?
There are many possible qualitative results when treatment effects are adjusted for covariates. Some hypothetical examples are shown in Table 21.8, where I have assumed that balance of the prognostic factors in the treatment groups is immaterial. Table 21.8 is constructed for proportional hazards models but the behavior discussed here could be seen with other analytic models. All prognostic factors in Table 21.8 are dichotomous. Models 1–3 are univariable analyses of treatment or covariates. All effect estimates are statistically significant at conventional levels. Model 4 gives the treatment hazard ratio (HR) adjusted for sex. There is an increase in the size and significance of the treatment HR, typical of circumstances where the covariate effect is important on its own and also to the treatment effect. Adjustment usually pushes the treatment HR toward the null, opposite of what model 4 indicates. If the reduction in the HR were severe and it was rendered nonsignificant, it might be taken as evidence that the treatment effect was spurious, namely, due to an imbalance in a strong covariate. A shift in the adjusted HR away from the null seems to occur infrequently in practice. Model 5 shows a different sort of result. When adjusted for risk level, the treatment HR is virtually unaffected. The effects of treatment and that covariate are nearly independent of one another.
TABLE 21.8 Hypothetical Proportional Hazards Regression Models Illustrating Adjusted Treatment Effects Model
Variable
Hazard Ratio
95% Confidence Limits
P-Value
1
Treatment group
1.74
1.51–2.01