244 26 6MB
English Pages 211 [212] Year 2022
Hybrid Frequentist/Bayesian Power and Bayesian Power in Planning Clinical Trials
Chapman & Hall/CRC Biostatistics Series
Series Editors Shein-Chung Chow, Duke University School of Medicine, USA Byron Jones, Novartis Pharma AG, Switzerland Jen-pei Liu, National Taiwan University, Taiwan Karl E. Peace, Georgia Southern University, USA Bruce W. Turnbull, Cornell University, USA
Recently Published Titles Statistical Design and Analysis of Clinical Trials Principles and Methods, Second Edition Weichung Joe Shih and Joseph Aisner Confidence Intervals for Discrete Data in Clinical Research Vivek Pradhan, Ashis Gangopadhyay, Sandeep Menon, Cynthia Basu and Tathagata Banerjee Statistical Thinking in Clinical Trials Michael A. Proschan Simultaneous Global New Drug Development Multi-Regional Clinical Trials after ICH E17 Edited by Gang Li, Bruce Binkowitz, William Wang, Hui Quan and Josh Chen Quantitative Methodologies and Process for Safety Monitoring and Ongoing Benefit Risk Evaluation Edited by William Wang, Melvin Munsaka, James Buchanan and Judy Li Statistical Methods for Mediation, Confounding and Moderation Analysis Using R and SAS Qingzhao Yu and Bin Li Hybrid Frequentist/Bayesian Power and Bayesian Power in Planning Clinical Trials Andrew P. Grieve For more information about this series, please visit: https://www.routledge. com/Chapman--Hall-CRC-Biostatistics-Series/book-series/CHBIOSTATIS
Hybrid Frequentist/ Bayesian Power and Bayesian Power in Planning Clinical Trials Andrew P. Grieve
First edition published 2022 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2022 Andrew P. Grieve Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www. copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. ISBN: 978-1-032-11129-2 (hbk) ISBN: 978-1-032-11131-5 (pbk) ISBN: 978-1-003-21853-1 (ebk) DOI: 10.1201/9781003218531 Typeset in Palatino by SPi Technologies India Pvt Ltd (Straive)
For Luitje
Contents List of Figures..........................................................................................................xi List of Tables......................................................................................................... xiii Preface......................................................................................................................xv Acknowledgements..............................................................................................xix Author.....................................................................................................................xxi List of Acronyms................................................................................................ xxiii 1. Introduction...................................................................................................... 1 2. All Power Is Conditional Unless It’s Absolute.......................................... 9 2.1 Introduction................................................................................................ 9 2.2 Expected, Average and Predicted Power............................................. 10 2.2.1 Averaging Conditional Power with Respect to the Prior – Analytic Calculation......................................................... 11 2.2.2 Calculating the Probability of Achieving “Significance” – Predictive Power........................................................................ 14 2.2.3 Averaging Conditional Power with Respect to the Prior – Numerical Integration............................................................... 16 2.2.4 Averaging Conditional Power with Respect to the Prior – Simulation................................................................................... 17 2.3 Bounds on Average Power..................................................................... 18 2.4 Average Power for a Robust Prior......................................................... 21 2.5 Decomposition of Average Power......................................................... 24 2.6 Average Power – Variance Estimated................................................... 29 2.6.1 Bound on Average Power when the Variance Is Estimated.................................................................................... 31 3. Assurance........................................................................................................ 33 3.1 Introduction.............................................................................................. 33 3.2 Basic Considerations............................................................................... 33 3.3 Sample Size for a Given Average Power/Assurance......................... 34 3.4 Sample Size for a Given Normalised Assurance................................. 37 3.5 Applying Assurance to a Series of Studies.......................................... 38 3.6 A Single Interim Analysis in A Clinical Trial....................................... 44 3.7 Non-Inferiority Trials.............................................................................. 49 3.7.1 Fixed Margin.................................................................................. 52 3.7.2 Synthesis Method.......................................................................... 54 3.7.3 Bayesian Methods......................................................................... 57
vii
viiiContents
4. Average Power in Non-Normal Settings................................................... 59 4.1 Average Power Using a Truncated-Normal Prior............................... 59 4.2 Average Power When the Variance Is Unknown: (a) Conditional on a Fixed Treatment Effect........................................ 60 4.3 Average Power When the Variance Is Unknown: (b) Joint Prior on Treatment Effect and Variance................................61 4.4 Average Power When the Response Is Binary.................................... 64 4.5 Illustrating the Average Power Bound for a Binary Endpoint.......... 68 4.6 Average Power in a Survival Context................................................... 69 4.6.1 An Asymptotic Approach to Determining the AP................... 69 4.6.2 The Average Power for the Comparison of One Parameter Exponential Distributions......................................... 71 4.6.3 A Generalised Approach to Simulation of Assurance for Survival Models......................................................................72 Note................................................................................................................... 73 5. Bayesian Power.............................................................................................. 75 5.1 Introduction.............................................................................................. 75 5.2 Bayesian Power........................................................................................ 75 5.3 Sample Size for a Given Bayesian Power............................................. 76 5.4 Bound on Bayesian Power...................................................................... 77 5.5 Sample Size for a Given Normalised Bayesian Power....................... 79 5.6 Bayesian Power When the Response Is Binary................................... 80 5.7 Posterior Conditional Success Distributions....................................... 81 5.7.1 Posterior Conditional Success Distributions – Success Defined By Significance................................................................ 82 5.7.2 Posterior Conditional Success Distributions – Success Defined By a Bayesian Posterior Probability............................. 84 5.7.3 Use of Simulation to Generate Samples from the Posterior Conditional Success and Failure Distributions.................................................................................. 85 5.7.4 Use of the Posterior Conditional Success and Failure Distributions to Investigate Selection Bias................................ 86 6. Prior Distributions of Power and Sample Size........................................ 87 6.1 Introduction.............................................................................................. 87 6.2 Prior Distribution of Study Power – Known Variance....................... 88 6.3 Prior Distribution of Study Power – Treatment Effect Fixed, Uncertain Variance.................................................................................. 92 6.4 Prior Distribution of Study Sample Size – Variance Known............. 94 6.5 Prior Distribution of Sample Size – Treatment Effect Fixed, Uncertain Variance.................................................................................. 96 6.6 Prior Distribution of Study Power and Sample Size – Uncertain Treatment Effect and Variance............................................. 98 6.7 Loss Functions and Summaries of Prior Distributions...................... 99
Contents
ix
7. Interim Predictions...................................................................................... 101 7.1 Introduction............................................................................................ 101 7.2 Conditional and Predictive Power...................................................... 103 7.3 Stopping for Futility Based on Predictive Probability...................... 109 7.4 “Proper Bayesian” Predictive Power.................................................. 111 8. Case Studies in Simulation........................................................................ 113 8.1 Introduction............................................................................................ 113 8.2 Case Study 1 – Proportional Odds Primary Endpoint..................... 114 8.2.1 Background.................................................................................. 114 8.2.2 The Wilcoxon Test for Ordered Categorical Data................... 115 8.2.3 Applying Conditional Power to the Proportional Odds Wilcoxon Test............................................................................... 117 8.2.4 Statistical Approach to Control Type I Error........................... 118 8.2.5 Simulation Set-Up....................................................................... 119 8.2.6 Simulation Results......................................................................120 8.3 Case Study 2 – Unplanned Interim Analysis..................................... 120 8.3.1 Background.................................................................................. 121 8.3.2 Interim Data................................................................................. 121 8.3.3 Model for Prediction................................................................... 121 9. Decision Criteria in Proof-of-Concept Trials.......................................... 127 9.1 Introduction.......................................................................................... 127 9.2 General Decision Criteria for Early Phase Studies......................... 127 9.3 Known Variance Case......................................................................... 128 9.4 Known Variance Case – Generalised Assurance............................. 134 9.5 Bounds on Unconditional Decision Probabilities for Multiple Decision Criteria........................................................... 135 9.6 Bayesian Approach to Multiple Decision Criteria.......................... 136 9.7 Posterior Conditional Distributions with Multiple Decision Criteria................................................................. 140 9.8 Estimated Variance Case..................................................................... 143 9.9 Estimated Variance Case – Generalised Assurance........................ 147 9.10 Discussion............................................................................................. 148 10. Surety and Assurance in Estimation........................................................ 149 10.1 Introduction.......................................................................................... 149 10.2 An Alternative to Power in Sample Size Determination............... 151 10.3 Should the Confidence Interval Width Be the Sole Determinant of Sample Size?............................................................ 153 10.4 Unconditional Sample Sizing Based on CI Width.......................... 156 10.4.1 Modified Cook Algorithm.....................................................158 10.4.2 Harris et al. (1948) Algorithm................................................ 158 10.5 A Fiducial Interpretation of (10.14)................................................... 159 Note���������������������������������������������������������������������������������������������������������������� 160
xContents
References............................................................................................................. 161 Appendix 1 Evaluation of a Double Normal Integral................................... 171 Appendix 2 Besag’s Candidate Formula......................................................... 173 Index...................................................................................................................... 175
Figures 1.1 2.1 2.2 2.3
2.4 2.5 3.1 3.2 4.1 4.2 5.1 5.2 5.3 6.1 6.2
Optimised Type I/Type II errors as a function of the NCP, if Type I errors are 4 times more important than Type II errors 4 Relationship between average power and planned power as a function of the prior information fraction (f0) 13 Conditional power function and prior density of the treatment difference for known variance: Example 2.1 14 Illustration of the von Neumann accept/reject approach to generating a random sample from an unnormalised density: (a) Generation of a random uniform value X (between MIN and MAX) and a random uniform value Y between 0 and M; (b) Random generated values which are accepted (Green) and rejected (Red) 22 Mixture approximation to the prior distribution of the log(hazard ratio) generated from the elicited prior density shown by Crisp et al. (2018) 22 Random sample from the joint distribution (2.11) with assumptions based on Example 2.1 25 Boundary plot of a group sequential design with a single interim allowing stopping for efficacy or futility 44 Sample size per arm (n1) as a function of the treatment effect (δST) to achieve a power of 90% with the assumptions of Example 3.5 56 Conditional power function and prior density of the treatment difference for unknown variance: Example 4.2, restless leg syndrome RCT 63 Contours of Equal predictive probability for the pairs of responses (rC, rT) and those which give significant results: (a) Asymptotic χ2 test and (b) FET 67 Bayesian power as a function of sample size per arm for the assumptions of Example 5.2 81 Prior posterior conditional distributions of the treatment effect given either a significant outcome of the future study or a non-significant outcome 83 Prior posterior conditional distributions of the treatment effect given either a successful outcome of the future study or a non-successful outcome (posterior-based) 85 Prior power: (a) Prior density of power (n0 = 2, 100, 200, 500); (b) Prior power CDF (n0 = 2, 100, 200, 500) 90 Prior power: (a) Prior density of power based on variance prior; (b) Power CDF based on variance prior 94 xi
xiiFigures
6.3 Prior sample size: (a) Prior density of sample size (n1) based on treatment difference prior; (b) Power CDF based on treatment difference prior 96 6.4 Prior sample size: (a) Prior density of sample size (n1) based on the variance prior; (b) Power CDF based on variance prior 97 7.1 Relationship between predictive and conditional power as a function of the information fraction (f1) 105 7.2 Relationship between predictive power (8.4) and information fraction (f1) as a function of ZINT 107 8.1 Posterior distributions of the (a) Proportion of patients with resistant pathogens and (b) Treatment effect by pathogen types 123 8.2 Predictive inference (a) predictive distribution of treatment effects and (b) cumulative predictive distribution of 95% lower confidence limit 124 9.1 A plot of intervals for the observed treatment effect (δˆ ) giving rise to the three possible decisions 130 9.2 Operating characteristics of the design: n1 = 20, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.3 with three potential decisions (stacked probabilities) 131 9.3 Operating characteristics of the design: n1 = 20, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.3 with three potential decisions 132 9.4 Operating characteristics of the design: n1 = 20, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.3 with three potential decisions 133 9.5 Operating characteristics of four design scenarios each with three potential decisions: (a) n1 = 20, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.3. (b) n1 = 10, σ = 2, δLRV = 0, α0 = 0.025, δTV = 0.15, α1 = 0.3. (c) n1 = 25, σ = 3.2, δLRV = 0, α0 = 0.025, δTV = 1.75, α1 = 0.4. (d) n1 = 10, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.1 133 9.6 Ternary plot of PGO, U, PNOGO, U and PPAUSE, U as a function of n1 the sample size per arm, based on (9.6)–(9.8) 136 9.7 Ternary plot of PGO, U, PNOGO, U and PPAUSE, U as a function of n1, the sample size per arm based on Table 9.2 139 9.8 Prior posterior conditional distributions of the treatment effect given any of the decisions NOGO, PAUSE, GO 141 9.9 Decision quadrants expressed in terms of criteria (9.15) and (9.16) 143 10.1 Integration regions to define the conditional probability that the width of the CI is less than w0 given that it includes the true parameter value (ϕI(w0)) 154 A.1 Integration region Associated with (A1) 171
Tables 1.1 2.1 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 5.1 5.2 5.3 5.4 8.1 8.2 8.3 9.1
Normal Approximations to Standard Likelihoods 2 Numerical Estimate of AP Based Upon a Mid-Point Method Applied to Example 2.1 16 Sample Sizes, Per Arm, to Give a Nominal Power Based on Average Power (3.7) 36 Sample Sizes, Per Arm, to Give a Specified Normalised Assurance (3.9/3.7) 38 Boundaries for Futility and Efficacy for a Group Design with O’Brien–Fleming Efficacy Boundaries 47 Efficacy Boundaries for a Group Design with O’Brien– Fleming Efficacy Boundaries 48 Assurance for a Non-Inferiority Trial Using a Synthesis Approach Based on (3.28) with the Assumptions of Exercise 5 for a Range of Prior Sample Sizes Per Arm 57 Data from a Trial Comparing an Active Treatment (T) with a Control (C) 64 Prior Probability (P(πT > πC)) of Treatment Superiority as a Function of αC and αT given αC + βC = αT + βT = 10 69 Average Powers for a Range of Prior Number of Events for the HCC Trial of Example 4.5 71 Sample Sizes, Per Arm, to Give a Nominal Power Based on Bayesian Power (5.7) 77 Sample Sizes, Per Arm, to Give a Specified Normalised Bayesian Power (5.9/5.7) 79 Prior Posterior Conditional Probabilities of the Treatment Effect Given Significance/Non-Significance of the Study 83 Prior Posterior Conditional Probabilities of the Treatment Effect Given Success/Failure of the Study 85 Configuration of Data in a Parallel Group Study of Placebo Compared to Active in Which the Primary Endpoint Is Multinomial 115 Simulation Results for an Adaptive Design with an Interim Which Includes Both a Test for Futility and Sample Size Re-Estimation (Scenarios A and D) 119 Interim Data: Eradication Rates in 57 Patients in the ME Population by Pathogen Type 121 Study Outcomes and Associated Decisions in a Trial with Multiple Decision Criteria 128
xiii
xivTables
9.2 Bayesian Probabilities of GO, NOGO and PAUSE Based on Two Conditional Predictive Distributions and an Unconditional Predictive Distribution 9.3 Prior Posterior Conditional Probabilities of Exceeding a Treatment
138
Effect Cut-Off Given One of Three Decisions NOGO, PAUSE or GO 142 9.4 Comparison of Numerical Approaches to Calculating the Probabilities of GO, NOGO and PAUSE 147 10.1 Expected Width of the 95% Confidence Interval for a Range of Sample Sizes 150 10.2 Sample Sizes Per Arm to a Give a Standardised Half-Width (w0(w0/(2σ)) of the 95% CI with Surety (1 − ψ) 152 10.3 ϕ(w0) and ϕI(w0) as a Function of Sample Size for a 95% CI with a Targeted Standardised Half-Width of 0.4 155
Preface It is not always easy to discern what event, or perhaps which person or group of people, is responsible for a researcher beginning a course of research. In this instance, it is clear to me that my interest in the issues covered in this book goes back to my early research into Bayesian methods and prediction approaches when I was working in Basel, Switzerland in the Mathematische Applikationen (Mathematical Applications) section of Ciba-Geigy’s Wissenschaftliches Rechenzentrum (Scientific Computing Centre). In our early investigations into the use of Bayesian methods in pharmaceutical R&D, we had been focused on bioequivalence testing and by 1984, we were interested in developing more efficient bioequivalence designs. At the time, it was well-known that small, single, bioavailability studies were unlikely to be able to demonstrate the bioequivalence of two bioequivalent formulations and we were therefore interested in developing strategies which, while maintaining small numbers of subjects, increased the chance of demonstrating bioequivalence. One strategy involved two-stage designs. The essential features of the two-stage design we proposed were (RacinePoon, 1987): 1) the choice of the (small) sample size for the first stage crossover design. 2) calculation of the probability that the bioequivalence criterion, typically that the ratio of average areas under the plasma-concentration time curves (AUC) lies in a chosen range given the first stage data. 3) calculation of the probability that if the study is continued to a second stage with a given number of subjects, that bioequivalence will be established with the totality of data. 4) calculation of the probability that the ratio of AUCs lies in the chosen range given the data from Stages 1 and 2. The probability in (3) is a predictive probability with respect to the future data of the second stage of the design which have not yet been collected. A second area that I was involved with was the design and analysis of acute toxicity studies, not only for pharmaceutical but also for agrochemical products. At the time, the use of acute toxicity experiments to estimate the LD50 of a chemical substance was a topic of continuing controversy. There was general opposition from animal rights groups to their use, and there was less confidence among toxicologists of the value of the LD50 as a measure of toxicity. It was my view that as the LD50 was required by regulatory authorities
xv
xviPreface
as a measure of toxicity, then we as statisticians had a duty to analyse such experiments as efficiently as possible. This was a point of view to which Finney (1985) subscribed and is closely aligned with Fisher’s grounds for developing the maximum likelihood estimate of the LD50 (Fisher Box, 1978, p. 272), “When a biologist believes there is information in an observation, it is up to the statistician to get it out”. To this end, Racine et al (1986) and Grieve (1988b) proposed a Bayesian approach aimed at providing answers to the specific questions raised by regulatory authorities. Diazinon, a thiophosphoric acid ester was developed in the early 1950s by Ciba-Geigy as an insecticide to control cockroaches, silverfish, ants and fleas. It had been developed to replace DDT which, is well known, had been found to be such an environmental hazard that its use was banned for all purposes except for combating disease-vectors such as malaria-carrying insects. In 1985, a national regulatory authority, whose identity I will conceal so as not to embarrass them, claimed in contrast to results from earlier studies, that a new formulation of a Diazinon had an LD50 of the order of 200 mg/kg or less in rats. The earlier studies, which had provided evidence to support registration of the product, had given estimates of the LD50 in the range of 700–1050 mg/kg. Given the pressures to reduce the number of experimental animals, it was deemed inappropriate to carry out a full LD50 estimation to test this assertion, particularly as it would have been necessary to check all 10 batches of the new formulation in question. It was decided, therefore, to dose 10 rats with 200 mg/kg, taken from each batch to test the claim. Based on data from three historical registration studies, the distributions of the number of deaths from 10 rats receiving 200 mg/kg were predicted and the results from the new experiments judged on the basis of these predictions, similar to Box’s(1980) “predictive check”. The results suggested that there was less than a 1/20 chance of seeing a single death out of 10 rats in each of the planned studies. In the event, there were no deaths amongst the 100 animals tested from the 10 batches and we, therefore, concluded that the new formulation of Diazinon did not have an LD50 of the order of 200 mg/ kg or less. Whilst this predictive approach is appealing in terms of saving on the use of animals, nonetheless one needs to be sure that its application is appropriate. The applicability of the technique assumes that all the experiments, historical and current, were conducted in a similar fashion and under similar conditions. The importance of this assumption was underlined when it was subsequently revealed that the regulatory authority who had claimed that the LD50 of Diazinon had changed had used mice instead of rats. In 1987, through colleagues in Ciba-Geigy Japan, I was invited to participate in a one-day satellite meeting on Biometry in Osaka, organised by the Japanese Region of the International Biometrics Society as part of the 46th Session of the International Statistical Institute which was held in Tokyo. The title of my talk was “Some uses of predictive distributions in research” and
Preface
xvii
was based on four examples. In addition to the two-stage bioequivalence and LD50 prediction applications, I included a predictive approach developed by Ciba-Geigy colleagues Murray Selwyn and Nancy Hall (1984) to assess bioequivalence for studies in which a standard and a new formulation are simultaneously administered, one being tagged with a radioactive isotope. The final topic included, was based on the at-that-time unpublished work by Spiegelhalter and Freedman (1988) of which I was an invited discussant (Grieve, 1988a). There were two methods they proposed in which account was taken of current uncertainties in making decisions about the conduct of clinical trials. The first looked at determining the sample size in planning clinical trials. The second addressed the issue of monitoring clinical trials in a series of interim analyses. The defining aspect of these methods was that they were a hybrid of Bayesian and frequentist concepts. It is these hybrid concepts which are the focus of this book.
Acknowledgements There are many people who I must thank for their support in the work and ideas detailed in this book. I am grateful to Bruno Lecoutre for helpful insights into the Lambda-prime and K-prime distributions; to Robb Muirhead and Joe Eaton, thanks are due for confirming my tentative understanding of their general results on prior bounds on average power and assurance. Tony O’Hagan and John Stevens put me straight on aspects of their work on assurance. I am grateful to Blake McShane for answering my naïve questions about his work. Finally, Kevin Kunzmann kindly provided insight into his recent work on the decomposition of the average power. Shuyen Ho, a UCB colleague, provided the first results detailed in Section 10.5 based on my analysis of the simpler model shown in the development and discussion of Example 10.2. The development of the computational methods was in response to a discussion with Foteini Strimenopoulou, another UCB colleague. Thanks to them both. I am grateful to both Shuyen and Foteini and UCB colleagues, Ros Walley, Margaret Jones, Daniel Meddings, Seth Seegobin and Lu Cui who read earlier drafts of selected parts of the manuscript and whose comments led to a measurable improvement. I am particularly grateful to Kristian Brock who proofread the whole manuscript. Ros provided the motivation for writing this manuscript. It was Ros’s suggestion to take what was an overly long draft paper, a weakness of mine to which I readily admit, and convert it into this monograph and encouraged me to explore extensions to the basic theory. Finally, for supporting me in my formative years exploring the application of Bayesian and predictive approaches in the context of pharmaceutical drug development while working for in Ciba-Geigy’s Wissenschatfliche Rechenzentrum (WRZ) in Basel, I will be eternally indebted to my former colleagues Amy Racine-Poon and Hugo Flühler, and Professor Adrian Smith. Hugo was my boss during my time in Basel and encouraged my fledgling research career by providing me with topics to investigate and report back on to the department, amongst which were Bayesian approaches to linear calibration, the analysis of crossover trials from a Bayesian perspective and multi-dimensional scaling. Hugo, it was, who established a research project entitled “Lernen aus Erfahren” (Learning from Experience) with the aim of exploring the use of Bayesian approaches across the whole of chemical/ pharmaceutical research with an emphasis on drug development. Amy and I led the “Lernen aus Erfahrung” project. It was during this time that my interest in the issues covered in this monograph began. Amy, brought up statistically, in the high cathedral of frequentism that was the Berkeley
xix
xxAcknowledgements
campus of the University of California, is the most brilliantly intuitive statistician with whom I have ever worked and together we started a career-long interest in applying Bayesian approaches to pharmaceutical R&D. Adrian Smith was a consultant to WRZ, visiting us four times a year and providing us with mentorship and friendship. It was he who encouraged me in my early research career and became my PhD supervisor. Much later, I had the honour of serving as one of Adrian’s Vice Presidents during his Presidency of the Royal Statistical Society.
Author Andrew P. Grieve is a Statistical Research Fellow in the Centre of Excellence in Statistical Innovation at UCB Pharma. He is a former Chair of PSI (Statisticians in the Pharmaceutical Industry) and a past-President of the Royal Statistical Society. He has over 45 years of experience as a biostatistician working in the pharmaceutical industry and academia and has been active in most areas of pharmaceutical R&D in which statistical methods and statisticians are intimately involved, including drug discovery, pre-clinical toxicology, pharmaceutical development, pharmacokinetics and pharmacodynamics, phase I–IV of clinical development, manufacturing, health economics and clinical operations.
xxi
Acronyms AP AUC BP CA CCDF CDF CFU CEP CI CP CrI cUTI DBP EGOS-I ES FDA FET GSD ICH IDMC IRLS KOL LR LRV MACE MCID ME MWT NA NBP NCP OC PCES PDF PFR12 PO POC POS
Average Power Area Under the plasma-concentration time Curve Bayesian Power Conditional Assurance Complementary Cumulative Distribution Function Cumulative Distribution Function Colony-Forming Unit Conditional Expected Power Confidence Interval Conditional Power Credible Interval complicated Urinary Tract Infections Diastolic Blood Pressure Extended Glasgow Outcome Scale Effect Size Food and Drug Administration Fisher’s Exact Test Group Sequential Design The International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use Independent Data Monitoring Committee International Restless Leg Scale Key Opinion Leader Likelihood Ratio Lower Reference Value Major Adverse Cardiovascular Events Minimally Clinically Important Difference Microbiologically Evaluable Mann-Whitney/Wilcoxon Test Normalised Assurance Normalised Bayesian Power Non-Centrality Parameter Operating Characteristics Power-Calibrated Effect Size Probability Density Function Progression-Free Rate at 12 weeks Proportional Odds Proof of Concept Probability of Success
xxiii
xxivAcronyms
PP PPCF PPCS QDM R RECIST SHELF SOC SSR STS TAP TBI TV TPP U UA V W
Predictive Power Prior Posterior Conditional Failure Prior Posterior Conditional Success Quantitative Decision-Making Rejection – if CI excludes the null parameter value (null hypothesis) Response Evaluation Criteria in Solid Tumors Sheffield Elicitation Framework Standard Of Care Sample Size Re-estimation Soft Tissue Sarcoma Truncated Average Power Traumatic Brain Injury Target Value Target Product Profile Unconditional Unconditional Assurance Validity – occurs if the CI includes the true parameter value Width – occurs if the CI width is less than a pre-specified value
1 Introduction The power of a test is both a conditional and predictive concept. Conditional on the assumed statistical model, the alternative hypothesis and other parameters of the model, which may or may not be nuisance parameters, the outcome of an experiment is predicted and the proportion of predictions giving a “significant result” at a predetermined level is the power. The power of an experiment will therefore only achieve its nominal level if the conditionally assumed model and alternative hypothesis are true. In this book, it is my intention to review unconditional (absolute) approaches, essentially Bayesian in construct, which can be used both in the planning phase of a clinical trial and during an interim analysis to adjust the sample size or in the extreme case to halt the study for futility. These approaches account for the current uncertainty in our knowledge of the treatment effect and have variously been called the strength, the expected power, the average power (AP), the predictive power (PP) and the assurance of the test. In the experimental sciences, received wisdom has it that there are four factors that drive a statistical design, some, but not all, of which are in the direct control of the experimenter. These are: 1. An appropriate null hypothesis of no treatment effect; 2. The probability of committing a type I error (rejecting the null hypothesis when true), α, the significance level; 3. An alternative hypothesis representing a treatment effect magnitude that is of interest or one that is expected to be achieved – variously termed the clinically relevant difference (Lachin, 1977) or minimally clinically important difference (MCID) (Chuang-Stein et al., 2011a); 4. A sample size per experimental arm to give the probability of committing a type II error (not rejecting the null hypothesis when the alternative hypothesis is true), β. For illustrative purposes, throughout this book, we will assume a simple two treatment experiment in which n1 patients are randomised to each arm. The difference in sample means, δˆ , is assumed to have a normal distribution
ˆ ~ N , 2 2 n1
DOI: 10.1201/9781003218531-1
(1.1)
1
2
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
in which δ is the true difference in population means and we assume that the variance σ2 is known. With these assumptions, the sample size, n1, to test the null hypothesis Ho : δ = 0 against the alternative hypothesis HA : δ = δ0 is usually given by the formula 2 2 Z1 Z1 n1 . 02 2
(1.2)
In all cases, we will be considering a one-sided test, but this is not a restrictive assumption. Although this approach is not appropriate in all circumstances, the central limit theorem will often allow it to be used, at least following a suitable transformation. In clinical trials, Spiegelhalter et al. (2004) suggest it is reasonable to assume in many cases that after m “effective observations” relevant to a treatment effect ϕ on a suitable scale, a summary statistic ym exists with approximately a normal likelihood 2 N , . m
Table 1.1 illustrates four examples and shows how the definition of m differs from case to case. We will show how this approach can be used in simple scenarios and that more complex models can also be dealt with. For many, the concept of the power of a test was introduced by Neyman and Pearson (1933a) both to define tests of specific hypotheses and as a measure for comparing different tests of the same hypothesis, for a fixed type I error. However, in truth, the basic idea had appeared four years earlier in work by Pearson and Adyanthāya (1929) on the robustness of Student’s t-test to non-normal symmetrical distributions or skew distributions and TABLE 1.1 Normal Approximations to Standard Likelihoods Distribution
Treatment Effect
Variance - V(ym)
Definition of m
τ2
Sample size per group
2σ2
Normal
ym = difference in means
2σ m
Binary
ym = log (odds ratio)
≅
4 m
Total # events
4
Poisson count
ym = log (rate ratio)
≅
4 m
Total count
4
Survival
ym = log (hazard ratio)
≅
4 m
Total # events
4
2
Introduction
3
independently considered by Dodge and Romig (1929) in their introduction of consumer risks and producer risks in acceptance sampling. The type-I error corresponds to the producer’s risk that consumers reject a good product or service indicated by the null hypothesis. In other words, a producer introduces a good product and in doing so, he/she takes a risk that consumers will reject it. In contrast, the type II error corresponds to the consumer’s risk for not rejecting a possibly worthless product or service indicated by the null hypothesis. Neyman and Pearson’s proposal had little immediate impact on the design of individual clinical trials. As Campbell (2013) has commented, “even exemplary clinical trials done in the early 1940s did not use statistical arguments to justify their Sample sizes”, including the Medical Research Council trial of streptomycin in the treatment of tuberculosis (Marshall et al., 1948). It was not until the early 1960s that power calculations began to appear in reports of clinical trials, one of the earliest being Zubrod et al. (1960). In psychological research, Cohen (1962) was the first to seriously study the power of studies. He found amongst 70 studies from the Journal of Abnormal and Social Psychology that the median power of a medium effect size (ES), defined as the treatment difference divided by the pooled standard deviation of 0.5, was 0.48. This led to the recommendation that “investigators use larger sample sizes than they customarily do”, not least because “the chance of obtaining a significant result was about that of tossing a head with a fair coin” (Cohen, 1992). Later, Cohen produced a “power handbook” for psychologists “to solve the problem” (Cohen, 1969). One aspect of his work was the identification of the ratio of the type II to type I error rates as a measure of the relative importance of a type I error compared to a type II error. For example, in many studies, the type I error is set at 5% and the type II error at 20% which Cohen says implies that “mistaken rejection of the null hypothesis is considered four times as serious as mistaken acceptance”. Recently, there has been increased interest in investigating optimal type I and type II errors when the objective is to minimise the weighted sum of errors in which the weights represent the relative importance of the errors (Mudge et al., 2012; Grieve, 2015). Grieve showed that for a fixed sample size, the optimal type I and type II error rates depend upon the non-centrality parameter (NCP), which is related to Cohen’s ES. Figure 1.1 displays the optimal error rates as a function of the NCP of a normal, two-arm comparative study, with known variance when type I errors are 4 times more important than type II errors. To illustrate, suppose that the NCP, 0 n1 / 2 / , takes the value 2 and we believe that type I errors are 4 times more important than type II errors. From Figure 1.1 we can read off that the optimal type I (two-sided) error is 0.090 (2×0.045) and the corresponding optimal type II error is 0.380 and their ratio is not 4:1. The optimal values are dependent on the value of the NCP and change as the NCP changes. For example, if
4
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
FIGURE 1.1 Optimised type I/type II errors as a function of the NCP, if type I errors are 4 times more important than type II errors.
the NCP takes the value 1.5 the optimal two-sided type I error is 0.094 and the optimal type II error is 0.569. Going further, Grieve also considered the inverse problem. For what relative importance are type I (two-sided) and type II error rates of 0.05 and 0.2, respectively, optimal? He showed that the solution is independent of the NCP. A two-sided type I error of 0.05 and a type II error of 0.2 are optimal for a relative importance of 4.79, 20% larger than the nominal ratio of 4. Walley and Grieve (2021) extend the analytic results developed by Grieve (2015) to studies for which the primary analysis is planned to be Bayesian. In the context of this book in which we concentrate on two-arm trials, Walley and Grieve (2021) consider examples in which a prior is available for the treatment effect and when an informative prior is available for the response mean in the control arm with a vague prior on the treatment effect which is related to Walley et al. (2015) and Lim et al. (2018). In 1978, Freiman et al. (1978) investigated 71 “negative” trials taken mainly from the New England Journal of Medicine, The Lancet and the Journal of the American Medical Association restricting their attention to studies with a binary outcome. They reported that for 94% of the studies there was a greater than 10% type-II error of missing a 25% therapeutic improvement and for 70% there was a greater than 10% type-II error of missing a 50% therapeutic improvement. The problem was still substantial 25 years later as reported by Halpern et al. (2002) who argued that underpowered trials were unethical. Part of the problem has to do with the basis on which the sample size in a study is chosen. I have argued previously (Grieve, 2015) that whilst it is
5
Introduction
possible to take a pragmatic decision about the sample size and to base it on what the program budget allows, this type of resource-sizing is unsatisfactory because it is associated with underpowering as Freiman et al. (1978) and Halpern et al. (2002) have shown. From a pharmaceutical perspective, it has been recognised over the last 20 years that the high failure rate particularly in late-phase clinical rates with average failure rates being as high as 45% (Kola and Landis, 2004) and as high as 60% in some therapeutic areas is unsustainable, although there are some signs of recent improvement (Hay et al., 2014). It can be argued that one element of the high failure is due to the tendency for development teams to be optimistic about the likely benefit of their drug candidate. This is understandable as by the time a candidate drug reaches the late stage of drug development, some members of the team may have spent a considerable proportion of their career on the development. My experience is that there are overt, and covert, incentives for teams to be optimistic and a negative consequence of such enthusiasm is likely to be an under-powering of studies. One approach to this problem is to get teams to be more realistic by acknowledging uncertainty in their view of the likely magnitude of the benefit of their development compound. This idea has been the subject of proposals for over 80 years. Following their first publication in 1933, Neyman and Pearson (1933b) published a paper that considered the following problem. Suppose that the set of possible hypotheses governing the generation of data are H0, the null hypothesis, and m alternative hypotheses H1, …, Hm. Suppose also that associated with each hypothesis is an a priori probability φi(i = 0, 1, …, m), then their objective was to find tests which were “independent of a priori probability laws” because “the practical statistician is nevertheless forced to recognise that the values of φi can only rarely be expressed in precise numerical form”. In seeking such tests, they introduced the idea of “resultant power” defined as m
P |H i
i
i 1
in which ω is a critical region of the appropriate size. In subsequent papers, they rejected the use of resultant power fundamentally because it is not independent of the prior probabilities. Jeffreys (1948), in contrast, made the following observation: Now Pearson and Neyman proceed by working out the above risks for different values of the new parameter, and call the result the power function of the test, the test itself being in terms of the P integral. But if the actual value is unknown the value of the function is also unknown; the total risk of errors of the second kind must be compounded of the power functions over the possible values, with regard to their risk of occurrence.
6
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
This may be thought of as a limiting case of “resultant power”. A similar approach was proposed by Good and Crook (1974) who were interested in exploring compromises between Bayesian and non-Bayesian approaches to significance testing. One compromise they looked at concerned the use of a “unified” power function in the following circumstances. Suppose that we are planning to test a simple null hypothesis against a composite alternative hypothesis and that the alternative hypothesis has a considerably higher dimension than the null hypothesis. For example, in a dose-finding study with k doses, the null may specify that the effect of all doses is identical, but the alternative may allow complete freedom for the response of all doses. In such cases, the power function is difficult to visualise, or as Good and Crook put it, “to grasp intuitively as a whole” (Good and Crook, 1974). They suggested the use of a “unified” power function which they proposed to create as a weighted average of the multi-dimensional power function. Concretely, suppose the alternative hypothesis contains two sets of parameters; the first set, θ, is associated with the alternative hypothesis, and ϕ, a set of nuisance parameters with prior distribution p(ϕ). If the condition for rejecting the null hypothesis is that a given statistic T is greater than a given criterion T0, chosen to control the type I error, then the power of the test as a function of the parameters θ is
P T T | , p d 0
which is a weighted average of the individual, conditional, powers. In later papers, Crook and Good (1982) and Good (1992) termed this averaged power the “strength” of the test. The purpose of this book is to review the use of hybrid frequentist/Bayesian approaches to power, not only to plan studies but also to support decision making at formal and informal interim analyses during the course of a study based on PP. Such ideas have been in use for over 30 years. For example, Racine et al. (1986) utilised PP in the context of a two-stage bioequivalence study to determine whether to progress to the second stage, while Frei et al. (1987) consider a similar use which allowed them to stop a clinical trial in stroke for reasons of futility. As part of the development, we will also review similar approaches for trials in which a full Bayesian analysis is planned. In Chapter 2, we review average, expected and PP and give four different methods for their calculation for known variance. In addition, we cover bounds on power, robust priors, introduce a decomposition of the average power which leads to the suggestion that average power may not be the most appropriate measure and include the case in which the variance is estimated as an integral part of the analysis of the study.
Introduction
7
In Chapter 3, we introduce the related idea of the assurance of a design and cover an approach to determining the sample size required to achieve a pre-specified average power or assurance. Also, we introduce the idea of conditional assurance which determines the probability of success (POS), for example, of a phase III trial given success of one, or two, phase II trials as well as applying the same idea to trials which include an interim analysis. We also consider the use of assurance in planning a non-inferiority study. In Chapter 4, we look at the use of average power in non-normal settings. Included are the cases of a truncated prior for the treatment effect with known variance, unknown variance, binomial outcomes and time-to-event endpoints involving both parametric and non-parametric models. In Chapter 5, we review similar tools based on Bayesian methods alone, this includes Bayesian power (BP). Analogous bounds on Bayesian Power to the ones developed for assurance are also covered. We introduce ideas of pre-study posterior distributions which can give insight into a study’s ability to discriminate between effective and ineffective treatments. In Chapter 6, we generalise the concept of average power by considering the full prior distribution of study power and summaries other than the expected value for a fixed sample size. We also consider the prior distribution of the sample size for a given power. In Chapter 7, we cover conditional and predicted power calculations that can be used at an interim analysis to inform decision-making related to stopping for futility and/or sample-size re-estimation. In Chapter 8, we provide two case studies of simulation. The first concerns an adaptive design in which the primary endpoint is an ordered-categorical variable and we show how to utilise conditional power (CP) calculations based on a normal model with known variance in place of simulation. The study includes the opportunity to stop for futility at a single interim and/or to increase the sample size based on CP. The operating characteristics (OC) of the design are investigated using clinical trial simulations. The second concerns an active controlled trial in which slow recruitment caused the sponsor to assess the likelihood that the study would provide evidence enabling it to exclude large negative treatment effects if the study were to run to its planned conclusion. In Chapter 9, we generalise some of the results from earlier chapters to studies in which multiple decision criteria, based on statistical significance and relevance, are used. Such considerations are particularly relevant in the context of proof-of-concept trials. Finally, in Chapter 10, we turn attention to the related issue of determining the sample size of a study when its primary objective is to estimate a treatment effect, rather than to test that the effect is zero. In this case, power is replaced by what we call “surety” which is the probability that the width of the confidence interval (CI) for a treatment effect is less than a pre-specified value.
2 All Power Is Conditional Unless It’s Absolute
2.1 Introduction If we assume that δ is the true difference in population means, then the conditional probability of a successful trial from (1.2) can be written as
1 1 Z1 n1 / 2 /
(2.1)
in which the conditionality arises since δ is unknown and therefore the conditional POS, or power, is a function of this unknown alternative value δ. For the most part, our assumption will be that σ is known, but we will investigate whether it is possible, or desirable, to relax this assumption in later chapters and we will see that in many cases the added sophistication is unnecessary since the results based on assuming the variance is known are sufficiently accurate for most decision-making purposes. One important question is, what should the value of δ be assumed to take? The International Conference on Harmonisation (ICH) guidance on statistical principles in clinical trials (ICH, 1996) makes clear that there needs to be clear justification for all the assumptions that go into the sample size calculations, including
1. the means and variances; 2. response, or event, rates; and 3. the clinically meaningful difference.
According to ICH E9, the basis for the choice of the treatment effect to be detected may be based on “a judgement concerning the minimal effect which has clinical relevance in the management of patients” or on “a judgement concerning the anticipated effect of the new treatment”. Not all statisticians agree.
DOI: 10.1201/9781003218531-2
9
10
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Senn (2017) has suggested four potential approaches:
i) ii) iii) iv)
“the treatment difference we would like to observe”; “the treatment difference we would like to ‘prove’ obtains”; “the treatment difference we believe obtains”; “the treatment difference you would not like to miss”.
Senn rejects approach (i) arguing that even if the value we would like to observe is the true value, there is only a 50% chance of seeing an effect which is at least as large as this true value, irrespective of the power. This is an argument we will return to consider when investigating the possibility of sizing a clinical trial based on the width of the CI associated with the estimated treatment effect in Section 10.2. Senn also rejects (ii) because it requires testing a shifted null hypothesis, in which case any power calculation based on (2.1) is no longer of any relevance. He finds approach (iii) problematic because it implies that for drugs which we believe to have a smaller effect, we would require larger sample sizes, although he notes that it has a relationship to assurance which we consider in Chapter 3. Senn prefers option (iv) because it is the complement of “the conditional probability of cancelling an interesting project that we seek to control”. Both the alternatives (i) and (ii) appear in ICH E9 (1996) which states that the choice of the treatment effect to be detected may be based on “a judgement concerning the minimal effect which has clinical relevance in the management of patients” or on a “judgement concerning the anticipated effect of the new treatment, where this is larger”. In this book, we favour (iii) primarily because it introduces beliefs about the treatment effect and leads directly to an unconditional, absolute approach to power. In one sense, the difference between (iii) and (iv) is the difference between an experiment which is designed to investigate an uncertain event and an experiment thought of as a measuring instrument.
2.2 Expected, Average and Predicted Power In this chapter, we consider the same set-up that was introduced in Chapter 1. As before, n1 patients are in each arm and the difference in sample means, δˆ , an estimate of the treatment effect, has the distribution (1.1). A standard decision rule which is often used to indicate a positive outcome of the study requires that the estimated treatment effect satisfies the relationship
n1 / 2δˆ / σ > Z1−α
11
Power Is Conditional Unless It’s Absolute
in which Z1 − α is the 1−α percentile of the standard normal distribution. This decision rule can be rewritten to indicate that we need to observe a treatment effect, δˆ, such that
ˆ Z1 2 / n1 .
(2.2)
Now suppose that our expected knowledge, or belief, about the treatment effect before the trial is δ0, but we acknowledge that there is uncertainty about the treatment effect. This uncertainty we express by a prior normal distribution for the treatment effect which is centred at δ0 and has variability equivalent to n0 patients. In other words, our prior belief, or prior uncertainty, about the treatment effect can be represented by the probability density function (pdf)
p N 0 , 2 2/n0 .
(2.3)
Although for our purposes, in which we investigate the use of this uncertainty in planning and monitoring clinical trials, we will assume that this is a realistic expression of our pre-study beliefs about the true treatment effect, we will occasionally consider more robust alternatives. In the following sections, we examine four different approaches to determining the AP. These are analytic calculation, a predictive approach, numerical integration and simulation.
2.2.1 Averaging Conditional Power with Respect to the Prior – Analytic Calculation The first approach to determining the AP straightforwardly calculates the expected value of 1 − β(δ) in (2.1) with respect to the prior pdf (2.3), that is
1 p d which can be expanded to give
AP =
n0 n1 4πσ 2
∫
∞
∫ ∫
−∞ Z1−α σ
1 = 2π −∞ Z1−α − y ∞
∞
n1 − exp 22 δˆ − δ 2σ 2/n1
(
∞
∫
n1 /n0 −
)
2
n0 2 (δ − δ 0 )2 dδˆ dδ 2σ 2
−
2 2 e − x /2 dx e − y /2dy n0 /2δ 0 /σ
(2.4)
12
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Then, if we set a Z1 n0 / 2 0 / and b n1 / n0 and if we define the NCP as Z1
n1 0 , 2
(2.5)
we can use the results in Appendix 1 to evaluate (2.4) to give n0 n0 AP 1 Z1 Z1 . n n n n 0 1 0 1
(2.6)
Ibrahim et al. (2015) suggest that the POS of a trial be defined as a weighted average of power which when “…… translated to a simple situation, …. becomes an expected power with respect to a given prior for the parameter and has no relationship to the frequentist power”, or in other words, (2.1). This may be true philosophically, practically it is not. To see this, express AP as 1 − Φ(zAP) and compare (2.1) and (2.6) showing that are related by the expression ZAP
f 0 Z 0 ,
(2.7)
in which f0
n0 n1 n0
(2.8)
is the fraction of information in the posterior distribution represented by the prior distribution. The relationship between these two powers as a function of f0 is shown in Figure 2.1. What this figure illustrates is that if the planned power < 0.5 the AP exceeds it, whilst the converse is true if the planned power is greater > 0.5, it exceeds the AP. In truth, it is questionable whether the former scenario is realistic because in such circumstances, it is unlikely that a sponsor would pursue the development. The latter scenario is more likely and indeed in most published cases, the AP is less than the nominal power. Lan et al. (2009) have proposed an application of this result; see also Jiang (2011.) Let us suppose that by design the power of a new trial is 80% and that the available prior information represents 15% of the combined information from the prior and the future study, then from (2.7) we can show that AP is 0.15 0.842 63%.
13
Power Is Conditional Unless It’s Absolute
1.0 0.50 0.40 0.30 0.20
Average Power
0.8
0.10 0.05
0.6
= Prior Informaon fracon
0.4 0.2 0.0 0.0
0.2
0.4 0.6 Planned Power
0.8
1.0
FIGURE 2.1 Relationship between Average Power and Planned Power as a Function of the Prior Information Fraction (f0).
An alternative use of this same idea could be to answer any question of the type what CP of a study is necessary in order that the AP is 80% when the prior is worth 15% of the combined information? Then using (2.7) again, the required CP is 0.842 / 0.15 = 98.5%. Both examples illustrate the remark we made that for power values greater than 0.5, the AP is less than CP. One aspect of AP which is of critical importance to understand is its limiting value as the sample size n1 becomes large. As n1 → ∞, from the definition of AP in (2.6), we find that
n0 AP 0 2 2
(2.9)
which from (2.3) is the prior probability of a positive treatment effect and this provides an upper bound for the achievable AP. We will address this result in more detail in Section 2.3. Example 2.1 Muirhead and Ş oaita (2013) illustrate what they describe as a “proper Bayesian” approach using a clinical study comparing a new drug for the treatment of restless leg syndrome with a placebo. The primary endpoint of the study was the International Restless Leg Scale (IRLS) ranging from 0 to 40, with high scores indicating greater severity and inferior quality of life. A previous study gave an estimate of the standard deviation of 8 units and the study was planned to detect a
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
1.0
0.06
0.8
0.05 0.04
0.6
0.03 0.4
0.02
0.2 0.0
Prior Density
Power
14
0.01 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 Treatment Difference (δ) Power Function
0.00
Prior Density
FIGURE 2.2 Conditional power function and prior density of the treatment difference for known variance: Example 2.1.
“clinically meaningful” treatment difference of 4 units. With a one-sided type I error of 0.025 and a power of 80%, (2.1) gives a sample size per arm of 63, although Muirhead and Şoaita (2013) use 64 with a slightly elevated power of 80.7%. In their Bayesian approach, Muirhead and Şoaita assume that the prior distribution for δ is N(4, 64),which implies that the prior is based on a per arm sample size of 2 patients. In Figure 2.2, the power function for a sample size of 64 and a standard deviation of 8 is displayed together with the prior distribution for
δ. Using (2.7) the unconditional, AP of the study is 0.0303 0.868 = 0.5601. This AP corresponds to a 30% reduction from the nominal power calculated at the expected treatment effect, which is conditional and ignores the prior uncertainty. When sampling sizing, if there is insufficient power, we would be led to consider increasing the planned sample size. Here, although an increase in n1 will increase the AP, it is not a limitless increase because, as we have seen, it is bounded above by Φ(4/8) = 0.691 which is approximately 15% less than the nominal value.
2.2.2 Calculating the Probability of Achieving “Significance” – Predictive Power The second approach to determine the AP was used both by Spiegelhalter and Freedman (1988) and Grieve (1988a, 1991a). In what follows, we make use of the unconditional predictive, or marginal, distribution of the data. We may proceed in the following way.
15
Power Is Conditional Unless It’s Absolute
An AP
alternative expression for the unconditional power 1 p d is made possible by using the fact that the power
function can itself be written as an integral, over the future data so that AP can be written as a double integral. If we then exchange the order of integration, we have
p ˆ| dˆ p d 2 Z1 n1
p ˆ| p d dˆ
2 Z1 n1
p ˆ dˆ.
2 Z1 n1
The last term is the probability that δˆ is greater than (2.2) calculated with respect to its unconditional predictive pdf. This unconditional predictive distribution can be simply determined by noting that
ˆ ˆ
and that independent of one another
ˆ ~ N 0, 2n
1
2
2 2 ~ 0 , n0
from which it follows that
1 1 ˆ ~ N 0 ,2 2 . n n 1 0
(2.10)
There are alternative approaches for allowing us to derive this marginal predictive distribution that illustrate different aspects of prediction. We provide details of these alternatives in Section 7.2. The term
p ˆ dˆ
2 Z1 n1
can be expanded to once again to give (2.6). In many cases, this approach provides a more practical solution than the direct calculation of the average value that was used in Section 2.2.1.
16
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
2.2.3 Averaging Conditional Power with Respect to the Prior – Numerical Integration The third approach uses numerical integration. Figure 2.2 displays the CP function (2.1) together with the prior pdf for δ, (2.3). The integral of the product of these expressions with respect to δ can be approximated by a mid-point rule (Good and Gaskins, 1971), which is also the approach taken by Chuang-Stein (2006). The first step is to define two values δmin and δmax chosen to exclude δ values with effectively zero density. The second step is to choose the number of points for evaluating the integral, say N_PT. Then the integral of the product may be approximated by AP =
δ max − δ min N _ PT
N _ PT
∑ Φ Z
α
i =0
+
δˆi σ
n1 2
n0 1 n0 ˆ exp − δi − δ0 2 2π 2σ 2 2 2σ
(
) 2
where ˆi min i max min / N_PT . Example 2.1 (Continued) Table 2.1 illustrates the use of the mid-point rule to determine the AP for the restless leg syndrome clinical trial in Example 2.1. What these results show is that the choice of δmin and δmax is likely to be more important than the number of points used in the mid-point rule. Whilst the accuracy achieved by the mid-point rule is adequate for most purposes, greater accuracy than the mid-point rule with less computational burden is achievable using Gaussian quadrature based on Hermite polynomials (Abramowitz and Stegun, 1965). An additional advantage of this latter approach is that it can be used without the need to pre-define minimum and maximum δ values with effectively zero density. We do not pursue this approach here, waiting until Section 2.6 when using it to determine the AP when the variance is estimated from the data of the future study. A similar approach will be referred to in Section 4.5 for binary data and in Section 10.3 when considering designing studies based on the width of a CI. TABLE 2.1 Numerical Estimate of AP Based upon a Mid-Point Method Applied to Example 2.1 N_PT 500
5000
Minimum
Maximum
Estimated Average Power
−20 −30 −40 −20 −30 −40
20 30 40 20 30 40
0.5376 0.5595 0.5601 0.5374 0.5595 0.5601
Power Is Conditional Unless It’s Absolute
17
2.2.4 Averaging Conditional Power with Respect to the Prior – Simulation The final approach to determine the AP uses simulation to perform the appropriate integration, an example of the comment that Stephen Senn has made that “Much simulation is just multiple integration”. Simulation is increasingly being used to determine the OC of clinical trials, including power and the type I error rate, but also assurance. In fact, O’Hagan et al. (2005) in their original application of assurance to clinical trials – which is tackled in Chapter 3 – use simulation to determine the assurance for a pair of non-standard examples. O’Hagan et al.’s (2005) first example covers the case that prior information about the variance is available and in planning the trial this uncertainty is to be allowed for. Their second example covers binary data and uses a mixture distribution for the treatment effect giving a formal prior probability that the treatment response rate is the same as placebo. The use of mixture priors has recently become of increasing interest, particularly in the context of augmented control groups based on historical control data. Mixture priors will be introduced and covered in detail in Section 2.4. In our simple set-up, there are two potential approaches to simulation. In the first approach, the treatment effect δi (i = 1, …, nsim) is simulated from the prior distribution (2.3), and then conditional on δi the treatment effect, ˆi i 1, ,nsim , is simulated from (1.1). The proportion of simulations for which δˆ is greater than (2.2) is the simulation estimate of AP. The second approach simulates δˆ directly from (2.10) but is otherwise the same.
The advantage of the more direct approach is that it requires the generation of only a single value for each simulation. Nonetheless, the two-stage approach allows us to investigate approaches that combine observed data and population values, and this makes it our chosen approach. There is another, more fundamental reason for us to use the two-stage approach. Box (1980) in developing his predictive checks wrote “If the prior probability distribution of parameters is accepted as essential, then a complete statement of the entertained model at any stage of an investigation is provided by the joint density for potential data y and parameters θ”. This joint distribution of the potential, but yet unobserved data, and the relevant parameters of our model provides the framework for our simulations and allows us to illustrate the power of the simulation approach using more realistic, complex, models and prior beliefs. The pdf of the treatment estimate, δˆ , conditional on the population treatment effect δ is
2 2 p ˆ| ~ N , . n1
18
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
whilst the prior distribution of the population treatment effect is also normally distributed with pdf 2 2 p N 0 , . n0
Then, properties of the bivariate normal distribution lead to the following joint distribution of δˆ and δ n0 n1 0 2 2 ˆ p , ~ N , n1 0 n0 1
1 . 1
(2.11)
(see O’Hagan et al., 2005). In Section 2.5, we will use simulated observations from this joint distribution (2.11) to support the investigation of a decomposition of the POS as proposed by Kunzmann et al. (2021).
2.3 Bounds on Average Power The result in equation (2.9) showed that as the sample size of a study increases, the AP is bounded above by the prior probability that the treatment effect is positive. In this section, this result is looked at more generally. Wald and Wolfowitz (1940) define a test to be consistent “…if the probability of rejecting the null hypothesis when it is false (i.e., the complement of the probability of a type II error ….) approaches one as the sample number approaches infinity”. Eaton et al. (2013) show that if a test of the null hypothesis in a partitioned parameter space Θ = {Θ0| Θ1} is consistent, then
lim n d 1 .
n
in which βn(θ) is the power function for a given sample size n. In our examples, Θ = {Θ0 : δ ≤ 0| Θ1 : δ > 0} and the assumption of consistency requires that as n → ∞, βn(θ) → 1 for parameter values in Θ1 and conversely βn(θ) → 0 for parameter values in Θ0. Consistency, therefore, implies that as n → ∞, the power function tends to a step function with the vertical
Power Is Conditional Unless It’s Absolute
19
part of the step occurring at the boundary point between null hypothesis space Θ0 and the alternative space Θ1. Therefore, if the step function is integrated with respect to the prior distribution of θ, the result is the prior probability of being in the space of the alternative hypothesis; in our case, this corresponds to the prior probability that the treatment effect is positive. In the cases in which we are primarily interested, π(Θ1) is the prior probability of the alternative hypothesis, that is P(δ > 0), the prior probability of a positive treatment effect. This probability is a measure of interest to drug regulators. For example, the US Food and Drug Administration (FDA), in a section of their Bayesian guidance for device trials entitled “Prior probability of the study claim”, recommends that sponsors determine, and report, the prior probability of the study claim (FDA, 2010). In other words, this is the prior probability of the claim before data in the proposed study have been collected. Furthermore, they suggest that this probability should not be “too high”, in which their judgement about whether the probability is “too high” will be made on a case-by-case basis. They “recommend” that the prior probability of the claim should not be greater than the posterior probability defining a successful trial. This is a reasonable position to take since if the prior probability of the claim is too large, it is difficult to argue that we are in a state of equipoise which in turn makes randomisation ethically difficult. In principle, the upper bound on the AP is a recognition that whilst we may have an expectation that the treatment is of sufficient magnitude, δ0, to be of clinical interest, the large uncertainty as measured by n0 implies that the likelihood of success is not as high as we might wish. This result should not be surprising, as it is a close cousin to results in prediction from a simple linear regression model and a random effect (Model II) one-way classification. Suppose that the relationship between a controlled variable X and a response variable Y can be described by a simple linear regression model
yi xi i i 1, ,n
where εi are independently identically normally distributed, mean with mean 0 and variance σ2. As Snedecor and Cochran (1980) point out in many applications of such models, the aim is to predict Y from knowledge of X. They point to three related prediction problems: a) prediction (estimation) of the expected value α + βx0 of the population regression line at the value x0. b) prediction of the value of a new observation of the population at the value x0. c) prediction of the mean of a random sample of m values for a specified value x0, and as they point out a) and b) are special cases of this.
20
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
In each case, the predicted value is given by ˆ ˆ x0 y ˆ x0 x , where they differ is in their standard errors which are: a) s
1 n
b) s 1
c) s
x0 x n 2 i 1 x0 x
1 n
2
where s is the estimate of σ.
x0 x n 2 i 1 x0 x
1 1 m n
2
x0 x n 2 i 1 x0 x 2
.
Snedecor and Cochran (1980) point out that c) is the general case and that a) and b) are special cases in which a) m = ∞ and b) m = 1. What this result indicates is that as the number of samples constituting the sample mean that we want to predict, increases, the inherent uncertainty in the parameters of the model dominates and cannot be avoided. Next, we turn to consider a simple random effects model of the type used in weight and content uniformity testing during pharmaceutical manufacturing. Suppose that we have k production batches of a drug and we take n samples from each from which we determine the active content (by weight) of each. The content yij of the jth sample from the ith batch can be expressed as a Model II one-way classification in the form yij i ij
(2.12)
where μ is the overall mean content of the batches, βi represents the difference between the content of the ith batch and the overall mean and has distribution N 0, b2 , and εij represents the measurement error distributed as N 0, 2 independently of the is. If the sample mean, y.. , is used to estimate the overall drug content of the samples, then it can be written y.. . .. from which the variance of y.. as an estimator of μ is
V y..
2 2 . k kn
As n gets large, the variance tends to 2 / k which is independent of the number of measurements per batch n. What these two examples demonstrate is that taking very large samples is not always a recipe for reducing variability indefinitely which is what is being reflected in the upper bound on AP.
Power Is Conditional Unless It’s Absolute
21
2.4 Average Power for a Robust Prior The assumptions in this chapter so far are relatively simple, but we often want to use more flexible, robust prior distributions. The results derived so far allow us to utilise a class of robust priors without the necessity of deriving the corresponding APs from scratch. Increasingly when using real data to generate a proper prior for a treatment parameter, statisticians utilise robust alternatives as one way of protecting against prior/data conflict, Lim et al. (2018) review the use of historic data which provides the rationale for such robust priors. In many instances, these robust priors are constructed from mixtures of distributions with a minimum of two components. The first component, which is standardly derived from, historical data, is a meta-analytic-predictive (MAP) prior (Spiegelhalter et al., 2004) which itself is more robust than a standard conjugate prior. This prior component is itself normally expressed as a mixture of standard distributions, for example, Schmidli et al. (2014), based on earlier work by Dallal and Hall (1983) and Diaconis and Ylvisaker (1985) show how to fit mixture distributions to Monte-Carlo samples from a meta-analytic prior (MAP). The second component, which is less informative, provides additional robustness against prior/data conflict.
Example 2.2 To illustrate how an arbitrary prior distribution can be represented by a mixture distribution we use an example of a prior distribution elicited from experts. Crisp et al. (2018) provide a graphic representation of a prior distribution for the hazard ratio to be used in a cardiovascular outcomes trial which was designed to achieve a target number of major adverse cardiovascular events (MACE) for a non-inferiority assessment relative to the standard of care. The prior, shown in their Figure 4, was formed from individual priors elicited independently from 6 experts, with the consensus being achieved by taking the average, an approach that follows Spiegelhalter et al. (1994), though not a method which has universal support (Grieve, 1994). To create a mixture representation of their prior we used three steps. Step 1. A web-based tool (WebPlotDigitizer) was used to create a digital representation of the average prior providing pairs of values {fi, HRi} in which HRi is a value of the hazard ratio and fi is the height of the un-normalised density at that hazard ratio. Step 2. The accept/reject algorithm originally described by von Neumann (1951) was used to generate a random sample of 1000 values from the distribution of the log (Hazard Ratio), illustrated in Figure 2.3.
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
(a)
(b) M
X
Min
M
Y
f(x)
Max
Min
Max
FIGURE 2.3 Illustration of the von Neumann accept/reject approach to generating a random sample from an unnormalised density: (a) generation of a random uniform value X (between MIN and MAX) and a random uniform value Y between 0 and M; (b) random generated values which are accepted (green) and rejected (red).
a) In Figure 2.3(a), the unnormalised density is bounded by Min and Max values on the Hazard Ratio axis, and by M on the vertical (density) axis, in our case, M is 1. b) A random Hazard Ratio value is generated uniformly between Min and Max, say X
7 6 5 Prior density
22
4 3 2 1 0 -0.4
-0.3
1st Component
-0.2
-0.1 0 0.1 log(Hazard Ratio)
2nd Component
0.2
0.3
Mixture Distribution
FIGURE 2.4 Mixture approximation to the prior distribution of the log (hazard ratio) generated from the elicited prior density shown by Crisp et al. (2018).
23
Power Is Conditional Unless It’s Absolute
c) The height of the un-normalised density at the value X is calculated by interpolation between the heights of the adjacent {fi, HRi} pairs which contain HR, say f. d) In Figure 2.3(a), the unnormalised density is bounded by Min and Max values on the Hazard Ratio axis, and by M on the vertical (density) axis, in our case, M was 1 Step 3. The function automixfit in the R library RBesT was then used to calculate the parameters of a two-component mixture of normal distributions to approximate the prior distribution which is shown in Figure 2.4.
To illustrate, suppose that we generalise the prior distribution (2.3) to a mixture prior with two components. The first component is exactly (2.3) but with probability ω0. The second component is N 0,2 2 / n0 with the complementary probability 1 − ω0, in which n0′ is much larger than n0 implying larger uncertainty. A similar prior distribution has been considered by Muirhead and Şoaita (2013). We now write the prior as
2 2 2 2 p 0 N 0 , 1 N 0 0, n0 n0
(2.13)
from which it follows that the overall average power, AP_M, can be written as a weighted average of APs based on the individual components of (2.13) AP _ M 0 AP 1 0 AP
where
n0 AP 1 Z1 . n1 n0 As n1 → ∞ the limiting value of AP_M is
0 0 / 2 2 / n0
1 0 . 2
(2.14)
If we now compare the result (2.14) with the probability of a positive treatment effect from the prior distribution (2.13), it once again demonstrates that the prior probability of a positive treatment effect provides an upper bound for the achievable AP, generalising (2.9). We will make use of this more general prior in the following chapter. Of course, the robust prior need not be restricted to two components but can be generalised to a mixture distribution with multiple components leading to extended versions of (2.13) and (2.14).
24
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
2.5 Decomposition of Average Power Kunzmann et al. (2021) have questioned whether the accepted definition of “success” implicit in the standard approach to AP is appropriate. Their concern is grounded on the observation that this standard definition includes rejections based on values of the treatment effect which are either “irrelevant”, defined as being less than the MCID, or in some more extreme cases belong to the region of the null hypothesis. A consequence is that the AP includes a contribution from type I errors, and this might be regarded as paradoxical. To see this, Kunzmann et al. decompose the AP into three separate components as follows:
2 2 Z Z AP ˆ 1 MCID Pr ˆ 1 0 MCID n1 n1 ˆ Z1 2 Pr 0 I II III . (2.15) n1
Clearly, the probability III is calculated under the null hypothesis δ < 0 and therefore AP itself is partly also determined under the null. Of course, this may have little relevance if both this probability and the probability II, which is calculated for treatment effects less than the MCID, are both small. Under the assumption of normality, a minimum of one double integral over the bivariate normal density p ˆ , , introduced in (2.11), is necessary to evaluate each of the three component parts of the AP. Before deriving analytic expressions for each of these probabilities, we illustrate how simulation can be used to obtain an empirical estimate of the probabilities.
Example 2.3 Based on the assumptions in Example 2.1, the joint distribution of δˆ and δ from (2.11) is
4 66 N , 4 64
64 . 64
Additionally, we will assume that the MCID is 2 units. In Figure 2.5, we display a random sample of 10,000 observations from this bivariate distribution. Those samples which fall into each of the regions defining the probabilities I, II and III are coloured green, yellow
25
Power Is Conditional Unless It’s Absolute
40 30 III
II
I
Estimated Delta
20 10 0 -10 -20 -30
-30
-20
-10
0
10
20
30
40
Delta Non-Significant
III
II
I
FIGURE 2.5 Random Sample from the Joint Distribution (2.11) with Assumptions Based on Example 2.1.
and red, respectively. Those samples which fail to meet the significance criterion (2.2) are coloured blue. The percentages of the sample points falling into each region provide empirical estimates of the probabilities I, II and III, respectively as well as the probability of non-significance. In this simulation the empirical estimates are: I:
0.5485
II:
0.0109
III:
0.0005
Non − significance:
0.4401
The sum of the first three probabilities, I, II and III, is 0.5599 which represents the empirical estimate of the POS and can be compared with the result from the analytic solution given in Example 2.1 of 0.5601. The probability, III,is extremely small and we will look at what this probability represents later.
26
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
The random sample that has been generated can also be used to derive samples from other distributions which might be of future interest. For example, if we were to select individual simulations for which ∣δ − 4∣ < ε where ε is small, then the resulting sample of δˆ values will represent a random sample from the predictive distribution of δˆ given δ = 4 and allows the empirical verification of power calculations. Similarly, if we were to select individual simulations for which |δˆ − 3| < ε where again ε is small, then the resulting sample of δ values will represent a random sample from the posterior distribution of δ given a sample estimate of ˆ 3 . In either case, of course, other values can be chosen. In subsequent chapters, we will point out how these simulated values can be used to determine other distributions of interest when appropriate. Returning to the analytic expressions for the probabilities I, II and III, the integral form of I can be written as ∞
∞
∫
I=
( )
∫
p δˆ ,δ dδ dδ .
δ MCID Z1−α 2σ / n1
Now, if we transform the variables δˆ and δ to y and x, respectively, utilising the expressions:
y
ˆ 0
2
2
n0 n1
n0 n1
,x
0 2
n0 2
then I can be written as
I
MCID 0
n0
g x,y, dydx
n0 Z1 Z1 n1 n0
2 2
where g x ,y ,
x 2 2 xy y 2 exp 2 1 2 2 1 2 1
in which Z1 is given by (2.5), 1 f 0 and f0 is given by (2.8).
(2.16)
27
Power Is Conditional Unless It’s Absolute
We can use standard properties of the Bivariate Normal distribution to allow I to be evaluated using readily available functions in statistical packages, for example, PROBBNRM in SAS, which evaluates B h,k ,
h k
hk
g x,y, dy dx g x,y, dy dx
(2.17)
or pmvnorm in the R library mvtnorm. So that MCID 0 n0 I B , 2 2
f 0 Z1 Z1 , 1 f 0 .
In a similar way, n II B 0 0 , f 0 Z1 Z1 , 1 f 0 2 2 MCID 0 n0 B , f 0 Z1 Z1 , 1 f 0 2 2
(2.18)
and finally n III B 0 0 , 2 2
f 0 Z1 Z1 , 1 f 0 .
(2.19)
Example 2.3 (Continued) Returning to the assumptions of our continuing example, we have the following values:
n0 2, n1 64, 2 64, 0 4, MCID 2, = Z1 2= .8284, f 0 0.0303,
0.9847 , Z1 1.95996
so that the three-component probabilities of AP are:
I B 0.25, 0.15118, 0.9847 0.5479
28
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
II B 0.5, 0.15118, 0.9847 B 0.25, 0.15118, 0.9847 0.0116 III B 0.5, 0.15118, 0.9847 0.00057. From these, we can determine the probability of non-significance as
1.0 0.5479 0.0116 0.00057 0.4399
If we compare these exact values to the simulated estimates previously reported in Figure 2.3, we can see that the simulated estimates are accurate enough for most practical purposes.
We have already seen in (2.15) that the sum of the three probabilities I, II and III equals the AP. We also know from Section 2.3 that the AP is bounded above by the prior probability that the treatment effect is greater than 0. The question remains, what is the behaviour of each of these probabilities as n1 tends to infinity? Are they each individually bounded by the parameters of the prior distribution of the treatment effect (2.3)? Since the correlation between δˆ and δ is given by 1 f 0 and f0 is given by (2.8), as n1 → ∞, ρ → 1 and this permits a considerable simplification of the asymptotic structure of the three probabilities. Specifically, we can exploit well-known properties of the cumulative distribution function (CDF) of the bivariate normal distribution to show that the asymptotic values of the three conditional probabilities can be written in the form:
MCID 0 n0 I 2 2 n II 0 2 2
MCID 0 n0 2 2 III = 0.
Clearly, each of these again depend only on properties of the prior distribution of the treatment effect, and again provide bounds on what is achievable in terms of the decomposition of AP. Kunzmann et al. (2021) argue that the conditional probability I is preferable to AP because it represents the POS, in which “success” is defined as the
29
Power Is Conditional Unless It’s Absolute
rejection of the null hypothesis and that the true treatment effect is a “relevant” effect. This should not be thought of as a purist position since in drug development we are concerned with developing drugs which have clinical value, and not in designing trials which allow us to clear a purely statistical hurdle. In contrast to this, the decomposition (2.15) shows that AP effectively defines success to include values of the treatment effect which are marginal, in that they are less than MCID, and more seriously values that define the null hypothesis. Consequently, AP is inflated by the probability of a classical type I error. Practically, however, their analysis shows that the impact of these two cases is likely to be of only minimal consequence unless the prior distribution is narrowly distributed around a mode which is slightly less than the boundary defining the null hypothesis. But of course, it is almost certainly the case that those are precisely the circumstances under which we are unlikely to be pursuing the clinical development of such a drug. As a consequence of this, it is arguable that the distinction between AP and POS as defined by Kunzmann et al. (2021) is really only of theoretical interest and indeed Example 2.2 demonstrates that in that specific example the probabilities II and III are likely to be small relative to I,and we conjecture that this is likely to be more widely true. Because of this observation, in the rest of the book, we will concentrate on the traditional definition of AP (or assurance as defined in Chapter 3), but we will signpost the relationship of other approaches and their link to the decomposition (2.15).
2.6 Average Power – Variance Estimated Thus far, we have assumed that the variance is known and fixed. In practice, we will most likely estimate the variance from the study and our best estimate of the variance will be used in planning the study. One question will therefore be whether we need the added sophistication of planning the trial based on a t-test rather than a standard normal test. Suppose that we decide to take the former approach and plan our study knowing we will conduct a t-test at the completion of the study. In this case, therefore, the decision rule (2.2) becomes
ˆ t1 ,
2 ss n1
n1 ˆ . 2 t1 ,
(2.20)
30
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
The probability of meeting this criterion can be calculated from the CDF of the non-central t-distribution. δˆ − δ δ + 2 2σ 2 2σ n1 n1 δ n1 = > t1−α ,ν P s < P 2 t1−α ,ν ν s2 2 νσ nδ = P T ν , 1 > t1−α ,ν . 2σ
(2.21)
The average power is calculated by integrating (2.21) over the prior distribution of the treatment effect δ given by (2.3) to give n0 2
1 2
n P T , 1 2
n20 1 2 0 2 d . t1 , e 2
This integral is not available analytically but since the kernel of the integral is a normal density, it can be approximated by N
i 1
Wi 2n1 n P T , Zi 1 0 t1 , n0 2
(2.22)
where Zi and Wi are the zeros and weights of the Nth order Hermite polynomial (Abramowitz and Stegun, 1965). Example 2.1 (Continued) In this example, we use the same assumptions as we did previously in Example 2.1. We can use (2.21) to generate the power function which is based on the decision criterion (2.20). The resulting power function could be plotted together with the power function of (2.1) which was shown in Figure 2.2. However, there is no point in doing this because they are so similar that the naked eye cannot distinguish between them. Because of the similarity between the power functions (2.1) and (2.21) we can anticipate that there will be little difference between the average powers. Based on the previous assumptions we have
n1 64; 126; 8; 0.0250
31
Power Is Conditional Unless It’s Absolute
0 4; n0 2.
Using the zeros and weights of the 100th order Hermite polynomial the expression (2.22) gives a value for the AP of 0.5590, as was predicted this is very close to the value calculated in Section 2.1, namely 0.5601.
2.6.1 Bound on Average Power when the Variance Is Estimated Asymptotically, the power function (2.21) tends to a step function as the sample size n1 increases. The consistency argument presented in Section 2.3 then implies that the step function has its vertical component at the boundary of the null and alternative spaces, in this case, zero. Consequently, when the step function is integrated with respect to the prior distribution of θ, the result is once again the prior probability that the treatment effect is positive,
n0 0 2 2
.
3 Assurance
3.1 Introduction O’Hagan and Stevens (2001) coined the term “assurance” to describe the unconditional probability that a trial will give rise to a specific outcome in the context of cost-effectiveness trials. This is equivalent to AP as they explicitly acknowledge, “Bayesian assurance, …. is a kind of AP”. In O’Hagan et al. (2005), they apply assurance to efficacy trials. The concept of assurance need not be restricted to simple applications in which the unconditional probability of attaining “statistical significance” based on a traditional test and critical region is required. As O’Hagan et al. (2005) point out the idea allows for its use in more complex applications such as attaining significance and the test drug is genuinely superior; multiple outcomes; a sequence of trials and joint decisions around marketing approval and market access. In this chapter, we restrict ourselves to O’Hagan et al.’s (2005) original application to standard clinical efficacy studies, returning to other more complex applications in later chapters.
3.2 Basic Considerations Like us, O’Hagan et al. (2005) considered a trial with two treatments. They modify the set-up of the trial compared to ours, by allowing both different sample sizes (m1 and m2) and variances ( σ 12 and σ 22 ) in each treatment arm. With this change, (2.1) and (2.3) become
1 1 Z1 /
DOI: 10.1201/9781003218531-3
(3.1)
33
34
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
and
p
1 2 1 exp 2 0 , 2 2 2
(3.2)
respectively, in which 2 12 / m1 + σ 22 / m2 . Combining (3.1) and (3.2) utilising the unconditional predictive distribution of the data they show that
Z 0 Assurance 1 1 . 2 2
(3.3)
If we rewrite (3.3) as
1 Z1 0 / Assurance 1 . 2 2
(3.4)
Then, we see that (2.7) still holds where now
f0
2 . 2 2
Additionally, as the sample sizes m1 and m2 get large and →∞, the variance τ2 →0 so that from (3.3)
Assurance 0
(3.5)
which now provides an upper bound to the achievable assurance. We will consider the consequences of this bound, and its counterpart which was pointed to in Section 2.2.1, in greater detail in Chapter 5.
3.3 Sample Size for a Given Average Power/Assurance For a given sample, sizes (3.3) and (3.4) can be used to determine the corresponding assurance. However, when planning a study, it is often the case that we are more interested in knowing the required sample size, in each arm, to deliver a pre-specified assurance (or AP).
35
Assurance
If that is the case, then it is possible to proceed as follows. First, we can rewrite (3.3) as
Z1 0 2 2 Z1 AP
a quadratic equation in τ whose solution is
Z1 0 Z1 AP 02 2 Z12 Z12 AP
Z
2 1
2 1 AP
Z
,Z
1 AP
0 .
(3.6)
In this equation, the constraint is determined by the bounds (3.5). In Chapter 2, we considered a special case in which 2 12 22 and m1 = m2 = n1, so that 2 2 / n1 and defined the prior variance to be ω2 = 2σ2/n0. If these values are substituted into (3.6), it follows that the required sample size per arm is
n1
n0 Z1 Z0 Z1 AP Z02 Z12 Z12 AP
Z
2 0
Z12 AP
2
2
, Z1 AP Z0
(3.7)
where
Z0 n0 / 2 0 /
(3.8)
The result (3.7) was first derived by McShane and Böckenholt (2016) in what they term the power-calibrated effect size (PCES) approach to the choice of sample size. Their argument is that the PCES method of determining the sample size is attractive precisely because it takes account of uncertainty and “…is properly calibrated so that sample sizes determined using the PCES approach provide the desired level of power on average”. This leads them to suggest that by using it we are less likely to waste resources by having sample sizes which are either too small or too large. Further, they also argue that it is simple to implement. A similar result was derived by Jiang (2011) who expressed n0 as a function of n1 in the context of two studies a phase II (n0 patients per arm) and a phase III trial (n1 patients per arm). There has been a recent increase in the number of studies in which the ratio of patients randomised to the test treatment compared to control is not 1:1 but is higher. The rationale is to increase the exposure to the new drug primarily to meet regulatory needs for evidence of the experimental drug’s
36
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
safety. There are consequences of this approach because such designs are less efficient than balanced designs. To illustrate, suppose that we plan a study for which there is k : 1 randomisation in favour of the new treatment. Then it is straightforward to demonstrate that the ratio of the unbalanced total sample size, nu, to the corresponding balanced total, nb, is nu/nb = (k + 1)2/(4k) giving the relative increase in sample size required in the unbalanced case compared to the balanced case. If k = 2, the value is 1.125, indicating a 12.5% increase in sample size. In planning of trials, drug development teams should consider other ways of achieving the exposure requirements; an open-label extension might be considered. Suppose that despite efficiency considerations, a k-fold unbalanced trial is planned, then as a consequence, the standard deviation τ takes the value
k
2 1
22 / kn1 . We can proceed as before, and if this value is substituted
into (3.6), it follows that the required sample size per arm can now be written in the form
n1
k
2 1
22 Z12 Z12 AP
2
k 2 Z1 Z0 Z1 AP Z02 Z12 Z12 AP
2
, Z1 AP Z0
where Z0 = δ0/ω. Example 3.1 Continuing with Example 2.1, assume that what we require is the necessary sample size to give an AP of 50% to detect a “clinically meaningful” treatment difference of 4 units. The previous assumptions imply that Z0 = 0.5 and from (3.7), we can calculate the required sample size as n1 = 31 per arm. TABLE 3.1 Sample Sizes, Per Arm, to Give a Nominal Power Based on Average Power (3.7) Average Power 0.50 0.51 0.52 0.53 0.54 0.55
Sample Size (3.7) 31 35 39 43 49 56
37
Assurance
This result is sensitive to small changes in the assumptions. Table 3.1 shows the samples sizes per arm for APs of 0.51–0.55. The sample sizes range from 31 to 56 so that in this range, a 10% increase of AP results in approximately an 80% increase in sample size. This outcome is related to the AP bounds of equations (2.9), (2.14) and (3.5). As the sample size increases, the slope on the AP becomes increasingly flat and requires a large increase in sample size for diminishing marginal increases in AP.
3.4 Sample Size for a Given Normalised Assurance It is not usual to sample size a study to achieve a given assurance despite the recommendation of McShane and Böckenholt (2016). The lack of uptake of the method is understandable given that assurance is bounded above by the prior probability of “success”. An alternative is to use the normalised expected power that Muirhead and Şoaita (2013) introduced. Muirhead and Şoaita (2013) in their related “proper Bayesian” approach, which we will review in Chapter 5, suggest that the Bayesian equivalent of AP should be normalised with respect to the upper bound when using it to determine the appropriate sample size. If we were to apply this idea to our concrete case, this would entail using the normalised function
f 0 Z Z1
0 / 2 / n0
2
.
(3.9)
Ciarleglio and Arendt (2017) call this the Conditional Expected Power (CEP) approach. In effect, this approach is equivalent to working with a constrained prior distribution for the treatment effect in which the constraint says that a priori, we believe that the treatment effect is positive. This idea will be followed up in Section 4.2 when we consider the proposal of Lan and Wittes (2013) to use a normal prior, truncated at zero. The normalised version of (3.3) for the special case 2 12 22 and m1 = m2 = n1 is given by Normalised Assurance NA
1
f 0 Z1 Z1 Z0
Rewriting (3.9) as
NA Z0 1
f 0 Z1 Z1
.
(3.10)
38
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
we can set 1 − β(NA) = NA × Φ(Z0) and then use (3.7) to give us the appropriate sample size to achieve a given normalised assurance. Example 3.1 (Continued) Continuing with Example 3.1, we assume that what we require now is the necessary sample size to give an NA of 65% to detect a “clinically meaningful” treatment difference of 4 units. With these assumptions, we are interested in the sample size to give a value for 1 − β(NA) = 0.65 × 0.691 = 0.449 and from (3.7), we can calculate the required sample size as n1 = 20 patients per arm. TABLE 3.2 Sample Sizes, Per Arm, to Give a Specified Normalised Assurance (3.9/3.7) under the Assumptions of Examples 2.1 and 3.1 Normalised Assurance
Sample Size (3.9/3.7)
0.65 0.70 0.75 0.80 0.85 0.90
20 27 38 58 101 220
The sensitivity to the choice of the NA value is investigated in Table 3.2 in which we show the samples sizes per arm for NAs of 0.65, 0.90 (0.05). The sample sizes range from 20 to 220 patients per arm.
3.5 Applying Assurance to a Series of Studies In their paper applying assurance to clinical trials, O’Hagan et al. (2005) discuss practical uses of assurance, including a series of studies. For example, we might ask what is our current estimate of the POS of a Phase 2a study followed by a Phase 2b. More complex decisions might compare different scenarios for development programs based on the POS. In practice, it is not only of interest to calculate the overall POS for a set of studies, but also the POS of a study, or studies, conditional on a positive outcome of earlier studies. For example, a development team might be interested in the POS of a planned Phase 2b study, conditional on a positive outcome in a Phase 2a study. These ideas have been proposed in a recent paper by Temple and Robertson (2021) for normally distributed outcomes.
39
Assurance
To illustrate how assurance can be applied to a development program, we assume that the drug development team is at the stage of planning a Phase 2a POC study, with the prospect of conducting a phase 2b dose selection study and two Phase 3 studies at a later juncture. We will label the studies i = 1, . ., 4 corresponding to 2a, 2b, 3(1) and 3(2), respectively, each of which will give rise to independent data ˆi i 1,.., 4 all estimating a treatment effect, the difference between treatments, δ. Further, we will denote the number of patients per arm ni(i = 1, .., 4). Continuing with the assumption that the data from each individual study are normally distributed with known variance, then from (1.1), they are distributed as ˆi ~ N ( , 2 2/ni ) . If the prior distribution for δ is given by (2.3) then the unconditional, multivariate predictive distribution of the estimates ˆi i 1,.., 4 is (Lindley and Smith, 1972)
n0 n1 n 1 ˆ1 0 0 2 2 1 ˆ2 ~ N4 , 0 n0 1 ˆ3 ˆ 0 4 1
1
1
n0 n2 n2
1
1
n0 n3 n3
1
1
1 . (3.11) 1 n0 n4 n4 1
A direct consequence of the common parameter δ in each study is that in the unconditional multivariate predictive distribution, the study treatment estimates are no longer independent, each pair of studies having a predictive correlation which is given by ni n j
ij
n0 ni n0 n j
.
(3.12)
This dependence was remarked on by Carroll (2013) who described it as one of several “counterintuitive” properties of assurance. The marginal predictive density of each study’s data from (3.11) is ˆ i ~ N 0 ,2 2 n0 ni / n0 ni and therefore the assurance for each individ-
ual study in the program is
1
f 0 i Z1 Z1 i
(3.13)
40
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
where Z1 − αi = Φ−1(1 − αi), f0(i) = n0/(n0 + ni) and Z1 i
ni 0. 2 2
Example 3.2 Temple and Robertson (2021) illustrate an extension to assurance using the following example. They consider the same structure as ours with the following four designs. Phase 2a: a placebo-controlled study with two arms of 60 patients per arm, in which success is defined as at least an 80% posterior probability of a positive treatment effect based on a vague prior. Phase 2b: a placebo-controlled dose-finding study with 5 arms, the placebo arm and the top dose having 100 patients each, the other arms having 50 patients each. Success is defined as at least a 90% posterior probability of a positive treatment effect based on a vague prior. Phase 3: two identical placebo-controlled two-arm studies of 250 patients per arm, in which success is defined as a significant p-value at the 5% two-sided level. To illustrate their work, Temple and Robertson (2021) assume that the value of the known variance, σ2, is 1 and that the design prior for the treatment effect is bimodal with 50% weight on a zero treatment effect and standard deviation of 0.01 and 50% on a treatment effect of 0.2 with a standard deviation of 0.1. From these assumptions, the prior distribution for the treatment effect is given by (2.13) in which ω0 = 0.5, δ0 = 0.2, n0 = 200 and n0 20, 000. With these assumptions, we can use (3.13) to give the following individual assurance values: Phase 2a: 0.5 × 0.588 + 0.5 × 0.200 = 0.394 Phase 2b: 0.5 × 0.543 + 0.5 × 0.101 = 0.322 Phase 3 : 0.5 × 0.573 + 0.5 × 0.026 = 0.299. Temple and Robertson (2021) obtained comparable assurances based on 100,000 simulations. For the Phase 2a study, they report 0.39, for the Phase 2b study, they report 0.32 and for Phase 3, they report 0.21.
This latter probability is the probability of a successful phase 3 program, so that Temple and Robertson (2021) are requiring both Phase 3 studies to meet the significance criteria, which is logical as a regulatory authority would require that too.
41
Assurance
From (3.11) the predictive distribution of δˆ3 and δˆ4 is n0 n3 2 ˆ3 n3 0 2 ~ N4 , ˆ4 0 n0 1
, n0 n4 n4 1
a result also given by Zhang and Zhang (2013). From this bivariate normal distribution, we can determine the probability that both Phase 3 trials are significant as
Prob
4
y i
i3
ˆi 0
2 2 n0 ni n0 ni
ci p y 3 ,y 4 dy 3 dy 4 c4 c3
(3.14)
where p(y3, y4) is the standard bivariate normal distribution (2.16) with correlation ρ = ρ34 (from (3.12)) and ci f 0 i Z1 Z1 i .
As noted in Section 2.4, the probability (3.14) can be evaluated using PROBBNRM in SAS, or pmvnorm in the R library mvtnorm. Example 3.2 (Continued) Returning to the Temple and Robertson (2021) illustration, we can use (3.14) for both components of the prior distribution and combine them to get the overall assurance.
Component 1:
c3 c4 0.184; 34 0.556; B c3 , c4 , 34 0.420
Component 2:
c3 c4 1.948; 34 0.012; B c3 , c4 , 34 0.0007.
Consequently, the assurance of a successful Phase 3 program is 0.5 × 0.420 + 0.5 × 0.0007 = 0.210, which we can compare to the value of 0.21 which Temple and Robertson (2021) determined by simulation.
Equation (3.14) provides the joint probability that the two Phase 3 studies meet the statistical criteria for success. As the phase III sample sizes → ∞, ρ34 → 1 and c3 c4 Z0 . Consequently, the assurance of the success of not only both individual studies but also of a successful Phase 3 program is bounded above by the prior POS. Carroll (2013) has described this property as “counterintuitive”
42
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
More generally,
i
Qij Prob yi c y j c
j
g y ,y dy dy i
j
ci cj
i
(3.15)
j
gives the probability that any pair of studies will meet their statistical criteria. Analogously, Qijk Prob yi ci y j cj y k ck and
i
j
k
l
Qijkl Prob yi c y j c y k c yl c
where the integrals are over
the 3-dimensional and 4-dimensional standard normal distributions, respectively, with pairs of correlation given by (3.12). Wang et al. (2013) investigate the probability of program success (POSS) by considering the POS of the whole program for multiple studies with binary endpoints. This approach can be generalised to encompass what Temple and Robertson (2021) characterise as conditional assurance (CA). Building on work by Walley et al. (2015), they argue that CA allows planners to ask questions about how the success in the next study will alter beliefs about the POS of subsequent studies. For example, in planning a Phase 2a study followed by a phase 2b study, we may be interested in determining the assurance for the Phase 2b study conditional on the Phase 2a study being successful. For this simplest of cases, if our prior has the form (2.3), then CA (Study 2|Study 1 successful) =
Q12 AP1
where Q12 can be calculated from (3.15). Alternatively, if the mixture prior distribution, (2.13), is a more appropriate representation of our belief about the treatment effect, then CA (Study 2|Study 1 successful)
0Q12 1 0 Q12 . 0 AP1 1 0 AP1
Example 3.2 (Continued) Concentrating on the numerator in (3.16), we have Component 1:
c1 0.222; c2 0.108; 12 0.277 ; B c1 , c2 ,12 0.363
Component 2:
c1 0.840; c2 1.278; 12 0.004; B c1 , c2 ,12 0.020.
(3.16)
43
Assurance
Consequently, the overall numerator is 0.5 × 0.363 + 0.5 × 0.020 = 0.192. The denominator was previously shown to be 0.394 so that overall the CA (Study 2 |Study 1 successful) = 0.192/0.394 = 0.486. Again, we can compare this result to the value of 0.48 which Temple and Robertson (2021) determined by simulation.
What this shows is that if the Phase 2A study turns out to be positive, our CA in the success of the Phase 2b study increases by 50% as compared to the direct assurance of the study. For more complex CAs, (3.16) can be generalised in an obvious way. For example, suppose we are interested in the CA of a successful Phase 3 program given that both the Phase 2a and Phase 2b studies are successful. Assuming again that (2.13) is our prior, CA (Study 3 Study 4|Study 1 Study 2)
0Q1234 1 0 Q1234 . 0Q12 1 0 Q12
Using the R function pmvnorm, we calculate Q1234 0.229, Q1234 0.000016 so that overall CA (Study 3 ∧ Study 4 |Study 1 ∧ Study 2) = (0.5 × 0.229 + 0.5 × 0.000016)/0.192 = 0.598. There are other CAs that are available. For example, Temple and Robertson (2021) also consider both CA (Study 3 ∧ Study 4 |Study 1) and CA (Study 3 ∧ Study 4 |Study 2). For their example, we find that CA (Study 3 ∧ Study 4 |Study 1) = 0.389 and CA (Study 3 ∧ Study 4 |Study 2) = 0.470 which from their simulations Temple and Robertson give as 0.39 and 0.47, respectively.
The use of assurance for assessing different scenarios for developing a drug was an integral part of the development of assurance by O’Hagan et al. (2005). In their case, they consider a maximum of three studies, in three potential scenarios. In the first two scenarios, they consider a “preliminary phase 3 study” followed by a full phase 3 study which included both a futility analysis and an interim analysis. In the full phase 3 study, success included not only a statistically significant efficacy outcome, but also meeting a safety criterion. What differentiated these scenarios was the use of different doses. The third scenario included running a phase 2b study before the two phase 3 studies. Comparison of the scenarios included success at either the interim or the final analysis. In the next section, we illustrate how conditional assurance can be used to assess the impact of interim analyses and how passing a futility test at an interim can increase our assurance of ultimate success. Nixon et al. (2009) also provide an example of simulating a series of complex scenarios to determine the most appropriate scenario for developing a drug for Rheumatoid Arthritis out of 72 potential scenarios. In both publications, the authors made use of simulation, which, in many cases, is the most efficient approach.
44
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
3.6 A Single Interim Analysis in a Clinical Trial In Chapter 7, we give a fuller treatment of conditional and PP concepts in sequential trials. In this section, we apply the ideas that were introduced in the previous section to trials with a single interim at which decisions about futility and/or efficacy can be taken. Our work here follows the suggestion by Temple and Robertson (2021) that the concept of conditional assurance can be used to evaluate the value of an interim, or interims, in a drug development program. To illustrate the ideas, we suppose that it is planned to run a group sequential design (GSD) with a single interim analysis which allows us to stop early for efficacy (superiority) or for futility. If the study fails to stop early, then we would continue to a final inference at the completion of the study. In Figure 3.1, we illustrate such a design in which the interim occurs at an information fraction of 50% and with an efficacy boundary based on an O’Brien/Fleming type rule (O’Brien and Fleming, 1979). If the standardised test statistic at the interim exceeds the cut-off ZS(1) the study can be stopped, and efficacy claimed. On the other hand, if it is less than ZF, the study can be stopped for futility. If neither outcome occurs, the study continues to the end 3
Z-Statistic Boundaries
ZS(1)
ZS(2)
2
Second Stage
1
ZF First Stage
0
0
0.2
0.4 0.6 Information Fraction
0.8
1
FIGURE 3.1 Boundary plot of a group sequential design with a single interim allowing stopping for efficacy or futility.
45
Assurance
and its success, or not, can be judged by the final standardised test statistic being either greater or less than the second cut-off ZS(2). If we again assume that the prior distribution for δ is given by (2.3), then using the same approach that we took in the previous section, the unconditional, bivariate predictive distribution of the stagewise estimates δˆ1 and δˆ2 based on n1 and n2 patients per arm, respectively, is
n0 n1 2 n1 0 2 ˆ1 ,ˆ2 ~ N , 0 n0 1
. n0 n2 n2 1
(3.17)
If we denote the interim and final estimates by n ˆ n ˆ ˆI ˆ1 , ˆF 1 1 2 2 n1 n2
then, they can be expressed as the following simple linear transformation of (3.17) ˆI 1 ˆF n1 / n1 n2
ˆ1 n2 / n1 n2 ˆ2 0
giving the bivariate normal distribution
p ˆI ,ˆF
n0 n1 2 n1 0 2 ~ N , 0 n0 n0 n1 n2 n n 1 2
n0 n1 n2 n1 n2 . n0 n1 n2 n1 n2
(3.18)
with the correlation
IF
n1 n0 n1 n2 . n1 n2 n0 n1
There are two one-sided tests at the interim for superiority and inferiority (futility) of the test treatment compared to the standard, respectively. The test of superiority at the interim requires that
46
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
ˆI 2 2 n1
ZS1 .
The probability of this event is ∞
n0 n1 2 2σ ( n0 + n1 ) 2π
∫ ZS( 1)
nn exp − 2 0 1 δˆ − δ 0 4σ ( n0 + n1 ) I
(
2
2σ n1
) dδˆ 2
I
which can be written as n0 P1 1 n0 n1
n1 0 ZS1 . 2 2
(3.19)
Similarly, the probability that futility is demonstrated at the interim is n0 P2 n0 n1
n1 0 ZF . 2 2
(3.20)
If neither test is positive, achieving success at the final analysis requires that ZF
2 2 2 2 ˆ 2 2 ˆI ZS1 F ZS 2 . n1 n2 n1 n1
The probability of this composite event is
2 2 ZS 2 n1 n2
ZS1
2 2 n1
p ˆ ,ˆ dˆ dˆ . I
F
I
F
2 2 ZF n1
The probability of the composite event is structurally identical to the probability denoted by II in Section 2.5 and therefore using the same approach can be written in the form
47
Assurance
n0 n1 n2 0 ZS 2 , 2 2 n0 n1 n2 P3 B n0 n1 0 Z , IF S 1 n0 n1 2 2 n0 n1 n2 0 ZS 2 , 2 2 n0 n1 n2 B n0 n1 0 Z , F IF 2 n0 n1 2
(3.21)
where B(h, k, ρ) is defined in (2.17). Example 3.3 In illustrating the use of conditional assurance approaches to a GSD, we have chosen to consider a design that matches closely the single-stage design we have consistently used. The design uses one-sided 0.025 O’Brien–Fleming boundaries and an overall power of 78%, the latter was chosen so that the total sample size matched the total sample size of 128 patients for the single-stage design. The boundaries for both futility and efficacy are given in Table 3.3. With the same prior distribution for the treatment effect, δ~N(4, 64) we can use (3.19), (3.20) and (3.21) to give the following unconditional probabilities: P1 = 0.4298 P2 = 0.3796 P3 = 0.1259
so that the total unconditional POS is 0.4298 + 0.1259 = 0.5557. This probability is very close to the value of 0.5601, which we calculated in Example 2.1.
TABLE 3.3 Boundaries for Futility and Efficacy for a Group Design with O’Brien– Fleming Efficacy Boundaries Stage 1 2
Information (%)
Sample Size Per Arm
Efficacy Boundary
50 50
32 32
2.79651 1.97743
48
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
One question that remains open is how we should view the likely success of a study that progresses to the second stage without stopping either for efficacy, or futility. The unconditional probability of that occurring is simply given by 1 − P1 − P2 so that the conditional probability of final success given that the study does not stop early is simply P Final Success|No Early Stopping
P3 . 1 P1 P2
(3.22)
Example 3.3 (Continued) Continuing with this example, we have already determined each individual element in (3.21) so that the conditional POS of success given that the study fails to stop at the interim is given by 0.1259 0.6609. 1 0.4298 0.3796
This implies that absent the information from the interim the POS is 0.5557, which increases to 0.6609 when we are given the information that the study continues after the interim. Suppose that we design the study without the option of stopping for futility. The boundaries for efficacy then change and are given in Table 3.4. The design now has an overall power of 80% at an alternative of 4 units. The unconditional probability of stopping at the interim is still given by (3.19) and therefore P1 = 0.4234, which as, might have been expected, is little changed from the previous design. The probability of continuing after the interim and finishing successfully can be calculated by setting ZF = − ∞ in (3.21) to give P2 = 0.1361,slightly higher than the previous design and the overall POS is again slightly higher at 0.5595. In contrast, the conditional POS given that the study fails to stop at the interim is now given by 0.1361 0.2361 1 0.4234
TABLE 3.4
Efficacy Boundaries for a Group Design with O’Brien– Fleming Efficacy Boundaries Stage 1 2
Information (%) 50 50
Sample Size Per Arm 32 32
Boundaries Futility 0.73642 1.92964
Efficacy 2.72892 1.92964
Assurance
49
which is considerably lower than in the previous design. This is not unexpected because the futility stops which would have occurred at the interim in the first design now occur at the final analysis, which reduces the conditional POS. The advantage of the interim is in terms of early stopping for futility and in the knowledge that if the study continues after the interim, the conditional POS is higher than the POS at the beginning of the study.
A similar idea was proposed by Broglio et al. (2014). They investigated the feasibility of predicting the results of ongoing clinical trials based solely on publicly available information such as a press release by a sponsor following a meeting of an Independent Data Monitoring Committee (IDMC). Their approach was to predict the future outcome of a study based on such information releases. They illustrated their approach by predicting the eventual success of a clinical trial in stage II and III colon cancer patients after the third interim analysis. At that point, their predicted probabilities were 48% for eventual success, 7.4% for stopping early for futility and 44.5% for continuing to the end without statistical significance. In the event, the exercise proved the value of the method since the study did indeed end without statistical significance. One issue that needs to be addressed in these circumstances is whether the stopping boundaries are binding or non-binding. If the boundaries are nonbinding, then merely reporting that the IDMC has recommended continuation does not of itself allow a proper assessment of future success or failure, since we cannot necessarily assume the range of values for the efficacy estimate which continuation implies.
3.7 Non-Inferiority Trials In the development in the previous section, we concentrated on trials for which the objective was to determine the superiority of a new drug compared to a control, which could be a placebo or standard of care (SOC). We turn now to a simple non-inferiority trial. In the 1960s and 1970s, it was not uncommon that if a clinical trial with a positive control was found to give a statistically non-significant result, a claim would be made that the test drug was as good as the control. This has obvious deficiencies. For example, if the approach were valid, then we could make the claim simply by choosing a very small number of patients, say 3, in each group. Statistically, a small number of patients means that there is little chance of a positive outcome to the trial and therefore we would claim equality.
50
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
In the early 1970s, the idea of showing equivalence of different formulations of the same drug simply by testing their difference was challenged, not least because by choosing an inadequate sample size a sponsor might claim that showing no difference meant that the two drugs were equivalent. As a result, it was suggested that to demonstrate equivalence, we should know what we mean by equivalence. In other words, how big a difference between formulations would be necessary before we had shown that they were non-equivalent. Such considerations led to the introduction of so-called bounds of equivalence within which the two formulations were deemed to be the same. Equivalence was shown by calculating a confidence interval for the measure of difference and if these limits were wholly within the pre-specified bounds, equivalence was established (Westlake, 1972, 1976, 1979). Alternatively, from a Bayesian perspective equivalence was shown if the posterior probability of the measure of difference being within the limits was sufficiently large (Rodda and Davis, 1980; Mandallaz and Mau, 1981; Selwyn et al., 1981; Flühler et al., 1983). In the context of clinical trials, however, it is unlikely that we would normally wish to restrict ourselves to equivalence if the test drug was very much more efficacious than the comparator. An exception to this is the approach of Bauer and Bauer (1994) who consider simultaneous bounds for the difference in mean efficacy and the ratio of variances in the treatment and control arms I (see also Grieve, 1998). In the 1980s, the idea of one-sided equivalence or noninferiority was developed precisely for such scenarios (Blackwelder, 1982). In this case, only one bound, the so-called non-inferiority margin, needs to be defined and if the confidence interval for the difference measure lies wholly to the right of this margin, non-inferiority is deemed to be established. Again, a Bayesian equivalent based on posterior probabilities can be considered. The margin is chosen such that if the test compound is non-inferior to the control, we would be able to infer that the test compound would have beaten the placebo had we compared them. However, the choice of this margin has not been without some controversy. A non-inferiority trial is designed to show that a new treatment is no more than a small amount less effective than a given reference treatment, generally the current SOC, which is sometimes referred to as being “not acceptably different”. This type of trial design is most often used when the new treatment is in some way cheaper, safer or in its form, more convenient than the SOC. For example, the new treatment may be given QD rather than BID and might therefore be preferable if not appreciably less effective. The starting point for a non-inferiority trial is to define what is meant by the “small amount less effective” which can be thought of as a margin, most often called the non-inferiority margin, that we will denote by −Δ. There are four main approaches to determining the margin: consensus, the 95-95 method, the point estimate method and the synthesis method.
Assurance
(i) Consensus This approach envisages asking key opinion leaders (KOLs), clinicians and patients, or their representatives, to think about what level of efficacy they would be prepared to lose in exchange for whatever benefits are anticipated. One way is to establish a panel of KOLs who would be able to address at the population level, the trade-offs that patients might be willing to make and propose a credible non-inferiority margin. Their views could be augmented by the views of patient groups, or they could be used in such a panel as well. An example of such a so-called Delphi process was described by Acuna et al. (2018). The conclusion of such a process is a fixed non-inferiority margin. (ii) The 95-95 Method This approach begins by determining an estimate of the complete effect of the active control or SOC relative to placebo, which is often denoted M1 (FDA, 2016). This is generally obtained from a metaanalysis of historic studies from which an overall estimate with associated 95% CI is calculated. As a conservative estimate of the difference, the lower 95% limit is used to define M1. Following on from M1, a smaller margin, M2, is defined so that it corresponds to preserving a predetermined fraction of the estimated SOC-placebo difference, often 50 or 75%. M2 can be interpreted as the largest loss of effect that is clinically acceptable in the comparison of the new drug compared to SOC and again details of this idea can be found in the FDA guidance (FDA, 2016). Snapinn (2004) and Snapinn and Jiang (2008a, 2008b) give two different interpretations of the retained fraction. In the first, the fraction is thought of as a discounting of the effect of SOC to acknowledge any potential reduction in its effect over time. In that sense, it is an attempt to reduce any biases resulting from a violation of the socalled constancy assumption by which it is assumed that the effect of the SOC remains constant over time. In the second, the fraction takes the role of a threshold for demonstrating non-inferiority, because it implies that the effect of the new drug must be greater than the fraction, implying that the evidence that the new drug is superior to a putative placebo is insufficient. The purpose of non-inferiority trials in a regulatory context is primarily to provide evidence that the experimental therapy (T) is efficacious, in the sense that it would have been shown to be superior to placebo had a comparative study against placebo been run. If μT and μP are the parameters representing the true benefit of T and P on an appropriate scale, then a standard alternative hypothesis would be μT > μP or vice versa depending on context. If μS is the parameter representing the true benefit of the SOC against which T is to be tested
51
52
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
in a non-inferiority trial than an equivalent hypothesis would be μT − μS > − (μS − μP) which would suggest a non-inferiority margin of μS − μP = ∆. If λ is the desirable fraction of the effect of S compared to P to be retained, then the relevant hypothesis is μT − μP > λ(μS − μP) or equivalently μT − μS > − (1 − λ)(μS − μP), implying that the noninferiority margin is (1 − λ)(μS − μP) = ∆. In this approach, the margin M2 is regarded as fixed and a noninferiority trial is defined to be a success if the lower 95% confidence limit of the difference between the new drug and SOC is greater than M2, if this is the appropriate direction of the effect. (iii) Point Estimate Method In the point estimate method, the determination of the margin is based on a meta-analysis of studies comparing the SOC to placebo. As in the 95-95 method, the margin is based on the point estimate of the treatment effect of the SOC compared to placebo and the fraction of that treatment effect that it is desirable to preserve. (iv) Synthesis Method In both the 95-95 method and the point estimate method, the analysis of the subsequent non-inferiority study regards the derived margin as fixed, with no uncertainty. One aspect of uncertainty that this approach ignores is study-to-study variability which may be led to either an under- or over-estimation. The 95-95 method overcomes this by defining the margin in terms of the “smallest effect size” from historical trials, which is defined by the 95% confidence limit closest to the null effect. In contrast, in the synthesis method, it is unnecessary to pre-specify a margin nor an effect of the SOC. What it does require is that we pre-specify the fraction of the effect of the SOC which we want the new drug to retain. The test of the non-inferiority hypothesis combines an estimate of the standard error for the comparison of active control with placebo, obtained from historical data, with the estimate and SE for the comparison of the new treatment with the active control in the current study. 3.7.1 Fixed Margin To illustrate the role of assurance in this context, we again use the special case for which 2 12 22 , m1 = m2 = n1 and ω2 = 2σ2/n0. If Δ represents the fixed non-inferiority margin by consensus, from the 95-95 method or the point estimate method, then in analogy to the decision criterion (2.2), a success in the non-inferiority trial occurs if
ˆ Z1
2 . n1
53
Assurance
The POS can then be calculated from the marginal predictive distribution given by (2.10) to give n Assurance 1 f 0 Z1 Z1 1 2
(3.23)
where Z1 is given in (2.5). Writing (3.23) in the form n1 0 Assurance 1 f 0 Z1 , 2
(3.24)
we can see that it is equivalent to (2.6) with prior expectation δ0 + Δ rather than δ0. That being the case we can use (3.7) and (3.8) with δ0 replaced by δ0 + Δ to determine the appropriate sample size to achieve a given assurance, or (3.10) again replacing δ0 with δ0 + Δ to determine the appropriate sample size to achieve a given normalised assurance.
Example 3.4 Continuing with Example 2.1, for purely illustrative purposes, we now assume that instead of wanting to establish superiority against placebo, we are content to demonstrate non-inferiority against a comparator with a margin of Δ = 1 unit, all other assumptions remaining unchanged. From these assumptions, we derive the following values:
Z1 2.8284, f 0 0.0303,
n1 0.7071 2
which in (3.16) implies that Assurance = 0.6081.
Additionally, as the sample size n1 → ∞, the value of assurance given by (3.24) tends to
n0 0 Assurance 2
(3.25)
which now provides an upper bound to the achievable assurance of achieving non-inferiority and is the previous cases we have considered is equivalent to the prior probability of the event of interest, in this case, non-inferiority.
54
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
3.7.2 Synthesis Method The role of the synthesis method in non-inferiority testing has been investigated by many authors (e.g., Holmgren 1999, Wang et al. 2002, Rothmann et al. 2003, Snapinn 2004, Hung et al. 2005, Snapinn and Jiang 2008a, 2008b, Schumi and Wittes 2011, Yu et al. 2019). The various proposals have a similar structure, in which to test the hypothesis that T preserves more than 100f% of the effect of SOC compared to P on whatever scale is deemed appropriate, log(OR), log(Relative Risk), absolute risk reduction, etc., can be achieved by the test statistic Z
ˆTS 1 f ˆSP 2 2 2 1 f s2 n1
where δˆTS is the effect of T compared to SOC (S) in the current non-inferiority trial, δˆSP is the effect of S compared to P in a series of historical trials, 2 2σ2/n1 is the variance of δˆTS and sSP is the estimated variance of δˆSP . This statistic is compared to Zα and if it is less than Zα, non-inferiority can be concluded. To understand the power and the required sample size for a non-inferiority trial using the synthesis method, the first step is to rewrite this decision rule in the form 2 2 2 2 ˆTS 1 f ˆSP Z1 1 f sSP . n1
(3.26)
At the time of planning ˆTS ~ N TS ,2 2 / n1 , Rothmann et al. (2012) show that for a given power, 1 − β, the standard error of δˆTS , n1 2 2 / n1, satisfies the quadratic equation 2 2 Z1 n1 1 f ˆSP TS Z1 2 n1 1 f sSP
whose solution gives the sample size per arm in the form n1
2 2 Z2 Z2 Z
1 f ˆ
SP
TS Z1
Z
2
2
Z
2
1 f
2
s 1 f ˆSP TS 2 SP
2
2
(3.27)
.
55
Assurance
Example 3.5 To illustrate the use of this formula, we utilise a hypothetical example from Rothmann et al. (2012) in which the primary endpoint is a continuous variable with smaller values being preferable. They make the following assumptions: One-sided type I error Power Estimated mean difference between S and P Standard error of δˆ
α 1−β δˆSP
0.025 0.90 4.5 units
sSP
0.6 units
Retention fraction of the effect of S Known population variance Targeted difference between T and S
f σ2 δTS
0.6 100 −0.5 units
SP
With these assumptions, the sample size per arm given by (3.27) as
n1
2 100 1.960 2 1.2822
1.282 0.4 4.5 0.5 1.960 427 per arm
1.282
2
2
2 1.960 2 0.4 2 0.6 2 0.4 4.5 0.5
Rothmann et al. (2012) show that if TS 1 f ˆSP Z1 sSP
2
that the
“conditional power” of the trial will always be less than α. Their use of the term “conditional power” comes from the fact that all the calculations are conditional on δˆSP and sSP which have already been determined from historical studies. Example 3.5 (Continued) The above result determined by Rothmann et al. (2012) implies that as the true treatment effect δTS tends to 1 f ˆSP Z1 sSP , the sample
size necessary to achieve the required power gets larger and larger as is illustrated in Figure 3.2 which shows the sample size required to achieve 90% power for a range of assumed values of δTS. The asymptote for the sample size occurs at 1 f ˆSP Z1 sSP = 0.4(4.5 − 1.960 × .6) = 1.330.
56
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Sample ize per Arm (n1)
10000000 1000000 100000 10000 1000 100 10
-1
-0.5
0
0.5
1
1.5
Treatment Effect (δST) FIGURE 3.2 Sample Size per Arm (n1) as a function of the Treatment Effect (δST) to achieve a power of 90% with the assumptions of Example 3.5.
The assurance of a non-inferiority trial utilising the synthesis method can be calculated from the marginal predictive distribution given by (2.10) to determine the probability of achieving the criterion (3.26) giving
2 2 n1 0 1 f ˆSP n1 1 f sSP Assurance f 0 Z 1 2 2 2
. (3.28)
Example 3.5 (Continued) In this example, we use the same assumptions as we did previously in Example 3.5. Additionally, we need to consider the prior distribution for the effect of the new treatment, T, compared to the SOC, S. For illustrative purposes we will assume that the prior mean treatment effect δ0 = − 0.5, which was one of the scenarios considered by Rothmann et al. (2012). To complete the prior specification, we assume that the value of the prior represents n0 = 100 (50)300 patients per arm. Table 3.5 shows the assurance for the non-inferiority trial under consideration for a range of prior sample sizes per arm.
57
Assurance
TABLE 3.5 Assurance for a Non-Inferiority Trial Using a Synthesis Approach Based on (3.28) with the Assumptions of Exercise 5 for a Range of Prior Sample Sizes Per Arm Prior Sample Size Per Arm 100 150 200 250 300
Assurance 0.596 0.631 0.659 0682 0.702
3.7.3 Bayesian Methods In Alderson et al. (2005), Susan Ellenberg commented, I think that the area of noninferiority trials is one that really lends itself to the Bayesian approach. After all, what we are doing is trying to understand our prior information when we develop our noninferiority margin. The Bayesian approach allows us to formally incorporate the variability among the different studies that give rise to the development of the margin in a way that really makes a lot of sense to me. We don’t have to just eyeball it and say we will use the average of these studies, but in fact they are all over the map, so we really shouldn’t rely on any particular estimate.
In this context, Hung and Wang (2013) propose a “…viable approach using a Bayesian view of the active control effect as a random parameter to allow for statistical uncertainty to be accounted for …” Before addressing this proposal, it is worth noting that a number of authors have proposed a Bayesian approach to non-inferiority trials, beginning with Simon (1999). Simon proposes a model in which the response of patient y is modelled as y = μP + δSPx + δSPz Hung and Wang’s proposal can be achieved by using information from historical studies to construct a prior distribution for ∆, and then to take the expectation of (3.16) with respect to this prior distribution to give what is essentially an unconditional assurance (UA) of success. If we suppose that the outcome of a systematic review providing a posterior distribution for ∆ can be represented by a normal density with mean ∆0 and variance 2σ2/ng, then the UA is
58
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
UA
1 2
1 2
ng x2 ng 2 2 e dx 22 0 d 2 2 2 n f0 Z1 Z1 21 y2 2 x 2 e dx e 2 dy. n1 0 n1 f0 Z1 Z1 ng y 2
(3.29)
Then, using the result in Appendix 1 with a f 0 Z1 Z1 n1 0 / 2 and b f 0 n1 / ng , (3.29) can be expressed as
n0 ng UA 1 n0 ng n1ng n0 n1 n0 ng 1 n0 ng n1ng n0 n1
n1 0 Z1 Z1 2 n1 0 0 Z1 2
(3.30)
The asymptotic properties of (3.30) are of interest and there are two cases to consider. First, as ng → ∞ (3.30) converges to the assurance (3.24) with ∆ = ∆0, which corresponds to regarding the fixed margin equally the prior mean of ∆. Second, as n1 → ∞ (3.30) tends to n0 n0 ng
n g 0 0 2
which can be expressed as the prior probability that μT − μS is greater than a weighted average of ∆0 and δ0 in which the weights are 1
ng . n0 ng
ng and n0 + ng
4 Average Power in Non-Normal Settings
4.1 Average Power Using a Truncated-Normal Prior The results so far have used the standard assumption that both the likelihood and the prior are normal. Lan and Wittes (2013) explore other potential assumptions, including the use of a truncated normal prior in place of the standard normal whilst at the same time holding the assumed data model fixed. The resulting power, which they denote TAP, can be expressed as
TAP
n0 / 2 2 2
1 0 2
Z1
exp n0 /2 2 0 2 d dw n1 2 2 (4.1) Z0 ew
2
/2
in which the denominator ensures that the prior distribution integrates to 1. Again, we can make use of the transformation y n0 / 2 0 / so that
2 TAP
1
n1 /2 0 /
Z1
n1 y n1 /2 0 / n0
ew
2
/2
Z0
2 dw e y /2 dy
and from standard properties of bivariate normal distributions, this expression can be written in the form TAP
B Z0 , f 0 Z1 Z1 , 1 f 0
Z0
(4.2)
in which Z1 was defined in (2.5), f0 in (2.8), Z0 in (3.8), and B(h, k, ρ) in (2.17). Lan and Wittes (2013) used numerical integration and an approximation to evaluate (4.1) for two values of δ0 and a range of total “sample size” n1 + n0. DOI: 10.1201/9781003218531-4
59
60
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Using the SAS function PROBBNRM, (4.2) replicates the probabilities in their Table 5 to four significant digits. Example 4.1 Returning to the assumptions of Example 2.1, with the exception that the prior for the treatment effect (2.3) is truncated at zero we have the following values,
= Z1 2= .8284, f 0 0.0303, Z0 = 0.5 which in (4.2) implies that TAP = 0.5601/0.6915 = 0.810.
The result in (4.2) is related to the assurance of the composite event ˆ Z1 / n1 / 2 and δ > 0 determined by O’Hagan et al. (2005). TAP is the conditional POS given the treatment effect is positive, whilst the numerator of (4.2) is precisely O’Hagan et al.’s result and is also the sum of (2.18) and (2.19), or (2.18) for δMCID = 0. In (4.2), as n1 → ∞, TAP
B Z0 ,Z0 ,1 Z0 1. Z0 Z0
(4.3)
This result is related to a result by Muirhead and Şoaita (2013). They show that as the sample size increases, their normalised POS index tends to 1 as can be seen by considering n1 → ∞ in (4.3). Lan and Wittes (2013) also consider a Gamma prior for the mean of the normal likelihood. This does not give rise to an analytic solution using either method in Sections 2.2.1 or 2.2.2 and therefore requires simulation.
4.2 Average Power When the Variance Is Unknown: (a) Conditional on a Fixed Treatment Effect If the variance, σ2, is unknown and if s2 is its estimate, then to obtain a “positive result”, we need to observe a treatment effect δˆ that is greater than
t1 ,1 2 s/ n1
(4.4)
in which υ1 = 2(n1 − 1) and tα, υ are the α percentile of a t-distribution with υ degrees of freedom. The conditional probability of achieving (4.4) can then be rewritten in the form
61
Average Power in Non-Normal Settings
P
ˆ 2 2 2 2 n1 ˆ n1 t1 ,1 t1 ,1 P 1 2s 2 2s 2 2 n n1 1 1 n1 P T 1 , t1 ,1 2
(4.5)
where T(ν, ϕ) is a non-central t-distribution with ν degrees of freedom and NCP parameter ϕ. This power function is also conditional. In this case, the conditionality in the power function depends on the parameter δ/σ, Cohen’s (1969) standardised ES. The unconditional power can be obtained by taking the expectation of (4.5) with respect to the prior distribution of δ/σ.
4.3 Average Power When the Variance Is Unknown: (b) Joint Prior on Treatment Effect and Variance Suppose a conjugate prior distribution is available, for example, the posterior distribution from a pilot study using a Jeffreys’ prior, so that
2 2 2 2 2 p | ~ N 0 , and ~ 0 s0 0 n 0
(4.6)
In the absence of a prior study, based on an algorithm developed by Martz and Waller (1982), Grieve (1987) provides a means of eliciting a conjugate prior for the variance from an expert (see also Grieve, 1991b) by assessing two quantiles from the prior. An example of this approach will be given in Section 10.4. Given the prior in (4.6), the prior for δ/σ is
2 p | 0 , s02 ~ 0 , 0 , . s0 n0
in which the lambda prime distribution q0 x ,a,b 2 Lecoutre (1984, 1999).1
(4.7)
was developed by
62
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
A mid-point rule (Section 2.2.3) could be used to determine the unconditional POS. Alternatively, the predictive distribution of the t-statistic for n1 future observations per arm is t ~ K 0 , 1 0 s0
n0 / 2 ,
n0 n1 n0
in which the distribution K q 0 , q1 a,b 2 was also developed by Lecoutre (1984, 1999). The expression δ 0 / 2s / n0 is the value of the t-statistic based on the prior data/distribution (4.7). The unconditional power is then given by 2 0
0 n n1 AP P K 0 , 1 , 0 n1 2 s0 n1
t1 , . 1
(4.8)
Grouin et al. (2007) show that as n1 → ∞ the unconditional power in (4.8) tends to
AP 1 P t 0 0 / 2s02 /n0
1 p
(4.9)
0
in which tν0 has a t-density with ν0 degrees of freedom and p0 is the one-sided p-value based on the prior distribution. This is analogous to the results on upper bounds reported in previous sections. Example 4.2 Continuing with Example 2.1, we assume now that the variance is unknown. Given s0 = 8 based on ν0 = 20 degrees, the corresponding prior distribution for δ/σ is q0 / ,0.5, 1 . Figure 4.1 displays the CP func-
tion (4.5) together with 0 / ,0.5, 1 . This latter distribution can be calculated using the LePAC software (Lecoutre and Poitevineau, 2020) or the function plambdap in the R library sadists. Comparing Figures 2.2 and 4.1, there is little difference in their shapes, although they are not directly comparable as the horizontal scales are different. In fact, in Figure 2.2, if we were to plot the power as a function of δ/σ and convert the prior to one for δ/σ the plots are practically indistinguishable. This is confirmed
when we use (4.8) to determine the AP giving P K 2 ,126 2.8284,33 1.979
= 0.5564, which is little different to the result achieved assuming the variance is known, namely 0.5601.
63
Average Power in Non-Normal Settings
1.0
0.45 0.40 0.35 0.30
0.6
0.25 0.20
0.4
0.15 0.10
0.2 0.0
Prior Density
Power
0.8
0.05 -4
-3 -2 -1 0 1 2 3 Standardised Treatment Difference (δ/σ) Power Function
4
0.00
Prior Density
FIGURE 4.1 Conditional power function and prior density of the treatment difference for unknown variance: Example 4.2, restless leg syndrome RCT.
It is, perhaps, surprising that the known and unknown cases give results which are so close to one another, a result shown also by Chen and Ho (2017) and which Walley et al. (2015) confirm “…experience of integrating over a prior for σ has been that there is little difference to using a fixed best estimate”. To further investigate this result, we compared the distribution of 𝛿/𝜎 for the known variance case, which is N(0.5,1) with 20 0.5, 1 . There is very little difference between the two, the densities being no greater than 10% different and that only in the extremities where it is effectively irrelevant. Even if we reduce the degrees of freedom to values as low as 5 for the Lambda prime distribution, there would still be little difference between the two densities. Given that there is only a minimal difference between the APs in the known and unknown variance cases, and furthermore because it is a difference that is unlikely to be of any practical importance, this example challenges the need to consider the unknown variance case at all when determining sample size. In truth, it is almost certainly an irrelevant sophistication. The results which we have outlined here echo a remark by Grouin et al. (2007) that “solutions based on the known variance are generally very close, especially when the required sample sizes are high which is usually the case in confirmatory phase III trials”. Nonetheless, they go on to say “…. since computations remain easily tractable, the assumption of a known variance can be relaxed without any inconvenience”. Of course, this should not be surprising since it mirrors what normally happens as it is rare for development teams to use a t-distribution to sample size a trial, rather than to assume that the variance is known and to use the normal distribution.
64
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
4.4 Average Power When the Response Is Binary In this section, we consider the same two-armed randomised trial comparing a control with an active treatment, the only difference being that at the end of the study, a patient is designated as either having responded, or not, by whatever definition is appropriate for the disease under consideration. The configuration of the data is shown in Table 4.1 where rj from nj(j = C, T) patients respond. The basic building blocks of any inference about the difference between the effects of the two treatments are the observed rates or proportions of responding patients pj = rj/nj which are estimates of the population response rates πj. In designing this type of trial, the first decision is what is the primary measure of treatment effect, the absolute difference, δ = πT − πC, the relative rate, ϕ = πT/πC or the odds ratio θ = πT(1 − πC)/((1 − πT)πC). The second decision is the statistic that will be used to test the null hypothesis that the treatment and control effects are the same. There are any number of options which include pT pC
(i)
One-sided test:
Z
(ii)
Yates’ corrected one-sided test:
ZYates
(iii)
Chi-square test:
2
(iv)
Yates’ corrected chi-square test:
2 2 Yates ZYates
(v)
Likelihood ratio (LR) test:
2 LR 2
1 1 p 1 p n n C T pT pC
N 2
1 1 p 1 p nC nT
N rT nC rC rC nT rT
nC nT rC rT N rC rT
1
sign
x ln x
x
TABLE 4.1 Data from a Trial Comparing an Active Treatment (T) with a Control (C) Group Control Treatment Total
Response rC rT rC + rT
Non-Response nC − rC nT − rT N−rC − rT
Total nC nT N=nC + nT
2
Z2
65
Average Power in Non-Normal Settings
where x are the numerical elements of Table 4.1 and the sign takes value 1 for the row and column sums and 0 otherwise (see Woolf, 1957). (vi)
Fisher’s Exact Test (FET):
PF
P
k
A
where A is made up of the set of all 2 × 2 tables with the same marginals and with probabilities of the data configuration less than or equal to the probability of the observed data. The probability of the observed data is given by the hypergeometric distribution:
PObserved
nC nT rC rT N rC rT
There are other possibilities, for example, Lancaster (1961) proposed the midp-value defined as half the conditional probability of the observed statistic plus the conditional probability of more extreme values, given the marginal totals. Brown et al. (1987) tackle the problem and use a numerical approach (see Section 2.2.3) to determine AP. In particular, they use a 10-point Gaussian quadrature based on Legendre polynomials (Abramowitz and Stegun, 1965) to perform the necessary integrals, ensuring that the zeros of the Legendre polynomials covered the important part of the distribution by integrating over four separate regions defined by the mean ± 3 standard deviations. We use the predictive approach introduced in Section 2.2.2. For binomial data, the standard conjugate prior is a beta-distribution so that the posterior distribution is a beta-distribution too. To illustrate, consider the control arm and let the prior for the response rate be
CC 1 1 C B C , C
C 1
whose expected value is αC/(αC + βC). If prior data are available so that the expected value of πC is say θ based on m patients’ worth of data, then we can set αC = mθ and βC = m(1 − θ). The treatment arm can be treated in a similar way with parameters αT and βT. The predictive distribution of future data in the control arm is given by
CC 1 1 C B C , C
C 1
nC rC nC rC d C C 1 C rC 0 nC B C rC ,C nC rC , B C , C rC
p rC|nC
1
66
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
a beta-binomial distribution, see, for example, Spiegelhalter et al. (1986). A similar result applies to the treatment arm so that the joint predictive distribution of the data for the study is the product of these distributions which is given by
nC B C rC ,C nC rC p rC ,rT|nC ,nT B C , C rC nT B T rT ,T nT rT . B T ,T rT
(4.10)
The AP can then be determined from the following simple algorithm. a) For all pairs of potential data rC = 0, 1, …. ., nC; rT = 0, 1, …. ., nT calculate the predictive probability p(rC, rT| nC, nT) from (4.10). b) For each pair rC and rT, calculate the appropriate statistics from (i) to (vi), and determine whether each individual statistic meets the criterion for statistical significance. c) Sum all the predictive probabilities from (a) for pairs rC and rT which would be significant from each statistic considered in (b) and these are their associated APs. Example 4.3 In this example, we suppose that we are designing a two-arm clinical trial to compare a Verum to placebo and that the primary endpoint is a response rate, for example, tumour response. Further, we suppose that our prior belief in the response rate of patients treated with placebo is modelled by a beta distribution with parameters αC = 2, βC = 8 and that the corresponding parameters for Verum are αT = 5, βT = 5. Each prior corresponds to 10 patients’ worth of data with expected response rates of πC = 0.2 and πT = 0.5, respectively. For a trial to achieve 90% power with a two-sided type-I error rate of 5% to detect the expected difference of 0.3 with the given control rate, a standard calculation shows that a sample size of 58 per arm is required if we use FET as the primary analysis. Based on this sample size, the power for the asymptotic χ2 test of the null hypothesis would be 93.4%. With these assumptions, the panels of Figure 4.2 display smoothed contours of equal predictive probability using (4.10), the contours corresponding to 5 % (10%) 95%of the maximum probability. The yellow areas within each square show pairs (rC, rT) which give a “statistically significant” result based on the asymptotic χ2 test in panel (A) and based on the FET in panel (B). In both cases, there are substantial portions of the predictive distribution which do not give rise to a significant outcome, and consequently, we would expect that the APs would be substantially reduced compared to the standard powers. The APs for χ2 and FET are 0.761 and 0.734, respectively.
67
Average Power in Non-Normal Settings
(a)
60
50
rC
40
30
20
10
0
(b)
0
10
20
30 rT
40
50
60
0
10
20
30 rT
40
50
60
60
50
rC
40
30
20
10
0
Predictive Probability
Significant Result
FIGURE 4.2 Contours of equal predictive probability for the pairs of responses (rC, rT) and those which give significant results: (a) Asymptotic χ2 test and (b) FET.
68
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
4.5 Illustrating the Average Power Bound for a Binary Endpoint We suppose in this section that we are designing a clinical trial to compare an active treatment (T) to a control (C). We also suppose, as we did in Section 4.4, that our prior belief in the response rate of patients treated with control (πC) and active treatment (πT) can both be modelled by beta priors with parameters {α′C, β′C} and {α′T, β′T}, respectively. If all the prior parameters are integer, then our prior belief that the active treatment is better than control, that is P(πT > πC), can be written in the form
P T C B T ,T B C , C
1
T 1 T
1 T
T 1
0
C
B ,
C
T 1
j 0
T T T
T
j j1
T CC 1 1 C C 1 d C d T 0
C T j 1 C T j
C
T
C
T
1
(4.11)
B ,
C C 1
T
T
j C
C C C
C
j j1
T j C C T j 1
C
T
C
T
1
(4.12)
The choice between using (4.11) or (4.12) depends upon the magnitudes of C and T relative to C C and T T , respectively. If C and T are relatively small, then it is preferable to use (4.12), either directly or by interchanging the roles of T and C. In contrast, if they are relatively large, (4.11) is preferable. The expressions for the probabilities in (4.11) and (4.12) are the tails of a beta-binomial or hypergeometric waiting-time distribution, and Altham (1969) shows how they can be written in terms of the hypergeometric distribution and Grieve (2016) provides some history of these distributions for uniform priors for the response probabilities of each arm in the study Example 4.4 In Table 4.2, we show P(πT > πC) for a small selection of parameter values with the restriction that C C T T 10 to reflect situations in which whilst prior information is available, it is not overwhelming. The highlighted prior probabilities show that under this set of assumptions it would be necessary that the prior expected response rates C / C C and T / T T differ by at least 20% in order that the bound on the AP exceeds 80%.
69
Average Power in Non-Normal Settings
TABLE 4.2 Prior Probability (P(πT > πC)) of Treatment Superiority as a Function of αC and αC given αC + βC = αT + βT = 10 αT 1 2 3 4 αC
5
1
2
3
4
5
6
7
8
0.500
0.765 0.500
0.897 0.712
0.959 0.853
0.985 0.934
0.995 0.975
0.999 0.992
1.000 0.998
1.000 1.000
0.500
0.690
0.833
0.923
0.972
0.992
0.999
0.500
0.681
0.827
0.923
0.975
0.995
0.500
0.681
0.833
0.934
0.985
0.500
0.690
0.853
0.959
0.500
0.712
0.897
0.500
0.765
6 7 8 9
9
0.500
4.6 Average Power in a Survival Context Whilst the analysis of clinical trials with a survival endpoint will generally be analysed by either a parametric survival model, for example, a Weibull distribution (Weibull, 1939), or a non-parametric approach such as Cox’s proportional hazard model (Cox, 1972), planning them often involves simplifying assumptions as is exemplified by Spiegelhalter et al. (1994) and Spiegelhalter et al. (2004). In general, these simplifying assumptions are based on asymptotic normality. In this section, we look at both an asymptotic approach based on the log of the hazard ratio and a simple parametric model based on exponential survival distributions. 4.6.1 An Asymptotic Approach to Determining the AP Suppose that π1 and π2 are the chances of surviving up to a fixed time point for two treatments which are to be compared in a study. If we are prepared to assume that the hazard ratio (HR) is constant over time, which is the proportional hazards assumption, then
log 1 log HR log . log 2
(4.13)
Tsiatis (1981) has shown that for trials which are sufficiently large, if Lm is the log-rank test statistic based on m total events, then ym = 4Lm/m is an
70
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
approximate estimate of log(HR) and ym~N(log(HR), 4/m). Spiegelhalter et al. (2004) show that a consequence of this asymptotic structure is that the number of events necessary to give a power of 1 − β at a 1 − α significance level is
2 Z1 Z1 m , 2 4, log HR 2 2
(4.14)
Example 4.5 Liu et al. (2016) report on an RCT for patients with hepatocellular carcinoma (HCC). The optimal treatment for HCC is liver transplant, but the demand for livers to transplant is far greater than the supply and therefore other treatments are required. The current trial proposed to compare partial hepatectomy with transcatheter arterial chemoembolisation (TACE) followed by radiofrequency ablation (RFA). Planning for the study assumed that the 4-year survival rates would be 49% for TACE+RFA and 68% for hepatectomy which from (4.13) is equivalent to a log(HR) of 0.615. From (4.14) the number of events required to give 80% at a one-sided type I error of 2.5% is 4 1.960 0.842 83. 0.6152 2
and therefore, based on an average 4-year survival rate of 60%, it would be necessary to recruit approximately 71 patients per arm to give the necessary number of events. There are minor discrepancies between these numbers and those reported by Liu et al. (2016) which were 88 events and 75 patients per arm. For our purposes, these differences are not relevant.
To determine the AP for this model, we first rewrite (4.14) to mirror (2.1)
1 1 Z1 m / .
(4.15)
Spiegelhalter et al. (2004) give an example of eliciting a prior distribution for a trial of surgery for gastric cancer, based on work reported by Fayers et al. (2000) which culminates in a prior for log(HR) of the form.
4 N 0 , m 0
.
(4.16)
71
Average Power in Non-Normal Settings
Combining (4.15) and (4.16) following the procedure in Section 2.2.1 gives the AP in the form
m0 m0 AP 1 Z1 Z1 , Z1 m m0 m m0
m 0 . 2
Example 4.5 (Continued) Suppose that the log(HR) value of −0.615 represents our optimistic view of the likely reduction in the efficacy of using TACE+RFA compared to hepatectomy. Suppose that we represent uncertainty in this value by a range of prior event numbers m0 = 10 (2) 20. In Table 4.3, we show the resulting AP values which indicate that POS based on AP, with given assumptions, is reduced by 15–20% compared to the nominal power value. TABLE 4.3 Average Powers for a Range of Prior Number of Events for the HCC Trial of Example 4.5 Prior Number of Events (m0) Average Power
10
12
14
16
18
20
0.609
0.618
0.625
0.632
0.639
0.645
4.6.2 The Average Power for the Comparison of One Parameter Exponential Distributions Determining the AP for parametric survival models is more complex than the cases we have covered so far. One simple model for which some analytic progress can be made is an exponential survival model. Suppose we have samples of survival times xi : i = 1, …. ., n and yj : j = 1, …. ., n from two exponential distributions with means μx and μy, respectively. Then standard properties of exponential distributions mean that the ratio of means w = x / y is distributed as x F2 n , 2 n w where y
0 0 1 0 w2 2 F , 0 w 0 0 0 w 2 2 2
(4.17)
72
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
is the F-distribution with ν and ν0 degrees of freedom. Under a proportional hazards model, x e is the hazard ratio. y Clearly, under the null hypothesis, H0 : μx =μy, so that in this case, w = x/y is distributed as F2n, 2n(w). This latter result allows us to calculate the critical value for the test of the ratio of means; suppose this is F1 − α for a one-sided test. Then the power of the test is y F1
1 1
x
0
F2 n , 2 n w dw.
(4.18)
A conjugate prior for the mean of the exponential distribution is an inversegamma distribution and it is not difficult to see that if we were to use an inverse-gamma distribution for the prior of both μx and μy, then the marginal prior for μx/μy is again an F-distribution. The expectation of (4.18) with respect to this second F-distribution can be carried out numerically using a mid-point rule as described in Section 2.2.3 or by using Gaussian quadrature, this time based on Jacobi polynomials (Abramowitz and Stegun, 1965). Alternatively, a simulation approach outlined in Section 2.2.4 can be used. 4.6.3 A Generalised Approach to Simulation of Assurance for Survival Models An alternative simulation approach has been proposed by Ren and Oakley (2014). There are two major differences to their approach compared to the approach outlined in the previous section. First, they use an asymptotic normal test statistic to test the null hypothesis of equal median survival in the two treatment arms of the study. The specific asymptotic test they use was first proposed by Schoenfeld and Richter (1982) and is based on the maximum likelihood estimate of the log (hazard ratio). Second, they elicit a prior for (i) the survival probability of one treatment arm at a fixed time point and (ii) the difference between the survival probabilities in the two treatment arms at that same time point. They assume that these two quantities are independent of each other. The elicitation process is facilitated by using the SHELF methodology (Oakley and O’Hagan, 2010), which is finding increasing use in pharmaceutical development (Dallow et al., 2018). From these elicited priors, Ren and Oakley simulate values for the two parameters from which values for the hazard rates λx = 1/μx and λy = 1/μy are generated. This leads directly to a simulated value for the power and the average of these power values estimates the AP (or assurance). In further work, they extend this approach to a Weibull model and subsequently to a proportional hazard and nonparametric survivor function
Average Power in Non-Normal Settings
73
model that we considered in Section 4.6.1. They generalise our approach to cover limited recruitment and follow-up periods.
Note 1 Historical Note: The lambda prime distribution was originally introduced by Fisher (1990) as the fiducial distribution of the coefficient of variation.
5 Bayesian Power
5.1 Introduction In this chapter, we consider what Muirhead and Şoaita (2013) have termed a “proper Bayesian” approach that does not mix frequentist and Bayesian ideas. For a Bayesian, given the likelihood which we specified in Chapter 1 and the prior specified in Section 2.1, the appropriate analysis combines them using Bayes’ theorem to provide the posterior distribution n ˆ n0 0 2 2 p |ˆ ~ N 1 , n1 n0 n1 n0
.
(5.1)
It is standard practice in drug development to pre-define a decision criterion indicating success. In this chapter, our Bayesian criterion defining a successful trial is n ˆ n0 0 P 0|ˆ 1 2 n1 n0
1
(5.2)
for some small α > 0. In other words, the posterior probability that the treatment difference is greater than zero must be higher than a pre-specified “large” value.
5.2 Bayesian Power In analogy to frequentist power, Bayesian power can be expressed as the predictive probability of achieving (5.2). In terms of the yet-to-be-observed data δˆ , success occurs if DOI: 10.1201/9781003218531-5
75
76
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Z 2 n1 n0 n0 ˆ 1 0. n1 n1
There are then two approaches which we can take. First, we can assume that we know the “true” value of the treatment effect and denote it by δ, in which case the distribution of the future data is ˆ ~ N ,2 2 / n1 and therefore will occur with conditional Bayesian power (CBP)
n0 n1 n0 0 n1 CBP Z n1 2 n1
(5.3)
In practice, the “true” treatment effect δ is unknown. Our current expectation of the treatment effect is δ0 and at this value CBP is
Z f 0 Z0 n0 n1 n0 n1 0 CBP 0 Z n1 2 n1 f0 1 f0
(5.4)
in which f0 was defined in (2.8) and Z0 in (3.8). The Bayesian version of assurance makes use of the unconditional predictive distribution (2.10) to calculate the predictive probability, or BP, as
n0 n n0 0 BP Z 1 n1 2
n0 n1
Z f 0 Z0 1 f0
(5.5)
It is clear from (5.4) and (5.5) that
BP
f 0 1 CBP 0
(5.6)
and this relationship mirrors the relationship between traditional CP and AP shown in (2.7).
5.3 Sample Size for a Given Bayesian Power If we require the sample size per arm, then following the approach taken in Section 3.3, we can use (5.5) to determine the value of n1,which is necessary to achieve a specified BP. If we rewrite (5.5) in the following way,
77
Bayesian Power
n1 n Z BP Z 1 1 Z0 , n0 n0
the required sample size is the solution of the equation
n1
n0 Z1 BP Z1 Z0 Z12 Z12 BP Z02
Z
2 1 BP
Z02
2
2
, Z1 BP Z0 .
(5.7)
Example 5.1 Continuing with the assumptions of Example 2.1, suppose that we are now interested in determining the sample size to give a BP of 50–55% of detecting a “clinically meaningful” treatment difference of 4 units. Again assuming Z0 = 0.5, we use (5.7) to calculate the sample sizes and these can be compared to the corresponding sample sizes to achieve a given AP shown in Table 5.1. These sample sizes range from 29 to 53. A couple of points are of interest. First, for a 50% nominal power, the sample sizes based on (3.7) and (5.7) differ by exactly the prior sample size, n0, and this can be seen by comparing (3.7) for Z1 − AP = 0 and (5.6) Z1 − BP = 0. Secondly, the difference in sample sizes based on (3.7) and (5.7) increases as the targeted power increases with this difference being a function of α, δ0 and σ. The increase difference, relative to n0, remains constant.
TABLE 5.1 Sample Sizes, Per Arm, to Give a Nominal Power Based on Bayesian Power (5.7) under the Assumptions of Examples 2.1 and 3.1 Bayesian Power 0.50 0.51 0.52 0.53 0.54 0.55
Sample Size (5.7) 29 32 36 41 46 53
5.4 Bound on Bayesian Power As the sample size increases there is an upper bound to the attainable BP, in the same way that there is an upper bound to AP that was derived in
78
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Chapter 2. To see this as n1 → ∞ in (5.7), f0 → 0 and consequently BP → Φ(Z0). This is the prior probability of a positive treatment effect and is identical to the upper bound that was shown for AP in (2.9). The normalised Bayesian power (NBP) proposed by Muirhead and Şoaita (2013) is
n n 1 0 Z1 1 1 Z0 n1 n0 NBP . Z0
(5.8)
In Section 2.3, we summarised the work of Eaton et al. (2013) on bounds for AP as the sample size increased in a general setting. The starting point for their work was a “proper Bayesian” context in which they showed that under conditions of π-consistency, BP converges to the prior probability of the region of parameter space in which we have an interest. The BP bound has a second interpretation. In (5.2), the posterior probability which is the basis of the success criterion can be expressed as n ˆ n0 0 PB 1 2 n1 n0
1 2
ex
n1ˆ n0 0
2
/2
dx
2 n1 n0
whose expectation with respect to (2.10) can be written as 1 1 2 2
E PB
2 2 e x /2dx e y /2dy. n0 n1 0 2
n 1 y n0
Then, using the result in Appendix 1 with a we have E PB
1 2
n0 0
ey
2
/2
n0 n1 0 n and b 1 y , 2 n0
n dy 0 0 2
2
which is again the prior probability of a positive treatment. In other words, absent any other information, “we always expect our belief that the new drug is worse than the old to be the same as it is now” (Spiegelhalter and Freedman, 1988).
79
Bayesian Power
5.5 Sample Size for a Given Normalised Bayesian Power In the same way that we can find the appropriate sample size to achieve a given NAP, in this section, we show how we can do this for a given NBP. Rewriting (5.8) as
n n NBP Z0 1 0 Z1 1 1 Z0 , n1 n0
(5.9)
we can set 1 − β(NBP) = NBP × Φ(Z0) and then use (5.7) to give us the appropriate sample size to achieve a given NBP.
Example 5.1 (Continued) Continuing with Example 5.1, we assume that a development team requires the necessary sample size to give an NBP of 65% to detect a “clinically meaningful” treatment difference of 4 units. With these assumptions from (5.9), we require the sample size to give a value of 1 − β(NBP) = 0.65 × 0.691 = 0.449 and from (5.7), we can calculate the necessary sample size as 18 patients per arm. In Table 5.2, we investigate the sensitivity of the choice of NBP on the resulting sample size. For NBP values in the range 0.65 to 0.90, the sample size is in the range of 18–211 patients per arm, corresponding to between 2 and 9 patients fewer per arm compared to the sample size based on Normalised Assurance (Table 3.2).
TABLE 5.2 Sample Sizes, Per Arm, to Give a Specified Normalised Bayesian Power (5.9/5.7) under the Assumptions of Examples 2.1 and 3.1 Normalised Bayesian Power 0.65 0.70 0.75 0.80 0.85 0.90
Sample Size (5.9/5.7) 18 25 36 55 96 211
80
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
5.6 Bayesian Power when the Response Is Binary For the case of binary outcomes, the likelihood function based on the data in Table 5.1 is proportional to
C rC 1 C
nC rC
T rT 1 T
nT rT
.
Combined with the prior from Section 4.5, the posterior distribution is the product of beta distributions p C , T|rC , rC nC , rT , rT rT
B C ,C CC 1 1 C
C 1
B T ,T TT 1 1 T
T 1
in which C C rC , C C nC rC , T T rT , C T nT rT . If the Bayesian criterion defining a positive outcome is given by P T C|rC , rC nC , rT , rT rT 1 ,
the posterior probability can be determined from either (4.12) or (4.13). The result of this can be used in step (b) of the algorithm in Section 4.5. Examples 5.2 Returning to the assumptions of Example 4.2 in which the prior distribution for the response rate of patients treated with placebo is a beta with parameters αC = 2, βC = 8 and for the active treatment beta with parameters αT = 5, βT = 5.These assumptions correspond to 10 patients worth of data in each arm with expected response rates of πC = 0.2 and πT = 0.5. With these prior assumptions, the BP for an alpha of 0.025 is 0.797. What sample size is required? We know from the general result in Eaton et al. (2013) that the BP is asymptotically bounded above by the same value as the AP, which was calculated in Example 4.3 as 0.934. Given that the bound is not much larger than 0.9, it is reasonable to expect that the required sample size to achieve 90% BP will be large. In Figure 5.1, we display the relationship between BP and sample size per arm. The figure shows that the convergence to the asymptote is quite slow. Consequently, the sample size per arm required for 90% BP is 644.
81
Bayesian Power
1.0
0.934
Bayesian Power
0.8 0.6 0.4
644
0.2 0.0
0
200
400
600
800
Number of Patients per Arm
FIGURE 5.1 Bayesian power as a function of sample size per arm for the assumptions of Example 5.2.
5.7 Posterior Conditional Success Distributions Walley et al. (2015) question the utility of assurance in making judgements about the ability of a design to address the study’s objectives. In a sense, what they are questioning is the absolute value of assurance, although acknowledging the value it has in allowing us to compare projects within a portfolio. As an alternative to assurance on its own, Walley et al. propose that the “preposterior distribution” for the treatment effect under the assumption that the study is a success is a useful tool for understanding an individual study’s ability to discriminate between compounds which provide sufficient activity and compounds that do not. They describe their approach as providing “… what one would believe at the end of the study, if only told, the study was successful; any further information on the size of the treatment effect, other than the prior, being withheld”. Unfortunately, the term “pre-posterior distribution” has a specific meaning in the context of Bayesian analysis, having been coined by Raiffa and Schlaifer (1961) to describe the distribution of the as-yet-unknown posterior mean, renamed by their students the pre-posterous distribution (Fienberg,
82
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
2008). For this reason, and to avoid any confusion, we will term these distributions prior posterior conditional success (PPCS) distributions. 5.7.1 Posterior Conditional Success Distributions – Success Defined By Significance Walley et al. (2015) define success in Bayesian terms as the posterior probability that the treatment effect is greater than a threshold value Δ. Before considering their approach, we suppose that we are using the traditional approach to determine the success of the study based on (2.2), or, in other words, success is defined as a positive outcome of a significance test. Then, it follows by the definition of conditional probability, see, for example, Box and Tiao (1973) and their use of conditional probability to account for parameter constraints, that the posterior distribution of the treatment effect given only the information that the study had achieved success can be expressed as p |significant
Pr significant| p . Pr significant
This expression can also be derived by expressing the joint distribution of δ and ′significant’ in two different ways and re-arranging. The first term in the numerator of the right-hand side (RHS) is the power function of the study, given by (2.1) and the denominator of the RHS is the POS, defined by (2.6). This allows us to write the PPCS distribution of the study in the form p |significant
Power Prior . POS
(5.10)
In a similar way, the posterior distribution of the treatment effect we would expect, given only the information that the study had failed, in this case the prior posterior conditional failure (PPCF), distribution is defined as p |non significant
Prior 1 POS
where now β(δ) is the type II error as a function of the treatment effect. Example 5.3 Using the assumptions of Example 2.1, we can determine the PPCS and PPCF distributions based on (5.10) and (5.11) which are displayed in Figure 5.2. One interesting characteristic is that the PPCS distribution is
(5.11)
83
Bayesian Power
0.10
Density
0.08 0.06 0.04 0.02 0.00
-30
-25
-20
-15
-10 -5 0 5 10 Treatment Effect (δ)
Significant
15
20
25
30
Non-Significant
FIGURE 5.2 Prior posterior conditional distributions of the treatment effect given either a significant outcome of the future study or a non-significant outcome.
positively skewed, while in contrast, the PPCF distribution is negatively skewed. Consequently, for the PPCS distribution, the mean > median > mode whilst for the PPCF the converse is true, the mean < median < mode. For this example, the two means are −3.06 and 9.51 units for the PPCF and PCs distributions, respectively. Whilst these distributions overlap, the overlap is not overly excessive, and this suggests that the study design is capable of discriminating between effective and ineffective drugs. From each of these distributions, we can calculate the prior posterior probability that the true treatment difference is greater than some pre-determined value, 0, for example, if we are interested in any evidence of a positive treatment effect, or 2 units if that is our current MCID and we want evidence that the true treatment effect is greater than the MCID. In Table 5.3, we provide the prior posterior conditional probabilities that the true treatment effect is greater than a series of cut-offs calculated TABLE 5.3 Prior Posterior Conditional Probabilities of the Treatment Effect Given Significance/Non-Significance of the Study Treatment Effect 0 1 2 3 4
Study Significant 0.999 0.994 0.978 0.940 0.878
Study Non-Significant 0.298 0.201 0.114 0.050 0.016
84
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
from both the PPCS and PPCF distributions. In practical terms, these probabilities demonstrate that the design has reasonable properties and can be used to differentiate between effective and ineffective therapies.
5.7.2 Posterior Conditional Success Distributions – Success Defined By a Bayesian Posterior Probability The same approach can be taken if we are using a Bayesian approach based on (5.2). In this case, utilising the same conditional probability argument p |success
Pr success| p Pr success
which can be expressed as
p |success
CBP p . BP
(5.12)
Similarly, in the case of failure of the study
p |failure
1 CBP p 1 BP
(5.13)
where CBP(δ) and BP are defined in (5.3) and (5.5). Example 5.3 (Continued) Continuing with the assumptions of Example 2.1, the PPCS and PPCF distributions based on (5.12) and (5.13) are displayed in Figure 5.3. As before, both distributions are skewed, the PPCS distribution to the right and the PPCF distribution to the left. Using the posterior distribution, the means of the PPCF and PPCS distributions are −3.12 and 9.47 units, respectively, which are essentially no different to the previous calculations and can be explained by the relatively small amount of information that is being input through the prior distribution. Given the closeness of the distributions in the two approaches to define a study’s success, it should not be surprising if the prior posterior conditional probabilities that the true treatment effect is greater than the series of cut-offs that we defined in Table 5.3 are very similar. The probabilities in Table 5.4 confirm this conjecture for both the PPCS and PPCF distributions. In practical terms, these probabilities demonstrate that the design has reasonable properties and can be used to differentiate between effective and ineffective therapies.
85
Bayesian Power
0.10
Density
0.08 0.06 0.04 0.02 0.00
-30
-25
-20
-15
-10 -5 0 5 10 Treatment Effect (δ) Success
15
20
25
30
Failure
FIGURE 5.3 Prior posterior conditional distributions of the treatment effect given either a successful outcome of the future study or a non-successful outcome (posterior-based).
TABLE 5.4 Prior Posterior Conditional Probabilities of the Treatment Effect Given Success/Failure of the Study Treatment Effect 0 1 2 3 4
Study Successful 0.999 0.993 0.976 0.936 0.873
Study Failed 0.291 0.194 0.108 0.047 0.015
5.7.3 Use of Simulation to Generate Samples from the Posterior Conditional Success and Failure Distributions Both the PPCS and PPCF distributions, whether they are generated from a significance test or from the posterior distribution, can simply be derived from the type of simulation which we used to illustrate Kunzmann et al.’s decomposition of AP in Section 2.5. First, we think about the PPCS and PPCF based on a significance test. Starting from the sample of pairs δ and δˆ which were simulated, we select only those pairs for which ˆ Z1 2 / n1 ; then, the corresponding values of δ provide a random sample from the PPCS distribution with the remaining δ values providing a random sample from the PPCF distribution.
86
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
In Figure 2.3, the samples coloured red, yellow and green provide the samples in PPCS, while the samples coloured blue provide the samples in PPCF. The same method can be applied if we plan to use the posterior distribution to define the study’s success or failure. In this case, values of δˆ for which ˆ Z1 2 n1 n0 / n1 n0 0 / n1, which was shown in Section 5.2 to provide the boundary between success and failure, are used to split the δ values into the PPCS and PPCF distributions.
5.7.4 Use of the Posterior Conditional Success and Failure Distributions to Investigate Selection Bias Recently, Wiklund and Burman (2021) have also investigated properties of the PPCS distribution which they term the “efficacy density given progression”. Their principal interest lies in potential selection bias associated with drugs that enter phase 3. Such biases are due to the use of decision criteria which promote projects that show good efficacy in phase 2. In their work, they compare the PPCS with the corresponding prior distribution to understand how the selection mechanism can influence the “treatment effect distribution”. Wiklund and Burman (2021) consider three classes of prior distributions: a mixture distribution with a point mass at zero with a fixed probability q and normal density with probability 1-q, an exponential distribution and a log-normal distribution. Additionally, they consider a range of sample sizes for the phase 3 study. They show that selection is not perfect and, as we indicated, there may be a substantial probability that selection will falsely choose treatments with efficacy that is lower than what was anticipated or desired.
6 Prior Distributions of Power and Sample Size
6.1 Introduction The primary difference between the conditional and unconditional power functions covered in Chapters 2 and 3 is that the former fixes the treatment and determines the power conditional on that treatment effect, whilst the latter accounts for the uncertainty in the current knowledge of the treatment effect as measured by the prior and calculates the expected power with respect to the prior distribution. An alternative approach is to consider the whole distribution of prior power for which the AP or assurance is its expected value. This then opens up the possibility of other summaries, for example, the median prior power. In some ways, the difference between the AP and this new approach is analogous to the difference between tolerance intervals of β-expectation and β-content (see, for example, Guttman, 1970). A similar approach can be used to study the prior distribution of the sample size per arm, or the probability that the sample size lies between pre-specified limits. In both cases, it will become apparent that there is considerable uncertainty at the planning stage about both study power, for a fixed sample size, or the study sample size for a fixed power. In Chapter 2, four different methods were introduced for calculating the expected power: analytical, via the predictive distribution of the data, numerical and simulation. There is a fifth method that opens the possibility of investigating other approaches for choosing the sample size. Remembering that AP is defined by the integral AP
1 p d
where p(δ) is defined in (2.3), we can think of the expression
1 1 Z1 n1 / 2 /
DOI: 10.1201/9781003218531-6
87
88
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
as providing a transformation of δ into power it therefore allows us to determine the prior distribution of power for a fixed sample size n1 or alternatively as providing a transformation of δ into the sample size for a fixed power. The mean of the prior power distribution is then the expected, or average, power. Of course, since we now have a complete distribution of power, we are not constrained by looking only at the prior mean power. We can, for example, consider the prior median power or any other quantile of the prior power distribution. Alternatively, we could define a range within which the power should lie with high prior probability.
6.2 Prior Distribution of Study Power – Known Variance Starting with the prior distribution of the treatment effect (2.3), for fixed, n1, the prior distribution of power, φ, can be derived from the transformation
1 Z1 n1 / 2 /
(6.1)
with Jacobian d d
2 n1 1
giving p
n0 n exp 0 Z1 1 Z0 n1 2n1
2
1
2
2
(6.2)
in which Z0 was defined in (3.8). This result was first given as equation (5) in Rufibach et al. (2016) although there is a difference as we have concentrated on positive treatment effects, whilst they are interested in negative treatment effects. Spiegelhalter et al. (2004) show how simple transformations can be utilised to change such a formula to cover other cases. For example, by reversing the definition of success and failure or by utilising more general alternative thresholds in which, for example, zero treatment effect is replaced by a non-zero treatment effect δA > 0.
89
Distributions of Power and Sample Size
To determine the CDF of φ we can make use of the following lemma. Lemma 6.1 If Z = g(X) is a one-to-one transformation of the variable X and we wish to determine the CDF of Z in terms of the CDF of X. Then,
FZ z P Z z P g X z P X g 1 z FX g 1 z . In the present context, X is the treatment effect δ with prior density (2.3) with associated CDF,
0 F 2/ n0 and Z is the study power defined by the transformation (6.1) which has the following inverse
2 Z1 1 . n1
With these inputs Lemma 6.1 gives
n F 0 1 Z1 Z0 . n1
(6.3)
From (6.3), the prior probability that the power is greater than any given value, the complementary cumulative distribution function (CCDF) is
n F 1 0 1 Z1 Z0 . n 1
(6.4)
These distribution functions allow us to determine the prior probability about any statement concerning the power of the study. For example, we might say there is a 70% chance that the study’s power is greater than 80%. Two properties of (6.3) and (6.4) are of interest.
First, the median prior probability is the solution of F med F med 0.5 implying that n0 / n1 1 med Z1 Z0 0 from which the prior median
power is φmed = Φ(Z1 − Z1 − α), where Z1 is as defined in (2.5). This relationship
90
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
is identical to (2.1) so that the prior median power is the power at the prior expected treatment effect independent of the “prior sample size”, n0. 1 Second, as n1 → ∞, n0 / n1 Z1 0, so from (6.4), F Z0 which is the AP upper bound previously derived in Chapter 2 (Equation 2.9).
Example 6.1 We return to the assumptions of Example 2.1 given by Muirhead and Ş oaita (2013). In their Bayesian approach, they assumed a prior distribution for the treatment difference on the IRLS is N(4, 64) where σ2 = 64 and n0 = 2.The planned sample size was n1 = 64 per arm, based on 80% power to detect a difference of 4 points. Based on these assumptions, Figures 6.1(a) and (b) display the prior distribution of power and its associated CDF for n0 = 2, 100, 200, 500. Concentrating for the moment on the base case n0 = 2, the extreme shape
FIGURE 6.1 Prior power: (a) Prior density of power (n0 = 2, 100, 200, 500); (b) Prior power CDF (n0 = 2, 100, 200, 500).
Distributions of Power and Sample Size
91
with an anti-mode just below 50% prior power and reminiscent of a betadistribution with both parameters less than 1, is due to the extreme uncertainty in the prior for the treatment effect whose prior credible interval (CrI) is approximately −16 to 20. For values < −4 and >10, the power for a fixed treatment effect is effectively 0 and 1, respectively, hence the infinite asymptotes at 0 and 1. Looking at the cases for increased prior knowledge, for 50 < n0 < 85, the prior power distribution has a single asymptote at φ = 1 and for n0 > 85 the prior has a single, non-asymptotic mode which converges to the power at the prior expected effect as n0 → ∞. Figure 6.1(b) also illustrates the property of the CDF and CCDF discussed previously, namely that the prior median power equals the power at the prior expected treatment effect independent of n0, which is indicated by the common point where the four CDFs meet.
We have seen that the use of a normal prior for the treatment effect, particularly for small values of the effective prior sample size, n0, gives rise to problems with the derived prior for the study power. Rufibach et al. (2016) investigate conditions under which the prior power distribution is u-shaped. They prove that if n0 > n1, the prior power distribution is no longer u-shaped. This condition implies that the information about the treatment effect contained in the prior is greater than the information contained in the study itself. As they point out, this is unlikely to be the case in drug development in such circumstances it is arguable whether a new study is even necessary, or ethical. Part of the problem with a normal prior for the treatment effect is that it can give rise to negative values of the treatment effect as we saw in Example 6.1. Negative values of the treatment effect correspond to power values less than α. It is arguable whether we should be interested in such values at the planning stage of a clinical trial, a point which is related to the decomposition of the AP discussed in Section 2.5, since such values give rise to type I errors and not power. One approach to overcome this problem is to condition on positive values of the treatment effect. In Example 6.1, this implies truncating the prior power distribution at α and re-normalising with respect to the prior probability of a positive treatment effect, Φ(Z0). The expected value of this truncated prior power distribution is precisely the same as TAP defined in (4.2). This approach is also related to the decomposition of AP introduced in Section 2.5, in particular, the AP can be expressed as the sum of (2.18) and (2.19), or (2.18) for δMCID = 0. A second approach would be to develop a prior based on a distribution whose support is the positive real line. One example would be a gamma distribution p
1e , 0, 0, 0
92
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
as proposed by Lan and Wittes (2013) and mentioned in Section 4.2. Alternatively, we could define a prior based on a scaled and shifted beta-distribution
a a a b , p 1 B , b a 1
1
0 , 0.
6.3 Prior Distribution of Study Power – Treatment Effect Fixed, Uncertain Variance In the previous section, we used (6.1) to transform the treatment effect, δ, into power with the variance, σ2, treated as known. Alternatively, several authors have assumed the treatment effect is fixed and have used (6.1) to transform the variance into power (Sims et al., 2007; Shieh, 2017). This contrasts with Browne (1995), Kieser and Wassmer (1996) and Shieh (2013) who investigate the use of an upper confidence limit for the variance from a pilot study to calculate the sample size. Their approaches ensure that there is a reasonable chance that the study achieves its desired power. Suppose that our knowledge concerning the variance σ2 can be expressed in the form of a scaled inverse-χ2 distribution
p
2
s
2 0 0
/2
0 /2
exp 0 s02 / 2 2 0 2 /2
0 / 2 2
(6.5)
with scale parameter, s02 and df ν0. The parameters can be determined from a pilot study with a vague prior, previous studies in the same population and indication, or by elicitation, see Section 11.4. Then (6.1) as a function of σ2 has Jacobian d 2 d
2 2 3 n1 1
allowing us to derive the prior distribution of power in the form, 0
2 s2 2 s2 2 02 0 exp 02 0 1 Z1 1 Z1 n n 1 1 p 0 1 2
0 1
(6.6)
.
93
Distributions of Power and Sample Size
In this case, X is the variance, σ2, with prior CDF s2 F 2 Q 0 , 0 02 2 2
in which Q(ν, a) denotes the regularised upper gamma function with ν df Q , a
1
t 1e t dt. a
As in Section 6.2, Z is the study power defined by the transformation (6.1), but in this instance, it is thought of as a function of σ2, so that its inverse transformation is
2
2n1
2 Z1 1
2
.
With these inputs, Lemma 6.1 gives
2 1 0 s0 Z1 F Q 0 , 2 2n1
2
(6.7)
(6.8)
and in this case, the CCDF is 2 1 0 s0 Z1 F P 0 , 2 2n1
2
in which P(., .) denotes the regularised lower gamma function with ν df P , a 1 Q ,a
a
1
t 1e t dt. 0
Example 6.2 In this case, we modify our basic example by assuming 1) that the treatment effect we are interested in is fixed at 4 units and that the pre-trial uncertainty about the variance can be described by an inverse-χ2 distribution with parameters s02 = 64 and ν0 = 20.Despite the increased uncertainty, we assume that the planned sample size remains at n1 = 64 per arm which we remember was based on a nominal 80% power to detect the difference of 4 points.
94
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
FIGURE 6.2 Prior power: (a) Prior density of power based on variance prior; (b) Power CDF based on variance prior.
Based on these assumptions, Figures 6.2(a) and (b) display the prior distribution of power (6.6) and its associated CDF (6.8), respectively. From these figures, we can determine that there is a prior probability of 52% that the power is less than 80% which indicates that it is as likely as not that we can achieve the power on which the study has been sized. Also, we can determine that the highest prior CrI for the power is 54–98%, indicating great uncertainty.
6.4 Prior Distribution of Study Sample Size – Variance Known In this section, the case we consider has both the power and the variance fixed, and we determine the prior distribution for the sample size based on a prior for the treatment effect.
95
Distributions of Power and Sample Size
We begin with the prior distribution of the treatment effect in (2.3), δ~N(δ0, 2σ2/n0) from which the treatment effect, standardised by the prior standard deviation, n0 / 2 , is distributed as N n0 0 / 2 ,1 . It therefore follows that the square of the standardised treatment effect
Y
n0 /
2
2
has a non-central χ2-distribution with 1 df and non-cen-
trality parameter Z02 n0 02 / 2 2 . The pdf of the non-central χ2-distribution with 1 df is p Y e
Z02 /2
Z
2 0
/2
j
f 2 j 1 Y
j!
j 0
where X v / 2 1e X / 2 /2 2 2
f X
is the central χ2-distribution with ν df. Therefore, the prior distribution of n1 = n0(Z1 − α + Z1 − β)2/Y is an inverse non-central χ2-with 1 df with pdf
2 Z 2 /2 n p n1 02 Z1 Z1 e 0 n1
Z
2 0
/2 j!
j 0
j
2 n f 2 j 1 0 Z1 Z1 n 1
(6.9)
and CDF F n1 1 e
Z02 /2
j 0
Z
2 0
/2 j!
j
2 n P2 j 1 0 Z1 Z1 . n 1
(6.10)
Example 6.1 (Continued) Returning to the previous assumptions: the prior distribution for the treatment difference δ is N(4, 64) where σ2 = 64 and n0 = 2 and the target power is 80%. Based on these assumptions, Figures 6.3(a) and (b) show the prior distribution of the sample size and its associated CDF based on (6.9) and (6.10). These exhibit extreme uncertainty about the appropriate sample size based on our prior uncertainty about the treatment difference. To illustrate, the prior probability that the sample size is greater than 200 is approximately 0.2, the corresponding sample sizes for probabilities of 0.1 and 0.05 are 774 and 3109. Similarly, the 95% equal tailed prior CrI
96
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
FIGURE 6.3 Prior sample size: (a) Prior density of sample size (n1) based on treatment difference prior; (b) Power CDF based on treatment difference prior.
for sample size is 2.5–12,449. The highest prior density interval for the sample size is 0.5–3109, essentially a one-sided prior CrI, indicating the extreme skewness of the prior and the great uncertainty we have regarding the appropriate sample size.
6.5 Prior Distribution of Sample Size – Treatment Effect Fixed, Uncertain Variance In this section, we start with the standard sample size formula (1.2) and, in this case, think of this as a simple linear transformation from σ2 to n1 with δ fixed at our prior expectation δ0. Applying this transformation to the prior density of the variance (6.5), the prior density of the sample size is an 2 inverse-χ2 distribution with scale parameter 2s02 Z1 Z1 / 02 and df ν0.
97
Distributions of Power and Sample Size
0 /2
2 0 s02 Z1 Z1 2 2 0 s02 Z1 Z1 2 exp 2 02 2n1 02 . p n1 / 2 2 0 0 / 2 n1
(6.11)
It follows that the prior cdf of n1 is
0 s02 Z1 Z1 2 . F n1 Q 0 , 2 n1 02
(6.12)
Example 6.2 (Continued) Returning to the basic assumptions of Example 6.2, the treatment effect we are interested in is fixed at 4 units and our pre-trial uncertainty about the variance is again described by an inverse-χ2 distribution with parameters s02 = 64 and ν0 = 20.The targeted power of 80% is to detect a treatment effect of 4 points. Based on these assumptions, Figures 6.4(a) and (b) display the prior distribution (6.11) of the sample size and its associated CDF (6.12). From
FIGURE 6.4 Prior sample size: (a) Prior density of sample size (n1) based on the variance prior; (b) Power CDF based on variance prior.
98
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
these figures, we can determine that there is a 90% prior probability that the sample size is less than 100 and the 95% highest prior density interval is 32–118, a factor of approximately 4. Additionally, the prior median sample size is 65.
6.6 Prior Distribution of Study Power and Sample Size – Uncertain Treatment Effect and Variance In Section 5.4, we examined AP when a prior is available for both the treatment effect, δ, and the variance, σ2. The approach involved determining the prior for the standardised parameter δ/σ, The resulting prior given by (5.7) can be used to determine the prior distribution of power again using (6.1) but thinking of it as providing a transformation of δ/σ to the power φ for fixed n1, or to n1 for a given φ. In the latter case, the prior distribution of the sample size n1 can be derived as follows. If we rewrite (6.1) as 2 Z1 Z1 A . 2 2 2 / / 2 2
n1
(6.13)
Then, to determine the CDF of n1 Lemma 6.1 no longer applies because the transformation is not one-to-one. The CDF of n1 can be written as
A Fn1 n1 P N1 n1 P 2 n1 2 / A A A A P F / F / n1 n1 n1 n1
(6.14)
From (6.14), the pdf of n1 is derived as A A d F F / / n1 dn1 n1 A d A A d f / f / dn n n 1 1 1 n1 dn1 A A A f / 3/2 f / 2n1 n1 n1
f n1 n1
A n1 (6.15)
99
Distributions of Power and Sample Size
0 2 where f / 0 , , . s0 n0 In practice, distribution (6.15) is very similar to the distribution of power, for a fixed σ2, derived in Section 6.4. This should not be surprising given the closeness of the known variance and unknown variance cases noted in Section 5.4. There is no reason to believe that this is any less likely to be true for the power. For this reason, we will not pursue either of these cases any further.
6.7 Loss Functions and Summaries of Prior Distributions This analysis raises a question. Which is the most preferable representation of the power of a study from its prior distribution? The mean is the AP which we know to be smaller than the planned power based on treatment effect, or the median which is the planned power based on the expected treatment effect. In Bayesian estimation theory, the posterior mean is the Bayes estimate under quadratic loss. What is of interest is whether Bayesian estimation theory can be applied in a prior context. First, we need to define a Bayes estimate. If ϕ is an estimate of the true value of φ with prior distribution p(φ) and loss function, , , then a Bayes estimate with respect to the loss function and prior minimises the expected loss (EL)
, p d .
There are three loss functions that are standardly considered: quadratic loss, absolute error loss and 0−1 loss. (i) Quadratic Loss 2 If we define the EL as EL p d then the minimum EL is given by
d EL 0 2 p d 2 p d d
which is the prior mean. (ii) Absolute Error Loss In this case, we define the EL as EL p d
p d p d ,
100
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
then the minimum EL is given by
d EL 0 d
p d p d p d p d
from which adding
p d to each side we obtain
2
p d 1
which defines the median. (iii) 0–1 Loss First, we define the 0–1 loss function as , 1 ,where δ(x) is the Dirac delta function. Then the EL is defined as EL 1 p d 1 p d 1 p . In this case, therefore, minimising the expected loss is equivalent to maximising the prior density p , which occurs when ϕ is the prior mode. We noted in Section 5.7 that for positively skewed unimodal distributions, the relationship mean > median > mode holds with the reverse relationship holding for negatively skewed unimodal distributions. Bernardo and Smith (1994) argue that unless there is a compelling reason for believing that one loss function is more relevant than the others, giving a single measure to summarise the uncertainty indicated by the prior “may be extremely misleading as a summary of the information available about” in this case the power. Additionally, not all estimators constructed in this way are invariant under (possibly non-linear) re-scaling of the data. They go on to say that their comment “acquires even greater force” if the prior “is multimodal or otherwise ‘irregular’”. The prior distribution for the power based on n0 = 2 in Figure 6.1 is most certainly not regular and it may be wise to resist providing single measures of these prior distributions.
7 Interim Predictions
7.1 Introduction The earliest sequential test procedure in which the number of observations is a random variable goes as far back as Dodge and Romig (1929). They developed a double sampling procedure in which the decision as to whether a second sample should be taken depends on the outcome of the observations in the first sample. Bartky (1943) generalised the approach to multiple stages and both have the advantage that on average they require fewer observations than traditional single sampling schemes. Whilst the motivation behind the development of sequential designs and analyses was in industrial production and was largely economic, similar ideas have been developed in medical research where the motivation is ethical (Ghosh, 1991). As an illustration, consider the Simon design (Simon, 1989). A Simon design is a single-arm, phase II design used in oncology in which the primary response variable is binary (responder/non-responder). Suppose a responder is a patient with at least 50% shrinkage in the size of a tumour. If π is the true population response rate, then large values of π indicate efficacy of the treatment. Let π0 be a population response rate which is not sufficiently large to be of interest, and π1 (π1 > π0) be the desirable response rate. The null hypothesis H0 : π < π0 (ineffective treatment) is tested against the alternative hypothesis H1 : π > π1 (effective treatment). The Simon design is conducted in two stages:
1. In Stage I, the null hypothesis is not rejected if the number of responders is less than k1out of n1 patients evaluated, and the trial is stopped. 2. If the number of responders is at least k1 out of n1 patients, then n2 additional patients are enrolled in Stage 2. 3. In Stage 2, if the total number of responders is less than k out of n = n1 + n2 then the null hypothesis is not rejected. 4. otherwise H0 is rejected in favour of H1.
DOI: 10.1201/9781003218531-7
101
102
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
The characteristic of these designs is that the sample size is not fixed, and this provides the means to control the sample size. Simon originally proposed two criteria, minimax and optimal, for selecting the pairs (n1, n) and their respective critical cut-offs (k1, k). In the minimax design, the maximum sample size under H0 is minimised, and in the optimal design, the expected sample size, under H0, is minimised. Example 7.1 Sleijfer et al. (2009) report the results of a study investigating the tolerability and anti-tumour activity of pazopanib in patients with relapsed or refractory advanced soft tissue sarcoma (STS). The study ran as a set of four independent Simon two-stage designs with patients recruited to four cohorts: adipocytic sarcomas, leiomyosarcomas, synovial sarcomas and a group of patients with the other eligible STS entities. The primary endpoint was the progression-free rate at 12 weeks (PFR12) defined by the Response Evaluation Criteria in Solid Tumours (RECIST) guidelines (Therasse et al., 2000). Patients alive without progression after 12 weeks were defined as a treatment success, whilst patients who suffered progression, who had an unknown progression status or died were defined as a treatment failure. The settings for an optimal two-stage Simon design were π0 = 0.2, π1 = 0.4, α = 0.1 and β = 0.1 which gives n1 = 17 with k1 = 4 and n = 37 with k = 11. The expected sample size under H0 is 26 and the probability of stopping at the end of Stage 1 under H0 is 0.55. What is immediately apparent is that in extreme cases it is possible that the number of responses seen in Stage 1 is already larger than the cut-off for the whole study. For example, if there are 13 responders out of 17 patients at the end of Stage 1, then in theory there is no need to run Stage 2. Sleijfer et al. (2009) report that the adipocytic sarcoma cohort was closed at the end of Stage 1 with only 3 of the first 17 patients being assessed as successes. The remaining three cohorts ran to completion of Stage 2 with success rates of 18/41, 18/37 and 16/41 patients for leiomyosarcomas, synovial sarcomas and eligible sarcoma types, respectively. These response numbers indicate that recruitment to all three of these cohorts could have been halted because at some point the cut-off to reject the null hypothesis had already been achieved. This is an example of curtailed inspection in which inspection, in this case, patient recruitment, is halted short of the required sample size because the accept/reject decision has been determined regardless of the results of the remaining patients. Curtailed inspection can be used for both acceptance and rejection and on each stage of a double or multiple sampling plan.
A similar approach could be used in a two-stage trial with a binary endpoint comparing a Test Treatment and Control in 20 patients per arm. After allocating 10 patients to each treatment, suppose the success response rates
103
Interim Predictions
are 4/10 and 8/10 in the treatment and control arms, respectively. The minimum control rate at the end of the study is 8/20, and we can determine that the only results in the treatment arm that would be statistically significant at the nominal 5% level are 15, 16, 17, 18, 19 or 20 out of 20. None of these results is possible, and therefore, it is futile to continue. If the rate in the treatment arm had been 5/10, a significant result could be achieved but only if 10/10 patients treated with the Test Treatment in the second half of the study respond while none receiving control do, which is highly unlikely given the current results. To assess the likelihood, we need to be able to calculate the appropriate probability and while this is simple for this example, it is more complex for continuous endpoints. This is the topic of the next section in which we investigate three alternatives: the CP based on the original power assumption, CP based on the currently observed treatment effect and predictive based on the observed treatment effect and its uncertainty.
7.2 Conditional and Predictive Power To illustrate the difference between these three approaches, we consider, as before, the same simple clinical trial. In each of two arms n1 patients are treated and the posterior distribution of δ, the treatment effect, is N (δˆ1 ,2σ 2/n1 ) where δˆ1 is the difference in sample means, the estimated treatment effect, and the variance σ2 is known (at this point we have assumed a vague prior for δ). Then, the posterior probability that δ > 0 is given by ∞ 2 n δˆ n1 / 2 −n / 2 PnB1 = P δ > 0|δˆ1 = exp 1 2 δ − δˆ1 dδ = Φ 1 1 . 2 2σ 2πσ 2σ 0
(
)
(
∫
)
This probability is connected to a one-sided p-value, PnF1 , by the relationship P 1 PnB1 . By analogy, the posterior probability that the treatment effect is Φ ( n1 + n2 ) / 2δˆ12 / σ positive after n1 + n2 patients in each arm is PnB1 + n 2 = where ˆ12 n1ˆ1 n2ˆ2 / n1 n2 and y is the difference in means based on F n1
(
)
n2
a further n2 patients per arm. As before, PnF1 n 2 1 PnB1 n 2 . If we define a study to be successful when the final one-sided p-value PnF1 + n 2 is less than α, then simple substitution implies that δˆ2 needs to be greater than n1 n2 2 Z n1ˆ1 / n2 , where Zε is the appropriate critical
value from a standard normal distribution. If the true treatment effect is δ, then δˆ2 has a normal density with mean δ and variance 2σ2/n , implying that 2
104
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
the probability, or the CP (conditional on δ) of obtaining a δˆ2 of sufficient magnitude to achieve success can be calculated in the same way as before.
n1 n2 2 Z n1ˆ1 | CP P ˆ2 n2 n1 n2 n1ˆ1 n2 Z . n2 n2 2
(7.1)
The result (7.1) should be compared to Spiegelhalter et al.’s (1994) equation 14 that points to the fact that these types of results form the basis of classical stochastic curtailment. Of course, the assumed value δ may have little support from the observed difference in means δˆ1 , the estimated treatment effect. In such cases, it may be preferable to use the observed, as opposed to the assumed value. For ˆ1 the estimated CP is n n2 n1 n2 ˆ1 . CP ˆ 1 Z n2 n2 2
(7.2)
Alternatively, we can use the predictive distribution of δˆ2 given the current result δˆ1 which has the form 1 1 N ˆ1 , 2 n1 n2
derivable from
p ˆ | p |ˆ d 2
1
(7.3)
or from Besag’s (1989) candidate for-
mula (Appendix 2), or by noting that ˆ1 ~ N ( ,2 2/n1 ) and ˆ2 ~ N ( ,2 2/n2 ) implying that ˆ2 ˆ1 ~ N (0, 2/n1 2/n2 ) from which (7.3) immediately follows (Armitage, 1988, 1989). The statistic ˆ2 ˆ1 is termed a pivotal function, whose fundamental characteristic is that its distribution does not depend on the unknown parameter δ. Using (7.3), the predictive POS can be calculated as
n1 n2 2 Z n1ˆ1 ˆ | 1 PP P ˆ2 n2 n1 n1 n1 n2 ˆ1 . Z n2 n2 2
(7.4)
105
Interim Predictions
As the sample size of the second stage of the study, n2, increases relative to n1, PP converges in the limit to Φ n1 / 2δˆ1 / σ = 1 − PnF1 .
(
)
If we define f1 = n1/(n1 + n2), the information fraction, 1 − f1 = n2/(n1 + n2), ZINT n1 ˆ1 / 2 and Z2 n2 / 2 / , the NCP for the second part of
the study, then equations (7.1), (7.2) and (7.4) can be written as
Z ZINT f1 Z2 1 f1 CP 0 , 1 f1
,
(7.6)
Z f1 ZINT CP ˆ f1 1 f1
.
(7.7)
Z f1 ZINT PP 1 f1
Comparison of (7.6) and (7.7) demonstrates that PP
(7.5)
.
f1 1 CP ˆ
Given this relationship, any decision rule based on PP has an equivalent based on CP ˆ . Figure 7.1 provides contours of CP ˆ as a function of PP and the
FIGURE 7.1 Relationship between predictive and conditional power as a function of the information fraction (f1).
106
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
information fraction f and illustrates a point made by Lan et al. (2009) that the
relationship between them implies that if CP ˆ is less than 0.5, PP exceeds
CP ˆ , whilst if it is greater than 0.5, CP ˆ exceeds PP; see also Spiegelhalter et al. (2004), Proschan et al. (2006) and Wassmer and Brannath (2016). Example 7.2 The basis of this example is a clinical trial of glycerol, glycerol + dextran and placebo in the treatment of acute stroke reported by Frei et al. (1987). Despite, or perhaps because, the results of trials of glycerol and dextran separately had delivered inconsistent outcomes, this sponsor proposed to randomise 200 patients to each of the three arms to investigate whether the hypothesis that “the combined use of both agents could be beneficial”, in which beneficial was defined as an additional 10 points benefit over placebo. The primary endpoint of the study was the change from baseline to 24 weeks of the Matthews neurological rating scale. The study was designed as a GSD with a single interim after 100 patients with a Pocock boundary. Due to poor recruitment, an interim analysis was conducted after complete data were available on 52 patients. There was little evidence of any difference between the three arms with respect to the primary endpoint. The authors reported that the “probability of eventually achieving experimental significance with a total of 200 patients was then easily calculated and yielded a chance of only 6%”. The study was then stopped with 61 patients recruited. The data at the interim were not published but only the final data on 61 patients. We will use these final data to illustrate the conditional and PP calculations. Although the primary endpoint was to be analysed non-parametrically, the predicted calculations assumed normality. The estimated treatment effect comparing Glycerol + Dextran to placebo gave an estimated treatment effect of ˆ1 1.45 units with a standard deviation of 22.0 units. When the study was stopped, the sample sizes in the two arms were m11 = 20 and m12 = 23, respectively. We could make an approximate analysis by using the average sample size at the interim of 22 patients per arm and assume 44 patients per arm for the second half of the study and then use equations (7.5) to (7.7) in their original form. However, it is simple to modify the equations to accommodate unequal sample sizes by replacing 2/n1 with 1/m11 + 1/m12 and 2/n2 with 1/m21 + 1/m22 in each equation where m11 = 46 and m21 = 43 to achieve 66 per arm at the end of the study. Also f1 is redefined as (1/m11 + 1/m12)/(1/m11 + 1/m12 + 1/m21 + 1/m22). With these changes, the relevant inputs to (7.5)–(7.7) are Zα = −1.96; f1 = 0.325; ZINT = 0.215; δ = 10; Z2 = 2.143 and the three probabilities are
()
= CP (δ 0 ) 0.463; = CP δˆ 0.027; = PP 0.136.
107
Interim Predictions
These probabilities indicate that even under an optimistic scenario in which glycerol + dextran gave a 10-point improvement over placebo, continuing to the end would be like tossing a coin, the chance of ultimate success is approximately 50%. They also illustrate the previous result that if the CP at the estimated treatment effect is ε, then we may potentially consider taking a second sample of size, this time of size n2 per arm. Decisions as to whether such a second sample should be taken can be based on
PnPred P PnF1 n 2 . 2
What can we say about PnPred ? From the Markov inequality 2
P 1 PnF1 n 2 1
E 1 PnF1 n 2 1
(7.11)
111
Interim Predictions
Given δˆ2 , F n1 + n2
1− P
1 = 2π
∞
∫
(
e−x
2
/2
dx
)
( n1 + n 2 )/2 n1δˆ1 + n2δˆ2 /( n1 + n 2 )/σ
so that the expected value of 1 PnF1 n2 can be written as
F n1 n 2
E 1 P
1 2
ˆ1 n1 n 2 /2 / y
e n 2 /n1
x 2 /2
y 2 /2 dx e dy.
Then, using the result in Appendix 1 with a ˆ1
n1 n2 / 2
and
b = n2 / n1 , we have
E 1 PnF1 n 2
e u
2
/2
du 1 PnF1 .
ˆ1 n1 /2 /
Using either of the two approaches which led to (7.9) and (7.10), we find that E P 0|ˆ1 , ˆ2 Pn1 . Therefore, using (7.11),
P P P A|ˆ1 , ˆ2 1 n1 , 1
a result pointed out by Dawid (1986). What these two results show is that while we expect the POS to remain the same following the gathering of more data, the distribution of that probability may change considerably and is bounded (see also Spiegelhalter et al. 1994). One of the implications of the latter result is that continuation to a second stage of an experiment will only make sense if we are feeling rather optimistic about the outcome after the first stage.
7.4 “Proper Bayesian” Predictive Power What is meant by “proper Bayesian” PP? The idea is simply to say that for a Bayesian whose prior distribution for the treatment effect is (2.3), the appropriate posterior distribution after the first stage of the study upon which to
112
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
base predictions is (5.1). The final posterior probability is a generalisation of PnB1 + n 2 and can be written as PnB0 + n1 + n 2 = Φ
(
( n0 + n1 + n2 ) / 2δˆ012 / σ )
in which ˆ012 n0 0 n1ˆ1 n2ˆ2 / n0 n1 n2 . If the decision rule used to indicate a positive outcome of the study is PnB0 n1 n 2 1 , then the future estimated treatment effect needs to satisfy the relationship
ˆ2
n0 n1 n2 2 Z n0 0 n1ˆ1 / n2
and the probability of achieving this can be determined from the predictive distribution n n1ˆ1 2 1 1 , p ˆ2|ˆ1 ~ N 0 0 n0 n1 n0 n1 n2
to be Z PP
f 2 ZINT 1 f2
in which now ZINT
n0 n1 n0 0 n1ˆ1
2 n0 n1
and f2 = (n0 + n1)/(n0 + n1 + n2). By analogy the estimated CP is
Z f 2 ZINT CP ˆ f2 1 f2
.
In this Bayesian case, the relationship between PP′ and CP ˆ is ZPP
f 2 ZCP ˆ .
which has the same structure as (2.7), (5.6) and (7.8).
(7.12)
8 Case Studies in Simulation
8.1 Introduction Apart from in Section 2.2.4, we have only considered analytic solutions thus far. As a result, we have largely chosen to restrict attention to the simple case of a normal distribution for the treatment effect and known variance. There are two main reasons for this. Firstly, in many cases, the Central Limit Theorem can be applied, as we argued in Chapter 1. A consequence of this is that the assumptions of normality and known variance can be applied to other data structures and models and therefore are in no sense restrictive. Secondly, whilst simulation can be used to determine the AP quite easily, as Stephen Senn’s alter ego Guernsey McPearson has remarked “…simulation is just mathematics by other means…” Sometimes, of course, the mathematics is either too hard, or intractable, and simulation is our only option, but in many cases approximation, and large sample sizes, allow the results we have derived to be used. Despite these arguments, there have been simulation approaches in which a sample from the prior distributions of power has been generated. We referred to the paper of Sims et al. (2007) previously who considered the treatment effect to be fixed and used the prior distribution of the variance, σ2, to generate the prior distribution of power using simulation. Their approach was in the context of psychology and a further example from behavioural research was reported by Du and Wang (2016). These authors investigated a simulation procedure for “assessing the probability or ‘assurance’ level of achieving a given power level or higher” and apply their approach to the comparison of two means and for testing a null hypothesis of zero correlation. It is important to appreciate that Du and Wang’s use of “assurance” is different from our usage. Their use describes a single point on the prior power CDF which we considered under multiple assumptions in Chapter 5. As we noted in Section 6.7, a Bayes estimator is an estimator, or decision rule, that minimises the posterior expected value of a loss function. The common use of the posterior mean as an estimator is justified by using square
DOI: 10.1201/9781003218531-8
113
114
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
error as a loss function. We pointed out that a problem with this loss function, and others, are lack of invariance under (possibly non-linear) re-scaling of data. The AP is itself an estimate based on the prior distribution and may not necessarily be optimal for what is effectively a quantile. Providing the whole prior distribution analytically, or a sample by simulation, allows us to consider other estimates upon which to base the decision about the appropriate sample size. Simulation is a powerful tool to support such investigations.
8.2 Case Study 1 – Proportional Odds Primary Endpoint We saw in Chapter 1 that a normal approximation could be used for many simple cases. In this section, we illustrate that normal approximations can be used for interim analyses of more complex examples. The context is a two-arm clinical trial in which the primary endpoint, an ordered categorical response, is to be analysed using a proportional odds (PO) model. We show how a Mann–Whitney/Wilcoxon test (MWT) can be used to test the null hypothesis of no treatment effect and show how to determine the OC using clinical trial simulation for an adaptive design in which the decisions at the interim are either to stop for futility or to increase the sample size. 8.2.1 Background A placebo-controlled Phase III study was planned to demonstrate the efficacy of a Verum on the clinical outcome of patients suffering from moderate and severe Traumatic Brain Injury (TBI). The primary efficacy endpoint, measured at 6 months, was the interview-based extended Glasgow Outcome Scale (EGOS-I), which is an ordered categorical scale with 8 categories. The primary analysis was to be based on the MWT and the study was planned to at least 90% power for an odds ratio of 2.3 with a two-sided type I error of 5%. The design of the study was based on the following assumptions: 1. The placebo response probabilities for categories 1–8 are 0.09, 0.10, 0.10, 0.15, 0.30, 0.25, 0.005, 0.005 based on a smooth estimation from placebo data in a previous study 2. From the assumption that the odds ratio is 2.3 implies that the Verum response probabilities are 0.041, 0.051, 0.058, 0.104, 0.298, 0.424, 0.011, 0.011.
115
Case Studies in Simulation
The study was planned with an unblinded interim after analysis after 50% of the patients had completed their final assessment with three potential outcomes, based on the recommendation of an IDMC: i. Stop the study because of futility ii. Continue the study as planned iii. Continue the study but increase the number of patients – sample size re-estimation (SSR). It was not planned to stop the study in the interim if a significant efficacy signal was observed. An increase in sample size could be recommended by the IDMC guided by a conditional power of 90% to detect an odds ratio of 1.5. The maximum increase in sample size that was allowed was 50%. 8.2.2 The Wilcoxon Test for Ordered Categorical Data Consider two multinomial populations, each with c ordered categories. Let π1 = (π11, π11, …, π1c) and π2 = (π21, π21,.., π2c) denote the two vectors of response probabilities for populations 1 (placebo) and 2 (active), respectively. Assume there are xi. patients in population i (i = 1, 2) and let xij be the number of these patients that fall into ordered category j, j = 1, 2, … c. Finally, let x.j = x1j + x2j and x.1 + x.2 = N. The layout of the data is shown in Table 8.1. The PO model expresses the difference between the two vectors of response probabilities in terms of a single location-shift parameter θ. Specifically, for j = 1, 2, …, c, 1j log 1 1j
2j log 1 2j
TABLE 8.1 Configuration of Data in a Parallel Group Study of Placebo Compared to Active in Which the Primary Endpoint Is Multinomial Arm Placebo Verum Total
1
2
….
C
Total
x11 x21 x.1
x12 x22 x.2
…. …. ….
x1c x2c nc
x1. x2. N
116
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
in which γij = πi1 + πi2 + … + πij and θ is the log(PO). Whitehead (1993) shows that the score test for testing the null hypothesis H0 : θ = 0 is Z
c
x
1 N
2i
L1i U1i
i 1
in which i 1
Lji
c
x jk i 2,.., c; U ji
k 1
x
jk
i 1,.., c 1; Lj1 U jc 0
k i 1
This is the Mann–Whitney form of the Wilcoxon test. Asymptotically, Whitehead (1993) also demonstrates that Z~N(θV, V) where V
x1.x2.N 2 1 3 N 1
c
j 1
3 j
.
and that
j
x1. 1 j x2. 2 j x1. x2.
is the assumed proportion of subjects in category j. The estimated sample version is V
x1.x2.N 2 1 3 N 1
c
j 1
3 x. j N .
Further, Z / V ~ N ,1 and clearly under the null hypothesis Z / V ~ N 0,1. This is the Mann–Whitney version of the Wilcoxon test taking into account the possibility of ties. Whitehead (1993) shows that under equal allocation, for which x1. = x2. = N/2, if θ = θA, then H0 will be rejected with a type I error of α with probability 1−β if the sample size N satisfies
N 1
12 Z1 /2 Z1 c A2 1 j3 j 1
2
N3
2
117
Case Studies in Simulation
For large N, N/(N + 1) ≈ 1 so that the required sample size is 12 Z1 /2 Z1 . c A2 1 j3 j 1 2
N
(8.1)
We take the assumptions in Section 8.2.1, and plug them into (8.1); then, using these values in the expression for N above for a power of 92.2 and a one-sided type I error of 0.025 gives N = 218, which for the purposes of the protocol was increased to 220. This sample size is also enough to demonstrate significance with 80% power for an odds ratio of 2.0, a requirement of the sponsor. The maximum sample size of the study is set at 330 patients, which is based on the rule that sample size estimation should not increase the sample size by more than 50%. 8.2.3 Applying Conditional Power to the Proportional Odds Wilcoxon Test We can apply the general approach to CP in Section 7.2 to the MWT which gives the following. If m0 is the total sample size in the 1st stage and m1 is the corresponding total sample size in the 2nd stage – the value here will depend upon whether, or not, SSR has been conducted – then, we can use (7.5) in which f = m0/(m0 + m1),
ZINT
1 m0
c
x2i L1i U1i
i 1
2 1 12 m0 1 m03
3 x. ji j 1 m0
c
and Z0
3 0
m
2 1 12 m0 1
j 1
c
.
3 j
The same formulae can be used for SSR – the only change that is necessary is to the value m1, which is searched between the planned 2nd-stage sample size and the maximum sample size allowed.
118
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
8.2.4 Statistical Approach to Control Type I Error The use of SSR can have a negative impact on the type I error. There are different ways to control the type I error in these circumstances one of which is the combination principle (Bauer and Köhne, 1994; Lehmacher and Wassmer, 1999). The basic idea behind this principle is to analyse the data before and after the interim analysis separately, determining the p-values p1 and p2 from the appropriate tests of the null hypothesis, and then to combine them. Bauer and Köhne (1994) use Fisher’s combination method based on summing the −log(p-values), while Lehmacher and Wassmer (1999) proposed the inversenormal approach which was the approach used in this study. It was planned that the interim analysis would be conducted after half the patients had completed their 6-month assessment (110 patients). The analysis was to proceed as follows: 1. The MWT compared the arms and the associated p-value, p1 was calculated. 2. If the p-value > 0.5 the study stopped – this condition corresponds to an odds ratio less than 1 so that placebo patients out-perform treated patients. 3. If the p-value < 0.5, the CP based on an odds ratio of 1.5 is calculated (Section 8.2). If it is > 0.9, the study continues unchanged. 4. If the CP < 0.9, SSR determines a new second-stage sample size and if less than 220 patients (so that the overall sample size is less than 330 patients) the study continues with the new sample size. If it is greater than 330 patients, the study will stop. If the study continues to the end the analysis proceeds as follows: 1. The MWT compares the arms based on data from patients randomised in the 2nd stage and the associated p-value, p2, is calculated. 2. The p-values, p1 and p2, are combined using the formula:
p1 ,p2 1 1 1 1 p1 2 1 1 p2
in which 1 2 0.5 . If ρ(p1, p2) is less than 0.025, then significance can be claimed. The results of Bauer and Köhne (1994) and Lehmacher and Wassmer (1999) show that this approach controls the type I error. It is important to note that the ω-values are not altered in the light of the SSR, which is necessary to ensure control of the type I error.
119
Case Studies in Simulation
8.2.5 Simulation Set-Up In the actual simulation study, four different scenarios were considered: A. Neither a test for futility nor SSR is carried out at the interim. B. Only a test for futility is carried out at the interim C. Only SSR is carried out at the interim D. Both a test for futility and SSR are carried out at the interim. In each of these scenarios, ordinal data were simulated using the placebo category proportions given in Section 8.1. The following odds ratios from the PO model: 1.0, 1.1 (0.2) 2.3 were simulated in which 1.0 is the null hypothesis. The 1st stage of the study used a total sample size of 110 patients, if unchanged the 2nd stage also used 110 patients. SSR allows up to 330 patients in total in the 1st and 2nd stages combined.
TABLE 8.2 Simulation Results for an Adaptive Design with an Interim Including a Test of Futility and SSR (Scenarios A and D) Estimated Percentages Test for Futility
No
Yes
SSR
No
Yes
Odds Ratio
Futility Stop
Sample Size > Max
Significant Study
Nonsignificant Study
1.0 1.1 1.3 1.5 1.7 1.9 2.1 2.3 1.0 1.1 1.3 1.5 1.7 1.9 2.1 2.3
– – – – – – – – 50.8 38.5 22.4 11.9 6.3 3.3 1.5 0.8
– – – – – – – – 45.9 54.9 63.6 62.6 55.4 45.4 35.4 27.6
2.5 5.9 19.4 39.4 59.6 75.4 86.2 92.8 0.9 2.8 9.8 22.2 35.9 50.3 62.5 71.4
97.5 94.2 80.6 60.7 40.5 24.6 13.8 7.2 2.5 3.7 4.3 3.4 2.4 1.1 0.6 0.2
Average Sample Size
220
116 122 135 154 174 191 207 216
120
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
8.2.6 Simulation Results Table 8.2 summarises the results of the simulations of scenarios A and D. The type I error of the study is maintained – the estimated type I errors is either 0.86% or 2.46%. The conservatism of these results is due to both the test of futility and the SSR procedure. Under the null, and for small departures from the null, the power of the study is very small. For example, in scenario A and for odds ratios 1.5 or less, the power is less than 40% and to achieve 90% CP, a large increase in sample size is necessary and, in most cases, would exceed the maximum allowed. The simulation results confirm the power calculations based on the asymptotic approximation. For example, in scenario A the power for an odds ratio of 2.3 is estimated at 92.8%. For odds ratio of 1.9 and 2.1 – straddling the value 2.0 specified by the sponsor – the powers are 75.4% and 86.2%, respectively, confirming the 80% for an odds ratio of 2. When using a futility stop there is some loss of power, but it is not large. This could be controlled but would have implications for the sample size of the study. For none of these scenarios is SSR particularly effective. This is most probably down to the requirement that the CP be determined for an odds ratio of 1.5. The approach that we are using effectively tests on the log(odds ratio) scale. The study sample size was based on an odds ratio of 2.3, which on the log scale has the value of 0.833. The corresponding log value for an odds ratio of 1.5 is 0.405, which is less than half of the previous value. Halving the effect to be detected increases the sample by approximately a factor of 4 and this is much larger than the maximum allowed.
8.3 Case Study 2 – Unplanned Interim Analysis When recruitment to clinical trials is difficult, a circumstance which is not uncommon in drug development, a development team may need to make an assessment as to whether recruitment to the trial should continue, or whether the trial should be halted, which may have implications for the development program. Whilst the final decision is likely to be based not only on clinical considerations but also on commercial considerations, it is important that operational information is used in the decision-making process. One approach is to predict the final outcomes of the study and assess whether the objectives of the trial are likely to be still achievable. In the current case study, we examine an example in which the primary endpoint is binary, and in which the numbers of patients, and outcomes, in two subgroups, are of interest.
121
Case Studies in Simulation
8.3.1 Background The context in this case study is an active-controlled Phase II study which was planned to compare the efficacy and safety of a novel combination antibiotic (Test) with a standard combination (Control) in hospitalised adult patients suffering from complicated urinary tract infections (cUTI). The primary efficacy endpoint was the eradication of uropathogens from ≥105 colony-forming units (CFU)/mL to < 104 CFU/mL and no pathogens in the blood in the microbiologically evaluable (ME) population. The ME population consisted of subjects with positive urine culture with bacteria susceptible to either antibiotic at enrolment (105 CFU/mL or >104 CFU/mL if bacteraemia present). The study was planned to enrol 150 patients but was not powered to demonstrate non-inferiority. 8.3.2 Interim Data After approximately 18 months, data were available on the response of 57 patients in the ME population, of which 27 received the Test and 30 the Control. The data are displayed in Table 8.3, split by those patients with resistant (to the active treatment) uropathogens and those with non-resistant uropathogens. The sponsor was interested in assessing the likelihood that the study w/ would provide evidence enabling it to exclude large negative treatment effects if the study ran to its conclusion. The sponsor was interested in making the assessment for both all pathogens and for the subgroup of patients with resistant pathogens. To do this a Bayesian approach was proposed. 8.3.3 Model for Prediction The population eradication rates for the resistant pathogens in the Test and Control arms are denoted by θ1 and θ2, respectively, and the corresponding rates for non-resistant pathogens are ϕ1 and ϕ2, respectively. Since the proportion of pathogens that are resistant is not known exactly, we need to account for this uncertainty in the model. This rate is denoted by π. TABLE 8.3 Interim Data: Eradication Rates in 57 Patients in the ME Population by Pathogen Types Active Pathogen Types Non-Resistant Resistant Total
Control
Eradicated
Total
r1,NR = 15 (83%) r1,R = 7 (78%) r1 = 22
n1,NR = 18 n1,R = 9 n1 = 27
Eradicated r2,NR = 17 (77%) r2,R = 5 (63%) r2 = 22
Total n2,NR = 22 n2,R = 8 n2 = 30
122
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Under these assumptions, the eradication rate of all pathogens in the Test arm is ψ1 = πθ1 + (1 − π)ϕ1while the corresponding rate in the Control arm is ψ2 = πθ2 + (1 − π)ϕ2. The difference between the two eradication rates is denoted by δ = ψ1 − ψ2. Each component is assumed to have a binomial distribution. So, for example, the number of patients whose resistant pathogens were eradicated in the Control arm r1, R has a binomial distribution
r1, R ~ Bin 1 ,n1, R .
Also, the number of patients with resistant pathogens, n1, R + n2, R has a binomial distribution
n1, R n2 , R ~ Bin ,n1 n2 .
In a Bayesian analysis with small numbers of patients in each Control×Pathogen type subgroup, an informative prior, unless based on prior data, could have a large impact on the inferences that can be drawn. For this analysis, therefore, the priors for unknown parameters θ1, θ2, ϕ1, ϕ2 and π were taken to be uniform, that is, they can take any value between 0 and 1. Based on data in Table 8.3, the posterior distributions for the resistant pathogen rate and for the treatment differences in the resistant subgroup and for all pathogens are shown in Figure 8.1(a) and (b), respectively. Clearly, because the sample sizes are small, there is considerable uncertainty in the inferences we can make, as is shown by the width of these posterior distributions. The sponsor was interested in two sets of predictions. The first set was simply to provide predictions of the two treatment effects. The second set was to provide a prediction of the lower 95% confidence limit for the treatment effects. These latter predictions were intended to allow the sponsor to assess the chance of showing non-inferiority Test to Control. If the study were allowed to run to its natural conclusion there would be a further 93 patients in total, 48 patients in the Test arm and 45 in the Control arm. Of the 48 in the Test arm, there will be n1, RP resistant pathogens and 48 − n1, RP non-resistant pathogens and correspondingly n2, RP resistant pathogens and 45 − n2, RP non-resistant pathogens in the Control arm. Both of these have binomial distributions, namely
n1, RP ~ Bin , 48 and n2 , RP ~ Bin , 45
and we can predict the likely values by averaging these binomials distributions over the posterior distribution of π based on the current data, Figure 8.1(a). In the same way, given a value for the number of resistant and non-resistant pathogens in each arm, we can look to predict the number of eradicated
123
Case Studies in Simulation
FIGURE 8.1 Posterior Distributions of the (a) Proportion of Patients with Resistant Pathogens and (b) Treatment Effect by Pathogen Types
pathogens, conditional on this value from the appropriate binomial distribution whose parameter is then averaged over the appropriate posterior distribution. Finally, the number of eradicated pathogens is then averaged over the predicted distribution of the number of pathogens determined previously. For example, in the Test arm the number of eradicated “resistant pathogens”, r1, RP, can be generated from the binomial distribution
r1, RP ~ Bin 1 ,n1, RP .
124
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Again, this binomial distribution is averaged this time over the posterior distributions of both π and θ1. The predictive distribution of r2, RP can be similarly generated. From these distributions, the predictive distribution of the estimated treatment effect is calculated as well as the predictive distribution of the lower 95% confidence limit for the treatment effect. A similar process will lead to the predictive distribution of the estimated treatment effect for “all pathogens”. The relevant calculations were carried out using the Open BUGS system. Since the main interest is centred on the estimated treatment effects, we show their predictive distributions in Figure 8.2(a). While these distributions
FIGURE 8.2 Predictive Inference: (a) Predictive Distribution of Treatment Effects and (b) Cumulative Predictive Distribution of 95% Lower Confidence Limit
Case Studies in Simulation
125
are of interest, the reality is that in general, inferences are made about specific null hypotheses. In this case, the most appropriate null hypothesis was thought to be one of non-inferiority in which it was postulated that the treatment difference is less than a pre-determined margin (see Section 3.7.1) although in this case, such a margin had not been pre-specified. When the interim data became available, the question of interest was: How likely is it if the study continues to the end of recruitment that we will be able to reject a non-inferiority null hypothesis? This question can be answered by calculating the cumulative predictive distribution of the lower 95% confidence limit for the treatment effect. Then, a high predictive probability that the lower 95% confidence limit is greater than the appropriate non-inferiority margin will generate positive evidence in favour of the Test drug. Figure 8.2(b) displays the relevant cumulative distribution functions. The distribution function for “all pathogens” is steeper than that of “resistant pathogens” indicating that there is less uncertainty for the former. From these distribution functions, we could determine the chance of demonstrating non-inferiority if the study was to continue. What these results indicate is that for any non-inferiority margin less than −20% there was a very high probability (> 90%) of showing non-inferiority of the two treatments in terms of eradicating “all pathogens”. In contrast, in the case of “resistant pathogens”, the probability is only greater than 65%. As the margin increases, these probabilities reduce; at −15%, they are ~84% and ~56%, respectively, and at −10%, they are ~68% and ~43%, respectively. The sponsor decided to stop enrolment into this study. This decision was taken not solely based on the statistical analysis, but primarily on operational and commercial grounds.
9 Decision Criteria in Proof-of-Concept Trials
9.1 Introduction Several pharmaceutical companies have recently begun to move away from the use of a single statistical test for making GO/NOGO decisions in early phase clinical trials. Pfizer (Lalonde et al., 2007; Walley et al., 2015, Chuang-Stein et al., 2011b), Astra-Zeneca (Frewer et al., 2016), MedImmune (Pulkstenis et al., 2017) and Novartis (Fisch et al., 2015) have all have developed and implemented evidence-based decision-making frameworks which go beyond significance testing and a single p-value to support decision-making. The purpose of this chapter is to relate the concepts of AP and assurance to this more general decision framework.
9.2 General Decision Criteria for Early Phase Studies Quantitative decision criteria are rules which are used to decide on the appropriate course of action, or decision, to be taken after a clinical trial has been completed and pertinent data for a compound has been collected. Increasingly, drugs that receive market authorisation are under intensive examination concerning the value they deliver to patients. Such an examination is conducted in multiple stages and comprises an appraisal of its risk/ benefit, its efficacy in comparison to competitor treatments and its cost. During the development of a new treatment, it is important to keep in focus the objective of the program and this is generally achieved by the creation of a target product profile (TPP) which catalogues the desired efficacy, safety and other characteristics, for example, the regimen, of the drug (Breder et al., 2017). There is no unique superior way to characterise the desired efficacy in the TPP. In some cases, a simple threshold is chosen. An alternative approach was introduced by Lalonde et al. (2007) who proposed that efficacy within
DOI: 10.1201/9781003218531-9
127
128
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
TABLE 9.1 Study Outcomes and Associated Decisions in a Trial with Multiple Decision Criteria Relevance Minimum Requirement
Yes No
Yes
No
(A): GO (C): PAUSE
(B): PAUSE (D): NOGO
the TPP be defined by a lower reference value (LRV) and a target value (TV). The LRV is the lowest level of efficacy which is acceptable, which some have characterised as the “dignity line” or a level that allows the sponsor to pass a “red-faced test” and is used to restrict the chance that a drug with little efficacy is progressed. In contrast, the TV is the base-case option which is a level of efficacy that a sponsor would wish to achieve in order to indicate that their drug is a strong competitor to the existing standard of care and ensures that a commercially viable is progressed. These two criteria are used to enable a decision to be taken between GO, PAUSE and NOGO. In our example we consider the following criteria: (i) If there is high confidence (1 − α0) that the effect δ, relative to placebo, is greater than δLRV (LRV) – then the MINIMUM REQUIREMENT has been achieved. and (ii) If there is moderate confidence (1 − α1) that the effect δ, relative to placebo, is greater than δTV (TV))– then RELEVANCE has been achieved. Together, these criteria allow us to create a decision table, shown in Table 9.1, based on the study outcome.
9.3 Known Variance Case Returning to the set-up in Section 2.2 in which n1 patients are in each arm, the difference in sample means δˆ has the distribution (1.1) and the variance is known. Under these assumptions, the requirement of excluding the LRV at confidence level (1 − α0)% can be expressed as
ˆ Z10
2 2 LRV ˆ LRV Z10 . n1 n1
129
Decision Criteria in Proof-of-Concept Trials
Similarly, excluding the TV at confidence level (1 − α1)% can be expressed as
ˆ Z11
2 2 TV ˆ TV Z11 . n1 n1
Both conditions will be met (GO) if the estimated treatment effect, δˆ is greater than 2 2 max LRV Z10 , TV Z11 MAX. n n 1 1
(9.1)
Similarly, both conditions are unmet (NOGO) if δˆ is less than 2 2 min LRV Z10 , TV Z11 MIN . n1 n1
(9.2)
Finally, only one condition is met (PAUSE) if δˆ satisfies the relationship MIN ˆ MAX.
If the true treatment difference is δ, the sampling distribution of the estimated treatment difference is ˆ ~ N ( ,2 2/n1 ) from which PGO = PNOGO
n1 2π 2σ 2
∞
−n ∫ exp 2 × 2σ (δˆ − δ ) dδˆ =
1
2
MAX
n1 = 2π 2σ 2
PPAUSE =
2
1
n1 2π 2σ 2
MIN
MAX − δ 1− Φ , σ 2 / n1
MIN − δ −n Φ , ∫ exp 2 × 2σ (δˆ − δ ) dδˆ = σ 2/n
(9.4)
1
2
1
−∞
(
−n1 exp δˆ − δ 2 × 2σ 2 MIN
∫
2
1
MAX
(9.3)
1 P ) dδˆ =− 2
1
GO
− PNOGO .
(9.5)
Example 9.1 Suppose we are designing a proof-of-concept (POC) study to generate evidence a) that a new drug is efficacious, by showing statistical significance compared to placebo and b) that there is some evidence that the new drug is superior to a potential competitor.
130
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
The primary endpoint is a continuous measure with an assumed standard deviation of 2 units and a sample size of n1 = 20 patients per arm is proposed. Since the first objective is to show significance over placebo the LRV (δLRV) is 0 and we assume that α0 = 0.025 is deemed appropriate. If the potential competitor has shown a difference to placebo of 1.5 units, this is the TV (δTV = 1.5) and suppose that the sponsor decides that α1 = 0.3 provides a reasonable level of evidence. To illustrate the determination of the three probabilities PGO, PNOGO and PPAUSE, suppose that the true difference of the new drug compared to placebo is 1 unit (Figure 9.1). With these assumptions, we can determine from (9.1) and (9.2) that 2 2 ,1.5 0.52 2 MAX max 1.96 2 1.832 20 20
and 2 2 ,1.5 0.52 2 MIN min 1.96 2 20 20
1.240.
Substituting these values in (9.3)–(9.5) gives = PGO 0= .094, PNOGO 0.648, PPAUSE = 0.258
Based on these assumptions, it is unlikely that the trial will turn out to be successful, as the probability of a GO decision is less than 10%. Furthermore, it is probable that at the end of the trial we will make the
LRV
TV MIN MAX
-6
-4
-2 0 2 Observed Treatment Effect GO
PAUSE
4
6
NO GO
FIGURE 9.1 A plot of intervals for the observed treatment effect ( δˆ ) giving rise to the three possible decisions.
131
Decision Criteria in Proof-of-Concept Trials
decision not to proceed with the development of the compound. It is important to remember that decisions about the cessation, or continuation, of a development project, are not taken solely on the basis of the outcome of the primary endpoint. If multiple secondary endpoints, particularly if they are longer-term clinical endpoints, show favourable outcomes and in contrast, the primary outcome is a shorter-term biomarker, a sponsor may feel justified in continuing a development even if the prespecified criteria have not been achieved.
In practice, when designing a study, it is insufficient to look only at one specific value of the true treatment effect and a trial statistician will investigate the OC of the trial over a range of assumed true treatment effects. In analogy to traditional power calculations, we can calculate the decision probabilities as a function of potential true treatment effects covering a range of values which are of interest to the project team. There are two ways of displaying these functions. Example 9.1 (Continued) First, in Figure 9.2, we display the OC of the design in the form of what has come to be called a “Lalonde plot” (Lalonde et al., 2007). In a Lalonde plot, the probability of being within each decision interval as a function of the true treatment effect is displayed. This is a cumulative plot in the sense that the boundaries between the decision regions are defined by the values PGO and PGO + PPAUSE. This plot shows that if the true treatment effect is greater than approximately 2.35 units, the probability of 1.0
Probability
0.8 0.6 0.4 0.2 0.0
0
1
2 3 True Treatment Effect GO
PAUSE
4
5
NO GO
FIGURE 9.2 Operating characteristics of the design: n1 = 20, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.3 with three potential decisions (stacked probabilities).
132
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
1.0
Probability
0.8 0.6 0.4 0.2 0.0
0
1
2
3
4
5
True Treatment Effect GO
PAUSE
NO GO
FIGURE 9.3 Operating characteristics of the design: n1 = 20, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.3 with three potential decisions.
continuing the project is at least 80%. For a 90% probability of continuing, the true treatment effect needs to be at least 2.65 units. Not everyone finds it easy to interpret Lalonde plots, nor simple to read off the values of the probabilities. An alternative is to display the three decision probabilities individually as in Figure 9.3. One interesting aspect of this graph is that PPAUSE has a maximum value which occurs when the true treatment effect is at the mid-point of the interval (MIN, MAX).As a consequence of the structure of the decision criteria, at this value of the true treatment effect, PGO = PNOGO. Figure 9.2 displays the three decision probabilities for a range of true treatment effect values with the sample size fixed. In contrast, Figure 9.4 shows the probabilities for the true treatment effect of two units over a range of sample sizes. The apparent crossover point in this graph arises because it is possible to find a sample size for which PPAUSE = 0. To see this, (9.5) will be identically zero when MAX = MIN and from (9.1) and (9.2), this implies n1
2 2 Z1 1 Z1 0
LRV TV
2
2
.
This can be thought of as the sample size required to detect the difference between the LRV and TV at the one-sided α1 level with power α0. Based on the present assumptions, this value is 7.327.
133
Decision Criteria in Proof-of-Concept Trials
1.0
Probability
0.8 0.6 0.4 0.2 0.0
0
10
20 30 Sample Size / Arm GO
PAUSE
40
50
NO GO
FIGURE 9.4 Operating characteristics of the design: n1 = 20, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.3 with three potential decisions.
1.0
1.0
0.8
0.8
Probability
Probability
The shapes of the boundaries in the Lalonde plot in Figure 9.2 are determined by the parameters of the design. In Figure 9.5, four separate scenarios are shown demonstrating that the steepness, location, and separation of the boundary curves are highly dependent on these parameters. It is good practice in designing studies to gain an understanding of the sensitivity of the boundaries to changes in these parameters. This understanding allows the project team to choose the most appropriate combination of parameters for their trial.
0.6 0.4 0.2 0.0
(a)
0.6 0.4 0.2
0
1
2
3
4
True Treatment Effect GO PAUSE
NO GO
5
0.0
(b)
0
1
2
3
4
True Treatment Effect GO PAUSE
5
NO GO
FIGURE 9.5 Operating characteristics of four design scenarios, each with three potential decisions: (a) n1 = 20, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.3. (b) n1 = 10, σ = 2, δLRV = 0, α0 = 0.025, δTV = 0.15, α1 = 0.3. (C) n1 = 25, σ = 3.2, δLRV = 0, α0 = 0.025, δTV = 1.75, α1 =0.4. (D) n1 = 10, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.1.
(continued)
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
1.0
1.0
0.8
0.8
Probability
Probability
134
0.6 0.4 0.2 0.0
(c)
0.6 0.4 0.2
0
1
2
3
4
True Treatment Effect GO PAUSE
NO GO
5
0.0
(d)
0
1
2
3
4
True Treatment Effect GO PAUSE
5
NO GO
FIGURE 9.5 (Continued) (c) n1 = 25, σ = 3.2, δLRV = 0, α0 = 0.025, δTV = 1.75, α1 = 0.4. (d) n1 = 10, σ = 2, δLRV = 0, α0 = 0.025, δTV = 1.5, α1 = 0.1.
9.4 Known Variance Case – Generalised Assurance As in our previous discussion of CP and expected (unconditional or absolute) power, the probabilities PGO, PNOGO and PPAUSE are conditional probabilities, conditional on δ. Unconditional versions can be obtained by taking the expectation with respect to the prior distribution (2.3). Alternatively, since the probabilities (9.3)–(9.5) are based on the conditional predictive distribution of the treatment effect (1.1), their unconditional versions can be obtained by replacing (1.1) in (9.3)–(9.5) by the unconditional predictive distribution of the data (2.10). Taking this latter approach, the unconditional probabilities (U) of each decision are given by (9.6)
MAX 0 , PGO ,U 1 2 1 / n1 1 / n0
(9.7)
MIN 0 , PNOGO ,U 2 1 / n1 1 / n0
PPAUSE ,U 1 PGO ,U PNOGO ,U .
(9.8)
135
Decision Criteria in Proof-of-Concept Trials
Example 9.1 (Continued) Starting with the assumptions from Example 9.1, we now assume that the prior distribution of the treatment effect has mean δ0 = 2 units and an effective sample size of n0 = 10 per arm. With these additional values, as well as the values of MAX and MIN previously calculated, substitution of into (9.6)–(9.8) gives
= PGO ,U 0= .561, PNOGO ,U 0.244, PPAUSE ,U = 0.195.
The corresponding conditional values of the decision probabilities given a true treatment effect of 2 units are PGO = 0.605, PNOGO = 0.115, PPAUSE = 0.280. The additional uncertainty introduced by utilising the prior distribution, reduces our belief that the program will progress beyond this study, although in truth, the belief does not change dramatically.
9.5 Bounds on Unconditional Decision Probabilities for Multiple Decision Criteria We have seen in previous chapters that the unconditional probabilities AP, assurance and BP all have a bound determined by the relevant prior distribution. The same holds true in the case of multiple decision criteria. To see this, when n1 → ∞ in (9.1), (9.2), (9.6)–(9.8), then MAX → TV, MIN → LRV and (9.9)
n0 TV 0 PGO ,U 1 , 2
(9.10)
n0 LRV 0 PNOGO ,U , 2
(9.11)
n0 TV 0 n0 LRV 0 PPAUSE ,U 2 2
which are the prior probabilities of the treatment effect being greater than TV, less than LRV and in between these two values, respectively.
136
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Example 9.1 (Continued)
FIGURE 9.6 Ternary plot of PGO, U, PNOGO, U and PPAUSE, U as a function of n1 the sample size per arm, based on (9.6)–(9.8).
Figure 9.6 shows a ternary plot of the unconditional decision probabilities as a function of the sample size per arm (n1). This graph is worthy of comment on several counts. First, the sample size curve touches the PGO, U axis, which corresponds to PPAUSE, U = 0, and this occurs at the same sample size we noted when discussing the crossover in Figure 9.4. Second, the upper bound based on the prior distribution is indicated by the blue lines, and these lines show which regions of the simplex are achievable. Finally, as in the previous cases, the speed of convergence to the asymptotic value is very slow. For example, the value of n1 for which PPAUSE, U = 0 is ~7, whilst the corresponding value when PGO, U = 0.6 is 37, and this is still a considerable distance from the asymptote.
9.6 Bayesian Approach to Multiple Decision Criteria In this chapter, we have considered only a standard frequentist approach to designs incorporating multiple decision criteria, but as seen in Chapter 5, there are Bayesian analogues. Under a Bayesian model with prior given by (2.3), the requirement of excluding the LRV at credible level (1 − α0)% can be expressed as
Z 2 n1 n0 n n ˆ 1 0 1 0 LRV 0 0 CRIT1 n1 n1 n1
137
Decision Criteria in Proof-of-Concept Trials
and similarly, the requirement of excluding the LRV at credible level (1 − α1)% can be expressed as
Z 2 n1 n0 n n ˆ 11 1 0 TV 0 0 CRIT2 . n1 n n1 1 As in Section 9.3, we have the following:
(i) both conditions will be met (GO) if ˆ max CRIT1 ,CRIT2 MAX , (ii) neither condition will be met (NOGO) if ˆ min CRIT1 ,CRIT2 MIN , (iii) only one of the two conditions will be met (PAUSE) if MIN ˆ MAX. In Section 5.2, we pointed to three different types of Bayesian power: (i) CBP(δ), defined by the probability of a successful study with respect to the distribution of future data ˆ ~ N ( ,2 2/n1 ) , (ii) CBP(δ0), where δ0 is our current expectation of the treatment effect and (iii) BP, calculated from the unconditional predictive distribution (2.10). Each of these approaches can be applied to determining P(GO), P(NOGO) and P(PAUSE) and they are shown in Table 9.2. This table is interesting on two grounds. First, it clearly highlights the reason for the relationships between conditional and unconditional POS that we have already seen in (2.7), (3.4), (5.6), (7.8) and (7.12). The arguments of the standard normal integral in the second and third rows of Table 9.4 differ only by a multiple which equals n0 / n1 n0 f 0 . This difference comes solely from the difference in the variances of the corresponding predictive distributions. Second, it allows us to investigate the asymptotic behaviour of the unconditional Bayesian probabilities given in the final row. As the sample size per arm, n1 → ∞ CRIT1→δLRV and CRIT2→δTV so that, since by definition, δTV > δLRV, we have MAX → δTV and MIN → δLRV as in Section 9.5. Consequently, the probabilities in the final row of Table 9.4 as n1 → ∞, tend exactly to the probabilities (9.9)–(9.11). This result mirrors the equivalence of the upper bounds for AP/assurance that was pointed out in Section 5.4. The three unconditional probabilities as a function of sample size per arm closely match the type of behaviour seen in the ternary plot shown in Figure 9.6 at least if the amount of prior information is small. In this case, as the sample size per arm increases, the corresponding sample size curve again touches the PGO,U axis, which as before, corresponds to PPAUSE,U = 0. This time, the sample size per arm at which this occurs is given by n1
2 2 Z11 Z10
LRV TV
2
2
n0
138
Bayesian Probabilities of GO, NOGO and PAUSE Based on Two Conditional Predictive Distributions and an Unconditional Predictive Distribution Predictive Distribution
P(GO)
MIN 2 / n1
P(PAUSE) MAX 2 / n1
MIN 2 / n1
2 2 ˆ N , n1
MAX 1 2 / n1
2 2 ˆ N 0 , n1
MAX 0 1 2 / n1
MIN 0 2 / n1
MAX 0 2 / n1
2 1 1 ˆ N 0 , n n n 1 0 1
MAX 0 1 2 1 1 n0 n1
MIN 0 2 1 1 n0 n1
MAX 0 MIN 0 2 1 1 2 1 1 n0 n1 n0 n1
2
P(NOGO)
MIN 2 / n1
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
TABLE 9.2
139
Decision Criteria in Proof-of-Concept Trials
in which the subtraction of n0is in one sense recognition of the contribution of the prior density to the totality of information. On the other hand, if the amount of prior information is large, in particular, if n0 is greater than 2σ2(Z1 − α1 − Z1 − α0)2/(δLRV − δTV)2, the sample size curve no longer touches the PGO,U axis as n increases.
Example 9.1 (Continued) We can illustrate this last point using the assumptions which led to the ternary plot in Figure 9.6. This time, we use the results in the final row of Table 9.2. The unconditional decision probabilities as a function of the sample size per arm (n1) are shown in a similar ternary plot in Figure 9.7. In this case, the figure shows that the sample size curve does not touch the PGO,U axis. We have already seen in the analysis that followed Figure 9.4 that the value of the expression 2σ2(Z1 − α1 − Z1 − α0)2/(δLRV − δTV)2 for this example is 7.327. This value is less than the value of n0 in this example, which is 10, and therefore the above equation would give a negative value for n1, which accounts for the shape of the plot.
PGO,U
0.0 PNOGO,U
1.0
0.2
=1
0.8
PPAUSE,U 0.6
0.4
PNOGO,U
PGO,U
0.4
0.6
0.2
0.8
=
1.0
0.0
0.2
0.4 PPAUSE,U
0.6
0.8
0.0 1.0
FIGURE 9.7 Ternary plot of PGO, U, PNOGO, U and PPAUSE, U as a function of n1, the sample size per arm based on Table 9.2.
140
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
9.7 Posterior Conditional Distributions with Multiple Decision Criteria In Section 5.7, we introduced the idea of posterior conditional success and failure distributions, which are the posterior distributions we would expect given only information that the study had met its success criteria or not. These distributions were available whether a success/failure was defined based on a significance test or a Bayesian approach and are of use in assessing the ability of a chosen design to discriminate between effects which are of interest and those which are not. In this section, we apply the same idea to designs which utilise multiple decision criteria, concentrating here on designs in which a CI approach is used to assess the outcome of a study. Of course, there are times when we may wish to use a Bayesian approach to judge the outcome of a trial and we will point out how a Bayesian version can relatively easily be obtained. Once again, we use the conditional probability argument based on Box and Tiao’s (1973) analysis of parameter constraints and apply it to each of the three decision outcomes GO, NOGO and PAUSE to give
1 MAX p 2 n1 P GO| p , p |GO PGO ,U MAX 0 1 2 1 1 n1 n0
MIN p 2 P NOGO| p n1 p |NOGO , PNOGO ,U MIN 0 2 1 1 n n 0 1
(9.12)
(9.13)
141
Decision Criteria in Proof-of-Concept Trials
and P PAUSE| p PPAUSE ,U MIN MAX p 2 2 n1 n1 MAX 0 MIN 0 2 1 1 2 1 1 n n n n 0 0 1 1
p |PAUSE
(9.14)
Example 9.1 (Continued) Returning to the assumptions of Example 9.1, we can determine the conditional posterior distributions p(δ| GO), p(δ| NOGO) and p(δ| PAUSE) based on (9.12), (9.13) and (9.14) which are displayed in Figure 9.8. As with the PPCS and PPCF distributions in Section 5.7, there is considerable overlap of these conditional posterior distributions which is illustrated by the 95% HPD intervals: NOGO: −0.193–2.272 PAUSE: 0.662–2.734 GO: 1.194–3.901 1.00
Density
0.80 0.60 0.40 0.20 0.00
-2
-1
0 p( |NOGO)
1 2 Treatment Effect ( ) p( |PAUSE)
3
4
5
p( |GO)
FIGURE 9.8 Prior posterior conditional distributions of the treatment effect given any of the decisions NOGO, PAUSE and GO.
142
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
TABLE 9.3 Prior Posterior Conditional Probabilities of Exceeding a Treatment Effect Cut-Off Given One of Three Decisions NOGO, PAUSE or GO Decision
NOGO GO PAUSE
Treatment Effect Cut-Off 0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
0.998 1.000 1.000
0.988 1.000 1.000
0.944 0.999 1.000
0.807 0.988 0.999
0.535 0.903 0.991
0.229 0.639 0.936
0.054 0.278 0.761
0.006 0.062 0.482
0.000 0.007 0.228
0.000 0.000 0.081
As we did in Section 5.7, from these distributions, we can calculate the prior posterior probability that the true treatment effect exceeds a given pre-determined value, one value of particular interest would be the TV. In Table 9.3, we provide the prior posterior conditional probabilities that the true treatment effect is greater than a series of cut-offs calculated from each of the posterior conditional distributions in Figure 9.7. For true treatment effects less than 0.5 units, the design has little discriminatory power irrespective of the decision, for values in the range 0.5–1.5 units there is little difference between a PAUSE and GO decisions, whilst for values above 1.5, a GO decision will almost certainly be associated with an “active” compound in the sense of Walley et al. (2015).
Walley et al. (2015) suggest that these conditional posterior distributions can be thought of as a continuous analogue of positive predictive values (PPV) and negative predictive values (NPV) which are used in the assessment of diagnostic tests. Concretely transferring their idea to our context, they define the PPV as
p TV|GO 0.936
and correspondingly as NPV as
p TV|NOGO 0.771.
Suggesting that the design is better at identifying active treatments than it is in identifying inactive treatments. In such cases, it will be important to discuss with the development team whether changes to the parameters of the design, for example, by altering either α0 or α1, can improve the designs ability to discriminate between active and inactive compounds. If Bayesian analogues to the prior posterior conditional distributions given in (9.12)–(9.14) are required all that is needed is to replace the definitions of MAX and MIN with their Bayesian versions given in the previous section.
143
Decision Criteria in Proof-of-Concept Trials
9.8 Estimated Variance Case In general, it would not be standard practice to assume that the variance is known. Consequently, CIs will normally be based on the estimated standard deviation and the appropriate t-distribution. In this case, the decision criteria are statistically in analogy to (2.20) can be written as
2 ˆ t ,10 s LRV s n1
ˆ n1 LRV 2 t ,10
2 ˆ t ,11 s TV s n1
ˆ n1 TV . 2 t ,11
(9.15)
(9.16)
In Figure 9.9, the estimated standard deviation is plotted on the vertical axis and the estimated treatment effect on the horizontal axis and the four decision quadrants from Table 9.1 have been expressed in terms of the decision criteria (9.15) and (9.16). Criterion (9.15) is defined by a line starting at
Estimated Standard Deviation (s)
Slope =
(1)
0
,
,
(2) Slope = 2
s*
2
2
(5) 0 ,
L
T
(3)
(4) *
Estimated Treatment Effect ( ) GO
,
PAUSE
NO GO
FIGURE 9.9 Decision quadrants expressed in terms of criteria (9.15) and (9.16).
, ,
, ,
144
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
s = 0 and δ = δLRV with slope
n1 / 2t120 ,
and criterion (9.16) is similarly
defined by a line starting at s = 0 and δ = δTV with slope intersection of these two lines occurs at the point s
n1 / 2t121 , . The
LRV TV t TV t10 , LRV 11 , t t t 11 , 10 , 10 , t11 ,
In terms of the regions numbered (1)–(5) in Figure 9.1, the four decision quadrants in Table 9.1 may be defined as (A) ≡ (3) + (4) (B) ≡ (5) (C) ≡ (2) (D) ≡ (1) The probability of meeting criterion (9.15) on its own – regions (3) + (4) + (5) is given by the non-central t-probability.
Ps
ˆ n1 LRV 2 t10 ,
ˆ 2 2 n1 P P T ,
2 2 n1 t10 , s2 2 n1 LRV t10 , 2
LRV
(9.17)
Similarly, the probability of meeting criterion (9.16) on its own – regions (2) + (3) + (4) is given by
Ps
ˆ n1 TV 2 t11 ,
P T ,
n1 TV t11 , 2
(9.18)
ˆ All that remains to calculate are the probabilities that ,s are in the region (4) + (5) and (4). For a fixed δˆ ,
Ps
2 2 ˆ LRV n1 ˆ LRV ˆ s n 1 2 ˆ | P 2 | 2 2 2 t10 , 2 t 1 , 0
145
Decision Criteria in Proof-of-Concept Trials
and
2 2 ˆ n1 TV n1 ˆ TV ˆ s 2 ˆ | P 2 | . 2 2 2 t11 , 2 t 11 ,
Ps
These are conditional probabilities, and their associated unconditional probabilities are obtained by integrating them with respect to the sampling distribution of ˆ ~ N ,2 2 / n1 over its appropriate range. The probability of being in the region (4) + (5) is
δ LRV t1−α 1 ,ν −δ TV t1−α 0 ,ν
( t1−α 0 ,ν − t1−α1 ,ν )
∫
n1 ˆ LRV 2 t10 ,
Ps
δ LRV
(
)
2 ˆ n1 2 ν n1 δ − δ LRV 1 n1 ˆ ˆ P χν < |δ exp − δ −δ 2 2 2 2 2 t 2 2 σ π σ 2 2σ 1 , − α ν 0
(
) dδˆ 2
the probability of being in region (4) is similarly LRV t1 1 , TV t1 0 ,
n1 ˆ TV 2 t11 ,
Ps
t1 0 , t1 1,
TV
2 ˆ 2 n1 TV ˆ | P 2 2 2 t 11 , n1 1 n1 ˆ exp 2 2 2 2 2 2
dˆ 2
These integrals can be approximated by using the mid-point rule, described in Section 2.2.3, with the appropriate minimum or maximum values defined by the range of integration. Alternatively, they may be approximated by
U 2
Ps
n1 ˆ LRV 2 t10 ,
ˆ n i LRV i P 2 12 t12 0 , 2 i 1 m
2
n1 1 n1 ˆ 2 2 2 exxp 2 2 2 i
U−L U+L in which= and U δ= , L δ LRV= , δˆi ωi + 2 2 ∗
2
(9.19)
146
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Ps
UL 2
ˆ n i TV iP 2 12 2 t12 1 , i 1 m
ˆ n1 TV 2 t11 ,
2
n1 1 n1 ˆ 2 2 2 exp 2 2 2 i
2
(9.20)
U − L U + L U δ= , L δ TV= , δˆi xi + in which= and xi, ωi are the zeros and 2 2 associated weights of the mth order Legendre polynomial (Abramowitz and Stegun, 1965) which we referred to in Section 4.5. The probabilities (9.17)−(9.20) can then be used to determine the probabilities of being in the regions (A), (B), (C) and (D): ∗
P A 9.17 9.19 9.20 P B 9.19 9.20 . P C 9.18 9.17 9.19 9.20 P D 1 9.18 9.19 9.20
(9.21)
from which PGO, PNOGO and PPAUSE can be calculated. An alternative approach is to simulate data and determine the proportion of simulations that lie in the four regions. Concretely, for a given sample size per arm, n, variance, σ2, and true treatment effect, δ if we simulate 2 2 2 2 2 ˆ ~ N , and s ~ n1
It is then simple to check in which of the regions A, B, C or D the pair of generated values δˆ , s lies. By repeating the process, a sufficiently large number of times an estimate of the probability of being in each region can be determined.
Example 9.2 This example continues with the assumptions of Example 9.1. To illustrate the different numerical approaches, suppose for the time being that as in Example 9.1, the true difference of the new drug compared to placebo is 1 unit.
147
Decision Criteria in Proof-of-Concept Trials
TABLE 9.4 Comparison of Numerical Approaches to Calculating the Probabilities of GO, NO/GO and PAUSE Decision Areas Methods
A
Mid-point (100,000 pts) Legendre Polynomials (50 pts) Simulation (×1,000,000)
B
0.094499 0.094501 0.094616
0.24321 0.24321 0.34283
C
D
2.1e− 2.7e−9 0 6
0.66229 0.66229 0.66255
We compare the mid-point method with N_PT = 100,000, the Legendre Polynomial approach using a 50th-order polynomial and a simulation based on 1,000,000 replicates. The results in Table 9.4 demonstrate that the use of Legendre polynomials is as accurate as the mid-point method based on 100,000 points and 1,000,000 simulations. Based on these results, PGO = 0.095, PNOGO = 0.662, PPAUSE = 0.243, illustrating that accounting for the uncertainty in the variance in determining the CIs has little impact, at least on average. Again, when designing a study like this, the trial statistician will determine the OC of the trial over a range of assumed true treatment effects. As we did in Section 9.3, we could show the OC of the design in a stacked probability plot similar to Figure 9.2, or the individual probabilities similar to Figure 9.3. In this case, the resulting graphs are so similar to Figures 9.2 and 9.3 that displaying them is superfluous.
9.9 Estimated Variance Case – Generalised Assurance Each of the probabilities PGO ∣ δ, PNOGO ∣ δ and PPAUSE ∣ δ is conditional on the assumed value of δ. If we again acknowledge uncertainty in δ through the prior distribution (2.3), we can determine the expected value of these probabilities, which is a form of generalised assurance. These expected probabilities are expressed through the integrals
1 2
n0 2
P R| e
n 1 0 0 2 2 2 2
d R GO , PAUSE, NOGO
(9.22)
The kernel of the integral in (9.22) is a normal density and therefore can be approximated by N
i 1
Wi P R| 2 2 / n0 Zi 0 R GO ,PAUSE,STOP
(9.23)
148
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
where Zi and Wi are the zeros and weights of the Hermite polynomials (Abramowitz and Stegun, 1965). Example 9.3 Continuing with Example 9.2, suppose that at the planning stage, the development team specifies that their belief about the treatment effect (relative to placebo) can be expressed by a normal distribution centred at 2 units with a variance of 1.6 corresponding to 5 patients per arm (n0 = 5). Then, using (9.23), the unconditional probabilities of the three decisions are:
= PGO 0= .547 , PNOGO 0.149 and PPAUSE = 0.304.
9.10 Discussion The use of quantitative decision-making, whilst relatively new, is finding ever more support amongst pharmaceutical sponsors. Despite this, most clinical trials continue to use traditional methods for sample-sizing based on a single test for rejecting a null hypothesis and associated its power. This is almost certainly sub-optimal as the design of a trial is best undertaken using the analyses, or decision criteria that it is intended to use at the end of the trial. This is no less the case in other contexts. For example, in a Bayesian clinical trial, or an adaptive clinical trial, aspects of the trial design, for example, the sample size, are determined from its OC (Grieve, 2016), which in many cases will require the use of simulation. Nonetheless, Pulkstenis et al. (2017) have argued there are several challenges that need to be overcome before these approaches become standard. These include the potential reluctance of sponsors to put into protocols information from the TPP which may have strategic value to competitors. Additionally, they may worry that the same information might instigate a lengthy discussion with regulators or ethics committees. Pulkstenis et al. (2017) also suggest that operational issues need to be considered. None of these issues is insurmountable and we can expect to see more studies using these kinds of decision criteria in the future.
10 Surety and Assurance in Estimation
10.1 Introduction As before, suppose we are planning to run a clinical trial to compare two treatments. We anticipate that the outcome measure can be described by a normal distribution. At the end of the study, we intend to report an estimate of the treatment difference with a 1 − α % CI. Then an obvious question in planning the trial is: What sample size should be chosen to control the width of the CI to a pre-specified size? (Day, 1988; Grieve, 1989, 1990, 1991b; Julious and Patterson, 2004). If, as previously defined, δˆ is the estimated treatment effect and s2 is the sample estimate of the population variance σ2 then the (1 − α) % CI for the treatment effect has width w 2t1 /2 , 2s2/n1 , where n1 is the sample size per arm, ν = 2(n1 − 1) and t1 − α/2, ν is the 1 − α/2% critical value of the t-distribution with ν degrees of freedom. If the required width is w0, Day (1988) suggests that the necessary sample size can be determined by setting w equal to w0 and solving for n1 to give n1
8s2t12 /2 , . w02
(10.1)
A concern with this formula is that it involves s2, which is unknown at the planning stage. One approach is to replace s2 by σ2 since 1 2 2 E s . 2
(10.2)
This gives
n1
DOI: 10.1201/9781003218531-10
8 2t12 /2 , , w02
(10.3) 149
150
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
This result is equivalent to equation (2.2) of Julious and Patterson (2004) which becomes clear when we note that they define the problem in terms of the half-width w∗ = w0/2. Because the sample size appears on both sides of (10.3), the solution is only available by search or iteration. Example 10.1 In this example, we modify the hypothetical example used by Day (1988). Suppose now that our prime interest lies in estimating the difference in diastolic blood pressure (DBP) of two treatments, and that the sample size of the study is based on the objective that the 95% CI should have a width of no more than 8 mm Hg. Assuming σ = 10 mm Hg, then for a range of n values 45(1)54, the corresponding expected widths from (10.3) are given in Table 10.1. What this table shows is that a sample size of 50 per arm is necessary to ensure that the width of the 95% CI for the difference in reduction of DBP is no more than 8 mm Hg. TABLE 10.1 Expected Width of the 95% Confidence Interval for a Range of Sample Sizes Sample Size Per Group (n)
Width
45 46 47 48 49
8.38 8.29 8.19 8.11 8.02
Sample Size Per Group (n) 50 51 52 53 54
Width 7.94 7.86 7.78 7.70 7.63
An implication of the result (10.2) implies that the sample size can be determined based on the expected width, an idea suggested by a referee of Kupper and Hafner (1989). The expected width of a CI has been proposed by Pratt (1961) as a measure of the “desirability” of a CI procedure as “a measure of the ‘average extent’ of the false values included”. As an alternative measure of “desirability”, Pratt (1961) proposed the average probability of the CI including false values and shows that this is the same as the expected width. If L and U are the lower and upper ends of the CI, then Pratt showed that EL U
P L U d
0
In this chapter, we investigate two questions. First, we question to what extent can we be assured that the CI at the end of the trial will meet the specification? Second, in analogy to the previous work on AP, how can we
151
Surety and Assurance in Estimation
utilise pilot or prior information about σ2 to support the determination of the appropriate sample size for a prospective study.
10.2 An Alternative to Power in Sample Size Determination Julious (2004) provides the following alternative equation to solve for n
n w2 1 20 t1 /2 , 0.5. 8
(10.4)
He points out that (10.4) is equivalent to a superiority sample size calculation in which the type II error is set to 0.5. He goes on to say that “obviously as precision trials are not powered, they cannot have any type II error”. Technically, this is true. However, in the context of CIs whilst we cannot have a high probability of rejecting a null hypothesis (power), we may well want a high probability that the width of the CI is less than the pre-specified value. To distinguish it from power, we will call it “surety”1. In proposing (10.1), Day (1988) commented that “whatever the difference observed, the precision of that difference can be assured”. But, in what sense can it be assured? In using (10.3) the sample size is chosen such that on average the CIs will have the given width; this follows from (10.2). In practice, of course, the value of s will not be identical to σ – sometimes it will be smaller, sometimes larger – and consequently, the CI will sometimes be wider than the specification and sometimes narrower. Intuitively, because of the average property, CIs based on the sample size in (10.3) will have approximately 50% chance of being wider than the given width and therefore the degree of surety which is provided is not great, a result noted by Harris et al. (1948). Choosing the sample size in order that the probability that the width of the CI is less than its prescribed value is large, say 0.8 or 0.9 can increase the surety. For a surety of 1 − ψ, n1 is defined by
2 P w w0 1 P 2t s w0 1 1 , n1 2 2 w02n1 s P 2 1 8 2t 2 1 , 2 2 2 w0 n1 P 1 8 2t12 /2 ,
(10.5)
152
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
which can only be solved iteratively. Kelley et al. (2003) provide the following algorithm. Suppose at the current iteration, i, the current estimate of the sample size, n1, is n1, i, then the value of n1 at the next iteration is n1, i 1
8 2t12 /2 , i 12 , i i w02
(10.6)
where , is the ψ percentile of a chi-square-density with ν degrees of freedom and νi = 2(n1, i − 1). The starting value is n0 8 2Z12 /2 /w02 , which is based on replacing the t-critical value in (10.1) with the corresponding normal critical value. The iterations based on (10.6) continue until the absolute difference between successive iterates is less than some relevant tolerance. The final sample size per arm is rounded up. This idea is not new. Grieve (1989, 1990, 1991b), Kupper and Hafner (1989) and Beal (1989) have all proposed it, and it was also proposed by both Graybill (1958) and Guenther (1965) much earlier. The proposal by Graybill (1958) is more general in that he proves a theorem that provides a solution to the problem of determining the sample size n, such that the width ω of a CI for a parameter θ has a given probability of being less than w. A single estimate of a normal mean is used to illustrate the approach. The sample sizes per arm in Table 10.2 are derived from (10.5) for ranges of standardised half-width and surety values when a 95% CI is planned. What the entries show is that the requirement to have high surety that the width of the interval meets a pre-specified condition can increase the sample size by 2
TABLE 10.2 Sample Sizes Per Arm to a Give a Standardised Half-Width (w0(w0/ (2σ)) of the 95% CI with Surety (1 − ψ) Standardised Halfwidth (w0/(2σ)) 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Surety (1 − ψ) 0.5
0.8
0.9
0.95
0.99
3075 770 343 193 124 87 64 49 39 32
3121 793 358 205 134 94 71 55 44 37
3145 805 366 211 138 98 74 58 47 39
3165 815 373 216 142 102 77 60 49 41
3203 833 385 225 149 108 82 65 53 44
153
Surety and Assurance in Estimation
small amounts – approximately 4% – if the standardised half-width is narrow, up to nearly 40% if it is wide.
10.3 Should the Confidence Interval Width Be the Sole Determinant of Sample Size? Lehmann (1959) was one of the first to question whether in the context of CIs width is everything. The principal point in his argument is that CIs which are narrow are only of value if they contain the true parameter value. Since the CIs in (10.5) are (1 − α)%CIs, there is a high probability that they will contain the true parameter value; so intuitively, one might expect that requiring a CI to be of a pre-specified width and to contain the true parameter value would have little impact on the required sample size. Beal (1989) and Grieve (1991b) both investigated the impact of such an approach. The probability (10.5), which we denote by ϕ(w0), can be decomposed into two components as
w0 E w0 1 I w0 .
(10.7)
The two components of (10.7), ϕE(w0) and ϕI(w0), are both conditional probabilities. The former is the conditional probability that the width of the interval is less than w0 given that the interval excludes the true parameter value, whilst the latter is the conditional probability that the width is less than w0 and includes the true parameter value. Beal (1989) first calculated the joint probability that the width of the interval is less than w0 and the interval includes the true parameter value, which he denoted by π. The requirement that the CI width is less than w0 can equivalently be written in the form s
n1w02 8t12 /2 ,
(10.8)
while the requirement that the true parameter value is contained in the (1 − α)% is equivalent to
δ −s
2 2 t α < δˆ < δ + s t α ν 1 − , n1 2 n1 1− 2 ,ν
(10.9)
154
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Together, the pair of requirements, (10.8) and (10.9), define an equilateral triangle in the s,ˆ space which is shown in blue in Figure 10.1. To calculate
the joint probability Beal (1989) first determined the probability of (10.9) for a fixed s, and then integrated over the range of s values defined by (10.8) to give the joint probability in the form
1 2 /2 / 2
n1 w0 2 2 t1 /2 ,
0
yt 2 1 2 ,
1 y 1exp 1 y dy. 2
Following a change of variable and a change in the order of integration, Grieve (1991b) showed that this can be written as
2 2
1/2
n1 w0 2 2
0
n1w02 8 2 t12 /2 , 1 1 1 1 y exp y dy exp z 2 dz. /2 2 / 2 2 2 z 2 t12 /2 ,
FIGURE 10.1 Integration regions to define the conditional probability that the width of the CI is less than w0 given that it includes the true parameter value (ϕI(w0)).
155
Surety and Assurance in Estimation
Furthermore, Grieve (1991b) showed that this double integral can be approximated by m
i 1
w0i 2 4
vn1w02 1 xi 2 n w2 n1 2 21 0 8 2t12 /2 , 2 8 t1 /2 , n1w02 1 xi 2 exp (10.10) 32 2t12 /2 ,
where Χν(∙) is the CDF of a χ2 distributed variable with ν degrees of freedom and xi, ωi are the zeros and associated weights of the mth-order Legendre polynomials introduced in Section 8.5. Finally, the conditional probability that the width of the CI is less than w0, given that it includes the true parameter value, ϕI(w0) is, by definition, the ratio of the probabilities of being in the blue and green areas in Figure 10.1 which has the value π/(1 − α). Example 10.2 Using (10.10), Grieve (1991b) showed that the sample sizes required to control ϕI(w0) = π/(1 − π) and ϕ(w0) differed by at most 1. As illustration, we will use the same assumptions of Example 10.1, that is we suppose the targeted half-width is w0/(2σ) = 0.4 and α = 0.05, then for a range of n values, Table 10.3 gives the probabilities ϕ(w0) and ϕI(w0). If we require a surety of 80% then both approaches indicate a sample size of n1 = 55 per arm is necessary. Similarly, for a surety of 90% both give rise to a sample size of n1 = 58 per arm. However, for a surety of 95%, controlling ϕ(w0) requires a sample size of n1 = 60 per arm confirming the result in Table 10.1, but in contrast, controlling ϕI(w0) = π/(1 − α) requires a sample size of n1 = 61 per arm. This difference is explained by a very small reduction in the probability of achieving the appropriate width from 0.950 to 0.949. But of course, for all practical purposes, there is no difference. TABLE 10.3 ϕ(w0) and ϕI(w0) as a Function of Sample Size for a 95% CI with a Targeted Standardised Half-Width of 0.4 Sample Size per Group (n)
ϕ(w0)
ϕI(w0)
Sample Size per Group (n)
ϕ(w0)
ϕI(w0)
54 55 56 57 58 59
0.772 0.814 0.851 0.884 0.911 0.933
0.767 0.810 0.848 0.881 0.908 0.931
60 61 62 63 64 65
0.950 0.964 0.975 0.983 0.988 0.992
0.949 0.963 0.974 0.982 0.988 0.992
156
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Daly (1991) has argued that “in planning any investigation the question of power to detect the smallest clinically worthwhile difference must predominate over that of precision”. He argues that unless power is the most important factor in determining sample size, studies may be too small to achieve meaningful results, which has been known to be a serious problem for over 25 years. Jiroutek et al. (2003) have investigated alternative approaches more formally. They proceed by defining the events rejection (R), validity (V) and width (W). The event rejection occurs if the CI excludes the null parameter value (null hypothesis); the event validity occurs if the CI includes the true parameter value; the event width occurs if the CI is less than a pre-specified value. With these definitions, (10.5) is of the form P(W) > 1 − ψ while Beal’s (1989) conditional approach is of the form P(W|V) > 1 − ψ. Jiroutek et al. (2003) suggest that it is more appropriate to determine the sample size by assuring that P(W∩R|V) > 1 − ψ, in other words, by controlling the probability that the width is less than a pre-specified value and the null hypothesis is rejected given that the CI contains the true parameter value. The basis of their argument is that because scientists often both test hypotheses and construct interval estimates, an experiment’s sample size should be determined to address both of these objectives. In truth, as Jiroutek et al. (2003) argue, the appropriate method of determining the sample size should rightly depend upon the objective of the study. If the prime objective is to test a null hypothesis, then P(R) should be used; if pure estimation is of interest, then it is preferable to use P(W); if, however, estimation and hypothesis testing are of equal importance, then P(W∩R|V) may well be a more appropriate measure to use.
10.4 Unconditional Sample Sizing Based on CI Width The probability ϕ(w0) is a conditional probability. Conditional on the value of σ2, ϕ(w0) is the probability that the width of the CI is less than the prespecified value w0. If p(σ2) is the prior distribution of σ2, then in analogy to average power and assurance, the corresponding unconditional probability ϕU(w0) is the expected value of ϕ(w0) with respect to p(σ2):
U w0 w0 p 2 d 2
2
(10.11)
If similar studies have been conducted, the prior distribution p(σ2) can be taken to be the posterior distribution based on these studies, and with standard assumptions, this posterior will have an inverse χ2 density
157
Surety and Assurance in Estimation
p
2
s
2 0 0
/2
0 /2
exp 0 s02 / 2 2 0 2 /2
0 / 2 2
(10.12)
Rewriting (10.5), the unconditional probability, defined in (10.11), can be written as
U w0
n1 w0 2 2t1 /2 ,
0
/2s 1exp s2 / 2 2
0
/2 1
2
/ 2
p d 2
2
.
(10.13)
Substituting (10.12) in (10.13), integrating out σ2 and rearranging gives n w2 U w0 P F , 0 2 12 0 8s0 t1 /2 ,
(10.14)
where Fν, ν is an F-variate with ν and ν0 degrees of freedom given in (4.17). As the amount of a priori information that we have about σ2 increases, that is as ν0 → ∞, Grieve (1991b) has shown that 0
n w2 n1w02 P F , 0 2 12 0 P 2 8s0 t1 /2 , 8 2t12 /2 ,
which is precisely (10.5). Following the approach of Kelley et al. (2003), an iterative solution to (10.14) can be derived. As before, suppose at the current iteration, i, the current estimate of the sample size, n1, is n1, i. Then, the value of n at the next iteration is n1, i 1
8s02t12 /2 , i F1 , i , 0 w02
(10.15)
in which Fψ, ν, ν is the ψ percentile of an F-density with ν and ν0 degrees of freedom. Cook (2010) provides an algorithm for determining the parameters of a two-parameter Gamma distribution based on two quantiles. If s12 and s22 are specified as the p1and p2 (p1< p2) quantiles of an inverse chi-square distribution for σ2 with degrees of freedom ν0 and variance parameter s02 , since 1/ (2σ2) has a gamma distribution, Cook’s algorithm can be used to determine the parameters of an σ2 inverse-χ−2 distribution. 0
158
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
10.4.1 Modified Cook Algorithm
a. Let x1 1 / 2s22 and x2 1 / 2s12 and set pi = 1 − pi. b. Solve the equation for α G 1 p2 , ,1) x2 G 1 p1 , ,1) x1
where G−1(p, α, β)) is the inverse gamma function. Cook (2010) proposes a bisection method, we have used a Newton–Raphson algorithm using numerical derivatives. c. Determine β from
x1 . G 1 p1 , ,1)
d. Calculate ν0 = 2α and s02 / 0 . It is interesting to note that Harris et al. (1948) proposed an alternative approach to determine ν0 and s0. Their suggestion is for “the experimenter to place what he feels to be reasonable upper and lower limits on the standard deviation” with associated quantiles, essentially s1, s2, p1 and p2. 10.4.2 Harris et al. (1948) Algorithm
s0 s1 s2 / 2s22 and x2 1 / 2s12 ; 1. 2. interpolate for s2/s1 in a table of
χ p22 , v0 χ p21 , v0
for a range of ν0 values.
Example 10.3 Continuing with the assumptions of Examples 10.1 and 10.2, in which w0 = 8 and α = 0.05, we suppose an elicitation process conducted with the project development team has established that the team believes there is a 70% chance that the standard deviation is defined by P(8 < σ < 13) = 0.7.
159
Surety and Assurance in Estimation
For the purposes of illustration, we assume that there are equal probabilities in the two tails, but this restriction is not strictly necessary. With the inputs above, Cook’s algorithm delivers α = 4.89, β = 912.01, ν0 = 9.77 and s0 = 9.661 for the parameters of the inverse χ2 distribution. Then using (10.15), we calculate that a sample size of 118 patients per arm is required to give a surety of 95% that the width of the confidence for the treatment difference is less than 8 mm Hg.
10.5 A Fiducial Interpretation of (10.14) The result (10.14) is not just a Bayesian result but was first proposed over 70 years ago. In 1946, a query appeared in what was then called the Biometrics Bulletin. The query read: QUERY: I have a sample of 35 observations with mean, 24, and standard deviation, 6. I calculated the half interval, st0.05 / n = 2.061 , so that I cannot be reasonably certain of being within less than 8.6% of the population mean. What sample size do I need to be assured that my sample mean will be within 1.2 points of the population mean, 1.2 being 5% of the present sample mean?
A. M. Mood and George W. Snedecor provided the answer to the query. The first paragraph of their answer reads: ANSWER:
The
formula
for
the
prospective
sample
size
is
N = (s/d ) t F where d is the desired half confidence interval, 1.2,
t is at any specified level (0.05, say) with N − 1 degrees of freedom, and F has the degrees of freedom, n1 = N − 1 and n2 = 35 − 1 = 34. F may be set at the tabulated point which gives you satisfactory assurance that your proposed sampling will be successful.
Mood and Snedecor (1946) here use “assurance” where we use “surety. While Mood and Snedecor do not give their derivation of the above formula, it is likely that they proceeded along the fiducial path proposed by Fisher (1935). Suppose there exists an initial estimate s2 of the population variance σ2 based on n2degrees of freedom, and that it is proposed to take a second sample of size N which will give rise to an independent estimate s′2. Now, since the first estimate is distributed as 2 n22 independently of the second which is distributed as 2 n21 (where n1 = N − 1) it follows that s′2/s2~Fn , n so that
1
2
s′2~s2Fn , n . This last result implies that P st / n d P s Fn1 , n 2 t / n d , which leads directly to the formula given by Mood and Snedecor. The role of 1
2
160
Hybrid Frequentist and Bayesian Power in Planning Clinical Trials
Fn , n is to provide the fiducial pivot between the distribution of the known s2 and the unknown s′2. Harris et al. (1948) used a slightly different approach. First, remember that the objective is to have a high probability that the half-width of the CI is less than d with high probability, say β, so that P s d n / t . Second, since 1
2
for any n2 it is known that P s Fn1 , n 2 s . Equating the right-hand sides of these two inequalities again leads to the Mood and Snedecor formula. Whilst this result was developed for a CI for an estimate from a single arm, the same process leads directly to (10.13) in the two-arm case, a result which was first presented by Harris et al. (1948); see also Ryan (2013). There is a strong link between the use of fiducial pivots and Bayesian predictive methods based on uninformative priors. For example, it was noted in Section 6.2 that Armitage (1988, 1989) could derive some of the predictive Bayesian results in Spiegelhalter and Freedman (1986) essentially from a pivotal perspective. Beal (1991) has expressed concern about a perceived conflict between the Bayesian use of a prior and the intention to analyse the experiment in a frequentist fashion. Beal recommended that one determine the sample size for several different values of σ2 and then choose one that is “somewhat” conservative. However, the advantage of the hybrid frequentist/Bayesian approach is not restricted to use by Bayesians since Fisher (1960) has shown how data from two normal samples may be combined to give the standard Bayesian result, but from a fiducial perspective. In this context, therefore, namely, one in which a prior sample is available, the Bayesian and fiducial approaches are interchangeable. The Bayesian philosophy does, however, retain the advantage that even in those cases for which a prior sample is not available a similar strategy may be pursued by assessing, subjectively, the parameters of an inverse χ2prior for σ2 as in Section 10.4. Of course, it could still be argued that Beal’s pragmatic solution of using several values for σ2and choosing the one, which is “somewhat” conservative, is sufficient. The definition of “somewhat” must in a sense contain an assessment of the likelihood of a particular σ2 being the right one, and that is of course precisely what the Bayesian or fiducial methods are doing. The calculus of probabilities automatically provides a weighting of the variances without having to subjectively decide on how conservative a solution is.
Note 1 (NOTE: in previous work (Grieve 1989, 1990 and 1991b) this probability was called “assurance”. For obvious reasons this would cause confusion if we used the term here.)
References Abramowitz M, Stegun IA. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover, 1965. Acuna SA, Chesney TR, Amarasekera ST, Baxter NN. Defining non-inferiority margins for quality of surgical resection for rectal cancer: a Delphi consensus study. Annals of Surgical Oncology 2018; 25:371–378. Alderson NE, Campbell G, D’Agostino R, Ellenberg S, Lindborg S, O’Neill R, Rubin D, Siegel J. Statistical issues: a roundtable discussion. Clinical Trials 2005; 2:364–372. Altham PME. Exact Bayesian analysis of a 2x2 contingency table, and Fisher’s “exact” significance test. Journal of The Royal Statistical Society Series B 1969; 31:261–269. Armitage P. Some aspects of Phase-III trials. In Clinical Trials and Related Topics, T Okuno (ed). Excerpta Medica: Amsterdam, 1988. Armitage P. Inference and decision in clinical trials. Journal of Clinical Epidemiology 1989; 42:293–299. Bartky W. Multiple sampling with constant probability. The Annals of Mathematical Statistics 1943; 14:363–377. Bauer P, Bauer MM. Testing equivalence simultaneously for location and dispersion of two normally distributed populations. Biometrical Journal 1994; 36:643–660. Bauer P, Köhne K. Evaluation of experiments with adaptive interim analyses. Biometrics 1994; 50:1029–1041. Beal SL. Sample size determination for confidence intervals on the population mean and on the difference between two population means. Biometrics 1989; 45:969–977. Beal SL. Discussion of Grieve (1991). Biometrics 1991; 47:1602–1603. Bernardo JM, Smith AFM. Bayesian Theory. Chichester: Wiley, 1994. Besag J. A candidate’s formula: a curious result in Bayesian prediction. Biometrika 1989; 76:183. Blackwelder WC. “Proving the Null Hypothesis” in clinical trials. Controlled Clinical Trials 1982; 3:345–353. Box GEP. Sampling and Bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society: Series A 1980; 143:383–404. Box GEP, Tiao GC. Bayesian Inference in Statistical Analysis. Reading, MA: AddisonWesley, 1973. Breder CD, Du W, Tyndall A. What’s the regulatory value of a target product profile? Trends in Biotechnology 2017; 35:576–579. Broglio KR, Stivers DN, Berry DA. Predicting clinical trial results based on announcements of interim analyses. Trials 2014; 15:1–8. Brown BW, Herson J, Atkinson EN, Rozel ME. Projection from previous studies: a Bayesian and frequentist compromise. Controlled Clinical Trials 1987; 8:29–44. Browne RH. On the use of a pilot sample for sample size determination. Statistics in Medicine 1995; 14:1933–1940.
161
162References
Campbell MJ. Doing clinical trials large enough to achieve adequate reductions in uncertainties about treatment effects. Journal of the Royal Society of Medicine 2013; 106:68–71. Carroll KJ. Decision making from phase II to phase II and the probability of success: reassured by “assurance”. Journal of Biopharmaceutical Statistics 2013; 23:1188–1200. Chen D-G, Ho S. From statistical power to statistical assurance: it’s time for a paradigm change in clinical trial design. Communication in Statistics – Computing and Simulation 2017; 46:7957–7971. Chuang-Stein C. Sample size and the probability of a successful trial. Pharmaceutical Statistics 2006; 5:305–309. Chuang-Stein C, Kirby S, Hirsch I, Atkinson G. The role of the minimum clinically important difference and its impact on designing a trial. Pharmaceutical Statistics 2011a; 10:250–256. Chuang-Stein C, Kirby S, French J, Kowalski K, Marshall S, Smith MK, Bycott P, Beltangady M. A quantitative approach for making go/no-go decisions in drug development. Drug Information Journal 2011b; 45:187–202. Ciarleglio MM, Arendt CD. Sample size determination for a binary response in a superiority clinical trial using a hybrid classical and Bayesian procedure. Trials 2017; 18:1–21. Cohen J. The statistical power of abnormal-social psychological research: a review. Journal of Abnormal and Social Psychology 1962; 65:145–153. Cohen J. Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press, 1969. Cohen J. A power primer. Psychological Bulletin 1992; 112:155–159. Cook JD. 2010. Determining distribution parameters from quantiles. (https://www. johndcook.com/quantiles_parameters.pdf, last accessed 23 June 2021). Cox DR. Regression models and life tables (with discussion). Journal of the Royal Statistical Society Series B 1972; 34:187–220. Crisp A, Miller S, Thompson D, Best N. Practical experiences of adopting assurance as a quantitative framework to support decision making in drug development. Pharmaceutical Statistics 2018; 17:317–328. Crook JF, Good IJ. The powers and strengths of tests for multinomials and contingency tables. Journal of the American Statistical Association 1982; 77:793–802. Dallal S, Hall W. Approximating priors by mixtures of natural conjugate priors. Journal of the Royal Statistical Society, Series B 1983; 45:278–286. Dallow N, Best NG, Montague T. Better decision making in drug development through adoption of formal prior elicitation. Pharmaceutical Statistics 2018; 17:301–316. Dallow N, Fina P. The perils with the misuse of predictive power. Pharmaceutical Statistics 2010; 10:310–317. Daly LE. Confidence intervals and sample sizes: don’t throw out all your old sample size tables. British Medical Journal 1991; 302:333–336. Dawid AP. Discussion of Racine et al. Applied Statistics 1986; 35:132. Day SJ. Letter to the editor. Lancet 1988; ii:1427. Diaconis P, Ylvisaker D. Quantifying prior opinion. In Bayesian Statistics 2, JM Bernardo, MH DeGroot, DV Lindley, AFM Smith (eds). Elsevier: Netherlands, 1985, 133–156. Dodge HF, Romig HG. A method of sampling inspection. Bell System Technical Journal 1929; 8:613–631.
References
163
Du H, Wang L. A Bayesian power analysis procedure considering uncertainty in effect size estimates from a meta-analysis. Multivariate Behavioral Research 2016; 51:589–605. Eaton ML, Muirhead RJ, Soaita AI. On the limiting behavior of the “probability of claiming superiority” in a Bayesian context. Bayesian Analysis 2013; 8:221–232. Fayers PM, Cuschieri A, Fielding J, Craven J, Uscinska B, Freedman LS. Sample size calculation for clinical trials: the impact of clinician beliefs. British Journal of Cancer, 2000; 82:213–219. Fienberg SE. The Early Statistical Years: 1947–1967 A Conversation with Howard Raiffa. Statistical Science 2008; 23:136–149. Finney DJ. The median lethal dose and its estimation. Archives of Toxicology 1985; 56:215–218. Fisch R, Jones I, Jones J, Kerman J, Rosenkranz GK, Schmidli H. Bayesian design of proof-of-concept trials. Therapeutic Innovation & Regulatory Science 2015; 49:155–162. Fisher RA. The fiducial argument in statistical inference. Annals of Eugenics 1935; 6:391–398. Fisher RA. Statistical Methods, Experimental Design, and Scientific Inference. Oxford: Oxford University Press, 1990, 126–127. Fisher RA. On some extensions of Bayesian inference proposed by Mr Lindley. Journal of the Royal Statistical Society Series 1960; 22:299–301. Fisher Box J. R. A. Fisher, the life of a scientist. New York: John Wiley & Sons, 1978. Flühler H, Grieve AP, Mandallaz D, Mau J, Moser HA. Bayesian assessment of Bioequivalence: an example. Journal of Pharmaceutical Sciences 1983; 72:1178–1181. Food and Drug Administration. Guidance for Industry: Non-Inferiority Clinical Trials to Establish Effectiveness; 2016. (https://www.fda.gov/media/78504/ download, last accessed 23 June 2021). Food and Drug Administration. Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials; 2010. (https://www.fda.gov/media/71512/download, last accessed 23 June 2021). Frei A, Cottier P, Wunderlich P, Lüdin E. Glycerol and dextran combined in the therapy of acute stroke. Stroke 1987; 18:373–379. Freiman JA, Chalmers TC, Smith H, Kuebler RR. The importance of beta, the type ii error and sample size in the design and interpretation of the randomized control trial — survey of 71 negative trials. New England Journal of Medicine 1978; 299:690–694. Frewer P, Mitchell P, Watkins C, Matcham J. Decision-making in early clinical drug development. Pharmaceutical Statistics 2016; 15:255–263. Gamalo MA, Tiwari RC, LaVange LM. Bayesian approach to the design and analysis of non-inferiority trials for anti-invective products. Pharmaceutical Statistics 2014; 13:25–40. Gamalo-Siebers M, Gao A, Lakshminarayanan M, Liu G, Natanegara F, Railkar R, Schmidli H, Song G. Bayesian methods for the design and analysis of noninferiority trials. Journal of Biopharmaceutical Statistics 2016; 26:823–841. Gamalo MA, Wu R, Tiwari RC. Bayesian approach to non-inferiority trials for normal means. Statistical Methods in Medical Research 2016; 25:221–240. Ghosh BK. A brief history of sequential analysis. In Handbook of Sequential Analysis. BK Ghosh, PK Sen (eds). Marcel Dekker: New York, 1991, 1–19.
164References
Good IJ. The Bayes/non-Bayes compromise: a brief review. Journal of the American Statistical Association 1992; 87:597–606. Good IJ, Crook JF. The Bayes/non-Bayes compromise and the multinomial distribution. Journal of the American Statistical Association 1974; 69:711–720. Good IJ, Gaskins A. The centroid method of numerical integration. Numerische Mathematik 1971; 16:343–359. Graybill FA. Determining sample size for a specified width confidence interval. The Annals of Mathematical Statistics 1958; 29:282–287. Grieve AP. Discussion of Piegorsch and Gladen (1986). Technometrics 1987; 29:504–505. Grieve AP. Some uses of predictive distributions in pharmaceutical research. In Biometry - Clinical Trials and Related Topics, T Okuno (ed). Elsevier Science Publishers B.V.: Amsterdam, 1988a, 83–99. Grieve AP. A Bayesian approach to the analysis of LD50 experiments. In Bayesian Statistics 3, JM Bernardo, MH DeGroot, DV Lindley, AFM Smith (eds). Oxford University Press: Oxford, 1988b, 617–630. Grieve AP. Letter to the editor. Lancet 1989; i:337. Grieve AP. Letter to the editor. The American Statistician 1990; 44:190. Grieve AP. Predictive probability in clinical trials. Biometrics 1991a; 47:323–330. Grieve AP. Confidence intervals and sample sizes. Biometrics 1991b; 47:1597–1602. Grieve AP. Discussion of Spiegelhalter et al (1994). Journal of the Royal Statistical Society Series A 1994; 157:387–388. Grieve AP. Joint equivalence of means and variances of two populations. Journal of Biopharmaceutical Statistics 1998; 8:377–390. Grieve AP. How to test hypotheses if you must. Pharmaceutical Statistics 2015; 14:139–150. Grieve AP. Idle thoughts of a “well-calibrated” Bayesian in clinical drug development. Pharmaceutical Statistics 2016; 15:96–108. Grouin JM, Coste M, Bunouf P, Lecoutre B. Bayesian sample size determination in non-sequential clinical trials: statistical aspects and some regulatory considerations. Statistics in Medicine 2007; 26:4914–4924. Guenther WC. Concepts of Statistical Inference. New York: McGraw-Hill, 1965. Guttman I. Statistical Tolerance Regions: Classical and Bayesian. London: Griffin, 1970. Guttman I. A Bayesian analogue of Paulson’s lemma and its use in tolerance region construction when sampling from the multi-variate normal. Annals of the Institute of Statistical Mathematics 1971; 23:67–76. Harris M, Horvitz DG, Mood AM. On the determination of sample sizes in designing experiments. Journal of the American Statistical Association 1948; 43:391–402. Halpern SD, Karlawish JHT, Berlin JA. The continuing unethical conduct of underpowered clinical trials. Journal American Medical Association 2002; 288:358–362. Hay M, Thomas DW, Craighead JL, Economides C, Rosenthal J. Clinical development success rates for investigational drugs. Nature Biotechnology 2014; 32:40–51. Holmgren EB. Establishing equivalence by showing that a specified percentage of the effect of the active control over placebo is maintained. Journal of Biopharmaceutical Statistics 1999; 9:651–659. Hung HJ, Wang SJ, O’Neill R. A regulatory perspective on choice of margin and statistical inference issue in non-inferiority trials. Biometrical Journal: Journal of Mathematical Methods in Biosciences 2005; 47:28–36.
References
165
Ibrahim JG, Chen MH, Lakshminarayanan M, Liu GF, Heyse JF. Bayesian probability of success for clinical trials using historical data. Statistics in Medicine 2015; 34:249–264. International Conference of Harmonisation. E9: Statistical Principle for Clinical Trials, 1996. (www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500002928.pdf, last accessed 23 June 2021). Jeffreys H. Theory of Probability, 2nd Edition. Oxford: At the Clarendon Press, 1948. Jiang K. Optimal sample sizes and go/no-go decisions for phase II/III development programs based on probability of success. Statistics in Biopharmaceutical Research 2011; 3:463–475. Jiroutek MR, Muller KE, Kupper LL, Stewart PW. A new method for choosing sample size for confidence interval-based inferences. Biometrics 2003; 59:580–590. Julious SA, Patterson SD. Sample sizes for estimation in clinical research. Pharmaceutical Statistics 2004; 3:213–215. Julious SA. Tutorial in biostatistics: sample sizes for clinical trials with Normal data. Statistics in Medicine 2004; 23:1921–1986. Kelley K, Maxwell SE, Rausch JR. Obtaining power or obtaining precision: Delineating methods of sample-size planning. Evaluation & the Health Professions 2003; 26:258–287. Kieser M, Wassmer G. On the use of the upper confidence limit for the variance from a pilot sample for sample size determination. Biometrical Journal 1996; 38:941–949. Kola I, Landis J. Can the pharmaceutical industry reduce attrition rates? Nature 2004; 3:711–715. Kunzmann K, Grayling MJ, Lee KM, Robertson DS, Rufibach K, Wason J. A review of Bayesian perspectives on sample size derivation for confirmatory trials. The American Statistician 2021; 75:424–432. Kupper LL, Hafner KB. How appropriate are popular sample size formulas? The American Statistician 1989; 43:101–105. Lachin JM. Sample size determinations for rxc comparative trial. Biometrics 1977; 33:315–324. Lalonde RL, Kowalski KG, Hutmacher MM, Ewy W, Nichols DJ, Milligan PA, Corrigan BW, Lockwood PA, Marshall SA, Benincosa LJ, Tensfeldt TG. Modelbased drug development. Clinical Pharmacology & Therapeutics 2007; 82:21–32. Lan KKG, Hu P, Proschan M. A conditional power approach to the evaluation of predictive power. Statistics in Biopharmaceutical Research 2009; 1:131–136. Lan KKG, Wittes JT. Some thoughts on sample size: a Bayesian-frequentist hybrid approach. Clinical Trials 2013; 9:561–569. Lancaster HO Significance tests in discrete distributions. Journal of the American Statistical Association 1961; 56:223–234. Lecoutre B. L’Analyse Bayésienne des Comparaisons. Lille: Presses Universitaires de Lille, 1984. Lecoutre B. Two useful distributions for Bayesian predictive procedures under normal models. Journal of Statistical Planning and Inference 1999; 79:93–105. Lecoutre B, Poitevineau J. Le P A C Version 2.20, 2020. https://eris62.eu/Eris.html, last accessed 23 June 2021 Lehmacher W, Wassmer G. Adaptive sample size calculations in group sequential trials. Biometrics 1999; 55:1286–1290.
166References
Lehmann EL. Testing Statistical Hypotheses. New York: Wiley, 1959. Lim J, Walley R, Yuan J, Liu J, Dabral A, Best N, Grieve A, Hampson L, Wolfram J, Woodward P, Yong F. Minimizing patient burden through the use of historical subject-level data in innovative confirmatory clinical trials: review of methods and opportunities. Therapeutic Innovation & Regulatory Science 2018; 52:546–559. Liu H, Wang ZG, Fu SY, Li AJ, Pan ZY, Zhou WP, Lau WY, Wu MC. Randomized clinical trial of chemoembolization plus radiofrequency ablation versus partial hepatectomy for hepatocellular carcinoma within the Milan criteria. Journal of British Surgery 2016; 103(4):348–356. Lindley DV, Smith AFM. Bayes estimates for the linear model. Journal of the Royal Statistical Society Series B 1972; 34:1–41. Mandallaz M, Mau J. Comparison of different methods for decision-making in bioequivalence assessment. Biometrics 1981; 37:213–222. Marshall G, Blacklock JWS, Cameron C, Capon NB, Cruickshank R, Gaddum JH, Heaf FRG, Hill AB, Houghton LE, Hoyle JC, et al. Streptomycin treatment of pulmonary tuberculosis: a Medical Research Council investigation. British Medical Journal 1948; ii:769–782. Martz HF, Waller RA. Bayesian Reliability Analysis. New York: John Wiley & Sons, 1982. McShane BB, Böckenholt U. Planning sample sizes when effect sizes are uncertain: the power-calibrated effect size approach. Psychological Methods 2016; 21:47–60. Mood AM, Snedecor GW. Query 40. Biometrics Bulletin 1946; 2:120–122. Mudge JF, Baker LF, Edge CB, Houlahan JE. Setting an optimal α that minimizes errors in null hypothesis significance tests. PLoS ONE 2012; 7:e32734. Muirhead RJ, Şoaita AI. On an approach to Bayesian sample sizing in clinical trials. In Advances in Modern Statistical Theory and Applications: A Festschrift in Honor of Morris L. Eaton, G Jones, X Shen (eds). Institute of Mathematical Statistics: Beechwood, 2013, 126–137. Neyman J, Pearson ES. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society Series A 1933a; 231:289–337. Neyman J, Pearson ES. The testing of statistical hypotheses in relation to probabilities a priori. Proceedings of the Cambridge Philosophical Society 1933b; 24:492–510. Nixon RM, O’Hagan A, Oakley J, Madan J, Stevens JW, Bansback N, Brennan A. The Rheumatoid Arthritis Drug Development Model: a case study in Bayesian clinical trial simulation. Pharmaceutical Statistics 2009; 8:371–389. Oakley JE, O’Hagan A. SHELF: the Sheffield elicitation framework (version 2.0), School of Mathematics and Statistics, University of Sheffield, 2010 (http:// www.tonyohagan.co.uk/shelf/index.html, last accessed 23 June 2021). O’Brien PC, Fleming TR. A multiple testing procedure for clinical trials. Biometrics 1979; 35:549–556. O’Hagan A, Stevens JW. Bayesian assessment of sample size for clinical trials of costeffectiveness. Medical Decision Making 2001; 21:219–230. O’Hagan A, Stevens JW, Campbell MJ. Assurance in clinical trial design. Pharmaceutical Statistics 2005; 4:187–201. Paulson E. A note on tolerance limits. The Annals of Mathematical Statistics 1943; 4:90–93. Pearson ES, Adyanthāya NK. The distribution of frequency constants in small samples from non-normal symmetrical and skew populations. Biometrika 1929; 21:259–286.
References
167
Pratt JW. Length of confidence intervals. Journal of the American Statistical Association 1961; 56:549–567. Proschan MA, Lan KG, Wittes JT. Statistical Monitoring of Clinical Trials: A Unified Approach. New York: Springer Science & Business Media, 2006. Pulkstenis E, Patra K, Zhang J. A Bayesian paradigm for decision-making in proof-ofconcept trials. Journal of Biopharmaceutical Statistics 2017; 27:442–456. Racine A, Grieve AP, Flühler H, Smith AFM. Bayesian methods in practice: experience in the pharmaceutical industry (with discussion). Applied Statistics 1986; 35:93–150. Racine-Poon A, Grieve AP, Flühler H, Smith AFM. A two-stage procedure for bioequivalence studies. Biometrics 1987; 43:847–856. Raiffa H, Schlaifer R. Applied Statistical Decision Theory. Cambridge: The M.I.T. Press, 1961. Ren S, Oakley JE. Assurance calculations for planning clinical trials with time-toevent outcomes. Statistics in Medicine 2014; 33:31–45. Rodda BE, Davis RL. Determining the probability of an important difference in bioavailability. Clinical Pharmacology and Therapeutics 1980; 28:247–252. Rothmann M, Li N, Chen G, Chi GY, Temple R, Tsou HH. Design and analysis of noninferiority mortality trials in oncology. Statistics in Medicine 2003; 22:239–264. Rothmann MD, Wiens BL, Chan ISF. Design and Analysis of Non-Inferiority Trials. Boca Raton: CRC Press, 2012. Rufibach K, Burger HU, Abt M. Bayesian predictive power: choice of power and some recommendations for its use as probability of success in drug development. Pharmaceutical Statistics 2016; 15:438–446. Ryan TP. Sample Size Determination and Power. New York: Wiley, 2013. Schmidli H, Gsteiger S, Roychoudhury S, O’Hagan A, Spiegelhalter D, Neuenschwander B. Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics 2014; 70:1023–1032. Schoenfeld DA, Richter JR. Nomograms for calculating the number of patients needed for a clinical trial with survival as an endpoint. Biometrics 1982; 38:163–170. Schumi J, Wittes JT. Through the looking glass: understanding non-inferiority. Trials 2011; 12:1–2. Selwyn MR, Dempster AP, Hall NR. A Bayesian approach to bioequivalence for the 2 x 2 changeover design. Biometrics 1981; 37:11–21. Selwyn MR, Hall NR. On Bayesian methods for bioequivalence. Biometrics 1984; 40:1103–1108. Senn SJ. Minimally Important Differences: definitions, ambiguities and pitfalls. Lecture given to the workshop on Minimally Important Differences at the 2nd EuroQol Academy Meeting, 8 March 2017. Noordwijk, NL (www.slideshare.net/ StephenSenn1/minimally-important-differences, last accessed 23 June 2021). Shieh G. On using a pilot sample variance for sample size determination in the detection of differences between two means: Power consideration. Psicologica: International Journal of Methodology and Experimental Psychology 2013; 34:125–143. Shieh G. The equivalence of two approaches to incorporating variance uncertainty in sample size calculations for linear statistical models. Journal of Applied Statistics 2017; 44:40–56. Simon R. Optimal two-stage designs for phase II clinical trials. Contemporary Clinical Trials 1989; 10:1–10.
168References
Simon R. Bayesian design and analysis of active control clinical trials. Biometrics 1999; 55:484–487. Sims M, Elston DA, Harris MP, Wanless S. Incorporating variance uncertainty into a power analysis of monitoring designs. Journal of Agricultural, Biological, and Environmental Statistics 2007; 12:236–249. Sleijfer S, Ray-Coquard I, Papai Z, Le Cesne A, Scurr M, Schöffski P, Collin F, Pandite L, Marreaud S, De Brauwer A, van Glabbeke M. Pazopanib, a multikinase angiogenesis inhibitor, in patients with relapsed or refractory advanced soft tissue sarcoma: a phase II study from the European Organisation for Research and Treatment of Cancer–Soft Tissue and Bone Sarcoma Group (EORTC study 62043). Journal of Clinical Oncology 2009; 27:3126–3132. Snapinn SM. Alternatives for discounting in the analysis of noninferiority trials. Journal of Biopharmaceutical Statistics 2004; 14:263–273. Snapinn S, Jiang Q. Controlling the type 1 error rate in non-inferiority trials. Statistics in Medicine 2008a; 27:371–381. Snapinn S, Jiang Q. Preservation of effect and the regulatory approval of new treatments on the basis of non-inferiority trials. Statistics in Medicine 2008b; 27:382–391. Snedecor GW, Cochran WG. Statistical Methods, 7th Edition. Ames: Iowa State University Press, 1980. Spiegelhalter DJ, Freedman LS. A predictive approach to selecting the size of a clinical trial. Statistics in Medicine 1986; 5:1–13. Spiegelhalter DJ, Freedman LS, Blackburn PR. Monitoring clinical trials: conditional or predictive power? Controlled Clinical Trials 1986; 7:8–17. Spiegelhalter DJ, Freedman LS. Bayesian approaches to clinical trials. In Bayesian Statistics 3, JM Bernardo, MH DeGroot, DV Lindley, AFM Smith (eds). Oxford University Press: Oxford, 1988, 453–477. Spiegelhalter DJ, Freedman LS, Parmar MKB. Bayesian approaches to randomised trials (with discussion). Journal of the Royal Statistical Society Series A 1994; 157:357–416. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health Care. Chichester: Wiley, 2004. Temple J, Robertson J. Conditional assurance: the answer to the questions that should be asked in drug development. Pharmaceutical Statistics 2021; 22:1102–1111. Therasse P, Arbuck SG, Eisenhauer EA, Wanders J, Kaplan RS, Rubinstein L, Verweij J, Van Glabbeke M, van Oosterom AT, Christian MC, Gwyther SG. New guidelines to evaluate the response to treatment in solid tumors. Journal of the National Cancer Institute 2000; 92:205–216. Tsiatis AA. The asymptotic joint distribution of the efficient scores test for the proportional hazards model calculated over time. Biometrika 1981; 68:311–315. von Neumann J. Various techniques used in connection with random digits. Journal of Research of the National Bureau of Standards 1951; 3:36–38. Wald A, Wolfowitz J. On a test whether two samples are from the same population. The Annals of Mathematical Statistics 1940; 11:147–162. Walley RJ, Grieve AP. Optimising the trade-off between type I and II error-rates in the Bayesian context. Pharmaceutical Statistics 2021; 20:710–720.
References
169
Walley RJ, Smith CL, Gale JD, Woodward P. Advantages of a wholly Bayesian approach to assessing efficacy in early drug development: a case study. Pharmaceutical Statistics 2015; 14:205–215. Wang M, Liu GF, Schindler J. Evaluation of program success for programs with multiple trials in binary outcomes. Pharmaceutical Statistics 2013; 14:172–179. Wang SJ, Hung HJ, Tsong Y. Utility and pitfalls of some statistical methods in active controlled clinical trials. Controlled Clinical Trials 2002; 23:15–28. Wassmer G, Brannath W. Group Sequential and Confirmatory Adaptive Designs in Clinical Designs. Cham, Switzerland: Springer, 2016. Westlake WJ. Use of confidence intervals in analysis of comparative bioavailability trials. Journal of Pharmaceutical Sciences 1972; 61:1340–1341. Westlake WJ. Symmetrical confidence intervals for bioequivalence trials. Biometrics 1976; 32:741–744. Westlake WJ. Statistical aspects of comparative bioavailability trials. Biometrics 1979; 35:273–280. Whitehead J. Sample size calculations for ordered categorical data. Statistics in Medicine 1993; 12:2257–2271. Weibull W. A statistical theory of the strength of materials. Ingeniorsvetenscapsakademiens Handlingar 1939; 151:5–45. Wiklund SJ, Burman CF. Selection bias, investment decisions and treatment effect distributions. Pharmaceutical Statistics 2021; 22:1168–1182. Woolf B. The log likelihood ratio test (the G-test). Annals of Human Genetics 1957; 21:397–409. Yu B, Yang H, Sabin A. A note on the determination of non-inferiority margins with application in oncology clinical trials. Contemporary Clinical Trials Communications 2019; 16:100454. Zhang J, Zhang JJ. Joint probability of statistical success of multiple phase III trials. Pharmaceutical Statistics 2013; 12:358–365. Zubrod CG, Schneiderman M, Frei E, et al. Appraisal of methods for the study of chemotherapy in man: comparative therapeutic trial of nitrogen mustard and thiophosphoramide. Journal of Chronic Diseases 1960; 11:7–33.
Appendix 1 Evaluation of a Double Normal Integral The double integral
I
1 2
e x2 /2dx e y 2 /2dy a by
(A1)
has the region of integration shown in Figure A1. We can make an orthogonal rotation so that u sin v cos
cos x sin y
in which sin 1/ 1 b 2 and cos b / 1 b 2 ; then, the geometry shows that the length of the perpendicular from the line y = a + bx to the origin is a / 1 + b 2 and the integral is given by I
1 2
e u
a/ 1 b 2
1+
2
/2
a du 1 2 1 b
.
y (0, )
x
0,
−
FIGURE A1 Integration region associated with (A1).
171
Appendix 2 Besag’s Candidate Formula Besag (1989) reports a result from an examination script of a Durham University undergraduate. The starting point is the joint distribution of future data, yf, and the parameters of the model, δ,given the current data yc. It can be expanded in two ways: p y f , |yc p |y f ,yc p y f|yc p y f , |yc p y f | , yc p |yc
implying that
p y f|yc p y f| , yc
p |yc p |y f , yc
(A2)
A similar expansion for p(yf, yc| δ) gives p y f , yc| p y f| , yc p yc|
and since for independent observations p y f , yc| p y f| p yc|
implying that p(yf| δ, yc) = p(yf| δ) which in (A2) gives p y f|yc p y f|
p |yc p |y f , yc
(A3)
Alternatively, p |y f , yc
p y f , yc| p p y f ,y c
p y f|yc , p yc| p p y f|yc p yc
p y f|yc , p |yc p y f|yc
173
174
Appendix 2: Besag’s Candidate Formula
which can be re-arranged to give (A3). What is surprising is that the righthand side of (A3) needs to be known only for a single value of δ. For example, we know that 2 2 p y f| ~ N , , n2 2 2 p |yc ~ N yc , and n1
n1 y f n2 yc 2 2 , p |,y f|,yc ~ N n1 n2 n1 n2 Substituting these three distributions into (A2) gives
1 1 p y f|yc ~ N yc ,2 2 . n1 n2
Index Page numbers in italics refer to figures; bold refer to tables and page numbers followed by ‘n’ refers to note numbers. A absolute difference, 64, 152 absolute error loss, 99–100 active and inactive compounds, 81, 142 active control effect, 51, 52, 57 active controlled trial, 7, 121, 121 active treatment versus control, 64, 64, 68, 80 Acuna SA, 51 acute stroke, 106 Adyanthāya NK, 2 Alderson NE, 57 algorithms, 21, 61, 80, 152, 157–159 determination of AP, 66 alternative hypothesis, 1, 2, 5 composite, 6 prior probability, 19 sets of parameters, 6 alternative space, 19, 31 Altham PME, 68 analytic calculation, 11–14, 60, 87, 113 area under plasma-concentration time curve (AUC), xv Arendt CD, 37 Armitage P, 160 assurance, 1, 7, 10, 17, 33–58, 87, 127, 135 absolute value questioned, 81 application to series of studies, 38–43 basic considerations, 33–34 Bayesian version, 76 bounds, 34, 35, 37 equivalent to AP, 34 interim analysis in clinical trials, 44–49 non-inferiority trials, 49–58 non-standard examples, 17 practical uses, 38 sample size, 34–37 asymptotic approach
AP determination, 69–71 asymptotic χ2 test, 66, 67 average power (AP), 1, 98, 99, 114, 127, 135 bound when variance is estimated, 31 bounds, 18–20 critical aspect, 13 decomposition, 6, 24–29, 91 decomposition, 27, 60, 91 defined by integral, 87 determination (four approaches), 11–18 equivalent to “assurance”, 34 generalisation of concept, 7, 87 relationship with planned power, 13 restless leg syndrome clinical trial, 16 robust prior, 21–23 traditional definition, 29 upper bound, 13, 18, 23, 37, 78, 90 use of simulation to determine, 113 using truncated-normal prior, 59–60 variance estimation, 29–31 written as double integral, 15 average power in non-normal settings, 7, 59–73 bound for binary endpoint, 68, 69 prior number of events, 71 survival context, 69–73 when response binary, 64–66, 67 when variance unknown (conditional on fixed treatment effect), 60–61 when variance unknown (joint prior on treatment effect and variance), 61–63 B Bartky W, 101 Basel, xv, xix
175
176Index
Bauer MM, 50 Bauer P, 50, 118 Bayes estimator, 99–100, 113 Bayes’ theorem, 75 Bayesian approach, xv–xvi, xvii, xix, 57–58, 140, 148, 160 multiple decision criteria, 136–139 “proper”, 13–14, 37, 75, 78, 111–112 unplanned interim analysis (case study), 121–125 Bayesian estimate definition, 99 Bayesian estimation theory application in prior context, 99 Bayesian posterior probability, 84, 85, 85 Bayesian power (BP), 7, 75–86, 135, 137 asymptotically bounded above, 80 binary response, 80, 81 bounds, 7, 77–78 criterion defining positive outcome, 80 predictive probability, 75 sample size, 76–77, 77 Bayesian probabilities, 109, 138 asymptotic behaviour of unconditional, 137 Bayesian-frequentist hybrids, xvii, 6 advantage, 160 Beal SL, 152, 153–154, 156, 160 Bernardo JM, 100 Besag J, 104 Besag’s candidate formula, 104, 173–174 beta distribution, 65, 66, 80, 91, 92 beta-binomial, 65, 68 binary data, 16, 17 binary endpoint, 42, 68, 69, 102 binary outcome, 4 Bayesian power, 80, 81 binary response, 64–66, 67 binomial distribution, 122–124 bioequivalence, xv, xvii, 6 Biometrics Bulletin, 159 bivariate normal distribution, 18, 27, 28, 41, 45, 59 Böckenholt U, 35, 37 boundaries binding versus non-binding, 49 Box GEP, xvi, 17, 82, 140
Brannath W, 106 Broglio KR, 49 Brown BW, 65 Browne RH, 92 Burman CF, 86 C Campbell MJ, 3 cardiovascular outcomes trial, 21 Carroll KJ, 39, 41 central limit theorem, 2, 113 Chen D-G, 63 chi-square-density, 152 χ2 distributed variable, 155 χ2 distribution central and non-central, 95 chi-square test Yates’ corrected, 64 Chuang-Stein C, 16 Ciarleglio MM, 37 Ciba-Geigy, xvi–xvii, xix clinical trial Bayesian, 148 design, 3, 148 high failure rate, 5 interim analysis, 44–49 objective, 7, 49, 81, 120, 127, 130, 150, 156, 160 planning stage, 91, 149 significant study versus nonsignificant study, 119 simulations, 7 sizing, 10 statistical principles (ICH guidance), 9 survival endpoint, 69–73 when recruitment difficult, 7, 120 clinically relevant difference, 1, 36, 77, 79 Cochran WG, 19–20 Cohen J, 61 power handbook, 3 colon cancer, 49 colony-forming units (CFU), 121 complementary cumulative distribution function (CCDF), 91, 93 complicated urinary tract infections (cUTI), 121
177
Index
composite event, 46, 60 probability, 47, 48 conditional assurance (CA), 7, 42–43, 44, 47 conditional Bayesian power (CBP), 137 conditional expected power (CEP), 37 conditional power (CP), 1, 7, 55, 76, 103–109, 112, 115, 118, 120, 134 averaging with respect to prior, 11–14, 16–18 based on currently observed treatment effect, 103 based on original power assumption, 103 estimated, 104, 105 conditional power function, 14, 62, 63, 87 conditional probability, 65, 84, 134, 145, 153, 154, 155, 156 of successful trial, 9–12, 14, 16, 30, 33, 70, 82, 90 confidence interval (CI), 7, 10, 50, 51, 140, 143, 147, 160 confidence interval width, 149–152 unconditional sample sizing, 156–159 whether should be sole determinant of sample size, 153–156 confidence levels, 128–129 confidence limit, 122, 124, 125 conjugate prior, 21, 61, 65, 72 consensus method, 50, 51 control arm, 4, 50, 65, 103, 121, 122 Cook algorithm, 157, 159 modified, 158 credible interval (CrI), 91, 94, 95–96 Crisp A, 21–23 Crook JF, 6 crossover, 132, 133, 136 cumulative distribution function (CDF), 28, 30, 88, 93–98 passim, 109–110, 155 prior power, 113 curtailed inspection, 102 D Dallal S, 21 Dallow N, 107–108 Daly LE, 156 data
potential, but yet unobserved, 17 unconditional predictive distribution, 14–15 Dawid AP, 111 Day SJ, 149, 150, 151 decision areas (ABCD), 144–147 decision criteria in proof-of-concept trials, 7, 127–148 estimated variance case, 143–147 general (early phase studies), 127–128 known variance case, 128–133, 134 decision quadrants, 143–144, 143 decision rule, 105, 112, 113 decision-making, 6, 7, 9, 107, 120 evidence-based, 127 degrees of freedom, 60–63, 72, 149, 152, 155, 157, 159 Delphi process, 51 dextran, 106–107 Diaconis P, 21 diastolic blood pressure (DBP), 150 dignity line, 128 Dirac delta function, 100 distributions unimodal (positively skewed), 83, 84, 100 Dodge HF, 3, 101 dose-finding study, 6, 40 double normal integral, 12, 58, 78, 109, 111, 171 evaluation, 171 integration region, 171 double sampling, 101, 102 single sampling, 101 drug development, xix, 29, 44, 91 Bayesian criterion, 75 decision criterion (pre-definition), 75, 109 over-optimism, 5 drug development team, 5, 36, 38–39, 63, 79, 120, 142, 158 planning stage, 148 Du H, 113 E early phase studies general decision criteria, 127–128
178Index
Eaton ML, 18, 78, 80 effect size (ES), 3, 61 efficacy, 33, 43, 44, 51, 101, 114, 115, 127–128, 129 boundaries, 47–48; see also
superiority
efficacy density given progression (Wiklund and Burman), 86 Ellenberg S, 57 equivalence Bayesian perspective, 50 estimated treatment effect, 149, 154 estimated variance case, 143–147 (generalised assurance), 147–148 estimation surety and assurance, 7, 149–160 expected loss (EL), 99–100 expected power, 6, 10–11 unconditional or absolute, 134 expected value, 7, 11, 19, 65, 87, 91, 108, 109, 111, 113, 147, 156 experimental therapy (T), 51–52 exponential survival, 69, 71–72 extended Glasgow outcome scale (EGOS-I), 114 F Fayers PM, 70 fiducial approach, 159–160 Fina P, 107–108 final analysis, 43, 46, 49 final success, 48 Finney DJ, xvi Fisher RA, 73n1, 118, 160 fiducial path, 159 Fisher’s exact test (FET), 65, 66, 67 Fisher Box J, xvi fixed treatment effect, 60–61, 91 Freedman LS, xvii, 14, 109, 160 Frei A, 6 Freiman JA, 4–5 frequentist approach, xvii, xix–xx, 75, 160 designs incorporating multiple decision criteria, 136 frequentist power, 12, 75 function automixfit (R library RBesT), 23
futility, 7, 44, 45, 48–49, 103, 115 boundaries, 47 probability, 46 stopping for based on predictive probability, 109–111 futility test, 43, 119, 119, 120 G gamma distribution, 91 two-parameter, 157 gamma function upper and lower, 93 gamma prior, 60 Gaskins A, 16 Gaussian quadrature, 16, 65, 72 generalised assurance estimated variance case, 147–148 known variance case, 134–135 glycerol, 106–107 GO decision, 128–148 Good IJ, 6, 16 Graybill FA, 152 Grieve AP, xvi, 3–4, 14, 68, 152, 153–155, 157, 160n1 background, xv–xxi book purpose, 6 means of eliciting conjugate prior, 61 publications, 163, 164, 166–168 passim sample size problem, 4–5 Grouin JM, 62, 63 group sequential design (GSD), 44, 47, 106 boundary plot, 44 Guenther WC, 152 Guttman I, 87 H Hafner KB, 150, 152 half-width, 150, 152, 155, 160 Hall NR, xvii Hall W, 21 Halpern SD, 4–5 Harris M, 151, 158–160 Harris algorithm, 158–159 hazard ratio (HR), 21–22, 22, 69, 72 hepatocellular carcinoma (HCC) RCT, 70–71, 71
179
Index
Hermite polynomial, 16, 30–31, 148 historical data, xvi, 17, 21, 52, 54, 55, 57 Ho S, 63 Hung HJ, 57 hypergeometric distribution, 65, 68 I Ibrahim JG, 12 independent data monitoring committee (IDMC), 49, 115 information fraction, 44, 44, 106 information fraction function predictive versus conditional power, 105 integral, 16, 30, 42, 65, 147 standard normal, 137 integration region, 154, 171 interim analysis, 7, 43, 118 in clinical trials, 7, 44–49 unplanned (simulation case study), 120–125 interim predictions, 7, 101–112 conditional and predictive power, 103–109 “proper Bayesian” predictive power, 111–112 stopping for futility based on predictive probability, 109–111 interim result (“ZINT”), 105, 106, 107, 107 International Conference for Harmonisation (ICH), xxiii, 9–10 International Restless Leg Scale (IRLS), 13, 90 International Statistical Institute, xvi–xvii inverse χ2 density, 156–157 inverse chi-square distribution, 93, 96, 157, 159 inverse χ2 prior, 160 inverse gamma, 72, 158 J Jacobian, 72, 88, 92 Jeffreys H, 5
Jeffreys’ prior, 61 Jiang K, 12, 35 Jiang Q, 51 Jiroutek MR, 156 joint distribution of future data, 173 joint prior on treatment effect and variance, 61–63 joint probability (CI width and true parameter value), 153–154 Journal of Abnormal and Social Psychology, 3 Journal of American Medical Association, 4 Julious SA, 150, 151 K Kelley K, 152, 157 key opinion leaders (KOLs), 51 Kieser M, 92 known variance case decision criteria, 128–133, 134 generalised assurance, 134–135 Köhne K, 118 Kunzmann K, 18, 24, 28–29, 85 Kupper LL, 150, 152 L Lalonde RL, 127–128 Lalonde plot, 131–133 lambda prime distribution, 61, 63, 73n1 Lan KKG, 12, 37, 59–60, 92, 106 Lancaster HO, 65 Lancet, 4 LD50, xv–xvii LePAC software, 62 Lecoutre B, 61–62 Legendre polynomials, 65, 146, 147 Lehmacher W, 118 Lehmann E, 153 likelihood, 75, 80 normal, 59, 60 likelihood ratio (LR) test, 64 Lim J 4, 21 Lindley DV, 39 linear regression model, 19 Liu H, 70
180Index
loss functions, 99–100, 113–114 lower reference value (LRV), 128, 130, 132, 135, 136–137, 144–145
study outcomes and associated decisions, 128 multiple sampling plan, 102 multivariate predictive distribution, 39
M major adverse cardiovascular events (MACE), 21 Mann-Whitney form of Wilcoxon Test (MWT), 114, 116, 117, 118 marginal distribution, 14–15 marginal predictive density, 39–40 marginal predictive distribution, 56 Markov inequality, 110, 111 Martz HF, 61 Matthews neurological rating scale, 106 maximum and minimum values, 16, 22, 23, 129–137 passim, 138, 142, 145 maximum likelihood estimate, xvi, 72 McShane BB, 35, 37 mean, 1–2, 2, 9, 10, 71, 72, 83, 84, 100, 103, 104, 113, 128, 159 mean efficacy, 50 median, 3, 72, 83, 98, 99, 100 median prior power, 87 prior median power, 88, 89, 90, 91 meta-analytic-predictive (MAP) prior, 21 microbiologically evaluable (ME) population, 121 mid-p-value (Lancaster), 65 mid-point rule, 16, 16, 62, 72, 145, 147 minimally clinically important difference (MCID), 1, 24, 27, 29, 60, 83, 156 minimum requirement test, 128, 128 mixture distribution, 21, 22 mixture prior, 17, 42 components, 23 monitoring (of clinical trials), xvii, 11 Monte-Carlo samples, 21 Mood AM, 159–160 Muirhead RJ, 13–14, 23, 37, 60, 75, 78, 90 multiple decision criteria, 7 Bayesian approach, 136–139 bounds on unconditional decision probabilities, 135–136 posterior conditional distributions, 140–142
N negative predictive values (NPV), 142 new drug, 146–147 New England Journal of Medicine, 4 Newton-Raphson algorithm, 158 Neyman J, 2, 3, 5 95–95 method, 50, 51–52 Nixon RM, 43 NOGO decision, 128–148 nominal power, 12, 14, 36, 77 non-central t-distribution, 30 non-central t-probability, 144 non-centrality parameter (NCP), 3–4, 4, 61, 105 definition, 12, 16, 53, 59 non-inferiority, 50, 121, 122, 125 non-inferiority trials, 7, 49–58 Bayesian methods, 57–58 fixed margin, 52–54 purpose, 51–52 “small amount less effective” criterion, 50–52 starting point, 50 synthesis method, 54–57 non-normal settings average power, 7, 59–73 non-parametric approach, 7, 69, 106 non-significance probability, 25, 28 normal approximations (to standard likelihoods), 2 normal density, 30, 147 normal distribution, 113, 148, 149 normalised assurance (NA), 53, 79 sample size, 37–38, 38 normalised Bayesian power (NBP), 78, 79 sample size, 79, 79 normality assumption, 24, 113 Novartis, 127 nuisance parameters, 1, 6 null hypothesis, 2, 5, 24, 64, 72, 113, 119, 156 asymptotic χ2 test, 66, 67
Index
score test (Whitehead), 116 shifted, 10 single test for, 148 test (in partitioned parameter space), 18 testing against composite alternative hypothesis, 6 null hypothesis rejection, 1, 3, 6, 18, 29, 101, 102, 125, 148, 151 numerator and denominator, 42–43, 59, 82 numerical integration, 11, 16, 87 example, 16 O O’Brien-Fleming type rule, 44, 47, 47–48 O’Hagan A, 17–18, 33, 38, 43, 60 Oakley JE, 72–73 observations, 2, 18, 24, 62, 101, 159, 173 observed data, 17, 65, 75 odds ratio, 64, 114, 115, 117, 118, 119, 120 one-parameter exponential distributions, 71–72 one-sided test, 2, 45 Yates’ corrected, 64 Open BUGS system, 124 operating characteristics (OC), 7, 17, 131, 131–133, 147, 148 ordered categorical data, 115–117 P p-value, 40, 118, 127 one-sided, 62, 103 parallel group study placebo versus active, 115 parameter constraints, 82, 140 parameter space, 18, 78 parameter value true, 153–156 parameters, 1, 6, 17, 20, 23, 28, 51, 65, 66, 68, 72, 80, 91–93, 97, 133, 142, 157, 159, 160, 173 unknown, 104, 122 parametric models, 7, 69 pathogen types, 121
181
patients per arm (of trial), 1, 10, 35, 38, 39, 40, 45, 56, 70, 79, 81, 102, 103, 106, 130, 148, 159; see also two-arm trial Patterson SD, 150 Paulson E, 109 PAUSE decision, 128–148 pazopanib, 102 Pearson ES, 2, 3, 5 Pfizer, 127 pharmaceutical industry, xxi R&D, xv, xix phase I trials, 44, 47, 102, 111, 117, 119 phase II, 7, 35, 38–49 passim, 86, 101, 102, 105, 111, 117, 118, 119, 121 phase III, 7, 35, 39–43 passim, 49, 63, 86, 114 pivotal function, 104 pivotal perspective, 160 placebo, 17, 49–53 passim, 66, 80, 106–107, 115, 118, 119, 128, 129–130, 146–147, 148 response probabilities, 114 -controlled study, 40 plambdap function, 62 planned power relationship with average power, 13 pmvnorm, 27, 43 Pocock boundary, 106 point estimate method, 50, 52 population means, 2, 9, 159 population variance, 55, 149, 159 positive predictive values (PPV), 142 posterior conditional distributions with multiple decision criteria, 140–142 posterior conditional failure distribution, 85–86, 140 posterior conditional success distribution, 81–86, 140 investigation of selection bias, 86 success defined by Bayesian posterior probability, 84, 85, 85 success defined by significance, 82–84 use of simulation to generate samples, 85–86
182Index
posterior distribution, 12, 26, 57, 65, 80, 82, 85, 103, 109, 111, 122–124, 123, 156 Bayes’ theorem, 75 posterior mean, 81, 99, 113 posterior probability, 19, 40, 50, 75, 78, 80, 82, 103 Bayesian, 84–85 final, 112 power, 1, 7, 114, 151, 156 80% target, 12, 13, 14, 37, 48, 68, 89, 90, 93–95, 97, 117, 120 bounds on, 6 conditional unless absolute, 6, 9–31 power calculation, 3, 7, 10, 14, 26, 108, 120, 131 power calibrated effect size (PCES), 35 power CDF, 94, 96–97 power function, 5, 18, 30–31 conditional versus unconditional, 87 multi-dimensional, 6 Pratt JW, 150 pre-posterior distribution, 7, 81–82 predicted power, 6, 10–11 prediction problems, 19–20 predictive approach, xvi, xvii, xix, 11, 65 predictive check, xvi, 17 predictive density, 39, 110 predictive distribution, xvi–xvii, 15, 26, 41, 87, 104, 124, 125 conditional and unconditional, 138 predictive power (PP), 1, 6, 7, 11, 14–15, 103–110 based on observed treatment effect and its uncertainty, 103 “proper Bayesian”, 111–112 relationship with information fraction, 107 predictive power (relationship with CP), 108, 112, 137 predictive probability, xv, 66, 76 contours, 67 stopping for futility, 109–111 press releases, 49 primary endpoint, 7, 13 binary, 120
continuous measure (assumed standard deviation of 2 units), 130–131 continuous variable (smaller values preferable), 55 Matthews neurological rating scale, 106 multinomial, 115 PFR12, 102 POC study, 130–131 proportional odds, 114–120 Verum-to-placebo trial, 66 prior, 4, 75, 122, 136 averaging conditional power, 11–14, 16–18 normal, 59 robust, 6 truncated-normal, 59–60 prior beliefs, 11, 17, 66, 68 prior data, 62, 65, 122 prior density, 14, 22, 63, 89, 90, 94, 96–97, 98, 100, 139 prior distribution, 12, 18, 22, 29, 31, 40, 56, 57, 62, 111, 113, 114, 134, 156 arbitrary, 21 classes (Wiklund and Burman), 86 components, 41 of power and sample size, 7, 87–100 summaries, 99–100 trial of surgery for gastric cancer, 70 prior distribution of sample size treatment effect fixed, variance uncertain, 96–98 prior distribution of study power variance known, 88–92 treatment effect fixed, variance uncertain, 92–94 prior distribution of study power and sample size uncertain treatment effect and variance, 98–99 prior distribution of study sample size variance known, 94–96 prior expected response rate, 68 prior information, 17 available, but not overwhelming, 68 large amount, 139
183
Index
small amount, 137 prior information fraction, 13 prior mean, 58, 88, 99 prior median power, 88, 91 prior posterior conditional failure (PPCF), 82, 84, 86, 141 distributions, 85 prior posterior conditional probabilities, 142 treatment effect, 83 prior posterior conditional success (PPCS), 82–83, 141 distributions, 82 prior posterior probability, 83, 142 prior power CDF, 90, 91 prior power distribution U-shaped, 91 prior probability, 5, 18–19, 23, 28, 31, 58, 78 prior sample size, 56, 57, 77, 90, 91, 96–97 probability density function (PDF), 17–18, 95, 98 prior, 16 unconditional predictive, 15 probability of programme success (POSS), 42 probability of success (POS), 7, 12, 25, 28–29, 38, 41, 42, 53, 82, 111 based on AP, 71 conditional, 48–49, 60, 137 conditionality, 9 decomposition, 18 normalised, 60 predictive, 104, 105, 108, 109 ultimate posterior, 109–110 unconditional, 62, 137 PROBBNRM (statistical package), 27, 41, 60 progression-free rate at 12 weeks (PFR12), 102 proof-of-concept (POC) trials decision criteria, 7, 127–148 estimated variance case, 143–147 estimated variance case (generalised assurance), 147–148
proportional hazard model, 69, 72 proportional hazard and non-parametric survivor function model, 72–73 proportional odds (PO) model, 115, 119 proportional odds primary endpoint (simulation case study), 114–120 applying CP to Wilcoxon test, 117 background, 114–115 simulation results, 119, 120 simulation set-up, 119 statistical approach to control type I error, 118 Wilcoxon test for ordered categorical data, 115–117 Proschan MA, 106, 107 Pulkstenis E, 148 Q quadratic equations, 35, 54 quadratic loss, 99 quantiles, 61, 88, 114, 157, 158 quantitative decision-making (QDM), 127, 148 R Racine A, xvi, 6 radiofrequency ablation (RFA), 70–71 Raiffa H, 81 random effects model, 19, 20 random sample, 24, 25, 26 randomised clinical trial (RCT), 35–36 hepatocellular carcinoma, 70–71 restless leg syndrome, 63 rejection (R) event, xxiv, 156 relevance, 7, 9, 10, 24, 128, 128 Ren S, 72–73 response evaluation criteria in solid tumours (RECIST), 102 response rate, 17, 64, 65, 66, 68, 80, 101, 102 response variable, 19 binary, 101 restless leg syndrome, 13, 16, 63 resultant power, 6
184Index
definition, 5 rheumatoid arthritis, 43 Richter JR, 72 Robertson J, 38, 40–44 robust prior average power, 21–23 Romig HG, 3, 101 Rothmann MD, 54–56 Royal Statistical Society, xx, xxi Rufibach K, 88, 91 Ryan TP, 160 S sample size, xvii, 1, 3, 10, 13, 14, 18, 47, 54–56, 57, 60, 86, 105, 107, 110, 113, 116–117, 122, 132, 133, 148, 160 based on programme budget, 5 decision about appropriate, 114 effective prior, 91 for given AP, 34–37 for given Bayesian power, 76–77, 77 for given normalised assurance, 37–38, 38 for given normalised Bayesian power, 79, 79 ICH guidance, 9 inadequate, 49–50 increase, 7, 108 initial, 108 maximum, 117, 119, 120 part of problem, 4–5 POC study, 130 prior density, 97 prior distribution, 98–99 prior distribution (variance known), 94–96 prior distribution (variance uncertain), 96–98 prior distribution for given power, 7 prior median, 98 Simon design, 102 standard formula, 2, 9, 96 in study to estimate treatment effect, 7 treatment effect fixed, 96–98 unknown variance case “irrelevant sophistication”, 63
width of CI, 149–151 sample size per arm, 87, 135, 136, 137, 139, 146, 152–153, 159 determination, 149–150 whether CI width should be sole determinant of sample size, 153–156 sample size determination alternative to power, 151–153 sample-size re-estimation (SSR), 7, 108, 115, 117–120 sample size required, 77, 77, 78, 79, 98 scaled inverse-χ2 distribution, 92 Schlaifer R, 81 Schmidli H, 21 Schoenfeld DA, 72 selection bias, 86 Selwyn MR, xvii Senn SJ, 10, 17, 113 sequential test procedure, 101 Sheffield Elicitation Framework (SHELF) methodology, 72 Shieh G, 92 significance probability of achieving, 14–15 definition of success, 82–84 significance test, 85, 140 Bayesian and non-Bayesian (compromises), 6 Simon R, 57 Simon design, 101–102 Sims M, 92, 113 simulation, 11, 17–18, 26, 41, 43, 60, 87, 147, 148 of assurance (survival models), 72–73 “mathematics by other means”, 113 posterior conditional success and failure distributions, 85–86 simulation case studies, 7, 113–125 proportional odds primary endpoint, 114–120 unplanned interim analysis, 120–125 Sleijfer S, 102 Smith AFM, 39, 100 Snapinn S, 51 Snedecor GW, 19–20, 159–160 Şoaita AI, 13–14, 23, 37, 60, 75, 78, 90 soft tissue sarcoma (STS), 102
185
Index
Spiegelhalter DJ, xvii, 2, 14, 21, 66, 69–70, 88, 104, 106, 109–111, 160 square error, 113–114 stacked probability, 131, 147 standard of care (SOC), 49–52, 54, 128 standard conjugate prior, 21, 65 standard deviation, 3, 13, 14, 36, 40, 65, 106, 130, 143, 154, 158, 159 standard error (SE), 52, 54, 55 standard likelihoods, 2 standard normal distribution, 11, 42, 103 statistical design four factors, 1 statistical significance, 7, 33, 43, 49, 66, 103, 127, 129 step function, 18–19, 31 Stevens JW, 33 study failure, 84, 85, 85 study power, 7, 120 prior distribution, 98–99 prior distribution (known variance), 88–92 prior distribution (treatment effect fixed, variance uncertain), 92–94 success Bayesian definition, 82 superiority, 33, 44, 49, 51, 53, 69, 129; see also efficacy superiority sample size calculation, 151 superiority test, 45–46 surety, 159, 160n1 80% target, 155 definition, 151–152 surety and assurance in estimation, 7, 149–160 survival context, 69–73 AP for comparison of one parameter exponential distributions, 71–72 asymptotic approach to determining AP, 69–71 generalised approach to simulation of assurance for survival models, 72–73 synthesis method, 50, 52, 54–57 sample size, 54–56, 57
T t-distribution, 30, 63, 143 t-statistic, 62 t-test, 2, 29 target power, 95, 97 target product profile (TPP), 127–128, 148 target value (TV), 128–130, 135, 144–146 Temple J, 38, 40–44 ternary plot, 136, 137, 139, 139 test consistency (definition), 18 test statistic, 54 asymptotic normal, 72 log-rank, 69 standardised, 44–45 test strength (Crook and Good), 6 test treatment and control arms, 102–103 Tiao GC, 82, 140 toxicity studies, xv–xvi transcatheter arterial chemoembolisation (TACE), 70–71 transformation, 98, 110 simple linear, 96 traumatic brain injury (TBI), 114 treatment effective versus ineffective, 7, 84, 101, 102 treatment difference, 14, 63, 95, 125 clinically meaningful, 14 potential approaches (Senn), 10 treatment effect, 2, 9, 17, 18–19, 24, 30, 39, 40, 55, 56, 82, 106, 108, 111–112, 122, 124, 148 conditional predictive distribution, 134 estimated, 143 estimation (determination of sample size), 7 expected, 14, 76, 137 fixed, 113 fixed (prior distribution of sample size), 96–98 fixed (prior distribution of study power), 92–94 joint prior on, 61–63 negative (exclusion), 7 negative values, 91
186Index
normal prior, 91 observed, 130 by pathogen types, 123 positive probability, 23 positive values, 91 PPCS (significant versus nonsignificant outcome), 83 primary measure (determination), 64 prior distribution, 135 prior distribution, 11, 88, 95 prior posterior conditional distributions, 141 prior probability, 28 significance or non-significance of study, 83 study success or failure, 84, 85, 85 true, 76, 84, 103–104, 129, 131–134, 135, 142, 146, 147 truncated prior, 7 uncertain (prior distribution of study power and sample size), 98–99 uncertainty in current knowledge, 87 unconditional predictive distribution, 134 zero versus non-zero, 88 treatment effect cut-off prior posterior conditional probabilities of exceeding, 142 treatment effect information greater in prior than in study, 91 truncated average power (TAP), 59 Tsiatis AA, 69 two-arm trial, 1–2, 3, 4, 10, 40, 64, 66, 114, 130, 128 active treatment versus control, 64 single-arm case, 160; see also
patients per arm
two-stage bioequivalence, xvii, 6 two-stage designs advantages, 17 essential features, xv type I error, 1–3, 24, 91, 116, 120 one-sided, 14, 55, 70, 117 optimised, 4 statistical approach to control, 118
two-sided, 4, 114 type I error rate, 17 two-sided, 66 type II error, 1, 3, 18, 151 optimised, 4 two-sided, 4 U uncertainty, 11, 17, 19, 35, 87, 94, 95–96, 108, 121, 122, 147 study-to-study variability, 52 unconditional (absolute) approaches essentially Bayesian in construct, 1 unconditional assurance (UA), 57 unconditional decision probabilities, 139 bounds, 135–136 unconditional power, 15, 61, 87 unconditional predictive distribution, 14–15, 34, 76, 137 unconditional probability (U), 47, 134, 145, 157 unconditional sample sizing based on CI width, 156–159 Cook algorithm, 157–158, 159 Harris algorithm, 158–159 unified power function (Good and Crook), 6 unnormalised density, 21, 22, 23 unplanned interim analysis (simulation case study), 120–125 background, 121 interim data, 121, 121 model for prediction, 121–125 upper bounds, 13, 19, 20, 23, 34, 37, 53, 62, 77–78, 90, 136, 137 US Food and Drug Administration (FDA), 51 Bayesian guidance, 19 V validity (V) event, 156 occurs if CI includes true parameter value, xxiv variance, 2, 3, 17, 20, 29–31, 34, 35, 113 bound on average power, 31
187
Index
estimated as integral part of analysis of study, 6 known and unknown cases (closeness), 62, 99 variance (known), 2, 7, 9, 14, 103, 113 prior distribution of study power, 88–92 prior distribution of study sample size, 94–96 variance (uncertain) prior distribution of sample size, 96–99 prior distribution of study power, 92–94, 98–99 variance (unknown), 7 average power, 61–63 case “irrelevant sophistication”, 63 variance prior, 94, 97 variances, 9, 33, 137, 160 ratio of, 50 Verum, 66, 115 response probabilities, 114 von Neumann J accept or reject approach, 21, 22 W Wald A, 18 Waller RA, 61 Walley RJ, 4, 42, 63, 81, 82, 142
Wang L, 113 Wang M, 42 Wang SJ, 57 Wassmer G, 92, 106, 118 WebPlotDigitizer, 21 Weibull distribution, 69 Weibull model, 72 Whitehead J, 116 width (W) event, 156 occurs if CI width less than prespecified value, xxiv Wiklund SJ, 86Wilcoxon test, 115–117 application of CP to proportional odds, 117; see also
Mann-Whitney
Wittes JT, 37, 59–60, 92 Wolfowitz J, 18 Y
Yates’ corrected tests, 64 Ylvisaker D, 21 Z Z-statistic boundaries, 44 0–1 loss, 99, 100 Zhang J and Zhang JJ, 41 Zubrod CG, 3