Innovative Statistics in Regulatory Science [1 ed.] 9780367224769, 9780429275067, 9781000710816, 9781000710038

Statistical methods that are commonly used in the review and approval process of regulatory submissions are usually refe

127 46 6MB

Pages [553] Year 2019

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :

Preface


1. Introduction

Introduction

Key Statistical Concepts

Complex Innovative Designs

Practical, Challenging, and Controversial Issues

Aim and Scope of the Book


2. Totality-of-the-Evidence

Introduction

Substantial Evidence

Totality-of-the-evidence

Practical and Challenging Issues

Development of Index for Totality-of-the-Evidence

Concluding Remarks


3. Hypotheses Testing Versus Confidence Interval

Introduction

Hypotheses Testing

Confidence Interval Approach

Two One-sided Tests Procedure and Confidence Interval Approach

A Comparison

Sample Size Requirement

Concluding Remarks

Appendix of Chapter 3


4. Endpoint Selection

Introduction

Clinical Strategy for Endpoint Selection

Translations Among Clinical Endpoints

Comparison of Different Clinical Strategies

A Numerical Study

Development of Therapeutic Index Function

Concluding Remarks


5. Non-inferiority Margin

Introduction

Non-inferiority Versus Equivalence

Non-inferiority Hypotheses

Methods for Selection of Non-inferiority Margin

Strategy for Margin Selection

Concluding Remarks


6. Missing Data

Introduction

Missing Data Imputation

Marginal/Conditional Imputation for Contingency

Test for Independence

Recent Development

Concluding Remarks


7. Multiplicity

General Concepts

Regulatory Perspective and Controversial Issues

Statistical Methods for Multiplicity Adjustment

Gate-keeping Procedures

Concluding Remarks


8. Sample Size

Introduction

Traditional Sample Size Calculation

Selection of Study Endpoints

Multiple-Stage Adaptive Designs

Adjustment with Protocol Amendments

Multi-Regional Clinical Trials

Current Issues

Concluding Remarks


9. Reproducible Research

Introduction

The Concept of Reproducibility Probability

The Estimated Power Approach

Alternative Methods for Evaluation of Reproducibility Probability

Applications

Future Perspectives


10. Extrapolation

Introduction

Shift in Target Patient Population

Assessment of Sensitivity Index

Statistical Inference

An Example

Concluding Remarks

Appendix of Chapter 10


11. Consistency Evaluation

Introduction

Issues in Multi-regional Clinical Trials

Statistical Methods

Simulation Study

An Example

Other Considerations/Discussions

Concluding Remarks


12. Drug Products with Multiple Components

Introduction

Fundamental Differences

Basic Considerations

TCM Drug Development

Challenging Issues

Recent Development

Concluding Remarks


13. Adaptive Trial Design

Introduction

What Is Adaptive Design

Regulatory/Statistical Perspectives

Impact, Challenges, and Obstacles

Some Examples

Strategies for Clinical Development

Concluding Remarks


14. Selection Criteria in Adaptive Dose Finding

Introduction

Criteria for Dose Selection

Practical Implementation and Example

Clinical Trial Simulations

Concluding Remarks


15. Generic Drugs and Biosimilars

Introduction

Fundamental Differences

Quantitative Evaluation of Generic Drugs

Quantitative Evaluation of Biosimilars

General Approach for Assessment of Bioequivalence/Biosimilarity

Scientific Factors and Practical Issues

Concluding Remarks


16. Precision and Personalized Medicine

Introduction

The Concept of Precision Medicine

Design and Analysis of Precision Medicine

Alternative Enrichment Designs

Concluding Remarks


17. Big Data Analytics

Introduction

Basic Considerations

Types of Big Data Analytics

Bias of Big Data Analytics

Statistical Methods for Estimation of ∆ and μP - μN

Concluding Remarks


18. Rare Disease Drug Development

Introduction

Basic Considerations

Innovative Trial Designs

Statistical Methods for Data Analysis

Evaluation of Rare Disease Clinical Trials

Some Proposals for Regulatory Consideration

Concluding Remarks


References


Subject Index
Recommend Papers

Innovative Statistics in Regulatory Science [1 ed.]
 9780367224769, 9780429275067, 9781000710816, 9781000710038

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Innovative Statistics in Regulatory Science

Chapman & Hall/CRC Biostatistics Series Shein-Chung Chow, Duke University School of Medicine Byron Jones, Novartis Pharma AG Jen-pei Liu, National Taiwan University Karl E. Peace, Georgia Southern University Bruce W. Turnbull, Cornell University Recently Published Titles Cancer Clinical Trials: Current and Controversial Issues in Design and Analysis Stephen L. George, Xiaofei Wang, Herbert Pang Data and Safety Monitoring Committees in Clinical Trials 2nd Edition Jay Herson Clinical Trial Optimization Using R Alex Dmitrienko, Erik Pulkstenis Mixture Modelling for Medical and Health Sciences Shu-Kay Ng, Liming Xiang, Kelvin Kai Wing Yau Economic Evaluation of Cancer Drugs: Using Clinical Trial and Real-World Data Iftekhar Khan, Ralph Crott, Zahid Bashir Bayesian Analysis with R for Biopharmaceuticals: Concepts, Algorithms, and Case Studies Harry Yang and Steven J. Novick Mathematical and Statistical Skills in the Biopharmaceutical Industry: A Pragmatic Approach Arkadiy Pitman, Oleksandr Sverdlov, L. Bruce Pearce Bayesian Applications in Pharmaceutical Development Mani Lakshminarayanan, Fanni Natanegara Innovative Statistics in Regulatory Science Shein-Chung Chow

For more information about this series, please visit: https://www.crcpress.com/go/biostats

Innovative Statistics in Regulatory Science

Shein-Chung Chow Duke University School of Medicine Durham, North Carolina

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2020 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works International Standard Book Number-13: 978-0-3672-2476-9 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

Contents Preface .................................................................................................................. xvii Author ................................................................................................................... xxi

1. Introduction ..................................................................................................... 1 1.1 Introduction ............................................................................................. 1 1.2 Key Statistical Concepts .........................................................................2 1.2.1 Confounding and Interaction ....................................................2 1.2.1.1 Confounding .................................................................3 1.2.1.2 Interaction......................................................................6 1.2.2 Hypotheses Testing and p-values ............................................. 8 1.2.2.1 Hypotheses Testing ......................................................8 1.2.2.2 p-value .......................................................................... 10 1.2.3 One-Sided versus Two-Sided Hypotheses ............................ 12 1.2.4 Clinical Significance and Clinical Equivalence .................... 14 1.2.5 Reproducibility and Generalizability .................................... 17 1.2.5.1 Reproducibility ........................................................... 17 1.2.5.2 Generalizability .......................................................... 19 1.3 Complex Innovative Designs ..............................................................22 1.3.1 Adaptive Trial Design...............................................................22 1.3.1.1 Adaptations ................................................................. 23 1.3.1.2 Types of Adaptive Design ......................................... 24 1.3.2 The n-of-1 Trial Design ............................................................. 24 1.3.2.1 Complete n-of-1 Trial Design .................................... 24 1.3.2.2 Merits and Limitations .............................................. 25 1.3.3 The Concept of Master Protocols ............................................ 26 1.3.4 Bayesian Approach ................................................................... 26 1.4 Practical, Challenging, and Controversial Issues............................. 27 1.4.1 Totality-of-the-Evidence ........................................................... 27 1.4.2 (1−α) CI for New Drugs versus (1−2α) CI for Generics/Biosimilars ................................................................ 28 1.4.3 Endpoint Selection .................................................................... 29 1.4.4 Criteria for Decision-Making at Interim ................................ 30 1.4.5 Non-inferiority or Equivalence/Similarity Margin Selection ...................................................................................... 31 1.4.6 Treatment of Missing Data....................................................... 32 1.4.7 Sample Size Requirement ........................................................ 36 1.4.8 Consistency Test ........................................................................ 37

v

Contents

vi

1.5

1.4.9 Extrapolation............................................................................ 38 1.4.10 Drug Products with Multiple Components......................... 39 1.4.11 Advisory Committee .............................................................. 40 1.4.12 Recent FDA Critical Clinical Initiatives ............................... 41 Aim and Scope of the Book .................................................................43

2. Totality-of-the-Evidence .............................................................................. 47 2.1 Introduction ........................................................................................... 47 2.2 Substantial Evidence ............................................................................ 48 2.3 Totality-of-the-Evidence ....................................................................... 49 2.3.1 Stepwise Approach ................................................................. 49 2.3.2 Fundamental Biosimilarity Assumptions ........................... 51 2.3.3 Examples—Recent Biosimilar Regulatory Submissions ..... 52 2.3.4 Remarks .................................................................................... 53 2.4 Practical Issues and Challenges ..........................................................54 2.4.1 Link among Analytical Similarity, PK/PD Similarity, and Clinical Similarity ...........................................................54 2.4.2 Totality-of-the-Evidence versus Substantial Evidence....... 57 2.4.3 Same Regulatory Standards .................................................. 58 2.5 Development of Totality-of-the-Evidence.......................................... 59 2.6 Concluding Remarks ............................................................................63 3. Hypotheses Testing versus Confidence Interval ...................................65 3.1 Introduction ...........................................................................................65 3.2 Hypotheses Testing .............................................................................. 66 3.2.1 Point Hypotheses Testing ...................................................... 67 3.2.2 Interval Hypotheses Testing.................................................. 68 3.2.3 Probability of Inconclusiveness............................................. 69 3.3 Confidence Interval Approach............................................................ 69 3.3.1 Confidence Interval Approach with Single Reference ...... 69 3.3.2 Confidence Interval Approach with Multiple References ............................................................................. 70 3.3.2.1 Pairwise Comparisons ............................................ 70 3.3.2.2 Simultaneous Confidence Interval ........................ 70 3.3.2.3 Example 1 (False Negative) ..................................... 71 3.3.2.4 Example 2 (False Positive) ....................................... 72 3.4 Two One-Sided Tests versus Confidence Interval Approach ......... 76 3.4.1 Two One-Sided Tests (TOST) Procedure.............................. 76 3.4.2 Confidence Interval Approach .............................................. 78 3.4.2.1 Level 1 − α versus Level 1 − 2α ............................... 78 3.4.2.2 Significance Level versus Size ................................ 79 3.4.2.3 Sizes of Tests Related to Different Confidence Intervals ..................................................................... 79 3.4.3 Remarks ....................................................................................80

Contents

vii

A Comparison ....................................................................................... 81 3.5.1 Performance Characteristics ................................................... 81 3.5.2 Simulation Studies .................................................................... 82 3.5.3 An Example—Binary Responses ........................................... 86 3.6 Sample Size Requirement .................................................................... 89 3.7 Concluding Remarks ............................................................................90 Appendix ........................................................................................................ 91

3.5

4. Endpoint Selection ....................................................................................... 93 4.1 Introduction ........................................................................................... 93 4.2 Clinical Strategy for Endpoint Selection ........................................... 95 4.3 Translations among Clinical Endpoints ............................................ 96 4.4 Comparison of Different Clinical Strategies ..................................... 99 4.4.1 Test Statistics, Power and Sample Size Determination ....... 99 4.4.2 Determination of the Non-inferiority Margin ................... 102 4.4.3 A Numerical Study ................................................................. 102 4.4.3.1 Absolute Difference versus Relative Difference .....103 4.4.3.2 Responders’ Rate Based on Absolute Difference.....103 4.4.3.3 Responders’ Rate Based on Relative Difference .....103 4.5 Development of Therapeutic Index Function ................................. 110 4.5.1 Introduction ............................................................................. 110 4.5.2 Therapeutic Index Function .................................................. 114 4.5.2.1 Selection of ωi ........................................................... 114 4.5.2.2 Determination of f i (.) and the Distribution of e ..... 115 4.5.2.3 Derivation of Pr(I i |e j ) and Pr(e j |I i ) ...................... 115 4.6 Concluding Remarks .......................................................................... 121 5. Non-inferiority/Equivalence Margin ..................................................... 123 5.1 Introduction ......................................................................................... 123 5.2 Non-inferiority versus Equivalence ................................................. 124 5.2.1 Relationship among Non-inferiority, Equivalence, and Superiority ....................................................................... 125 5.2.2 Impact on Sample Size Requirement ................................... 126 5.3 Non-inferiority Hypothesis ............................................................... 127 5.3.1 Regulatory Requirements...................................................... 127 5.3.2 Hypothesis Setting and Clinically Meaningful Margin .... 128 5.3.3 Retention of Treatment Effect in the Absence of Placebo .... 129 5.4 Methods for Selection of Non-inferiority Margin.......................... 130 5.4.1 Classical Method..................................................................... 130 5.4.2 FDA’s Recommendations ....................................................... 130 5.4.3 Chow and Shao’s Method ...................................................... 131 5.4.4 Alternative Methods .............................................................. 132 5.4.5 An Example ............................................................................. 133 5.4.6 Remarks ................................................................................... 135

viii

Contents

5.5 Strategy for Margin Selection............................................................ 135 5.5.1 Criteria for Risk Assessment ................................................. 136 5.5.2 Risk Assessment with Continuous Endpoints.................... 138 5.5.3 Numerical Studies................................................................... 143 5.5.4 An Example .............................................................................. 149 5.6 Concluding Remarks .......................................................................... 151 6. Missing Data................................................................................................ 153 6.1 Introduction ......................................................................................... 153 6.2 Missing Data Imputation ................................................................... 155 6.2.1 Last Observation Carried Forward ...................................... 155 6.2.1.1 Bias-variance Trade-off ........................................... 156 6.2.1.2 Hypothesis Testing .................................................. 157 6.2.2 Mean/Median Imputation ..................................................... 158 6.2.3 Regression Imputation ........................................................... 159 6.3 Marginal/Conditional Imputation for Contingency ..................... 159 6.3.1 Simple Random Sampling ..................................................... 160 6.3.2 Goodness-of-Fit Test ............................................................... 161 6.4 Test for Independence ........................................................................ 162 6.4.1 Results Under Stratified Simple Random Sampling .......... 162 6.4.2 When Number of Strata Is Large .......................................... 163 6.5 Recent Development ........................................................................... 164 6.5.1 Other Methods for Missing Data .......................................... 164 6.5.2 The Use of Estimand in Missing Data ................................. 165 6.5.3 Statistical Methods Under Incomplete Data Structure ...... 166 6.5.3.1 Introduction.............................................................. 166 6.5.3.2 Statistical Methods for 2 × 3 Crossover Designs with Incomplete Data ............................................... 168 6.5.3.3 A Special Case.......................................................... 172 6.5.3.4 An Example .............................................................. 174 6.6 Concluding Remarks .......................................................................... 176 7. Multiplicity .................................................................................................. 179 7.1 General Concepts ................................................................................ 179 7.2 Regulatory Perspective and Controversial Issues .......................... 180 7.2.1 Regulatory Perspectives ......................................................... 180 7.2.2 Controversial Issues ................................................................ 181 7.3 Statistical Method for Adjustment of Multiplicity ......................... 182 7.3.1 Bonferroni Method ................................................................. 183 7.3.2 Tukey’s Multiple Range Testing Procedure ......................... 184 7.3.3 Dunnett’s Test .......................................................................... 184 7.3.4 Closed Testing Procedure ...................................................... 185 7.3.5 Other Tests ............................................................................... 186

Contents

7.4

7.5

ix

Gate-Keeping Procedures ................................................................ 187 7.4.1 Multiple Endpoints ............................................................. 187 7.4.2 Gate-Keeping Testing Procedures .................................... 188 Concluding Remarks ........................................................................ 192

8. Sample Size .................................................................................................. 195 8.1 Introduction ....................................................................................... 195 8.2 Traditional Sample Size Calculation .............................................. 196 8.3 Selection of Study Endpoints .......................................................... 200 8.3.1 Translations among Clinical Endpoints .......................... 200 8.3.2 Comparison of Different Clinical Strategies ................... 202 8.4 Multiple-stage Adaptive Designs ................................................... 205 8.5 Sample Size Adjustment with Protocol Amendments ................ 208 8.6 Multi-regional Clinical Trials .......................................................... 211 8.7 Current Issues.................................................................................... 214 8.7.1 Is Power Calculation the Only Way? ................................ 214 8.7.2 Instability of Sample Size................................................... 215 8.7.3 Sample Size Adjustment for Protocol Amendment ....... 216 8.7.4 Sample Size Based on Confidence Interval Approach .... 216 8.8 Concluding Remarks ........................................................................ 217 9. Reproducible Research .............................................................................. 219 9.1 Introduction ....................................................................................... 219 9.2 The Concept of Reproducibility Probability ................................. 220 9.3 The Estimated Power Approach .....................................................222 9.3.1 Two Samples with Equal Variances..................................222 9.3.2 Two Samples with Unequal Variances .............................225 9.3.3 Parallel-Group Designs ...................................................... 227 9.4 Alternative Methods for Evaluation of Reproducibility Probability.......................................................................................... 228 9.4.1 The Confidence Bound Approach .................................... 228 9.4.2 The Bayesian Approach ..................................................... 230 9.5 Applications ....................................................................................... 235 9.5.1 Substantial Evidence with a Single Trial ......................... 235 9.5.2 Sample Size .......................................................................... 236 9.5.3 Generalizability between Patient Populations ............... 236 9.6 Future Perspectives .......................................................................... 240 10. Extrapolation ............................................................................................... 241 10.1 Introduction ....................................................................................... 241 10.2 Shift in Target Patient Population .................................................. 242

Contents

x

Assessment of Sensitivity Index ..................................................... 244 10.3.1 The Case Where ε Is Random and C Is Fixed................. 244 10.3.2 The Case Where ε Is Fixed and C Is Random ................. 247 10.3.3 The Case Where Both ε and C Are Random ................... 250 10.4 Statistical Inference........................................................................... 253 10.4.1 The Case Where ε Is Random and C Is Fixed .................254 10.4.2 The Case Where ε Is Fixed and C Is Random ................. 255 10.4.3 The Case Where ε and C Are Random ............................ 256 10.5 An Example ....................................................................................... 258 10.5.1 Case 1: ε Is Random and C Is Fixed .................................. 258 10.5.2 Case 2: ε Is Fixed and C Is Random .................................. 259 10.5.3 Case 3: ε and C Are Both Random .................................... 259 10.6 Concluding Remarks ........................................................................ 259 Appendix....................................................................................................... 260 10.3

11. Consistency Evaluation ............................................................................. 263 11.1 Introduction ....................................................................................... 263 11.2 Issues in Multi-regional Clinical Trials ......................................... 264 11.2.1 Multi-center Trials............................................................... 264 11.2.2 Multi-regional, Multi-center Trials ................................... 265 11.3 Statistical Methods ........................................................................... 266 11.3.1 Test for Consistency ............................................................ 266 11.3.2 Assessment of Consistency Index .................................... 267 11.3.3 Evaluation of Sensitivity Index ......................................... 269 11.3.4 Achieving Reproducibility and/or Generalizability ..... 270 11.3.4.1 Specificity Reproducibility Probability for Inequality Test .................................................... 270 11.3.4.2 Superiority Reproducibility Probability ......... 271 11.3.4.3 Reproducibility Probability Ratio for Inequality Test .................................................... 272 11.3.4.4 Reproducibility Probability Ratio for Superiority Test ................................................... 273 11.3.5 Bayesian Approach ............................................................. 273 11.3.6 Japanese Approach ............................................................. 275 11.3.7 The Applicability of Those Approaches .......................... 275 11.4 Simulation Study............................................................................... 276 11.4.1 The Case of the Matched-Pair Parallel Design with Normal Data and Superiority Test ................................... 276 11.4.2 The Case of the Two-Group Parallel Design with Normal Data and Superiority Test ................................... 281 11.4.3 Remarks ................................................................................ 285 11.5 An Example ....................................................................................... 286 11.6 Other Considerations/Discussions ................................................ 290 11.7 Concluding Remarks ........................................................................ 291

Contents

xi

12. Drug Products with Multiple Components—Development of TCM .......................................................................................................... 293 12.1 Introduction ....................................................................................... 293 12.2 Fundamental Differences ................................................................ 295 12.2.1 Medical Theory/Mechanism and Practice ...................... 295 12.2.1.1 Medical Practice ................................................ 296 12.2.2 Techniques of Diagnosis .................................................... 297 12.2.2.1 Objective versus Subjective Criteria for Evaluability ......................................................... 297 12.2.3 Treatment ............................................................................. 298 12.2.3.1 Single Active Ingredient versus Multiple Components ........................................................ 298 12.2.3.2 Fixed Dose versus Flexible Dose ...................... 298 12.2.4 Remarks ................................................................................ 299 12.3 Basic Considerations.........................................................................300 12.3.1 Study Design........................................................................300 12.3.2 Validation of Quantitative Instrument ............................ 301 12.3.3 Clinical Endpoint ................................................................ 302 12.3.4 Matching Placebo ................................................................ 303 12.3.5 Sample Size Calculation ..................................................... 303 12.4 TCM Drug Development .................................................................304 12.4.1 Statistical Quality Control Method for Assessing Consistency ..........................................................................304 12.4.1.1 Acceptance Criteria ............................................308 12.4.1.2 Sampling Plan .....................................................308 12.4.1.3 Testing Procedure............................................... 311 12.4.1.4 Strategy for Statistical Quality Control ........... 311 12.4.1.5 Remarks ............................................................... 315 12.4.2 Stability Analysis ................................................................ 317 12.4.2.1 Models and Assumptions ................................. 319 12.4.2.2 Shelf-Life Determination ................................... 320 12.4.2.3 An Example ......................................................... 321 12.4.2.4 Discussion ........................................................... 323 12.4.3 Calibration of Study Endpoints in Clinical Development ........................................................................ 323 12.4.3.1 Chinese Diagnostic Procedure ......................... 324 12.4.3.2 Calibration ........................................................... 325 12.4.3.3 Validity................................................................. 326 12.4.3.4 Reliability ............................................................ 328 12.4.3.5 Ruggedness ......................................................... 329 12.5 Challenging Issues ........................................................................... 331 12.5.1 Regulatory Requirements .................................................. 331 12.5.2 Test for Consistency ............................................................ 332 12.5.3 Animal Studies .................................................................... 333

xii

Contents

12.5.4 Shelf-Life Estimation .......................................................... 333 12.5.5 Indication and Label ...........................................................334 12.6 Recent Development ......................................................................... 335 12.6.1 Introduction ......................................................................... 335 12.6.2 Health Index and Efficacy Measure ................................. 336 12.6.3 Assessment of Efficacy ....................................................... 336 12.6.4 Remarks ................................................................................ 339 12.7 Concluding Remarks ........................................................................ 339 13. Adaptive Trial Design ............................................................................... 341 13.1 Introduction ....................................................................................... 341 13.2 What Is Adaptive Design? ...............................................................343 13.2.1 Adaptations ..........................................................................344 13.2.2 Types of Adaptive Designs ................................................344 13.2.2.1 Adaptive Randomization Design ..................344 13.2.2.2 Group Sequential Design ................................345 13.2.2.3 Flexible Sample Size Re-estimation (SSRE) Design ...................................................346 13.2.2.4 Drop-the-Losers Design ..................................346 13.2.2.5 Adaptive Dose Finding Design ...................... 347 13.2.2.6 Biomarker-Adaptive Design ...........................348 13.2.2.7 Adaptive Treatment-Switching Design ......... 349 13.2.2.8 Adaptive-Hypotheses Design ........................ 349 13.2.2.9 Seamless Adaptive Trial Design .................... 350 13.2.2.10 Multiple Adaptive Design ............................... 350 13.3 Regulatory/Statistical Perspectives ................................................ 351 13.4 Impact, Challenges, and Obstacles ................................................. 352 13.4.1 Impact of Protocol Amendments ...................................... 352 13.4.2 Challenges in By Design Adaptations ............................. 352 13.4.3 Obstacles of Retrospective Adaptations ..........................354 13.5 Some Examples .................................................................................354 13.6 Strategies for Clinical Development .............................................. 363 13.7 Concluding Remarks ........................................................................364 14. Criteria for Dose Selection ....................................................................... 367 14.1 Introduction ....................................................................................... 367 14.2 Dose Selection Criteria ..................................................................... 368 14.2.1 Conditional Power .............................................................. 369 14.2.2 Precision Analysis Based on Confidence Interval.......... 370 14.2.3 Predictive Probability of Success ...................................... 370 14.2.4 Probability of Being the Best Dose ................................... 370 14.3 Implementation and Example ......................................................... 371 14.3.1 Single Primary Endpoint ................................................... 371 14.3.2 Co-primary Endpoints ....................................................... 372 14.3.3 A Numeric Example ........................................................... 376

Contents

xiii

Clinical Trial Simulation .................................................................. 377 14.4.1 Single Primary Endpoint ................................................... 377 14.4.2 Co-primary Endpoints ....................................................... 377 14.5 Concluding Remarks ........................................................................ 386

14.4

15. Generics and Biosimilars .......................................................................... 387 15.1 Introduction ....................................................................................... 387 15.2 Fundamental Differences ................................................................ 388 15.3 Quantitative Evaluation of Generic Drugs .................................... 389 15.3.1 Study Design........................................................................ 390 15.3.2 Statistical Methods.............................................................. 391 15.3.3 Other Criteria for Bioequivalence Assessment ............... 392 15.3.3.1 Population Bioequivalence and Individual Bioequivalence (PBE/IBE) ................................. 392 15.3.3.2 Scaled Average Bioequivalence (SABE) ........... 392 15.3.3.3 Scaled Criterion for Drug Interchangeability (SCDI) ................................. 393 15.3.3.4 Remarks ............................................................... 394 15.4 Quantitative Evaluation of Biosimilars.......................................... 395 15.4.1 Regulatory Requirement .................................................... 395 15.4.2 Biosimilarity ........................................................................ 396 15.4.2.1 Basic Principles ................................................... 396 15.4.2.2 Criteria for Biosimilarity ................................... 397 15.4.2.3 Study Design ....................................................... 397 15.4.2.4 Statistical Methods ............................................. 398 15.4.3 Interchangeability ............................................................... 398 15.4.3.1 Definition and Basic Concepts ......................... 398 15.4.3.2 Switching and Alternating ............................... 399 15.4.3.3 Study Design ....................................................... 399 15.4.4 Remarks ................................................................................400 15.5 General Approach for Assessment of Bioequivalence/ Biosimilarity ......................................................................................400 15.5.1 Development of Bioequivalence/Biosimilarity Index ..... 400 15.5.2 Remarks ................................................................................ 403 15.6 Scientific Factors and Practical Issues for Biosimilars .................404 15.6.1 Fundamental Biosimilarity Assumption .........................404 15.6.2 Endpoint Selection ..............................................................405 15.6.3 How Similar Is Similar? ..................................................... 405 15.6.4 Guidance on Analytical Similarity Assessment.............405 15.6.5 Practical Issues .................................................................... 406 15.6.5.1 Criteria for Biosimilarity (in Terms of Average, Variability, or Distribution) .............. 406 15.6.5.2 Criteria for Interchangeability.......................... 407 15.6.5.3 Reference Product Changes .............................. 407 15.6.5.4 Extrapolation ....................................................... 407

Contents

xiv

15.7

15.6.5.5 Non-medical Switch ...........................................408 15.6.5.6 Bridging Studies for Assessing Biosimilarity ..................................................... 408 Concluding Remarks ........................................................................408

16. Precision Medicine ..................................................................................... 411 16.1 Introduction ....................................................................................... 411 16.2 The Concept of Precision Medicine................................................ 412 16.2.1 Definition of Precision Medicine ...................................... 412 16.2.2 Biomarker-Driven Clinical Trials...................................... 412 16.2.3 Precision Medicine versus Personalized Medicine ....... 414 16.3 Design and Analysis of Precision Medicine ................................. 415 16.3.1 Study Designs ...................................................................... 415 16.3.2 Statistical Methods.............................................................. 417 16.3.3 Simulation Results ..............................................................422 16.4 Alternative Enrichment Designs ....................................................423 16.4.1 Alternative Designs with/without Molecular Targets ..... 423 16.4.2 Statistical Methods..............................................................425 16.4.3 Remarks ................................................................................ 428 16.5 Concluding Remarks ........................................................................430 17. Big Data Analytics ......................................................................................433 17.1 Introduction ....................................................................................... 433 17.2 Basic Considerations......................................................................... 435 17.2.1 Representativeness of Big Data ......................................... 435 17.2.2 Selection Bias ....................................................................... 435 17.2.3 Heterogeneity ...................................................................... 435 17.2.4 Reproducibility and Generalizability .............................. 436 17.2.5 Data Quality, Integrity, and Validity ................................ 436 17.2.6 FDA Part 11 Compliance .................................................... 437 17.2.7 Missing Data ........................................................................ 437 17.3 Types of Big Data Analytics ............................................................ 438 17.3.1 Case-Control Studies .......................................................... 438 17.3.1.1 Propensity Score Matching ............................... 439 17.3.1.2 Model Building ................................................... 439 17.3.1.3 Model Diagnosis and Validation...................... 441 17.3.1.4 Model Generalizability ...................................... 441 17.3.2 Meta-analysis .......................................................................442 17.3.2.1 Issues in Meta-analysis......................................443 17.4 Bias of Big Data Analytics ...............................................................444 17.5 Statistical Methods for Estimation of ∆ and µ P − µ N .....................446 17.5.1 Estimation of ∆ ....................................................................446 17.5.2 Estimation of µ P − µ N ...........................................................448 17.5.3 Assumptions and Application ..........................................448

Contents

xv

17.6 17.7

Simulation Study............................................................................... 449 Concluding Remarks ........................................................................454

18. Rare Diseases Drug Development .......................................................... 457 18.1 Introduction ....................................................................................... 457 18.2 Basic Considerations......................................................................... 458 18.2.1 Historical Data ..................................................................... 458 18.2.2 Ethical Consideration ......................................................... 459 18.2.3 The Use of Biomarkers ....................................................... 459 18.2.4 Generalizability ................................................................... 460 18.2.5 Sample Size .......................................................................... 460 18.3 Innovative Trial Designs .................................................................. 461 18.3.1 n-of-1 Trial Design ............................................................... 461 18.3.1.1 Complete n-of-1 Trial Design ............................ 462 18.3.1.2 Merits and Limitations ......................................463 18.3.2 Adaptive Trial Design ........................................................ 463 18.3.3 Other Designs ......................................................................464 18.3.3.1 Master Protocol ...................................................464 18.3.3.2 Bayesian Approach ............................................ 466 18.4 Statistical Methods for Data Analysis............................................ 466 18.4.1 Analysis under a Complete n-of-1 Trial Design ............. 466 18.4.1.1 Statistical Model .................................................466 18.4.1.2 Statistical Analysis ............................................. 467 18.4.1.3 Sample Size Requirement ................................. 468 18.4.2 Analysis under an Adaptive Trial Design ....................... 470 18.4.2.1 Two-Stage Adaptive Design.............................. 473 18.4.2.2 Remarks ............................................................... 476 18.5 Evaluation of Rare Disease Clinical Trials .................................... 476 18.5.1 Predictive Confidence Interval (PCI) ............................... 477 18.5.2 Probability of Reproducibility........................................... 477 18.6 Some Proposals for Regulatory Consideration ............................ 478 18.6.1 Demonstrating Effectiveness or Demonstrating Not Ineffectiveness ................................................................. 478 18.6.2 Two-Stage Adaptive Trial Design for Rare Disease Product Development .........................................................480 18.6.3 Probability Monitoring Procedure for Sample Size ....... 481 18.7 Concluding Remarks ........................................................................ 482 Bibliography........................................................................................................483 Index ..................................................................................................................... 517

Preface In pharmaceutical/clinical development of a test drug or treatment, relevant clinical data are usually collected from subjects with the diseases under study in order to evaluate safety and efficacy of the test drug or treatment under investigation. For  approval of pharmaceutical products (including drugs, biological products, and medical devices) in the United States, the Food and Drug Administration (FDA) requires that substantial evidence regarding the safety and effectiveness of the test treatment under investigation be provided in the regulatory submission (Section 314 of 21 CFR). Statistics plays an important role to ensure the accuracy, reliability, and reproducibility of the substantial evidence. Statistical methods and/or tools that are commonly used in the review and approval process of regulatory submissions are usually referred to as statistics in regulatory science or regulatory statistics. Thus, in a broader sense, statistics in regulatory science can be defined as valid statistics that are employed in the review and approval process of regulatory submissions of pharmaceutical products. In addition, statistics in regulatory science also involves the development of regulatory policy, guidance, and regulatory critical clinical initiatives related research. For the purpose of assuring the validity of statistics employed in regulatory science, regulatory statistics generally follow the following principles. The first principle is to provide unbiased and reliable assessment of the substantial evidence regarding the safety and effectiveness of the test treatment under investigation. The second principle is to ensure quality, validity, and integrity of the data collected in order to support the substantial evidence required for regulatory approval. The third principle is to make sure that the observed substantial evidence is not by chance alone and it is reproducible if similar studies shall be conducted under the same experimental conditions. Thus, to ensure the validity of regulatory statistics, it is suggested that statistical principles for Good Statistics Practice (GSP) that are outlined in the ICH (International Conference Harmonization) E9 guideline should be followed (ICH, 2018). This book is intended to be the first book entirely devoted to the discussion of statistics in regulatory science for pharmaceutical development. The scope of this book is restricted to practical issues that are commonly encountered in regulatory science in the course of pharmaceutical research and development. This  book consists of 18  chapters, each dedicated to a topic related to research activities, review of regulatory submissions, recent critical clinical initiatives, and policy/guidance development in regulatory science. Chapter 1 provides key statistical concepts, innovative designs, and analysis methods that are commonly considered in regulatory science. Also included in this chapter are some practical, challenging, and controversial issues xvii

xviii

Preface

that are commonly seen in the review and approval process of regulatory submissions. Chapter  2 provides an interpretation of substantial evidence required for demonstration of the safety and effectiveness of drug products under investigation. The  related concepts of totality-of-the-evidence and real-world evidence are also described in this chapter. Chapter  3 distinguishes the concepts of hypotheses testing and confidence interval approach for evaluation of the safety and effectiveness of drug products including new drugs, generic drugs, and biosimilar products under investigation. Also included in this chapter is the comparison between the use of a 90% confidence interval approach for generics/biosimilars and a 95% confidence interval approach for new drugs. Chapter  4 deals with endpoint selection in clinical research and development. Also included in this chapter is the development of therapeutic index function for endpoint selection in complex innovative designs such as multiple-stage adaptive trial designs. Chapter 5 focuses on non-inferiority margin selection. Also included in this chapter is a proposed clinical strategy for margin selection based on risk assessment of false positive rate. Chapters 6 and 7 provide discussions of statistical methods for missing data imputation and multiplicity adjustment for multiple comparisons in clinical trials, respectively. Sample size requirements under various designs are summarized in Chapter 8. Chapter 9 introduces the concept of reproducible research. Chapter 10 discusses the concept and statistical methods for assessment of extrapolation across patient populations and/or indications. Chapter  11 compares statistical methods for evaluation of consistency in multi-regional clinical trials. Chapter 12 provides an overview of drug products with multiple components such as botanical drug products and traditional Chinese medicine. Chapter  13 provides an overview of adaptive trial designs that are commonly used in clinical research and development of pharmaceutical development. Chapter 14 emphasizes the selection and evaluation of several criteria that are commonly considered in adaptive dose finding studies. Chapter 15 compares bioequivalence assessment for generic drug products and biosimilarity assessment for biosimilar products. Also included in this chapter is a proposed general approach for assessment of bioequivalence for generics and biosimilarity for biosimilars. Chapter 16 discusses the difference between precision medicine and personalized medicine. Chapter  17 introduces the concept of big data analytics. Also included in this chapter is a discussion of types of big data analytics and potential bias of big data analytics. Chapter  18 focuses rare disease clinical development including innovative trial designs, statistical methods for data analysis, and some commonly seen challenging issues. From Taylor & Francis Group, we would like to thank Mr. David Grubbs for providing me the opportunity to work on this book. I would like to thank my wife, Dr. Annpey Pong, for her understanding, encouragement, and constant support during the preparation of the book. In addition, I also would like to thank colleagues from Duke University School of Medicine and Office

Preface

xix

of Biostatistics (OB), Office of Translational Science (OTS), Center for Drug Evaluation and Research (CDER), and US FDA for their support during the preparation of this book. I wish to express my gratitude to many friends from the pharmaceutical industry for their encouragement and support. Finally, the views expressed are those of the author and not  necessarily those of Duke University School of Medicine and US FDA. I am solely responsible for the contents and errors of this book. Any comments and suggestions that will lead to improvement of revision of the book are very much appreciated. Shein-Chung Chow, PhD Duke University School of Medicine Durham, North Carolina

Author Shein-Chung Chow, PhD, is a professor of Biostatistics and Bioinformatics at Duke University School of Medicine, Durham, NC. Dr. Chow is also a special government employee (SGE) appointed by the US Food and Drug Administration (FDA) as an advisory committee voting member and statistical advisor to the FDA. Between 2017 and 2019, Dr. Chow was on leave for the FDA as an associate director at Office of Biostatistics (OB), Center for Drug Research and Evaluation (CDER), FDA. Dr. Chow is an editor-in-chief of the Journal of Biopharmaceutical Statistics and editor-in-chief of the Biostatistics Book Series at Chapman and Hall and CRC Press, Taylor  & Francis Group. Dr. Chow is a fellow of the American Statistical Association, who is the author or co-author of over 300 methodology papers and 30 books including Innovative Statistics in Regulatory Science (Chapman and Hall/CRC Press).

xxi

1 Introduction

1.1 Introduction For approval of pharmaceutical products (including drugs, biological products, and medical devices) in the United States the Food and Drug Administration (FDA) requires that substantial evidence regarding the safety and effectiveness of the test treatment under investigation be provided in the regulatory submission process Section 314 of 21 Codes of Federal Regulation (CFR). The substantial evidence regarding safety and effectiveness of the test treatment under investigation will then be evaluated by the reviewers (including statistical reviewers, medical reviewers, and reviewers from other relevant disciplines) in the review and approval process of the test treatment under investigation. Statistics plays an important role to ensure the accuracy, reliability, and reproducibility of the substantial evidence obtained from the studies conducted in the process of product development. Statistical methods and/or tools that are commonly used in the review and approval process of regulatory submissions are usually referred to as statistics in regulatory science or regulatory statistics. Thus, in a broader sense, regulatory statistics can be defined as valid statistics that may be used in the review and approval of regulatory submissions of pharmaceutical products. The purpose of regulatory statistics is to provide an objective, unbiased and reliable assessment of the test treatment under investigation. Regulatory statistics generally follow several principles to ensure the validity of the statistics used in the review and approval process of regulatory submissions. The first principle is to provide unbiased and reliable assessment of the substantial evidence regarding the safety and effectiveness of the test treatment under investigation. The second principle is to ensure quality, validity and integrity of the data collected for supporting the substantial evidence required for regulatory approval. The third principle is to make sure that the observed substantial evidence is not by chance alone and it is reproducible if the same studies were conducted under similar experimental conditions. To ensure the validity of regulatory statistics, it is suggested that statistical principles for Good Statistics Practice (GSP) that outlined in the ICH (International Conference Harmonization) E9 guideline should be followed (ICH, 2018). 1

2

Innovative Statistics in Regulatory Science

The general statistical principles (or key statistical concepts) are the foundation of GSP in regulatory science, which not only ensure the quality, validity and integrity of the intended clinical research during the process of pharmaceutical development, but also provide unbiased and reliable assessment of the test treatment under investigation. Key statistical concepts include, but are not limited to, confounding and interaction; hypotheses testing and p-values; one-sided hypotheses versus two-sided hypotheses; clinical significance/ equivalence; and reproducibility and generalizability. In practice, some challenging and controversial issues in the review and approval process of regulatory submissions may arise. These issues include totality-of-the-evidence versus substantial evidence; confusion between the use of (1 − α ) × 100% confidence interval (CI) approach for evaluation of new drugs versus the use of (1 − 2α ) × 100% CI approach for assessment of generics/biosimilars; endpoint selection; selecting the proper criteria for decision-making at interim; noninferiority or equivalence/similarity margin selection; treatment of missing data, the issue of multiplicity; sample size requirement; consistency test in multi-regional trials; extrapolation; drug products with multiple components; and the role of Advisory Committees (e.g., Oncologic Drug Advisory Committee). In addition, there are several critical clinical initiatives recently established by the FDA. These critical clinical initiatives concern precision and/or personalized (individualized) medicine; biomarker-driven clinical research; complex innovative design (CID); model-informed drug development (MIDD); rare diseases drug development; big data analytics; real-world data and real-world evidence, and machine learning for mobile individualized medicine (MIM) and imaging medicine (IM). In  Section  1.2, some key statistical concepts are briefly introduced. Section  1.3 describes some complex innovative designs and corresponding statistical methods. These complex innovative designs include adaptive trial designs, complete n-of-1 trial design, master protocols, and Bayesian approach. Challenging and controversial issues that are commonly encountered in the review and approval process of regulatory submissions are outlined in Section  1.4. Also included in this section are introduction of FDA recent critical clinical initiatives. Section 1.5 provides the aim and scope of the book.

1.2 Key Statistical Concepts 1.2.1 Confounding and Interaction In  pharmaceutical/clinical research and development, confounding and interaction effects are probably the most common distortions in the evaluation of a test treatment under investigation. Confounding effects are contributed by various factors such as race and gender that cannot be separated by the

Introduction

3

design under study, while an interaction effect between factors is a joint effect with one or more contributing factors (Chow and Liu, 2013). Confounding and interaction effects are important considerations in pharmaceutical/ clinical development. For example, when confounding effects are observed, we cannot assess the treatment effect because it has been contaminated. On the other hand, when interactions among factors are observed, the treatment must be carefully evaluated to isolate those effects. 1.2.1.1 Confounding In clinical trials, there are many sources of variation that have an impact on the primary clinical endpoints for evaluation relating to a certain new regimen or intervention. If some of these variations are not identified and properly controlled, they can become mixed in with the treatment effect that the trial is designed to demonstrate. Then the treatment effect is said to be confounded by effects due to these variations. To gain a better understanding, consider the following example. Suppose that last winter Dr. Smith noticed that the temperature in the emergency room of a hospital was relatively low and caused some discomfort among medical personnel and patients. Dr. Smith suspected that the heating system might not be functioning properly and called on a technician to improve it. As a result, the temperature of the emergency room was at a comfortable level this winter. However, this winter is not as cold as last winter. Therefore, it is not clear whether the improvement (temperature control) in the emergency room was due to the improvement in the heating system or the effect of a warmer winter. In fact, the effect due to the improvement of the heating system and that due to a warmer winter are confounded and cannot be separated from each other. In clinical trials, there are many subtle, unrecognizable, and seemingly innocent confounding factors that can cause ruinous results of clinical trials. Moses (1985) discussed an example of the devastating result with the confounder being the personal choice of a patient. The example concerns a polio-vaccine trial that was conducted on two million children worldwide to investigate the effect of Salk poliomyelitis vaccine. This trial reported that the incidence rate of polio was lower in the children whose parents refused injection than whose who received placebo after their parent gave permission (Meier, 1989). After an exhaustive examination of the data, it was found that susceptibility to poliomyelitis was related to the differences between the families who gave the permission and those who did not. In many cases, confounding factors are inherent in the design of the studies. For  example, dose-titration studies in escalating levels are often used to investigate the dose-response relationship of the antihypertensive agents during phase II stage of clinical development. For  a typical dose titration study, after a washout period during which previous medication stops and the placebo is prescribed, N subjects start at the lowest dose for a prespecified time interval. At the end of the interval, each patient is evaluated as a

4

Innovative Statistics in Regulatory Science

responder to the treatment or a non-responder according to some criteria prespecified in the protocol. In  a titration study, a subject will continue to receive the next higher dose if he or she fails, at the current level, to meet some objective physiological criteria such as reduction of diastolic blood pressure by a specific amount and has not  experienced any unacceptable adverse experience. Figure  1.1 provides a graphical presentation of a typical titration study (Shih et al., 1989). Dose titration studies are quite popular among clinicians because they mimic real clinical practice in the care of patients. The major problem with this typical design for a dose-titration study is that the dose-response relationship is often confounded with time course and the unavoidable carryover effects from the previous dose levels which cannot be estimated and eliminated. One can always argue that the relationship found in a dose titration study is not  due to the dose but to the time. Statistical methods for binary data from dose-titration studies have been suggested under some assumptions (e.g., see Chuang, 1987; Shih et al., 1989). Because the dose level is confounded with time, estimation of the dose-response relationship based on continuous data has not yet been resolved in general. Another type of design that can induce confounding problems when it is conducted inappropriately is the crossover design. For a standard 2 × 2 crossover design, each subject is randomly assigned to one of the two sequences.

FIGURE 1.1 Graphical display of a titration trial. di, the ith dose level; si, the number of subjects who responded at the ith dose; wi, the number of subjects who withdrew at the ith dose; and m, the number of subjects who completed the study without a response. (From Shih, W.J. et al., Stat. Med., 8, 583–591, 1989.)

Introduction

5

In  sequence 1, subjects receive the reference (or control) treatment at the first dosing period and the test treatment at the second dosing period after a washout period of sufficient length. The  order of treatments is reversed for the subjects in sequence 2. The issues in analysis of the data from a 2 × 2 crossover design are twofold. First, unbiased estimates of treatment effect cannot be obtained from the data of both periods in the presence of a nonzero carryover effect. The second problem is that the carryover effect is confounded with sequence effect and treatment-by-period interaction. In  the absence of a significant sequence effect, however, an unbiased estimate of the treatment effect can be estimated from the data of both periods. In practice, it is not clear whether an observed statistically significant sequence effect (or carryover effect) is a true sequence effect (or carryover effect). As a result, this remains a major drawback of the standard 2 × 2 crossover design, since the primary interest is to estimate a treatment effect that is still an issue in the presence of a significant nuisance parameter. The sequence and carryover effects, however, are not  confounded to each other in higher-order crossover designs that compare two treatments and can provide unbiased estimation of treatment effect in the presence of a significant carryover effect (Chow and Liu, 1992a, 2000, 2008). Bailar (1992) provided another example of subtle and unrecognizable confounding factors. Wilson et al. (1985) and Stampfer et al. (1985) both reported the results on the incidence of cardiovascular diseases in postmenopausal women who had been taking hormones compared to those who had not. Their conclusions, however, were quite different. One reported that the incidence rate of cardiovascular disease among the women taking hormones was twice that in the control group, while the other reported a totally opposite result in which the incidence of the experimental group was only half that of women who were not taking hormones. Although these trials were not  randomized studies, both studies were well planned and conducted. Both studies had carefully considered the differences in known risk factors between the two groups in each study. As a result, the puzzling difference in the two studies may be due to some subtle confounding factors such as the dose of hormones, study populations, research methods, or other related causes. This example indicates that it is imperative to identify and take into account all confounding factors for the two adequate, well-controlled studies that are required for demonstration of effectiveness and safety of the study medication under review. In clinical trials, it is not uncommon for some subjects not to follow instructions concerning taking the prescribed dose at the scheduled time as specified in the protocol. If the treatment effect is related to (or confounded with) patients’ compliance, any estimates of the treatment effect are biased unless there is a placebo group in which the differences in treatment effects between subjects with good compliance and poor compliance can be estimated. As a result, interpretation and extrapolation of the findings are inappropriate. In  practice, it is very difficult to identify compliers and noncompliers

6

Innovative Statistics in Regulatory Science

and to quantify the relationship between treatment and compliance. On the other hand, subject withdrawals or dropouts from clinical trials are the ultimate examples of noncompliance. There are several possible reasons for dropouts. For example, a subject with severe disease did not improve and hence dropped out from the study. This will cause the estimate of treatment effect to be biased in favor of a false positive efficacy, if the subjects with mild disease remain and improve. On the other hand, if the subjects whose conditions improve withdraw from a study, and those who did not improve remain until the scheduled, the estimation of efficacy will then be biased and hence indicate a false negative efficacy. Noncompliance and subject dropouts are only two of the many confounding factors that can occur in many aspects of clinical trials. If there is an unequal proportion of the subjects who withdraw from the study or comply to the dosing regimen among different treatment groups, it is very important to perform an analysis on these two groups of subjects to determine whether confounded factors exist and the direction of possible bias. In addition, every effort must be made to continue subsequent evaluation of withdrawals in primary clinical endpoints such as survival or any serious adverse events. For analyses of data with noncompliance or withdrawals, it is suggested that an intention-to-treat analysis be performed. An intention-to-treat analysis includes all available data based on all randomized subjects with the degree of compliance or reasons for withdrawal as possible covariates. 1.2.1.2 Interaction The  objective of a statistical interaction investigation is to conclude whether the joint contribution of two or more factors is the same as the sum of the contributions from each factor when considered alone. The factors may be different drugs, different doses of two drugs, or some stratification variables such as severity of underlying disease, gender, or other important covariates. To illustrate the concept of statistical interaction, we consider the Second International Study of Infarct Survival (ISIS-2 Group, 1988). This study employed a 2 × 2 factorial design (two factors with two levels at each factor) to study the effect of streptokinase and aspirin in the reduction of vascular mortality in patients with suspected acute myocardial infarction. The  two factors are one-hour intravenous infusion of 1.5 MU of streptokinase and one month of 150 mg per day enteric-coated aspirin. The  two levels for each factor are either active treatment and their respective placebo infusion or tablets. A total of 17,187 patients were enrolled in this study. The  numbers of the patients randomized to each arm is illustrated in Table 1.1. The  key efficacy endpoint is the cumulative vascular mortality within 35  days after randomization. Table  1.2 provides the cumulative vascular mortality for each of the four arms as well as those for streptokinase and aspirin alone.

Introduction

7

TABLE 1.1 Treatment of ISIS-2 with Number of Patients Randomized IV Infusion of Streptokinase Aspirin

Active

Placebo

Active Placebo Total

4292 4300 8592

4295 4300 8595

Total 8,587 8,600 17,187

Source: ISIS-2 Group, Lancet, 13, 349–360, 1988.

TABLE 1.2 Cumulative Vascular Mortality in Days 0-35 of ISIS-2 IV Infusion of Streptokinase Aspirin

Active

Placebo

Total

Active Placebo Total

8.0% 10.4% 9.2%

10.7% 13.2% 12.0%

9.4% 11.8%

Source: ISIS-2 Group, Lancet, 13, 349–360, 1988.

From Table 1.2 the mortality of streptokinase group is about 9.2%, with the corresponding placebo mortality being 12.0%. The improvement in mortality rate attributed to streptokinase is 2.8% (12.0%–9.2%). This is referred to as the main effect of streptokinase. Similarly, the main effect of aspirin tablets can also be estimated from Table 1.2 as 2.4% (11.8%–9.4%). From Table 1.2, the joint contribution of both streptokinase and aspirin in improvement in mortality is 5.2% (13.2%–8.0%), which is exactly equal to the contribution in mortality by streptokinase (2.8%) plus that by aspirin (2.4%). This is a typical example that no interaction exists between streptokinase and aspirin because the reduction in mortality by joint administration of both streptokinase and aspirin can be expected as the sum of reduction in mortality attributed to each anti-thrombolytic agent when administrated alone. In other words, the difference between the two levels in one factor does not depend on the level of the other factor. For example, the difference in vascular mortality between streptokinase and placebo for the patients taking aspirin tablets is 2.7% (10.7%–8.0%). A similar difference of 2.8% is observed between streptokinase (10.4%) and placebo (13.2%) for the patients taking placebo tablets. Therefore, the reduction in mortality attributed to streptokinase is homogeneous for the two levels of aspirin tablets. As a result, there is no interaction between streptokinase infusion and aspirin tablets. The  ISIS-2 trial provides an example of an investigation of interaction between two treatments. However, in the clinical trial it is common to check for interaction between treatment and other important prognostic and stratification factors. For example, almost all adequate well-controlled studies for

Innovative Statistics in Regulatory Science

8

the establishment of effectiveness and safety for approval of pharmaceutical agents are multicenter studies. For multicenter trials, the FDA requires that the treatment-by-center interaction be examined to evaluate whether the treatment effect is consistent across all centers. 1.2.2 Hypotheses Testing and p-values 1.2.2.1 Hypotheses Testing In clinical trials a hypothesis is a postulation, assumption, or statement that is made about the population regarding the efficacy, safety, or other pharmacoeconomic outcomes (e.g., quality of life) of a test treatment under study. This  statement or hypothesis is usually a scientific question that needs to be investigated. A clinical trial is often designed to address the question by translating it into specific study objective(s). Once the study objective(s) has been carefully selected and defined, a random sample can be drawn through an appropriate study design to evaluate the hypothesis about the drug product. For example, a scientific question regarding a test treatment, say treatment A, of interest could be either (i) “Is the mortality reduced by treatment A?” or (ii) “Is treatment A superior to treatment B in treating hypertension?” For the questions regarding treatment A described above, the null hypotheses are that (i) there is no difference between treatment A and the placebo in the reduction of mortality and (ii) there is no difference between treatment A  and treatment B in treating hypertension, respectively. The  alternative hypotheses are that (i) treatment A reduces the mortality and (ii) treatment A  is superior to treatment B in treating hypertension, respectively. These scientific questions or hypotheses to be tested can then be translated into specific study objectives. Chow and Liu (2000) recommended the following steps be taken to perform a hypothesis testing: Step 1. Choose the null hypothesis that is to be questioned. Step 2. Choose an alternative hypothesis that is of particular interest to the investigators. Step 3. Select a test statistic, and define the rejection region (or a rule) for decision making about when to reject the null hypothesis and when not to reject it. Step 4. Draw a random sample by conducting a clinical trial. Step 5. Calculate the test statistic and its corresponding p-value. Step 6. Make conclusion according to the predetermined rule specified in step 3. When performing a hypotheses testing, basically two kinds of errors (i.e., type I error and type III error) occur. Table 1.3 summarizes the relationship between type I and type II errors when testing hypotheses.

Introduction

9

TABLE 1.3 Relationship Between Type I and Type II Errors If H0 is

When Fail to reject Reject

True

False

No error Type I error

Type II error No error

FIGURE 1.2 Relationship between probabilities of type I and type II errors.

A  graph based on the null hypothesis of no difference is presented in Figure 1.2 to illustrate the relationship between α and β (or power) for various βs under H 0 for various alternatives at α  =  5% and 10%. It  can be seen that α decreases as β increases or α increases as β decreases. The only way of decreasing both α and β is to increase the sample size. In clinical trials a typical approach is to first choose a significant level α and then select a sample size to achieve a desired test power. In other words, a sample size is chosen to reduce type II error such that β is within an acceptable range at a prespecified significant level of α. From Table 1.3 and Figure 1.2 it can be seen that α and β depend on the selected null and alternative hypotheses. As indicated earlier, the hypothesis to be questioned is usually chosen as the null hypothesis. The alternative hypothesis is usually of particular interest to the investigators. In practice, the choice of the null hypothesis and the alternative hypothesis has an impact on the parameter to be tested. Chow and Liu (2000) indicate that the null hypothesis may be selected based on the importance of the type I error. In either case, however, it should be noted that we would never be able to prove that H 0 is true even though the data fail to reject it.

Innovative Statistics in Regulatory Science

10

1.2.2.2 p-value In medical literature p-values are often used to summarize results of clinical trials in a probabilistic way. The probability statement indicated that a difference at least as great as the observed would occur in less than 1 in 100 trials if a 1% level of significance were chosen or in less than 1 in 20 trials if a 5% level of significance were selected provided that the null hypothesis of no difference between treatments is true and the assumed statistical model is correct. In practice, the smaller the p-value shows, the stronger the result is. However, the meaning of a p-value may not be well understood. The p-value is a measure of the chance that the difference at least as great as the observed difference would occur if the null hypothesis is true. Therefore, if the p-value is small, then the null hypothesis is unlikely to be true by chance, and the observed difference is unlikely to occur due to chance alone. The p-value is usually derived from a statistical test that depends on the size and direction of the effect (a null hypothesis and an alternative hypothesis). To show this, consider testing the following hypotheses at the 5% level of significance: H 0 : There is no difference vs. H a : There is a difference.

(1.1)

The statistical test for the above hypotheses is usually referred to as a twosided test. If the null hypothesis (i.e., H 0 ) of no difference is rejected at the 5% level of significance, then we conclude there is a significant difference between the drug product and the placebo. In this case we may further evaluate whether the trial size is enough to effectively detect a clinically important difference (i.e., a difference that will lead the investigators to believe the drug is of clinical benefit and hence of effectiveness) when such difference exists. Typically, the FDA requires at least 80% power for detecting such difference. In other words, the FDA requires there be at least 80% chance of correctly detecting such difference when the difference indeed exists. Figure 1.3 displays the sampling distribution of a two-sided test under the null hypothesis in (1.1). It  can be seen from Figure  1.3 that a two-sided test has (an) equal chance to show that the drug is either effective—one side—or ineffective—the other side. In Figure 1.3, C and −C are critical values. The area under the probability curve between −C and C constitutes the acceptance region for the null hypothesis. In  other words, any observed difference in means in this region is a piece of supportive information of the null hypothesis. The area under the probability curve below −C and beyond C is the rejection region. An observed difference in means in this region is a doubt of the null hypothesis. Based on this concept, we can statistically evaluate whether the null hypothesis is a true statement. Let µD and µ P be the population means of the primary efficacy variable of the drug product and the placebo, respectively. Under the null hypothesis of no difference (i.e., µD = µ P), a statistical test, say T, can be derived. Suppose that t, the observed difference in means of the drug product and the placebo, is a realization of T. Under the null hypothesis

Introduction

11

FIGURE 1.3 Sampling distribution of two-sided test.

we can expect that the majority of t will fall around the center, µD − µ P = 0. There is a 2.5% chance that we would see t will fall in each tail. That is, there is a 2.5% chance that t will be either below the critical value −C or beyond the critical value C. If t falls below −C, then the drug is worse than the placebo. On the other hand, if t falls beyond C, then the drug is superior to the placebo. In both cases we would suspect the validity of the statement under the null hypothesis. Therefore we would reject the null hypothesis of no difference if t > C or t < −C. Furthermore we may want to evaluate how strong the evidence is. In this case, we calculate the area under the probability curve beyond the point t. This  area is known as the observed p-value. Therefore the p-value is the probability that a result at least as extreme as that observed would occur by chance if the null hypothesis is true. It can be seen from Figure 1.3 that p-value < 0.05 if and only if t > C or t < −C. A  smaller p-value indicates that t is further away from the center (i.e., µD − µ P = 0) and consequently provides stronger evidence that supports the alternative hypothesis of a difference. In  practice, we can construct a confidence interval for µD − µ P = 0. If the constructed confidence interval does not contain 0, then we reject the null hypothesis of no difference at the 5% level of significance. It  should be noted that the above evaluations for the null hypothesis reach the same conclusion regarding the rejection of the null hypothesis. However, a typical approach is to present the observed p-value. If the observed p-value is less than the level of significance, then the investigators would reject the null hypothesis in favor of the alternative hypothesis.

12

Innovative Statistics in Regulatory Science

Although p-values measure the strength of evidence by indicating the probability that a result at least as extreme as that observed would occur due to random variation alone under the null hypothesis, they do not reflect sample size and the direction of treatment effect. In practice, p-values are a way of reporting the results of statistical analyses. It may be misleading to equate p-values with decisions. Therefore, in addition to p-values, they recommend that the investigators also report summary statistics, confidence intervals, and the power of the tests used. Furthermore, the effects of selection or multiplicity should also be reported. 1.2.3 One-Sided versus Two-Sided Hypotheses For marketing approval of a drug product, current FDA regulations require that substantial evidence of effectiveness and safety of the drug product be provided. Substantial evidence can be obtained through conducting two adequate well-controlled clinical trials. The evidence is considered substantial if the results from the two adequate well-controlled studies are consistent in the positive direction. In other words, both trials show that the drug product is significantly different from the placebo in the positive direction. If the primary objective of a clinical trial is to establish that the test drug under investigation is superior to an active control agent, it is referred to as a superiority trial (ICH, 1998). However, the hypotheses given in (1.1) do not specify the direction once the null hypothesis is rejected. As an alternative, the following hypotheses are proposed: H 0 : There is no difference vs. H a : The drug is better than placebo.

(1.2)

The statistical test for the above hypotheses is known as a one-sided test. If the null hypothesis of no difference is rejected at the 5% level of significance, then we conclude that the drug product is better than the placebo and hence is effective. Figure 1.4 gives the rejection region of a one-sided test. To further compare a one-sided and a two-sided test, let’s consider the level of proof required for marketing approval of a drug product at the 5% level of significance. For a given clinical trial, if a two-sided test is employed, the level of proof required is one out of 40. In other words, at the 5% level of significance, there is 2.5% chance (or one out of 40) that we may reject the null hypothesis of no difference in the positive direction and conclude the drug is effective at one side. On the other hand, if a one-sided test is used, the level of proof required is one out of 20. It turns out that the one-sided test allows more ineffective drugs to be approved by chance as compared to the two-sided test. As indicated earlier, to demonstrate the effectiveness and safety of a drug product, the FDA requires two adequate well-controlled clinical trials be conducted. Then the level of proof required should be squared regardless of which test is used. Table 1.4 summarizes the levels of proof required for the marketing approval of a drug product. As Table 1.4 indicates, the levels of proof required

Introduction

13

FIGURE 1.4 Sampling distribution of one-sided test.

TABLE 1.4 Level of Proof Required for Clinical Investigation Type of Tests Number of Trials One trial Two trials

One-Sided

Two-Sided

1/20 1/400

1/40 1/1600

for one-sided and two-sided tests are one out of 400 and one out of 1600, respectively. Fisher (1991) argues that the level of proof of one out of 400 is a strong proof and is sufficient to be considered as substantial evidence for marketing approval, so the one-sided test is appropriate. However, there is no universal agreement among the regulatory agencies (e.g., FDA), academia, and the pharmaceutical industry as to whether a one-sided test or a two-sided test should be used. The concern raised is based on the following two reasons: 1. Investigators would not run a trial if they thought the drug would be worse than the placebo. They would study the drug only if they believe that it might be of benefit. 2. When testing at the 0.05 significance level with 80% power, the sample size required is increased by 27% for the two-sided test as opposed to the one-sided test. As a result there is a substantial impact on cost when a one-sided test is used. It should be noted that although investigators may believe that a drug is better than the placebo, it is likely that the placebo turns out to be superior to the drug (Fleiss, 1987). Ellenberg (1990) indicates that the use of a one-sided

14

Innovative Statistics in Regulatory Science

test is usually a signal that the trial has too small a sample size and that the investigators are attempting to squeeze out a significant result by a statistical maneuver. These observations certainly argue against the use of one-sided test for the evaluation of effectiveness in clinical trials. Cochran and Cox (1957) suggest that a one-sided test is used when it is known that the drug must be at least as good as the placebo, while a two-sided test is used when it is not known which treatment is better. As indicated by Dubey (1991), the FDA  tends to oppose the use of a one-sided test. However, this position has been challenged at administrative hearings by several drug sponsors on behalf of Drug Efficacy Study Implementation (DESI) drugs. As an example, Dubey (1991) points out that several views that favor the use of one-sided test were discussed in an administrative hearing. Some drug sponsors argued that the one-sided test is appropriate in the following situations: (i) where there is truly only concern with outcomes in one tail and (ii) where it is completely inconceivable that the results can go in the opposite direction. In this hearing the sponsors inferred that the prophylactic value of the combination drug is greater than that posted by the null hypothesis of equal incidence, and therefore the risk of finding an effect when none in fact exists is located only in the upper tail. As a result, a one-sided test is called for. However, the FDA feels that a two-sided test should be applied to account for not  only the possibility that the combination drugs are better than the single agent alone at preventing candidiasis but also the possibility that they are worse at doing so. Dubey’s opinion is that one-sided tests may be justified in some situations such as toxicity studies, safety evaluation, analysis of occurrences of adverse drug reactions data, risk evaluation, and laboratory research data. Fisher (1991) argues that one-sided tests are appropriate for drugs that are tested against placebos at the 0.05 level of significance for two well-controlled trials. If, on the other hand, only one clinical trial rather than two is conducted, a one-sided test should be applied at the 0.025 level of significance. However, Fisher agrees that two-sided tests are more appropriate for active control trials. It is critical to specify hypotheses to be tested in the protocol. A one-sided test or two-sided test can then be justified based on the hypotheses. It should be noted that the FDA is against a post hoc decision to create significance or near significance on any parameters when significance did not  previously exist. This critical switch cannot be adequately explained and hence is considered an invalid practice by the FDA. More discussion regarding the use of one-sided test versus two-sided test from the perspectives of the pharmaceutical industry, academe, an FDA Advisory Committee member, and the FDA can be found in Peace (1991), Koch (1991), Fisher (1991), and Dubey (1991), respectively. 1.2.4 Clinical Significance and Clinical Equivalence As indicated in the hypotheses of (1.1), the objective of most clinical trials is to detect the existence of predefined clinical difference using a statistical

Introduction

15

testing procedure such as the unpaired two-sample t-test. If this predefined difference is clinically meaningful, then it is of clinical significance. If the null hypothesis in (1.1) is rejected at the α level of significance, then we conclude that a statistically significant difference exists between treatments. In other words, an observed difference that is unlikely to occur by chance alone is considered a statistically significant difference. However, a statistically significant difference depends on the sample size of the trial. A trial with a small sample size usually provides little information regarding the efficacy and safety of the test drug under investigation. On the other hand, a trial with a large sample size provides substantial evidence of the efficacy of the safety of the test drug product. An observed statistically significant difference, which is of little or no clinical significance, will not  be able to address the scientific/clinical questions that a clinical trial was intended to answer in the first place. The  magnitude of a clinically significant difference varies. In  practice, no precise definition exists for the clinically significant difference, which depends on the disease, indication, therapeutic area, class of drugs, and primary efficacy and safety endpoints. For example, for antidepressant agents (e.g., Serzone), a change from a baseline of 8 in the Hamilton depression (Ham-D) scale or a 50% reduction from baseline in the Hamilton depression (Ham-D) scale with a baseline score over 20 may be considered of clinical importance. For  antimicrobial agents (e.g., Cefil), a 15% reduction in bacteriologic eradication rate could be considered a significant improvement. Similarly, we could also consider a reduction of 10 mm Hg in sitting diastolic blood pressure as clinically significant for ACE inhibitor agents in treating hypertensive patients. The  examples of clinical significance of antidepressant or antihypertensive agents are those of individual clinical significance, which can be applied to evaluation of the treatment for individual patients in usual clinical practice. Because individual clinical significance only reflects the clinical change after the therapy, it cannot be employed to compare the clinical change of a therapy to that of no therapy or of a different therapy. Temple (1982) pointed out that in evaluation of one of phase II clinical trials for an ACE inhibitor, although the ACE inhibitor at 150  mg t.i.d. can produce a mean reduction from baseline in diastolic blood pressure of 16 mm Hg, the corresponding mean reduction from baseline for the placebo is also 9 mm Hg. It is easy to see that a sizable proportion of the patients in the placebo group reached the level of individual clinical significance of 10 mm Hg. Therefore, this example illustrates a fact that individual clinical significance alone cannot be used to establish the effectiveness of a new treatment. For assessment of efficacy/safety of a new treatment modality, it is, within the same trial, compared with either a placebo or another treatment, usually the standard therapy. If the concurrent competitor in the same study is a placebo, the effectiveness of the new modality can then be established, based on some primary endpoints, by providing the evidence of an average difference

Innovative Statistics in Regulatory Science

16

between the new modality and placebo that is larger than some prespecified difference of clinical importance to investigators or to the medical/scientific community. This observed average difference is said to be of the comparative clinical significance. The ability of a placebo-controlled clinical trial to provide such observed difference of both comparative clinical significance and statistical significance is referred to as assay sensitivity. A similar definition of assay sensitivity is also given in the ICH E10 guideline entitled, Choice of Control Group in Clinical Trials (ICH, 1999). On the other hand, when the concurrent competitor in the trial is the standard treatment or other active treatment, then efficacy of the new treatment can be established by showing that the test treatment is as good as or at least no worse than standard treatment. However, under this situation, the proof of efficacy for the new treatment is based on a crucial assumption that the standard treatment or active competitor has established its own efficacy by demonstrating a difference of comparative clinical significance with respect to placebo in adequate placebo-controlled studies. This assumption is referred to as the sensitivity-to-drug-effects (ICH E10, 1999). Table  1.5 presents the results first reported in Leber (1989), which was again used by Temple (1983) and Temple and Ellenberg (2000) to illustrate the issues and difficulties in evaluating and interpreting active-controlled trials. All six trials compare nomifensine (a test antidepressant) to imipramine (a standard tricyclic antidepressant) concurrently with placebo. The common baseline means and 4-week adjusted group means based on the Hamilton depression scale are given in Table 1.5. Except for trial V311(2), based on the Hamilton depression scale, both nomifensine and imipramine showed more than 50% mean reduction. However, magnitudes of average reduction on the Hamilton depression scale at 4 weeks for the placebo are almost the same as the other two active treatments for all five trials. Therefore, these five trials do not have assay sensitivity. It should be noted that trial V311(2) is the smallest trial, with a total sample size only of 22 patients. However, it was the only TABLE 1.5 Summary of Means of Hamilton Depression Scales of Six Trials Comparing Nomifensine, Imipramine, and Placebo Common Baseline

Four-Week Adjusted Mean (Number of Subjects)

Study

Mean

Nomifensine

Imipramine

Placebo

R301 G305 C311(1) V311(2) F313 K317

23.9 26.0 28.1 29.6 37.6 26.1

13.4(33) 13.0(39) 19.4(11) 7.3(7) 21.9(7) 11.2(37)

12.8(33) 13.4(30) 20.3(11) 9.5(8) 21.9(8) 10.8(32)

14.8(36) 13.9(36) 18.9(13) 23.5(7) 22.0(8) 10.5(36)

Source: Temple, R. and Ellenberg, S.S., Ann Intern. Med., 133, 455–463, 2000.

Introduction

17

trial in Table 1.5 that demonstrates that both nomifensine and imipramine are better than placebo in the sense of both comparative clinical significance and statistical significance. 1.2.5 Reproducibility and Generalizability As indicated in the previous chapter, for marketing approval of a new drug product, the FDA requires that substantial evidence of the effectiveness and safety of the drug product be provided through the conduct of at least two adequate and well-controlled clinical trials. The purpose of requiring at least two pivotal clinical trials is not only to assure the reproducibility, but also to provide valuable information regarding generalizability. Shao and Chow (2002) define reproducibility as (i) whether the clinical results in the same target patient population are reproducible from one location (e.g., study site) to another within the same region (e.g., the United States of America, European Union, or Asian Pacific region) or (ii) whether the clinical results are reproducible from one region to another region in the same target patient population. Generalizability is referred to as (i) whether the clinical results can be generalized from the target patient population (e.g., adult) to another similar but slightly different patient population (e.g., elderly) within the same region or (ii) whether the clinical results can be generalized from the target patient population (e.g., white) in one region to a similar but slightly different patient population (e.g., Asian) in another region. In what follows, we will provide the concept of reproducibility and generalizability for providing substantial evidence in clinical research and development. 1.2.5.1 Reproducibility In  clinical research, two questions are commonly asked. First, what is the chance that we will observe a negative result in a future clinical study under the same study protocol given that positive results have been observed in the two pivotal trials? In  practice, two positive results observed from the two pivotal trials, which have fulfilled the regulatory requirement for providing substantial evidence, may not guarantee that the clinical results are reproducible in a future clinical trial with the same study protocol with a high probability. This  is very likely, especially when the positive results observed from the two pivotal trials are marginal (i.e., their p-values are close to but less than the level of significance). Second, it is often of interest to determine whether a large clinical trial that produced positive clinical results can be used to replace two pivotal trials for providing substantial evidence for regulatory approval. Although the FDA  requires at least two pivotal trials be conducted for providing substantial evidence regarding the effectiveness and safety of the drug product under investigation for regulatory review, under the circumstances, the FDA Modernization Act (FDAMA) of 1997 includes a provision (Section 115 of FDAMA) to allow data from one

18

Innovative Statistics in Regulatory Science

adequate and well-controlled clinical trial investigation and confirmatory evidence to establish effectiveness for risk/benefit assessment of drug and biological candidates for approval. To address the above two questions, Shao and Chow (2002) suggested evaluating the probability of observing a positive result in a future clinical study with the same study protocol, given that a positive clinical result has been observed. Let H0 and Ha be the null hypothesis and the alternative hypothesis of (1.1). Thus, the null hypothesis is that there is no difference in mean response between a test drug and a control (e.g., placebo). Suppose that the null hypothesis is rejected if and only if |T| > C, where C is a positive known constant and T is a test statistic, which is usually related to a two-sided alternative hypothesis. In statistical theory, the probability of observing a significant clinical result when Ha is indeed true is referred to as the power of the test procedure. If the statistical model under Ha is a parametric model, then the power can be evaluated at θ, where θ is an unknown parameter or vector of parameters. Suppose now that one clinical trial has been conducted and the result is significant. Then, what is the probability that the second trial will produce a significant result, i.e., the significant result from the first trial is reproducible? Statistically, if the two trials are independent, the probability of observing a significant result in the second trial when Ha is true is the same as that of the first trial regardless of whether the result from the first trial is significant. However, it is suggested that information from the first clinical trial should be used in the evaluation of the probability of observing a significant result in the second trial. This leads to the concept of reproducibility probability (Shao and Chow, 2002). In general, the reproducibility probability is a person’s subjective probability of observing a significant clinical result from a future trial, when he/she observes significant results from one or several previous trials. Goodman (1992) considered the reproducibility probability as the power of the trial (evaluated at θ) by simply replacing θ with its estimate based on the data from previous trials. In other words, the reproducibility probability can be defined as an estimated power of the future trial using the data from previous studies. Shao and Chow (2002) studied how to evaluate the reproducibility probability using this approach under several study designs for comparing means with both equal and unequal variances. When the reproducibility probability is used to provide substantial evidence of the effectiveness of a drug product, the estimated power approach may produce an optimistic result. Alternatively, Shao and Chow (2002) suggested that the reproducibility probability be defined as a lower confidence bound of the power of the second trial. In addition, they also suggested a more sensible definition of reproducibility probability using the Bayesian approach. Under the Bayesian approach, the unknown parameter θ is a random vector with a prior distribution, say, π(θ), which is assumed known. Thus, the reproducibility probability can be defined as the conditional probability of |T| > C in the future trial, given the data set.

Introduction

19

TABLE 1.6 Reproducibility Probability Pˆ Known σ2 |T(x)| 1.96 2.05 2.17 2.33 2.58 2.81 3.30

Unknown σ2 (n = 30)

p-value



p-value



0.050 0.040 0.030 0.020 0.010 0.005 0.001

0.500 0.536 0.583 0.644 0.732 0.802 0.910

0.060 0.050 0.039 0.027 0.015 0.009 0.003

0.473 0.508 0.554 0.614 0.702 0.774 0.890

Source: Shao and Chow (2002).

In practice, the reproducibility probability is useful when the clinical trials are conducted sequentially. It provides important information for regulatory agencies in determining whether it is necessary to require the second clinical trial when the result from the first clinical trial is strongly significant. To illustrate the concept of reproducibility probability, reproducibility probabilities for various values of |T(x)| with n = 30 are given in Table 1.6. Table 1.6 suggests that it is not necessary to conduct the second trial if the observed p-value of the first trial is less than or equal to 0.001 because the reproducibility probability is about 0.91. On the other hand, even when the observed p-value is less than the 5% level of significance, say, the observed p-value is less than or equal to 0.01, a second trial is recommended because the reproducibility probability may not reach the level of confidence for the regulatory agency to support the substantial evidence of effectiveness of the drug product under investigation. When the second trial is necessary, the reproducibility probability can be used for sample size adjustment of the second trial. More details regarding sample size calculation based on reproducibility can be found in Shao and Chow (2002) and Chow et al. (2003). 1.2.5.2 Generalizability As discussed in Section 1.2.5.1, the concept of reproducibility involves whether clinical results observed from the same targeted patient population are reproducible from study site to study site within the same region or from region to region. In clinical development, after the drug product has been shown to be effective and safe with respect to the targeted patient population, it is often of interest to determine how likely the clinical results can be reproducible to a different but similar patient population with the same disease. We will refer to the reproducibility of clinical results in a different but similar patient population as the generalizability of the clinical results. For example, if the approved drug product is intended for the adult patient

Innovative Statistics in Regulatory Science

20

population, it is often of interest to study the effect of the drug product on a different but similar patient population, such as the elderly or pediatric patient population with the same disease. In addition, it is also of interest to determine whether the clinical results can be generalized to patient populations with ethnic differences. Similarly, Shao and Chow (2002) proposed to consider the so-called generalizability probability, which is the reproducibility probability with the population of a future trial slightly deviated from the targeted patient population of previous trials, to determine whether the clinical results can be generalized from the targeted patient population to a different but similar patient population with the same disease. In practice, the response of a patient to a drug product under investigation is expected to vary from patient to patient, especially from patients from the target patient population to patients from a different but similar patient population. The  responses of patients from a different but similar patient population could be different from those from the target patient population. As an example, consider a clinical trial, which was conducted to compare the efficacy and safety of a test drug with an active control agent for treatment of schizophrenia patients and patients with schizoaffective disorder. The primary study endpoint is the positive and negative symptom score (PANSS). The treatment duration of the clinical trial was 1 year with a 6-month followup. Table 1.7 provides summary statistics of PANSS by race. As it can be seen from Table  1.7, the means and standard deviations of PANSS are different across different races. Oriental patients tend to have higher PANSS with less variability as compared to those in white patients. Black patients seem to have lower PANSS with less variability at both baseline and endpoint. Thus, it is of interest to determine that the observed clinical results can be generalized to a different but similar patient population such as black or Oriental. Chow (2001) indicated that the responses of patients from a different but similar patient population could be described by the changes in mean and variance of the responses of patients from the target patient population. Consider a parallel-group clinical trial comparing two treatments with population means µ1 and µ2 and an equal variance σ 2. Suppose that in the future trial, the population mean difference is changed to µ1 − µ2 + ε and the population variance is changed to C 2σ 2, where C > 0. The  signal-to- noise ratio for the population difference in the previous trial is µ1 − µ2 /σ whereas the signal-to-noise ratio for the population difference in the future trial is

µ1 − µ2 + ε ∆( µ1 − µ2 ) = , σ Cσ where ∆=

1 + ε /( µ1 − µ2 ) . C

Introduction

21

TABLE 1.7 Summary Statistics of PANSS Baseline All Subjects

Race All Subjects

Test

Endpoint Active Control

All Subjects

Test

Active Control

N Mean

364 66.3

177 65.1

187 67.5

359 65.6

172 61.8

187 69.1

S.D.

16.85

16.05

17.54

20.41

19.28

Median

65.0

63.0

66.0

64.0

59.0

67.0

Range

(30–131)

(30–115)

(33–131)

(31–146)

(31–145)

(33–146)

N

174

81

93

169

77

92

Mean

68.6

67.6

69.5

69.0

64.6

72.7

S.D.

17.98

17.88

18.11

21.31

21.40

20.64

Median

65.5

64.0

66.0

66.0

61.0

70.5

Range

(30–131)

(30–115)

(33–131)

(31–146)

(31–145)

(39–146)

N

129

67

62

129

66

63

Mean

63.8

63.3

64.4

61.7

58.3

65.2

S.D.

13.97

12.83

15.19

18.43

16.64

19.64

Median

64.0

63.0

65.5

61.0

56.5

66.0

Range

(34–109)

(38–95)

(34–109)

(31–129)

(31–98)

(33–129)

N

5

2

3

5

2

3

Mean

71.8

72.5

71.3

73.2

91.5

61.0

S.D.

4.38

4.95

5.03

24.57

20.51

20.95

Median

72.0

72.5

72.0

77.0

91.5

66.0

Range

(66–76)

(69–76)

(66–76)

(38–106)

(77–106)

(38–79)

N

51

24

27

51

24

27

Mean

64.5

61.4

67.3

64.6

61.9

67.1

S.D.

18.71

16.78

20.17

20.60

16.71

23.58

Median

63.0

60.0

68.0

66.0

59.5

67.0

Range

(33–104)

(35–102)

(33–104)

(33–121)

(33–90)

(33–121)

20.83

White

Black

Oriental

Hispanic

is a measure of change in the signal-to-noise ratio for the population difference. Note that the above can be expressed by ∆ multiplying the effect size of the first trial. As a result, Shao and Chow (2002) refer to ∆ as the sensitivity index, which is useful when assessing similarity in bridging studies. For most practical problems, ε < µ1 − µ2 and thus ∆ > 0.

22

Innovative Statistics in Regulatory Science

If the power for the previous trial is p(θ), then the power for the future trial is p(∆θ ). Suppose that ∆ is known. As discussed earlier, the generalizability probability is given by Pˆ∆ , which can be obtained by simply replacing T(x) with ∆T(x). Under the Bayesian approach, the generalizability probability can be obtained by replacing p(δ/u) with p(∆δ/u). In practice, the generalizability probability is useful when assessing similarity between clinical trials conducted in different regions (e.g., Europe and the United States of America or the United States of America and the Asian Pacific region). It provides important information for local regulatory health authorities in determining whether it is necessary to require a bridging clinical study based on the analysis of the sensitivity index for assessment of possible difference in ethnic factors (Chow et al., 2002). When a bridging study is deemed necessary, the assessment of the generalizability probability based on the sensitivity index can be used for sample size adjustment of the bridging clinical study.

1.3 Complex Innovative Designs 1.3.1 Adaptive Trial Design In  the past several decades, it is recognized that increasing spending of biomedical research does not reflect an increase of the success rate of pharmaceutical (clinical) development. Woodcock (2005) indicated that the low success rate of pharmaceutical development could be due to: (i) a diminished margin for improvement that escalates the level of difficulty in proving drug benefits; (ii) genomics and other new science have not yet reached their full potential; (iii) mergers and other business arrangements have decreased candidates; (iv) easy targets are the focus of effort, as chronic diseases are harder to study; (v) failure rates have not improved; and (vi) rapidly escalating costs and complexity has decreased willingness/ability to bring many candidates forward for clinical trials. As a result, the FDA  established a Critical Path Initiative to assist sponsors in identifying the scientific challenges underlying the medical product pipeline problems. In  2006, the FDA  released a Critical Path Opportunities List that calls for advancing innovative trial designs, especially for the use of prior experience or accumulated information in trial design. Many researchers have interpreted the FDA’s action as an encouragement for the use of innovative adaptive design methods in clinical trials, while some researchers believe it is an encouragement for the use of Bayesian approach. The  purpose of adaptive design methods in clinical trials is to give the investigator the flexibility for identifying any signals or trends (preferably best or optimal clinical benefit) of the test treatment under investigation

Introduction

23

without undermining the validity and integrity of the intended study. The concept of adaptive design can be traced back to 1970s when the adaptive randomization and a class of designs for sequential clinical trials were introduced. As a result, most adaptive design methods in clinical research are referred to as adaptive randomization, group sequential designs with the flexibility for stopping a trial early due to safety, futility and/or efficacy, and sample size re-estimation at interim for achieving the desired statistical power. The use of adaptive design methods for modifying the trial and/or statistical procedures of on-going clinical trials based on accrued data has been practiced for years in clinical research. Adaptive design methods in clinical research are very attractive to clinical scientists due to the following reasons. First, it reflects medical practice in real-world. Second, it is ethical with respect to both e efficacy and safety (toxicity) of the test treatment under investigation. Third, it is not only flexible, but also efficient in the early phase of clinical development. However, it is a concern whether the p-value or confidence interval regarding the treatment effect obtained after the modification is reliable or correct. In addition, it is also a concern that the use of adaptive design methods in a clinical trial may lead to a totally different trial that is unable to address scientific/medical questions that the trial is intended to answer. In  its recent draft guidance, Adaptive Design Clinical Trials for Drugs and Biologics, the FDA defines an adaptive design clinical study as a study that includes a prospectively planned opportunity for modification of one or more specified aspects of the study design and hypotheses based on analysis of data (usually interim data) from subjects in the study. The FDA emphasizes that one of the major characteristics of an adaptive design is the prospectively planned opportunity. Changes should be made based on analysis of data, usually interim data (FDA, 2010b, 2018). Note that the FDA’s definition excludes changes made through protocol amendments. Thus, it does not reflect real practice in clinical trials. In many cases, an adaptive design is also known as a flexible design (EMEA, 2002, 2006). 1.3.1.1 Adaptations An adaptation is referred to as a modification or a change made to trial procedures and/or statistical methods during the conduct of a clinical trial. By definition, adaptations that are commonly employed in clinical trials can be classified into the categories of prospective adaptation, concurrent (or ad hoc) adaptation, and retrospective adaptation. Prospective adaptations include, but are not limited to, adaptive randomization; stopping a trial early due to safety, futility or efficacy at interim analysis; dropping the losers (or inferior treatment groups), sample size re-estimation, etc. Thus, prospective adaptations are usually referred to as by design adaptations as described in the PhRMA white paper (Gallo et al., 2006). Concurrent adaptations are usually referred to as any ad hoc modifications or changes made as the trial

24

Innovative Statistics in Regulatory Science

continues. Concurrent adaptations include, but are not limited to, modifications in inclusion/exclusion criteria, evaluability criteria, dose/regimen and treatment duration, changes in hypotheses and/or study endpoints, and etc. Retrospective adaptations are usually referred to as modifications and/or changes made to statistical analysis plan prior to database lock or unblinding of treatment codes. In  practice, prospective, ad hoc, and retrospective adaptations are implemented by study protocol, protocol amendments, and statistical analysis plan with regulatory reviewer’s consensus, respectively. 1.3.1.2 Types of Adaptive Design Based on the adaptations employed, commonly considered adaptive designs in clinical trials include, but are not  limited to: (i) an adaptive randomization design; (ii) a group sequential design; (iii) a sample size re-estimation (SSRE)  design or an N-adjustable design; (iv) a drop-the-loser (or pickthe-winner) design; (v) an adaptive dose finding design; (iv) a biomarkeradaptive design; (vii) an adaptive treatment-switching design; (viii) an adaptive-hypothesis design; (ix) an adaptive seamless (e.g., a two-stage phase  I/II or phase II/III) trial design; and (x) a multiple adaptive design. More detailed information regarding these designs can be found in Chapter 13 (see also Chow and Chang, 2011). 1.3.2 The n-of-1 Trial Design One of the major dilemmas for rare diseases clinical trials is the in-availability of patients with the rare diseases under study. In addition, it is unethical to consider a placebo control in the intended clinical trial. Thus, it is suggested that an n-of-1 crossover design be considered. An n-of-1 trial design is to apply n treatments (including placebo) to an individual at different dosing periods with sufficient washout in between dosing periods. A  complete n-of-1  trial design is a crossover design consisting of all possible combinations of treatment assignment at different dosing periods. 1.3.2.1 Complete n-of-1 Trial Design Suppose there are p dosing periods and two test treatments, e.g., a test (T) treatment and a reference (R) product, to be compared. A complete n-of-1 trial design for comparing two treatments consists of Π ip=1 2, where p ≥ 2, sequences of p treatments (either T or R at each dosing period). In this case, n = p. If p = 2, then the n-of 1 trial design is a 4 × 2 crossover design, i.e., (RR, RT, TT, TR), which is a typical Balaam’s design. When p = 3, the n-of-1 trial design becomes an 8 × 3 crossover design, while the complete n-of-1 trial design with p = 4 is a 16 × 4 crossover design, which is illustrated in Table 1.8. As indicated in a recent FDA  draft guidance, a two-sequence dual design, i.e., (RTR, TRT) and a 4  ×  2 crossover designs, i.e., (RTRT, RRRR)

Introduction

25

TABLE 1.8 Examples of Complete n-of-1 Designs with p = 4 Group 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Period 1

Period 2

Period 3

Period 4

R R T T R R T T R R R R T T T T

R T T R R T R T R R T T R R T T

R R R R T T T T R T R T R T R T

R R R R R T R T T T T R T T T R

Note: The first block (a 4 × 2 crossover design) is a complete n-of-1 design with 2 periods, while the second block is a complete n-of-1 design with 3 periods.

recommended by the FDA are commonly considered switching designs for assessing interchangeability in biosimilar product development (FDA, 2017). These two switching designs, however, are limited for fully characterizing relative risk (i.e., reduction in efficacy or increase incidence rate of adverse event rate). On the other hand, these two trial designs are special case of the complete n-of-1 trial design with 3 or 4 dosing periods, respectively. Under the complete n-of-1  crossover design with 4 dosing periods, all possible switching and alternations can be assessed and the results can be compared within the same group of patients and between different groups of patients. 1.3.2.2 Merits and Limitations A complete n-of-1 trial design has the following advantages: (i) each subject is at his/her own control; (ii) it allows a comparison between the test product and the placebo if the intended trial is a placebo-controlled study (this has lifted the ethical issue of using placebo on the patients with critical conditions); (iii) it allows estimates of intra-subject variability; (iv) it provides estimates for treatment effect in the presence of possible carry-over effect, and most importantly; (v) it requires less subjects for achieving the study objectives of the intended trial design. However, the n-of-1  trial design suffers from the drawbacks of (i) possible dropouts or missing data and (ii) patients’ disease status may change at each dosing period prior to dosing.

26

Innovative Statistics in Regulatory Science

1.3.3 The Concept of Master Protocols Woodcock and LaVange (2017) introduced the concept of master protocol for studying multiple therapies, multiple diseases, or both in order to answer more questions in a more efficient and timely fashion (see also, Redman and Allegra, 2015). Master protocols include the following types of trials: umbrella, basket, and platform. The umbrella trial is to study multiple targeted therapies in the context of a single disease, while the basket trial is to study a single therapy in the context of multiple diseases or disease subtypes. The platform trial is used to study multiple targeted therapies in the context of a single disease in a perpetual manner, with therapies allowed to enter or leave the platform on the basis of decision algorithm. As indicated by Woodcock and LaVange (2017), if designed correctly, master protocols offer a number of benefits include streamlined logistics; improved data quality, collection and sharing; as well as the potential to use innovative statistical approaches to study design and analysis. Master protocols may be a collection of sub-studies or a complex statistical design or platform for rapid learning and decision-making. In  practice, master protocol is intended for the addition or removal of drugs, arms, and study hypotheses. Thus, in practice, master protocols may or may not be adaptive, umbrella, or basket studies. Since master protocol has the ability to combine a variety of logistical, innovative, and correlative elements, it allows learning more from smaller patient populations. Thus, the concept of master protocols in conjunction with adaptive trial design described in the previous section may be useful for rare diseases clinical investigation although it has been most frequently implemented in oncology research.

1.3.4 Bayesian Approach Under the assumption that historical data (e.g., previous studies or experience) are available, Bayesian methods for borrowing information from different data sources may be useful. These data sources could include, but are not limited to, natural history studies and expert’s opinion regarding prior distribution about the relationship between endpoints and clinical outcomes. The impact of borrowing on results can be assessed through the conduct of sensitivity analysis. One of the key questions of particular interest to the investigator and regulatory reviewer is that how much to borrow in order to (i) achieve desired statistical assurance for substantial evidence; and (ii) maintain the quality, validity, and integrity of the study. Although the Bayesian approach provides a formal framework for borrowing historical information, which is useful in rare disease clinical trials, borrowing can only be done under the assumption that there is a well-established relationship between patient populations (e.g., from previous studies to current study). In practice, it is suggested to not borrow any data from previous

Introduction

27

studies whenever possible. The primary analysis should rely on the data collected from the current study. When borrowing, the associated risk should be carefully evaluated for scientific/statistical validity of the final conclusion. It should be noted that Bayesian approach may not be feasible if there are no prior experience or study available. The determination of prior in Bayesian is always debatable because the primary assumption of the selected prior is often difficult, if not impossible, to be verified.

1.4 Practical, Challenging, and Controversial Issues 1.4.1 Totality-of-the-Evidence For regulatory approval of new drugs, Section 314.126 of 21 CFR states that substantial evidence needs to be provided to support the claims of new drugs. For regulatory approval of a proposed biosimilar product, the FDA requires totality-of-the-evidence be provided to support a demonstration of biosimilarity between the proposed biosimilar product and the US-licensed drug product. In  practice, it should be noted that there is no clear distinction between the substantial evidence in new drug development and the totalityof-the-evidence in biosimilar drug product development. For  approval of a proposed biosimilar product, the FDA  requires that totality-of-the-evidence be provided to support a demonstration that the proposed biosimilar product is highly similar to the US-licensed product, notwithstanding minor differences in clinically inactive components, and that there are no clinically meaningful differences between the proposed biosimilar product and the US-licensed product in terms of the safety, purity and potency of the product. To assist the sponsor in biosimilar product development, the FDA recommends a stepwise approach for obtaining the totality-of-the-evidence for demonstrating biosimilarity between the proposed biosimilar product and its corresponding innovative drug product in terms of safety, purity, and efficacy (Chow, 2013; FDA, 2015a, 2017a; Endrenyi et al., 2017). The stepwise approach starts with a similarity assessment in critical quality attributes (CQAs) in analytical studies, followed by a similarity assessment in pharmacological activities in pharmacokinetic and pharmacodynamic (PK/PD) studies and a similarity assessment in safety and efficacy in clinical studies. For analytical similarity assessment in CQAs, the FDA further recommends a tiered approach that classifies CQAs into three tiers depending upon their criticality or risk ranking relevant to clinical outcomes. For determination of criticality or risk ranking, FDA  suggests establishing a predictive (statistical) model based on either mechanism of action (MOA) or PK relevant to clinical outcome. Thus, the following assumptions are made for the stepwise approach for obtaining the totality-of-the-evidence.

28

Innovative Statistics in Regulatory Science

1. Analytical similarity is predictive of PK/PD similarity; 2. Analytical similarity is predictive of clinical outcomes; 3. PK/PD similarity is predictive of clinical outcomes. These assumptions, however, are difficult (if not impossible) to verify in practice. For assumptions (1) and (2), although many in vitro and in vivo correlations (IVIVC) have been studied in the literature, the correlations between specific CQAs and PK/PD parameters or clinical endpoints are not fully studied and understood. In other words, most predictive models are not well established or are established but not validated. Thus, it is not clear how a (notable) change in a specific CQA can be translated to a change in drug absorption or clinical outcome. For  (3), unlike bioequivalence assessment for generic drug products, there does not exist Fundamental Biosimilarity Assumption indicating that PK/PD similarity implies clinical similarity in terms of safety and efficacy. In other words, PK/PD similarity or dis-similarity may or may not lead to clinical similarity. Note that the assumptions (1) and (3) together do not imply (2) automatically. The validity of assumptions (1)–(3) is critical for the success of obtaining totality-of-the-evidence for assessing biosimilarity between the proposed biosimilar and the innovative biological product. This is because the validity of these assumptions ensures the relationships among analytical, PK/PD, and clinical similarity assessment and consequently the validity of the overall biosimilarity assessment. Table 1.1 illustrates relationships among analytical, PK/PD, and clinical assessments in the stepwise approach for obtaining the totality-of-the-evidence in biosimilar product development. 1.4.2 (1−α) CI for New Drugs versus (1−2α) CI for Generics/Biosimilars Recall that, for review and approval of regulatory submissions of drug products, a (1 − α ) × 100% confidence interval (CI) approach is commonly used for evaluation of safety and efficacy of new drugs, while a (1 − 2α ) × 100% CI is often considered for assessment of bioequivalence for generic drug products and biosimilarity assessment for biosimilar products. If, α is chosen to be 5%, this leads to a 95% confidence interval approach for evaluation of new drugs and a 90% confidence interval approach for assessment of generics and biosimilars. In the past decade, the FDA has been challenged for adopting different standards (i.e., 5% type I error rate for new drugs and 10% for generics/ biosimilars) in the review and approval process of regulatory submissions of drugs and biologics. The issue of using a 95% CI for new drugs and a 90% CI for generics/biosimilars has received much attention lately. This  controversy surrounding this issue is probably due to the mixedup use of the concepts of hypotheses testing and the confidence interval approach for evaluation of drug products. The statistical methods for evaluation of new drugs and for assessment of generics/biosimilars recommended

Introduction

29

by the FDA are a two-sided test (TST) for testing point hypotheses of equality and a two one-sided tests (TOST) for testing interval hypotheses of bioequivalence or biosimilarity, respectively. Under the hypotheses testing framework, test results are interpreted using confidence interval approach regardless that their corresponding statistical inferences may not be (operationally) equivalent. In practice, there are fundamental differences (i) between the concepts of point hypotheses and interval hypotheses, and (ii) between the concepts of hypotheses testing and confidence interval approach. For  (i), a TST is often performed for testing point hypotheses of equality at the 5% level of significance, while a TOST is commonly used for testing interval hypotheses of equivalence or similarity at the 5% level of significance. For  (ii), hypotheses testing focuses on power analysis for achieving a desired power (i.e., a type II error), while confidence interval focuses on precision analysis (i.e., a type I error) for controlling the maximum error allowed. The  confusion between the use of a 95% CI and the use of a 90% CI for evaluation of drug products will inevitably occur if we use a confidence interval mindset to interpret the results obtained under the hypotheses testing framework. 1.4.3 Endpoint Selection Utility function for endpoint selection, Chow and Lin (2015) developed valid statistical tests for analysis of two-stage adaptive designs with different study objectives and study endpoints at different stages under the assumption that there is well-established relationship between different but similar study endpoints at different stages. To apply the methodology developed by Chow and Lin (2015), we propose the development of a utility function to link all clinical outcomes relevant to NASH for endpoint selection at different stages as follows. Let y = { y1 , y2 ,…, ym } be the clinical outcomes of interest, which could be efficacy or safety/toxicity of a test treatment under investigation. Clinical outcomes could be NAFLD Activity Score (NAS) including steatosis, laula inflammation, and ballooning, fibrosis progression at different stages, and abnormalities on liver biopsy. Each of these clinical outcomes, yi is a function of criteria yi ( x), x ∈ X , where X is a space of criteria. We can then define a utility function (e.g., a composite index such as NAS ≥ 4 and/or F2/F3 (fibrosis stage 2 ad fibrosis stage 3) or co-primary endpoints (NAS score and fibrosis stage) for endpoint selection as follows. Us =



m j =1

wsj =



m j =1

w ( ysj ),

where U s denotes the endpoint derived from the utility at the sth stage of the multiple-stage adaptive design and w j , j = 1,…, m are pre-specified weights.

30

Innovative Statistics in Regulatory Science

The above single utility function, which takes different clinical outcomes with pre-specified criteria into consideration, is based on the single utility index rather than individual clinical outcomes. The single utility index model allows the investigator to accurately and reliably assess the treatment effect in a more efficient way as follows: p = P{U s ≥ τ s , s = 1, 2,…, k },

(2)

whereτ s may be suggested by the regulatory agencies such as the FDA. It should be noted that U s, s = 1, 2,…, k (stages) are similar but different. In practice, if k = 2, we can apply statistical methods for two-stage adaptive design with different study endpoints and objectives at different stages (Chow and Lin, 2015). 1.4.4 Criteria for Decision-Making at Interim For a multiple-stage adaptive design, sample size is often selected to empower the study to detect a clinically meaningful treatment effect at the end of the last stage. As a result, the chosen sample size may not provide adequate or sufficient power for making critical decision (e.g., detecting clinically meaningful difference for dose selection or dropping the inferior treatment groups) at earlier stages in the study. In this case, a precision analysis is recommended to assure that (1) the selected dose has achieved statistical significance (i.e., the observed difference is not by chance alone) and (2) with the sample size available at interim, desired statistical inference for (critical) decision-making. Let (Li , U i ) be the 95% confidence interval for the difference between the ith dose group and the control group, where i = 1,…, k. If Li > 0, we claim that the observed difference between the ith dose group and the control group has achieved statistical significance. In other words, the observed difference is not by chance alone and hence is reproducible. In this case, the dose level with L* = max {Li , i = 1,…, k} will be selected moving forward to the next stage. On the other hand, if (Li , U i ) covers 0, we conclude that there is no difference between the ith dose group and the control group. In this case, the confidence level for achieving statistical significance is defined as the probability that the true mean difference is within (0,U i ). In the case where all (Li , U i ), i = 1,…, k cover 0, the following criteria are often considered for dose selection: 1. The dose with highest confidence level for achieving statistical significance will be selected; or 2. The  doses with confidence levels for achieving statistical significance less than 75% will be dropped. Note that sample size increases, the corresponding confidence level for achieving statistical difference increases. It should also be noted that precision assessment in terms of confidence interval approach is operationally equivalent to hypotheses testing for comparing means.

Introduction

31

1.4.5 Non-inferiority or Equivalence/Similarity Margin Selection In clinical trials, it is unethical to treat patients with critical/severe and/or life-threatening diseases such as cancer when approved and effective therapies such as standard of care or active control agents are available. In this case, an active control trial is often conducted for investigation of a new test treatment. The goal of an active control trial is to demonstrate that the test treatment is not  inferior to or equivalent to the active control agent in the sense that the effect of the test treatment is not below some non-inferiority margin or equivalence limit when compared with the efficacy of the active control agent. In practice, there may be a need to develop a new treatment or therapy that is non-inferior (but not necessarily superior) to an established efficacious treatment due to the following reasons: (i) the test treatment is less toxic; (ii) the test treatment has a better safety profile; (iii) the test treatment is easy to administer; (iv) the test treatment is less expensive; (v) the test treatment provides better quality of life; (vi) the test treatment provides an alternative treatment with some additional clinical benefits, e.g., generics or biosimilars. Clinical trials of this kind are referred to as non-inferiority trials. A comprehensive overview of design concepts and important issues that are commonly encountered in active control or non-inferiority trials can be found in D’Agostino et al. (2003). For  a non-inferiority trial, the idea is to reject the null hypothesis that a test treatment is inferior to a standard therapy or an active control agent and conclude that the difference between the test treatment and the active control agent is less than a clinically meaningful difference (non-inferiority margin) and hence the test treatment is at least as effective as (or not worsen than) the active control agent. The test treatment can then serve as an alternative to the active control agent. In  practice, however, it should be noted that unlike equivalence testing, non-inferiority testing is a one-sided equivalence testing which consists of the concepts of equivalence and superiority. In other words, superiority may be tested after the non-inferiority has been established. We conclude equivalence if we fail to reject the null hypothesis of non-superiority. On the other hand, superiority may be concluded if the null hypothesis of non-superiority is rejected. One of the major considerations in a non-inferiority trial is the selection of the non-inferiority margin. A different choice of non-inferiority margin will have an impact on sample size requirement for achieving a desired power for establishment of non-inferiority. It should be noted that non-inferiority margin could be selected based on either absolute change or relative change of the primary study endpoint, which will affect the method for data analysis of the collected clinical data, and consequently may alter the conclusion of the clinical study. In practice, despite the existence of some studies (e.g., Tsong et al., 1999; Hung et al., 2003; Laster and Johnson, 2003; Phillips, 2003), there is no established rule or gold standard for determination of non-inferiority margins in active control trials until early 2000. In 2000, the

32

Innovative Statistics in Regulatory Science

International Conference on Harmonization (ICH) published a guideline to assist the sponsors for selection of an appropriate non-inferiority margin (ICH E10, 2000). ICH E10 guideline suggests that a non-inferiority margin may be selected based on past experience in placebo control trials with valid design under conditions similar to those planned for the new trial and the determination of a non-inferiority margin should not only reflect uncertainties in the evidence on which the choice is based, but also be suitably conservative. Along this line, in 2010, the FDA also published draft guidance on non-inferiority clinical trials and recommends a couple of approaches for selection of non-inferiority margin (FDA, 2010). 1.4.6 Treatment of Missing Data Missing values or incomplete data are commonly encountered in clinical trials. One of the primary causes of missing data is the dropout. Reasons for dropout include, but are limited to, refusal to continue in the study (e.g., withdrawal of informed consent), perceived lack of efficacy, relocation, adverse events, unpleasant study procedures, worsening of disease, unrelated disease, non-compliance with the study, need to use prohibited medication, and death (DeSouza et al., 2009). Following the idea of Little and Rubin (1987), DeSouza et  al. (2009) provided an overview of three types of missingness mechanisms for dropouts. These three types of missingness mechanisms include (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR). Missing completely at random is referred to the dropout process that is independent of the observed data and the missing data. Missing at random indicates that the dropout process is dependent on the observed data but is independent of the missing data. For missing not at random, the dropout process is dependent on the missing data and possibly the observed data. Depending upon the missingness mechanisms, appropriate missing data analysis strategies can then be considered based on existing analysis methods in the literature. For example, commonly considered methods under MAR include (1) discard incomplete cases and analyze complete-case only, (2) impute or fill-in missing values and then analyze the filled-in data, (3) analyze the incomplete data by a method such as likelihood-based method (e.g., maximum likelihood, restricted maximum likelihood, and Bayesian approach), moment-based method (e.g., generalized estimating equations  and their variants), and survival analysis method (e.g., Cox proportional hazards model) that does not  require a complete data set. On the other hand, under MNAR, commonly considered methods are derived under pattern mixture models (Little, 1994) which can be divided into two types, parametric (see, e.g., Diggle and Kenward, 1994) and semi-parametric (e.g., Rotnitzky et al., 1998). In  practice, the possible causes of missing values in a study can generally be classified into two categories. The first category includes the reasons that are not directly related to the study. For example, a patient may be lost

Introduction

33

to follow-up because he/she moves out of the area. This  category of missing values can be considered as missing completely at random. The second category includes the reasons that are related to the study. For  example, a patient may withdraw from the study due to treatment-emergent adverse events. In clinical research, it is not uncommon to have multiple assessments from each subject. Subjects with all observations missing are called unit nonrespondents. Because unit non-respondents do not provide any useful information, these subjects are usually excluded from the analysis. On the other hand, the subjects with some, but not all, observations missing are referred to as item non-respondents. In  practice, excluding item non-respondents from the analysis is considered against the intent-to-treat (ITT) principle and, hence is not  acceptable. In  clinical research, the primary analysis is usually conducted based on ITT population, which includes all randomized subjects with at least post-treatment evaluation. As a result, most item nonrespondents may be included in the ITT population. Excluding item nonrespondents may seriously decrease power/efficiency of the study. Statistical methods for missing values imputation have been studied by many authors (see, e.g., Kalton and Kasprzyk, 1986; Little and Rubin, 1987; Schafer, 1997). To account for item non-respondents, two methods are commonly considered. The first method is the so-called likelihood-based method. Under a parametric model, the marginal likelihood function for the observed responses is obtained by integrating out the missing responses. The parameter of interest can then be estimated by the maximum likelihood estimator (MLE). Consequently, a corresponding test (e.g., likelihood ratio test) can be constructed. The  merit of this method is that the resulting statistical procedures are usually efficient. The drawback is that the calculation of the marginal likelihood could be difficult. As a result, some special statistical or numerical algorithms are commonly applied for obtaining the MLE. For  example, the expectation–maximization (EM) algorithm is one of the most popular methods for obtaining the MLE when there are missing data. The other method for item non-respondents is imputation. Compared with the likelihood-based method, the method of imputation is relatively simple and easy to apply. The idea of imputation is to treat the imputed values as the observed values and then apply the standard statistical software for obtaining consistent estimators. However, it should be noted that the variability of the estimator obtained by imputation is usually different from the estimator obtained from the complete data. In this case, the formulas designed to estimate the variance of the complete data set cannot be used to estimate the variance of estimator produced by the imputed data. As an alternative, two methods are considered for estimation of its variability. One is based on Taylor’s expansion. This method is referred to as the linearization method. The  merit of the linearization method is that it requires less computation. However, the drawback is that its formula could be very complicated and/ or not trackable. The other approach is based on re-sampling method (e.g., bootstrap and jackknife). The drawback of the re-sampling method is that it

34

Innovative Statistics in Regulatory Science

requires an intensive computation. The merit is that it is very easy to apply. With the help of a fast-speed computer, the re-sampling method has become much more attractive in practice. Note that imputation is very popular in clinical research. The  simple imputation method of last observation carried forward (LOCF) at endpoint is probably the most commonly used imputation method in clinical research. Although the LOCF is simple and easy for implementation in clinical trials, many researchers have challenged its validity. As a result, the search for alternative valid statistical methods for missing values imputation has received much attention in the past decade. In  practice, the imputation methods in clinical research are more diversified due to the complexity of the study design relative to sample survey. As a result, statistical properties of many commonly used imputation methods in clinical research are still unknown, while most imputation methods used in sample survey are well studied. Hence, the imputation methods in clinical research provide a unique challenge and also an opportunity for the statisticians in the area of clinical research. The issue of multiplicity—Lepor et al. (1996) report the results of a doubleblind, randomized multicenter clinical trial that evaluated the efficacy and safety of terazosin (10 mg daily), and αl-adrenergicantagonist, finasteride, a 5 α-reductase inhibitor (5 mg daily) or both with a placebo control in equal allocation in 1229 men with benign prostatic hyperplasia. The primary efficacy endpoints of this trial are the American Urological Association (AUA) symptom score (Barry et  al., 1992) and the maximum uroflow rate. These endpoints were evaluated twice during the four-week placebo run-in period and at 2, 4, 13, 26, 39, and 52 weeks of therapy. The primary comparisons of interest included pairwise comparisons among the active drugs and combination therapy, while the secondary comparisons consisted of a pairwise comparison of the active drugs and combination therapy with the placebo. The results for the primary efficacy endpoints presented in Lepor et al. (1996) were obtained by performing analyses of covariance with repeated measurements based on the intention-to-treat population. One of the objectives of the trial is to determine the time when the treatments reach therapeutic effects. Therefore, comparisons among treatment groups were performed at each scheduled post-randomization visits at 2, 4, 13, 26, 39, and 52  weeks. In  addition to the original observations of the primary endpoints, the change from baseline can also be employed to characterize the change after treatment for each patient. It may be of interest to see whether the treatment effects are homogeneous across race, age, and baseline disease severity. Therefore, some subgroup analyses can be performed such as for Caucasians and for non-Caucasians patients, for patients below or at least 65 years of age, for patients with the baseline AUA symptom score below 16 or at least 16, or for patients with the maximum uroflow rate below 10 mL/s. The number of the total comparisons for the primary efficacy endpoints can be as large as 1,344. If there is no difference among the four

Introduction

35

treatment groups and each of the 1,344 comparisons are performed at the 5% level of significance, we can expect 67 statistically significant comparisons with reported p-values smaller than 0.05. The  probability of observing at least one statistically significant difference among 1,344 comparisons could be as large as 1 under the assumption that all 1,344 comparisons are statistically independent. The number of p-values does not include those from the center-specific treatment comparisons and from other types of comparisons such as treatment-by-center interaction. Although the above example is a bit exaggerated, it does point out that the multiplicity in multicenter clinical trials is an important issue that has an impact on statistical inference of the overall treatment effect. In  practice, however, it is almost impossible to characterize a particular disease by a single efficacy measure due to: (i) the multifaceted nature of the disease, (ii) lack of understanding of the disease, and (iii) lack of consensus on the characterization of the disease. Therefore, multiple efficacy endpoints are often considered to evaluate the effectiveness of test drugs in treatment of most diseases such as AIDS, asthma, benign prostatic hyperplasia, arthritis, postmenopausal osteoporosis, and ventricular tachycardia. Some of these endpoints are objective histological or physiological measurements such as the maximum uroflow rate for benign prostatic hyperplasia or pulmonary function FEV1 (forced expiratory volume in one second) for asthma. Other may include the symptoms or subjective judgment of the well-being of the patients improved by the treatments such as the AUA symptom scores for benign prostatic hyperplasia, asthma-specific symptom score for asthma, or the Greene climacteric scale for postmenopausal osteoporosis (Greene and Hart, 1987). Hence one type of multiplicity in statistical inference for clinical trials results from the source of multiple endpoints. On the other hand, a clinical trial may be conducted to compare several drugs of different classes for the same indication. For  example, the study by Lepor et al. (1996) compares two monotherapies of terazosin, finasteride with the combination therapy, and a placebo control for treatment of patients with benign prostatic hyperplasia. Some other trials might be intended for the investigation of a dose-response relationship of the test drug. For example, Gormley et al. (1992) evaluate the efficacy and safety of 1 and 5 mg of finasteride with a placebo control. This type of multiplicity is inherited from the fact that the number of treatment groups evaluated in a clinical trial is greater than 2. Other types of multiplicity are caused by subgroup analyses. Examples include a trial reported by the National Institute for Neurological Disorders and Stroke rt-PA  Study Group (1995) in which stratified analyses were performed according to the time from the onset of stroke to the start of treatment (0–90  or 91–180  minutes). In  addition, the BHAT (1982) and CAST (1989) studies were terminated early because of overwhelming evidence of either beneficial efficacy or serious safety concern before the scheduled conclusion of the trials by the technique of repeated interim analyses. In summary, multiplicity in clinical trials can be classified as repeated

36

Innovative Statistics in Regulatory Science

interim analyses, multiple comparisons, multiple endpoints, and subgroup analyses. Since the causes of these multiplicities are different, special attention should be paid to (i) the formulation of statistical hypotheses based on the objectives of the trial, (ii) the proper control of experiment-wise false positive rates in subsequent analyses of the data, and (iii) the interpretation of the results. 1.4.7 Sample Size Requirement In clinical trials, a pre-study power analysis for sample size calculation (estimation or determination) is often performed based on either (i) information obtained from small scale pilot studies with limited number of subjects or (ii) purely guess based on the best knowledge of the investigator (with or without scientific justification). The observed data and/or the investigator’s best guess could be far away from the truth. Such deviation may bias the sample size calculation for reaching the desired power for achieving the study objectives at a pre-specified level of significance. Sample size calculation is a key to the success of pharmaceutical/clinical research and development. Thus, how to select the minimum sample size for achieving the desired power at a pre-specified significance level has become an important question to clinical scientists (Chow et al., 2008; Chow and Liu, 1998b). A study without sufficient number of subjects cannot guarantee the desired power (i.e., the probability of correctly detecting a clinically meaningful difference). On the other hand, an unnecessarily large sample size could be quite a waste of limited resources. Sample size calculation plays an important role in pharmaceutical/clinical research and development. In order to determine the minimum sample size required for achieving a desired power, one needs to have some information regarding study parameters such as variability associated with the observations and the difference (e.g., treatment effect) that the study is designed to detect. In practice, it is well recognized that sample size calculation depends upon the assumed variability associated with the observation, which is often unknown. Thus, the classical pre-study power analysis for sample size calculation based on information obtained from a small pilot study (with large variability) could vary widely and hence instable depending upon the sampling variability. As a result, one of the controversial issues regarding sample size calculation is the stability (sensitivity or robustness) of the obtained sample size. To overcome the instability of sample size calculation, alternatively, Chow (2011) suggested that a bootstrap-median approach be considered to select a stable (required minimum) sample size. Such an improved stable sample size can be derived theoretically by the method of an Edgeworth-type expansion. Chow (2011) showed that the bootstrapmedian approach performs quite well for providing a stable sample size in clinical trial through an extensive simulation study.

Introduction

37

It should be noted that procedures used for sample size calculation could be very different from one another depending on different study objectives and hypotheses (e.g., testing for equality, testing for superiority, or testing for non-inferiority/equivalence) and different data types (e.g., continuous, binary, and time-to-event). For  example, see Lachin and Foulkes (1986), Lakatos (1986), Wang and Chow (2002a), Wang et al. (2002) and Chow and Liu (2008). For  a good introduction and summary, one can refer to Chow et al. (2008). In this chapter, for simplicity, we will focus on the most commonly seen situation where the primary response is continuous and the hypotheses of interest are about the mean under the normality assumption. Most of our discussions thereafter focus on the one sample problem for the purpose of simplicity. However, the extension to two-sample problem is straightforward. 1.4.8 Consistency Test In recent years, multi-regional (or multi-national) multi-center clinical trials have become very popular in global pharmaceutical/clinical development. The main purpose of a multi-regional clinical trial is not only to assess the efficacy of the test treatment over all regions in the trial, but also to bridge the overall effect of the test treatment to each of the region in the trial. Most importantly, a multi-regional clinical trial is to shorten the time for drug development and regulatory development, submission, and approval around the world (Tsong ad Tsou, 2013). Although multi-regional clinical trials provide the opportunity to fully utilize clinical data from all regions to support regional (local) registration, some critical issues such as regional differences (e.g., culture and medical practice/perception) and possible treatment-by- region interaction may occur, which may have an impact on the validity of the multi-regional trials. In multi-regional trials for global pharmaceutical/clinical development, one of the commonly encountered critical issues is that clinical results observed from some regions (e.g., Asian Pacific region) are inconsistent with clinical results from other regions (e.g., European Community) or global results (i.e., all regions combined). The inconsistency in clinical results between different regions (e.g., Asian Pacific region and European Community and/or United States) could be due to difference in ethnic factors. In this case, the dose or dose regimen may require adjustment or a bridging study may be required before the data can be pooled for an overall combined assessment of the treatment effect. As a result, evaluation of consistency between specific regions (sub-population) and all regions combined (global population) is necessarily conducted before regional registration (e.g., Japan and China). It  should be noted that different regions may have different requirements regarding sample sizes at specific regions in order to comply with regulatory requirement for registration (see, e.g., MHLW, 2007).

38

Innovative Statistics in Regulatory Science

In  practice, consistency between clinical results observed from a subpopulation (specific region such as Japan or China) and the entire population (all regions combined) is often interpreted as similarity and/or equivalence between the two populations in terms of treatment effect (i.e., safety and/ or efficacy). Along this line, several statistical methods including test for consistency (Shih, 2001b), assessment of consistency index (Tse et al., 2006), evaluation of sensitivity index (Chow, Shao, and Hu, 2002), achieving reproducibility and/or generalizability (Shao and Chow, 2002), Bayesian approach (Hsiao et al., 2007; Chow and Hsiao, 2010), and the Japanese approach for evaluation of assurance probability (MHLW, 2007) have been proposed in the literature (see also, Liu, Chow, and Hsiao, 2013). The purpose of this section is not only to provide an overview of these methods, but also to compare the relative performances of these methods through extensive clinical trial simulation. 1.4.9 Extrapolation For  marketing approval of a new drug product, the FDA  requires that at least two adequate and well-controlled clinical trials be conducted to provide substantial evidence regarding the effectiveness and safety of the drug product under investigation. The purpose of requiring at least two clinical studies is not  only to assure reproducibility, but also to provide valuable information regarding generalizability. Generalizability can have several distinct meanings. First, it can refer to whether the clinical results of the original target patient population under study (e.g., adults) can be generalized to other similar but different patient populations (e.g., pediatrics or elderly). Second, it can be a measure of whether a newly developed or approved drug product in one region (e.g., United States or European Union) can be approved at another region (e.g., countries in Asian-Pacific Region) particularly if there exists a concern that differences in ethnic factors could alter the efficacy and safety of the drug product in the new region. Third, for case-control studies, it is often of interest to determine whether the developed or established predictive model based on a database at a medical center can be applied to a different medical center that has similar but different database for patients with similar diseases under study. In  practice, since it is of interest to determine whether the observed clinical results from the original target patient population can be generalized to a similar but different patient population, we will focus on the first scenario. Statistical methods for assessment of generalizability of clinical results, however, can also be applied to other scenarios. Although the ICH E5  guideline establishes the framework for the acceptability of foreign clinical data, it does not clearly define the desired similarity in terms of dose response, safety, and efficacy between the original region and a new region. Shih (2001) interpreted similarity as consistency among study centers by treating the new region as a new center of

Introduction

39

multicenter clinical trials. Under this definition, Shih proposed a method for assessment of consistency to determine whether the study is capable of bridging the foreign data to the new region. Alternatively, Shao and Chow (2002) proposed the concepts of reproducibility and generalizability probabilities for assessing bridging studies. In  addition, Chow et  al. (2002) proposed to assess similarity by analysis using a sensitivity index, which is a measure of population shift between the original region and the new region. For assessment of generalizability of clinical results from one population to another, Chow (2010) proposed to evaluate a so-called generalizability probability of a positive clinical result observed from the original patient population by studying the impact of shift in target patient through a model that link the population means with some covariates (see also, Chow and Shao, 2005; Chow and Chang, 2006). However, in many cases, such covariates may not exist or exist but not observable. In this case, it is suggested that the degree of shift in location and scale of patient population be studied based on a mixture distribution by assuming the location or scale parameter is random variable (Shao and Chow, 2002). The purpose of this section is to assess the generalizability of clinical results by evaluating the sensitivity index under different models given the different situations: (1) shift in location parameter is random, (2) shift in scale parameter is random, and (3) shifts in both location and scale parameters are random. 1.4.10 Drug Products with Multiple Components In  recent years, as more and more innovative drug products are going off patent protection, the search for new medicines that treat critical and/or lifethreatening diseases such as cardiovascular diseases and cancer has become the center of attention of many pharmaceutical companies and research organizations such as National Institute of Health (NIH). This leads to the study of the potential use of promising traditional Chinese medicines (TCM), especially for critical and/or life-threatening diseases. Bensoussan et  al. (1998) used randomized clinical trial (RCT) to assess the effect of Chinese herb medicine in treating the Irritable Bowel Syndrome. However, RCT is not in common use when studying TCM. There are fundamental differences between Western medicines and TCM in terms of diagnostic procedures, therapeutic indices, medical mechanism, medical theory and practice (Chow, Pong and Chang, 2006; Chow, 2015). Besides, TCM often consists of multiple components with flexible dose. Chinese doctors believe that all of the organs within a healthy subject should reach the so-called global dynamic balance and harmony among organs. Once the global balance is broken at certain sites such as heart, liver, or kidney, some signs and symptoms will appear to reflect the imbalance at these sites. The  collective signs and symptoms are then used to determine what disease the individual patient have. An experienced

Innovative Statistics in Regulatory Science

40

Chinese doctor usually assesses the causes of global imbalance before a TCM with flexible doses is prescribed to fix the problem. This  approach is sometimes referred to as a personalized (or individualized) medicine approach. In  practice, TCM consider inspection, auscultation and olfaction, interrogation, and pulse taking and palpation as the primary diagnostic procedures. The  scientific validity of these subjective and experience-based diagnostic procedures has been criticized due to lack of reference standards and anticipated large evaluator-to-evaluator (i.e., Chinese doctor-to-Chinese doctor) variability. For a systematic discussion of the statistical issues of TCM, see Chow (2015). In  this book, we attempt to propose a unified approach to developing a composite illness index based on a number of indices collected from a given subject under the concept of global dynamic balance among organs. Dynamic balance among organs can be defined as follows. Following the concept of testing bioequivalence or biosimilarity, if the 95% confidence upper bound is less than some health limit, we conclude that the treatment achieves dynamic balance among the organs of the subject hence is considered as efficacious. If we fail to reject the null hypothesis, we conclude that the treatment is not efficacious since there is still a signal of illness (e.g., some of signs and symptoms are still out of the health limit). In  practice, these signals of illness can be grouped to diagnose specific diseases based on some pre-specified reference standards for diseases status of specific diseases which are developed based on indices related to specific organs (or diseases). 1.4.11 Advisory Committee The  FDA  has established advisory committees each consisting of clinical, pharmacological, and statistical experts and one consumer advocate (not employed by the FDA) in designated drug classes and sub-specialties. The  responsibilities of the committees are to review data presented in NDA’s and to advise FDA as to whether there exists substantial evidence of safety and effectiveness based on adequate and well-controlled clinical studies. In addition the committee may also be asked at times to review certain INDs, protocols, or important issues relating to marketed drugs and biologics. The  advisory committees not  only supplement the FDA’s expertise but also allow an independent peer review during the regulatory process. Note that the FDA usually prepares a set of questions for the advisory committee to address at the meeting. The  following is a list of some typical questions: 1. Are there two or more adequate and well-controlled trials? 2. Have the patient populations been well enough characterized? 3. Has the dose-response relationship been sufficiently characterized?

Introduction

41

4. Do you recommend the use of the drug for the indication sought by the sponsor for the intended patient population? The  FDA  usually will follow the recommendations made by the Advisory Committee for marketing approval, though they do not have to legally. 1.4.12 Recent FDA Critical Clinical Initiatives In addition to the above-mentioned practical, challenging, and controversial issues, the FDA also kicked off several critical clinical initiatives to assist sponsors in pharmaceutical product development. These critical clinical initiatives include, but are limited to, (i) statistical methodology development for generic drugs and biosimilar product (see Chapter  15); (ii) precision/personalized medicine (see Chapter 16); (iii) biomarker-driven clinical trials (see Chapter 16); (iv) big data analytics (see Chapter 17); (v) rare diseases drug development (see Chapter 18); (vi) real-world data and real-world evidence; (vii) model-informed drug development (MIDD); and (viii) machine learning for mobile individualized medicine (MIM) and imaging medicine. Among these critical clinical initiatives, more details can be found in Chapters 13–18. For example, more details regarding an overview of adaptive trial designs such as phase II/III seamless adaptive trial design can be found in Chapters 13 and 14. Statistical methods for assessment of bioequivalence for generic drugs and for demonstration of biosimilarity for biosimilar products can be found in Chapter 15. The difference between precision medicine and personalized medicine is discussed in Chapter 16. The concept of big data analytics can be found in Chapter  17. Innovative thinking for rare disease drug development including innovative trial designs, statistical methods for data analysis, and sample size requirement can be found in Chapter 18. Regarding real-world data and real-world evidence, the following are commonly asked questions: 1. Does the information include clinical data collected from randomized clinical trials (RCT)? 2. Does real-world evidence constitute substantial evidence for evaluation of safety and efficacy of a drug product? To provide a better understanding of the issue associated with real-world data and real-world evidence, Table 1.9 provides a comparison between real-world evidence and substantial evidence obtained from adequate well-controlled clinical studies. As it can be seen from Table 1.9, analysis of real-world data for obtaining real-world evidence could lead to incorrect and unreliable conclusion regarding safety and efficacy of the test treatment under investigation

42

Innovative Statistics in Regulatory Science

TABLE 1.9 A Comparison between Real Word Evidence (RWE) and Substantial Evidence Obtained from Randomized Clinical Trial (RCT) Real-World Evidence • General population • Selection bias • Variability—expected and not controllable • Real-world evidence from multiple/diverse sources • Reflect real clinical practice • Statistical methods are not fully established • Could generate incorrect and unreliable conclusion

Randomized Clinical Trials • Specific population • Bias is minimized • Variability – expected and Controllable • Substantial evidence form RCTs • Reflect controlled clinical practice • Statistical methods are well-established • Accurate and reliable conclusion

due to (i) selection bias; (ii) uncontrollable variability; and (iii) the real-world data are from multiple/diverse sources. Thus, it is suggested that real-world evidence should be used for safety assessment but not for efficacy evaluation in regulatory approval process. Regarding model-informed drug development (MIDD), as indicated by PDUFA  VI, MIDD can be classified into six categories: (i) PK/PD; (ii)  PK, population PK (POPPK); and physiologically based PK (PBPK) modeling; (iii) disease models including clinical trial model; (iv) system biology: quantitative system pharmacology (QSP) and congenital insensitivity to pain with anhidrosis (CIPA); (v) quantitative structure activity relationship (QSAR) and quantitative structure property relationship (QSPR); and (vi) clinical trial simulation (see Figure  1.5). Statistically speaking, MIDD is to study response-exposure relationship, which can be performed in the following steps: (i) model building, (ii) model validation, and (iii) model

FIGURE 1.5 The scope of model-informed drug development.

Introduction

43

generalizability. Model building may involve the identification of risk factors (predictors), test for collinearity, and goodness-of-fit. For  model validation, a typical approach is to randomly split the data into two sets: one for model building and one for model validation. At  this stage of model validation, it is considered internal validation. For external validation, it is usually referred to as model generalizability, i.e., the predictive model can be generalized from one patient population to another or from one medical center to another. Machine learning is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. The application of machine learning in drug research and development includes, but are not limited to, mobile individual medicine (MIM) and imaging medicine (IM). For mobile individualized medicine, machine learning can be applied for (i) safety monitoring, e.g., the detection of the death of overdose of opioid, and (ii) capture of real time data in the conduct of clinical trials. For imaging medicine, machine learning is useful in analysis and interpretation of imaging data collected from clinical research.

1.5 Aim and Scope of the Book This is intended to be the first book entirely devoted to statistics in regulatory science that are relevant to the review and approval process of regulatory submissions of pharmaceutical product development. It covers general principles for GSP and key statistical concepts that are commonly employed in the review and approval process of regulatory submissions. Some practical challenging and controversial issues that may occur during the review and approval process of regulatory submissions are discussed. In addition to complex innovative designs, the FDA recently kicked off several critical clinical initiatives regarding precision medicine, big data analytics, rare disease product development, biomarker-driven clinical research, model-informed drug development (MIDD), Patient-focused risk assessment drug development, and real-world data and real-world evidence. These initiatives are to assist the sponsors in expedite pharmaceutical research and development process in a more efficient way. The purpose of this book is to outline these challenging and controversial issues and recent development regarding FDA critical clinical initiatives that are commonly encountered in the review and approval process of regulatory submissions. It is my goal to provide a useful desk reference and state-of-the-art examination of statistics in regulatory science in drug development, to those in government regulatory agencies who have to make critical decisions in regulatory submissions, and to biostatisticians who provide the statistical

44

Innovative Statistics in Regulatory Science

support to studies conducted for regulatory submissions of pharmaceutical product development. More importantly we would like to provide graduate students in pharmacokinetics, clinical pharmacology, biopharmaceutics, clinical investigation, and biostatistics an advanced textbook in biosimilar studies. We hope that this book can serve as a bridge among government regulatory agencies, the pharmaceutical industry and academia. The scope of this book is restricted to practical issues that are commonly seen in regulatory science of pharmaceutical research and development. This book consists of 20 chapters concerning topics related to research activities, review of regulatory submissions, policy/guidance development, and FDA  critical clinical initiatives in regulatory science. Chapter  1 provides key statistical concepts, and innovative design methods in clinical trials that are commonly considered in regulatory science. Also included in this chapter are some practical, challenging, and controversial issues that are commonly seen in the review and approval process of regulatory submissions. Chapter 2 provides interpretation of substantial evidence required for demonstration of the safety and effectiveness of drug products under investigation. The  related concepts of totality-of-the-evidence and real-world evidence are also described in this chapter. Chapter  3 distinguishes the concepts of hypotheses testing and confidence interval approach for evaluation of the safety and effectiveness of drug products under investigation. Also included in this chapter is the comparison between the use of a 90% confidence interval approach for evaluation of generics/biosimilars and the use of a 95% confidence interval approach for assessment of new drugs. Chapter  4 deals with endpoint selection in clinical research and development. Also included in this chapter is the development of therapeutic index for endpoint selection in complex innovative designs such as multiple-stage adaptive designs. Chapter  5 focuses on non-inferiority margin selection. Also included in this chapter is a proposed clinical strategy for margin selection based on risk assessment of false positive rate. Chapters 6 and 7 provide discussions on statistical methods for missing data imputation and multiplicity adjustment for multiple comparisons in clinical trials, respectively. Sample size requirements under various designs are summarized in Chapter  8. Chapter  9 introduces that concept of reproducible research. Chapter  10 discusses the concept and statistical methods for assessment of extrapolation across patient populations and/or indications. Chapter 11 compares statistical methods for evaluation of consistency in multi-regional clinical trials. Chapter 19 provides an overview of drug products with multiple components such as botanical drug products and traditional Chinese medicine. Chapter  13 provides an overview of adaptive trial designs that are commonly used in pharmaceutical/clinical research and development. Chapter  14 evaluates several selection criteria that are commonly considered in adaptive dose finding studies. Chapter 15 compares bioequivalence assessment for generic drug products and biosimilarity assessment for biosimilar products. Also included in this chapter is a proposed general

Introduction

45

approach for assessment of bioequivalence for generics and biosimilarity for biosimilars. Chapter 16 discusses the difference between precision medicine and personalized medicine. Chapter 17 introduces the concept of big data analytics. Also included in this chapter is types of big data analytics and potential bias of big data analytics. Chapter 18 focuses rare disease clinical development including innovative trial designs, statistical methods for data analysis, and some commonly seen challenging issues.

2 Totality-of-the-Evidence

2.1 Introduction As indicated in the previous Chapter 1, for approval of pharmaceutical products, the United States (US) Food and Drug Administration (FDA) requires that substantial evidence regarding the safety and effectiveness of the test treatment under investigation be provided for review and regulatory approval. The substantial evidence, however, can only be obtained through the conduct of adequate and well-controlled studies (Section  314 of 21  CFR). The  FDA  requires that reports of adequate and well-controlled investigations provide the primary basis for determining whether there is substantial evidence to support the claims of drugs, biologics, and medical devices. Recently, the FDA proposed a new concept of totality-of-the-evidence for review and approval of regulatory submissions of biosimilar (follow-on biological) products due to structural and functional complexity of large molecular biological drug products (FDA, 2015a). Following similar concepts of substantial evidence, the totality-of-the-evidence can be obtained through an FDA recommended stepwise approach that starts with analytical studies for functional and structural characterization of critical quality attributes (CQAs) that are identified at various stages of manufacturing process and considered relevant to clinical outcomes. The stepwise approach continues with animal studies for toxicity assessment, pharmacological studies such as pharmacokinetic (PK) and pharmacodynamic (PD) studies, and clinical studies such as immunogenicity, safety, and efficacy studies. The FDA indicated that stepwise approach is an approved approach for obtaining totalityof-the-evidence for demonstration of highly similar between a proposed biosimilar product and an innovative biological product. Thus, the totalityof-the-evidence comprises analytical similarity evaluation, PK/PD similarity assessment, and clinical similarity demonstration. Note that there is no one-size-fits-all assessment in the stepwise approach. In  practice, it is very likely that the proposed biosimilar product does not meet requirements for all similarity tests (i.e., analytical similarity, PK/PD similarity, and clinical similarity). For example, the proposed biosimilar product may fail analytical similarity evaluation but pass both PK/PD and clinical 47

48

Innovative Statistics in Regulatory Science

similarity assessment. In this case, the sponsor is often asked to provide justification that the observed difference has little impact on the clinical outcomes. On the other hand, if the proposed biosimilar product passes tests for analytical similarity and PK/PD similarity but fails to pass clinical similarity test, most likely the proposed biosimilar product will be viewed as not highly similar to the innovative product. Thus, a few of questions have been raised. First, are there any links among analytical similarity, PK/PD similarity, and clinical similarity? In other words, are analytical similarities predictive of PK/PD similarity and/or clinical similarity? Second, does analytical similarity evaluation carry the same weight as PK/PD similarity assessment and clinical similarity demonstration in the review and approval process of the proposed biosimilar product? If they carry the same weight (i.e., they are equally important), then one would expect all analytical similarity, PK/PD similarity, and clinical similarity should be demonstrated for obtaining the totality-of-the-evidence before the proposed biosimilar product can be approved. Third, does totality-of-the-evidence constitute substantial evidence for approval of the proposed biosimilar product? This  chapter will attempt to address these three questions. In the next section, an introduction of substantial evidence that is required for evaluation of pharmaceutical products is briefly described. Section  2.3 will examine the concept of the totality-of-the-evidence for demonstration of biosimilarity between a proposed biosimilar product and an innovative biological product. Also included in this section is an example concerning a recent regulatory submission utilizing the concept of totality-of-the-evidence for regulatory approval. Section  2.4 posts some practical and challenging issues regarding the use of totality-of-the-evidence in the regulatory approval process for biosimilar products. Section  2.5 outlines the development of an index for quantitation of the totality-of-the-evidence. Some concluding remarks are given in Section 2.6.

2.2 Substantial Evidence For  approval of a test treatment under investigation, Section  314.126 of 21 CFR requires that substantial evidence regarding the safety and effectiveness of the test treatment under investigation be provided in the review and approval process of the test treatment under investigation. Section 314.126 of 21 CFR also indicates that substantial evidence, however, can only be obtained through the conduct of adequate and well-controlled studies. Section 314.126 of 21  CFR provides the definition of an adequate and well-controlled study, which is summarized in Table 2.1. As it can be seen from Table  2.1, an adequate and well-controlled study is judged by eight criteria specified in the CFR. These criteria include: (i) study objectives; (ii) method of analysis; (iii) design of studies; (iv) selection of subjects; (v)

Totality-of-the-Evidence

49

TABLE 2.1 Characteristics of an Adequate and Well-Controlled Study Criteria Objectives Methods of analysis Design Selection of subjects Assignment of subjects Participants of studies Assessment of responses Assessment of the effect

Characteristics Clear statement of investigation’s purpose Summary of proposed or actual methods of analysis Valid comparison with a control to provide a quantitative assessment of drug effect Adequate assurance of the disease or conditions under study Minimization of bias and assurance of comparability of groups Minimization of bias on the part of subjects, observers, and analysts Well-defined and reliable Requirements of appropriate statistical methods

Source: Section 314.126 of 21 CFR.

assignment of subjects; (vi) participants of studies; (vii) assessment of responses; and (viii) the assessment of effect, all of which are objective and are closely related to statistics used for assessment of test treatment under investigation. In summary, an adequate and well-controlled study starts with clearly stated study objectives. Under the clearly stated study objective(s), valid study designs and methods will then be applied for collecting quality data from a random sample (a well representative sample) drawn from the target population. For this purpose, statistical methods such as randomization and blinding are used to minimize potential biases and variations for an accurate and reliable assessment of the treatment responses and effects of the test treatment under investigation.

2.3 Totality-of-the-Evidence For approval of a proposed biosimilar product, the FDA requires that totalityof-the-evidence be provided to support a demonstration that the proposed biosimilar product is highly similar to the original US-licensed product, notwithstanding minor differences in clinically inactive components, and that there are no clinically meaningful differences between the proposed biosimilar product and the US-licensed product in terms of the safety, purity and potency of the product. 2.3.1 Stepwise Approach As defined in the Biologics Price Competition and Innovation Act of 2009 (BPCI Act), a biosimilar product is a product that is highly similar to the

50

Innovative Statistics in Regulatory Science

reference product, notwithstanding minor differences in clinically inactive components and for which there are no clinically meaningful differences in terms of safety, purity, and potency. Based on the definition of the BPCI Act, biosimilarity requires that there are no clinically meaningful differences in terms of safety, purity and potency. Safety could include PK/PD, safety and tolerability, and immunogenicity studies. Purity includes all critical quality attributes during manufacturing process. Potency is referred to as efficacy studies. As indicated earlier, in the 2015 FDA guidance on scientific considerations, the FDA recommends that a stepwise approach be considered for providing the totality-of-the-evidence to demonstrating biosimilarity of a proposed biosimilar product as compared to a reference product (FDA, 2015a). To assist the sponsor in biosimilar product development, the FDA recommends a stepwise approach for obtaining the totality-of-theevidence for demonstrating biosimilarity between the proposed biosimilar product and its innovative drug product in terms of safety, purity, and efficacy (Chow, 2013; FDA, 2015a, 2017; Endrenyi et al., 2017). The stepwise approach starts with similarity assessment in CQAs, in analytical studies, followed by the similarity assessment in pharmacological activities in PK/PD studies and similarity assessment in safety and efficacy in clinical studies. The pyramid illustrated in Figure 2.1 briefly summarizes the stepwise approach.

FIGURE 2.1 Stepwise approach for biosimilar product development.

Totality-of-the-Evidence

51

2.3.2 Fundamental Biosimilarity Assumptions As indicated in Section  2.3.1, the assessment of analytical data constitutes the first step in the FDA  recommended totality-of-the-evidence as stepwise approach. In  practice, the assessment of analytical data is performed to achieve the following primary objectives (see, e.g., BLA  781028 and BLA 761074). 1. The assessment of analytical data is to demonstration that the proposed biosimilar product can be manufactured in a well-controlled and consistent manner that meets appropriate quality standards; 2. The  assessment of analytical data is to support a demonstration that the proposed biosimilar product and the reference product are highly similar; 3. The assessment of analytical data may be used to serve as a bridging for PK/PD similarity and/or clinical similarity; 4. The assessment of analytical data may be used to provide scientific justification for extrapolation of data to support biosimilarity in each of the additional indications for which the sponsor is seeking licensure. Note that, regarding (4) for analytical similarity assessment in CQAs, the FDA further recommends a tiered approach that classifies CQAs into three tiers depending upon their criticality or risk ranking relevant to clinical outcomes. For  determination of criticality or risk ranking, the FDA  suggests establishing a predictive (statistical) model based on either mechanism of action (MOA) or PK relevant to clinical outcome. To achieve these objectives, the study of the relationship among analytical, PK/PD, and clinical data is essential. For  this purpose, the following assumptions are made for the stepwise approach for obtaining the totality-of-the-evidence. 1. Analytical similarity is predictive of PK/PD similarity; 2. Analytical similarity is predictive of clinical outcomes; 3. PK/PD similarity is predictive of clinical outcomes. These assumptions, however, are difficult (if not  impossible) to verify in practice. For assumptions (1) and (2), although many in vitro and in vivo correlations (IVIVC) have been studied in the literature, the correlations between specific CQAs and PK/PD parameters or clinical endpoints are not  fully studied and understood. In other words, most predictive models are not well established or are established but not validated. Thus, it is not clear how a (notable) change in a specific CQA  can be translated to a change in drug absorption or clinical outcome. For  (3), unlike bioequivalence assessment

52

Innovative Statistics in Regulatory Science

for generic drug products, there does not  exist a Fundamental Biosimilarity Assumption indicating that PK/PD similarity implies clinical similarity in terms of safety and efficacy. In  other words, PK/PD similarity or dissimilarity may or may not lead to clinical similarity. Note that assumptions (1) and (3) being met simultaneously does not lead to the validity of assumption (2) automatically. The validity of assumptions (1)–(3) is critical for the success of obtaining totality-of-the-evidence for assessing biosimilarity between the proposed biosimilar and the innovative biological product. This  is because the validity of these assumptions ensures the relationships among analytical, PK/PD, and clinical similarity assessment and consequently the validity of the overall biosimilarity assessment. Figure  2.2 illustrates relationships among analytical, PK/PD, and clinical assessments in the stepwise approach for obtaining the totality-of-the-evidence in biosimilar product development. 2.3.3 Examples—Recent Biosimilar Regulatory Submissions For  illustration purpose, consider two recent FDA  biosimilar regulatory submissions, i.e., Avastin biosimilar (ABP215 sponsored by Amgen) and Herceptin biosimilar (MYL-1401O sponsored by Mylan). These two regulatory submissions were reviewed and discussed at an Oncologic Drug Advisory Committee (ODAC) meeting held on July 13th, 2017 in Silver Spring, Maryland. Table  2.2 briefly summarizes the results of the review based on the concept of totality-of-the-evidence. As for ABP215, a proposed biosimilar to Genetech’s Avastin, although ABP215 passed both PK/PD similarity and clinical similarity tests, several quality attribute differences were noted. These notable differences include glycosylation content, Fc γ Rllla binding and product related species (aggregates, fragments, and charge variants). The glycosylation and Fc γ Rllla binding differences were addressed by means of in vitro cell based on ADCC and CDC activity, which were not detected for all products (ABP215,

FIGURE 2.2 Relationships among analytical, PK/PD, and clinical assessment.

Totality-of-the-Evidence

53

TABLE 2.2 Examples of Assessment of Totality-of-the-Evidence Totality-of-the-Evidence Regulatory Submission

Innovative Product

Proposed Biosimilar

BLA 761028 (Amgen)

Avastin

ABP215

BLA 761074 (Mylan)

Herceptin

MYL-1401O

Analytical Similarity

PK/PD Similarity

Clinical Similarity

Notable differences observed in glycosylation content and FcgRllla binding Subtle shifts in glycosylation (sialic acid, high mannose, and NG-HC)

Pass

Pass

Pass

Pass

US-licensed Avastin, and EU-approved Avastin). In considering the totalityof-the-evidence, the ODAC panel considered the data submitted by the sponsor sufficient to support a demonstration that ABP215 is highly similar to the US-licensed Avastin, notwithstanding minor differences in clinically inactive components, and supports that there are no clinically meaningful differences between ABP215 and the US-licensed Avastin in terms of the safety, purity and potency of the product. For MYL-1401O, a proposed biosimilar to Genetech’s Herceptin, although MYL-1401O passed both PK/PD similarity and clinical similarity tests, there are subtle shifts in glycosylation (sialic acid, high mannose, and NG-HC). However, the residual uncertainties related to increase in total mannose forms and sialic acid and decrease in NG-HC were addressed by ADCC similarity and by the PK similarity. Thus, the ODAC panel considered the data submitted by the sponsor was sufficient to support a demonstration that MYL-1401O is highly similar to the US-licensed Herceptin, notwithstanding minor differences in clinically inactive components, and to support that there are no clinically meaningful differences between MYL-1401O and the US-licensed Herceptin in terms of the safety, purity and potency of the product. 2.3.4 Remarks As discussed in Section 2.3.3, it is not clear whether totality-of-the-evidence of highly similarity can only be achieved if the proposed biosimilar product has passed all similarity tests across different domains of analytical, PK/PD, and clinical assessment. When notable differences in some CQAs in Tier 1 are observed, these notable differences may be ignored if the sponsors can provide scientific rationales/justification to rule out that the observed difference have an impact on clinical outcomes. This, however, is somewhat

54

Innovative Statistics in Regulatory Science

controversial because Tier 1 CQAs are considered most relevant to clinical outcomes depending upon their criticalities or risk rankings that impact the clinical outcomes. The  criticalities and/or risk rankings may be determined using model (3). If a notable difference is considered having little or no impact on the clinical outcome, then the CQA should not be classified into Tier 1 in the first place. This controversy could be due to the classification of CQAs based on subjective judgment rather than via objectively statistical modeling. In  the two examples concerning biosimilar regulatory submissions of ABP215 (Avastin biosimilar) and MYL-1401O (Herceptin biosimilar), the sponsors also sought approval across different indications. There has been a tremendous amount of discussion regarding whether totality-of-the-evidence observed from one indication or a couple of indications can be used to extrapolate to other indications even different indications have similar mechanism of actions. The ODAC panel expressed their concern over extrapolation in the absence of any collection of clinical data and encouraged further research on the scientific validity of extrapolation and/or generalizability of the proposed biosimilar product.

2.4 Practical Issues and Challenges The  use of the concept of totality-of-the-evidence for demonstrating that a  proposed biosimilar product is highly similar to an innovative biological product has been challenged through the three questions described in Section 2.1. To address the first question whether analytical similarity is predictive of PK/PD similarity and/or clinical similarity, the relationship among analytical similarity, PK/PD similarity and clinical similarity must be studied. 2.4.1 Link among Analytical Similarity, PK/PD Similarity, and Clinical Similarity Relationships among CQAs, PK/PD responses, and clinical outcomes can be described in Figure 2.2. In practice, for simplicity, CQAs, PK/PD responses, and clinical outcomes are usually assumed linearly correlated. For  example, let x, y, and z be the test result of a CQA, PK/PD response, and clinical outcome, respectively. Under assumptions (1)–(3), we have (1) y = a1 + b1x + e1 ; (2) z = a2 + b2 y + e2 (3) z = a3 + b3 x + e3 ;

Totality-of-the-Evidence

55

where e1 , e2 , and e3 follow a normal distribution with mean 0  and variances σ 12 , σ 22 , and σ 32, respectively. In  practice, each of the above models is often difficult, if it is not  impossible, to validate due to insufficient data being collected during the biosimilar product development. Under each of the above models, we may consider the criterion for examination of the closeness between an observed response and its predictive value to determine whether the respective model is a good predictive model. As an example, under model (1), we may consider the following two measures of closeness, which are based on either the absolute difference or the relative difference, respectively, between an observed value y and its predictive value yɵ

{

}

Criterion I. p1 = P y − yɵ < δ ,   y − yɵ Criterion II. p2 = P  < δ .  y  It  is desirable to have a high probability that the difference or the relative difference between y and yɵ , given by p1 and p2, respectively, is less than a clinically meaningful difference δ . Suppose there is a well-established relationship between x (e.g., test result of a given CQA) and y (e.g., PK/PD response). Model (1) indicates that a change in CQA, say ∆ x corresponds to a change of a1 + b1∆ x in PK/PD response. Similarly, model (2) indicates that a change in PK/PD response, say ∆ y corresponds to a change of a2 + b2 ∆ y in clinical outcomes. Models (2) and (3) allows us to evaluate the impact of the change in CQA (i.e.,x) on PK/PD (i.e., y) and consequently clinical outcome (i.e., z). Under models (2) and (3), we have a2 + b2 y + e2 = a3 + b3 x + e3 . This leads to a1 =

b e −e a3 − a2 , b1 = 3 , and e1 = 3 2 . b2 b2 b2

with b22σ 12 = σ 32 + σ 22 or

σ1 =

1 σ 22 + σ 32 . b2

In practice, the above relationships can be used to verify primary assumptions as described in the previous section provided that models (1)–(3) have

Innovative Statistics in Regulatory Science

56

been validated. Suppose models (1)–(3) are well-established, validated, and fully understood. A commonly asked question is whether PK/PD studies and/or clinical studies can be waived if analytical similarity and/or PK/ PD similarity have been demonstrated. Note that the above relationships hold only under our linearity assumption. When there is a departure from linearity in any one of models (1)–(3), the above relationships are necessarily altered. Considering multiple CQAs and several endpoints in PK/PD and clinical outcomes, the model (1)–(3) can be easily extended to a general linear model of the form

( 4 ) Y = B1X + E1 ; ( 5 ) Z = B2Y + E2 ; ( 6 ) Z = B3X + E3 ; where, each of E1 , E2 , and E3 follow a multivariate normal distribution, N ( 0,σ 12I ) , N ( 0,σ 22I ) , and N ( 0,σ 32I ) , respectively. Thus, we have B1 = B2−1B3, with

(

)

σ 12I = σ 22 + σ 32 B2−1, where −1

B1 = ( X′X ) X′Y , −1

B2 = ( Y′Y ) Y′Z, and −1

B3 = ( X′X ) X′Z. The existence of unique solution depends on the rank of matrices, X and Y. One way to obtain those solutions is to use numerical computations. In  this case, no clinical meaningful difference might be obtained if the minimum of the

{

(

−1

) }

P norm Z − Z ( X′X ) X′Z < δ and

Totality-of-the-Evidence

57

{

(

−1

) }

P norm Z − Z ( X′X ) Y′Z < δ is sufficiently large.

2.4.2 Totality-of-the-Evidence versus Substantial Evidence The  second and third questions described in Section  2.1 are basically to challenge the FDA to explain (i) what constitute the totality-of-the-evidence and (ii) whether the totality-of-the-evidence is equivalent to regulatory standard of substantial evidence for approval of drug products. As indicated earlier, the FDA’s recommended stepwise approach focuses on three major domains, namely, analytical, PK/PD, and clinical similarity, which may be highly correlated under models (1)–(3). Some pharmaceutical scientists interpret the stepwise approach as a scoring system (perhaps, with appropriate weights) that includes the domains of analytical, PK/PD, and clinical similarity assessment. In this case, the totality-of-the-evidence can be assessed based on information regarding biosimilarity obtained from each domain. In practice, for each domain, we may consider either the FDA’s recommended binary response (i.e., similar or dis-similar) or resort to the use of the concept of biosimilarity index (Chow et al., 2011) to assess similarity information and consequently the totality-of-the-evidence across domains. For the FDA’s recommended approach, Table 2.3 provides possible scenarios that may be encountered when performing analytical similarity assessment, PK/PD similarity test, and clinical similarity assessment. As it can be seen from Table 2.1, if the proposed biosimilar product passes similarity test in all domains, the FDA  considers the sponsor has provided totality-of-theevidence for demonstration of highly similarity between the proposed biosimilar and the innovative biological product. On the other hand, if the proposed TABLE 2.3 Assessment of Totality-of-the-Evidence No. of Dis-similarities 0 1 1 1 2 2 2 3 *

Analytical Similarity Assessment

PK/PD Similarity Assessment

Clinical Similarity

Overall Assessment

Yes Yes Yes No Yes No No No

Yes Yes No Yes No Yes No No

Yes No Yes Yes No No Yes No

Yes No * * No No No No

Scientific rationale are necessarily provided.

58

Innovative Statistics in Regulatory Science

biosimilar product fails to pass any of the suggested similarity assessments (i.e., analytical similarity, PK/PD similarity, and clinical similarity), then regulatory agency will reject the proposed biosimilar product. In practice, it is uncommon to see that the proposed biosimilar may fail in one of the three suggested similarity assessments, namely analytical similarity, PK/ PD similarity, and clinical similarity assessments. In this case, the regulatory agency may be reluctant to grant approval of the proposed biosimilar product. A typical example of this sort of failure is that notable differences in some CQAs between the proposed biosimilar product and the innovative biological product may be observed in analytical similarity assessment. In this case, the sponsors often provide scientific rationales/justifications to indicate that the notable differences have little or no impact on clinical outcomes. This move by the sponsors may cause a contentious debate between the FDA and the Advisory Committee during the review/approval process of the proposed biosimilar product because it is not clearly stated in the FDA guidance whether a proposed biosimilar product is required to passes all similarity tests, regardless of whether they are Tier 1 CQAs or Tier 2/Tier 3 CQAs, before the regulatory agency can grant approval of the proposed biosimilar product. In this situation, if the FDA and the ODAC panel accept the sponsors’ scientific rationales and justifications that the notable differences have little or no impact on the clinical outcomes, the proposed biosimilar is likely to be granted for approval. Such occurrences, however, have raised the interesting question of whether the proposed biosimilar product is required to pass all similarity tests (i.e., analytical similarity, PK/PD similarity, and clinical similarity) for regulatory approval. 2.4.3 Same Regulatory Standards Recently, the FDA has been challenged by several sponsors who claim that the FDA employs inconsistent standards in the drug review and approval process because the FDA adopts a 95% confidence interval for evaluation of new drugs but uses a 90% confidence interval approach for assessment of generic drugs and biosimilar products. The FDA has tried to clarify the issue by pointing out the difference between a two-sided test (TST) for point hypotheses for the evaluation of new drugs and a two one-sided test (TOST) procedure for testing interval hypotheses in the case of bioequivalence and biosimilarity for generic drugs and biosimilar products. Both TST (for point hypotheses) and TSOT (for interval hypotheses) are size- α tests. Thus, the FDA upholds the same regulatory standards for both new drugs and generic drugs and biosimilar products. The confusion about FDA standards arose because of the mixed use of hypotheses testing and confidence interval approaches for the evaluation of drug products. More details can be found in Chapter 3. Analytical similarity evaluation usually involves a large number of CQAs relevant to clinical outcomes with high-risk ranking. In practice, it is not clear

Totality-of-the-Evidence

59

whether α-adjustment for multiple comparisons (i.e., analytical similarity, PK/PD similarity, and clinical similarity) should be done for obtaining the totality-of-the-evidence for demonstration of biosimilarity.

2.5 Development of Totality-of-the-Evidence Chow (2009) proposed the development of a composite index for assessing the biosimilarity of follow-on biologics based on the facts that (1) the concept of biosimilarity for biologic products (made of living cells) is very different from that of bioequivalence for drug products, and (2) biologic products are very sensitive to small changes in the variation during the manufacturing process (i.e., it might have a drastic change in clinical outcome). Some research on the comparison of moment-based criteria and probability-based criteria for the assessment of (1) average biosimilarity, and (2) the variability of biosimilarity for some given study endpoints by applying the criteria for bioequivalence are available in the literature (see, e.g., Chow et al., 2010 and Hsieh et  al., 2010). Yet, universally acceptable criteria for biosimilarity are not available in the regulatory guidelines/guidances. Thus, Chow (2009) and Chow et al. (2011) proposed a biosimilarity index based on the concept of the probability of reproducibility as follows: Step 1. Assess the average biosimilarity between the test product and the reference product based on a given biosimilarity criterion. For the purpose of an illustration, consider a bioequivalence criterion as a biosimilarity criterion. That is, biosimilarity is claimed if the 90% confidence interval of the ratio of means of a given study endpoint falls within the biosimilarity limits of (80%, 125%) or (−0.2231, 0.2231) based on log-transformed data or based on raw (original) data. Step 2. Once the product passes the test for biosimilarity in Step 1, calculate the reproducibility probability based on the observed ratio (or observed mean difference) and variability. Thus, the calculated reproducibility probability will take the variability and the sensitivity of heterogeneity in variances into consideration for the assessment of biosimilarity. Step 3. We then claim biosimilarity if the calculated 95% confidence lower bound of the reproducibility probability is larger than a prespecified number p0, which can be obtained based on an estimated of reproducibility probability for a study comparing a “reference product” to itself (the “reference product”). We will refer to such a study

Innovative Statistics in Regulatory Science

60

as an R-R study. Alternatively, we can then claim (local) biosimilarity if the 95% confidence lower bound of the biosimilarity index is larger than p0. In an R-R study, define d the   concluding average biosimiliarity between the test and  reference products in a future trial given that the average  PTR = P   E)  biosimiliarity based on average bioequivalence (ABE   criterion has been established in first trial   

(2.1)

Alternatively, a reproducibility probability for evaluating the biosimilarity of the same two reference products based on the ABE criterion is defined as:  concluding average biosimiliarity of the two same refeerence  PRR = P  products in a future trial given that the average biiosimilarity  (2.2)    based on ABE criterion have been established in first trial    Since the idea of the biosimilarity index is to show that the reproducibility probability is higher in a study for comparing “a reference product” with “the reference product” than the study for comparing a follow-on biologic with the innovative (reference) product, the criterion of an acceptable reproducibility probability (i.e., p0) for the assessment of biosimilarity can be obtained based on the R-R study. For example, if the R-R study suggests the reproducibility probability of 90%, i.e., PRR = 90%, the criterion of the reproducibility probability for the biosimilarity study could be chosen as 80% of the 90% which is p0 = 80% × PRR = 72%. The  biosimilarity index described above has the advantages that (i) it is robust with respect to the selected study endpoint, biosimilarity criteria, and study design, and (ii) the probability of reproducibility will reflect the sensitivity of heterogeneity in variance. Note that the proposed biosimilarity index can be applied to different functional areas (domains) of biological products such as pharmacokinetics (PK), biological activities, biomarkers (e.g., pharmacodynamics), immunogenicity, manufacturing process, efficacy, etc. An overall biosimilarity index or totality biosimilarity index across domains can be similarly obtained as follows: Step 1. Obtain pɵ i, the probability of reproducibility for the i-th domain, i = 1,.., K. Step 2. Define the biosimilarity index pɵ = ∑ iK=1 wi pɵ i , where wi is the weight for the i-th domain.

Totality-of-the-Evidence

61

Step 3. Claim global biosimilarity if we reject the null hypothesis that p ≤ p0, where p0 is a pre-specified acceptable reproducibility probability. Alternatively, we can claim (global) biosimilarity if the 95% confidence lower bound of p is larger than p0, Let T and R be the parameters of interest (e.g., a pharmacokinetic response) with means of µT and µR , for a test product and a reference product, respectively. Thus, the interval hypotheses for testing the ABE of two products can be expressed as H 0 : θ L′ ≥

µT′ µ′ or θU′ ≤ T µR′ µR′

vs. H a : θ L′
tα ,dfp and TU Y T , Y R , sT , sR < −tα ,dfp |δ L , δU

)

(2.3)

where sT , sR, nT , and nR are the sample standard deviations and sample sizes for the test and reference formulations, respectively. The value of dfp can be calculated by 2

 sT2 sR2   +  n1 n2  dfp =  2 2 ,  sR2   sT2       nT  +  nT  nT − 1 nR − 1

(

)

TL Y T , Y R , σ T , σ R =

(Y

T

)

− Y R − θL 2 T

2 R

s s + nT nR

(

)

, TU Y T , Y R , σ T , σ R =

(Y

T

)

− Y R − θU 2 T

s s2 + R nT nR

,

Innovative Statistics in Regulatory Science

62

δL =

µT − µ R − θ L σ σ + nT nR 2 T

2 R

and δU =

µT − µR − θU σd

(2.4)

σ T2 σ R2 + nT nR

σ T2 and σ R2 are the variances for test and reference formulations, respectively. The  vectors ( TL , TU ) can be shown to follow a bivariate noncentral t-distribution with n1 + n2 − 2 and dfp degrees of freedom, correlation of 1, and non-centrality parameters δ L and δU (Phillips, 1990; Owen, 1965). Owen (1965) showed that the integral of the above bivariate noncentral t distribution can be expressed as the difference of the integrals between two univariate noncentral t-distributions. Therefore, the power function in (2.3) can be obtained by: P (δ L , δU ) = Q f ( tU , δU ; 0, R ) − Q f ( tL , δ L ; 0, R )

(2.5)

where Qf ( t , δ ; 0, R ) =

2π f − 2 /2 Γ ( f / 2 ) 2( )

∫ G (tx / R

0

R = (δ L − δU ) f / ( tL − tU ) , G′ ( x ) =

)

f − δ x f −1G′ ( x ) dx

1 − x2 /2 e , G(x) = 2π



x

−∞

G′ ( t ) dt

and tL = tα ,dfp , tU = −tα ,dfp and f = dfp for parallel design Note that when 0 < θU = −θ L , P (δ L , δU ) = P ( −δU , −δ L ) . The reproducibility probabilities increase when the sample size increases and the means ratio is close to 1, while it decreases when the variability increases for the same setting of sample size and means ratio which show the impact of variability on reproducibility probabilities. Since the true values of δ L and δU are unknown, we proceed by using the idea of replacing δ L and δU in (2.4) with their estimates based on the sample from the first study. The  estimated reproducibility probability can be obtained as:

(

)

(

)

(

 δɵL , δɵU = Q f tL , δɵU ; 0, R  − Q f tU , δɵL ; 0, R  P

)

(2.6)

where Y T − Y L − θ L′ Y T − Y L − θU′  = δɵL − δɵU δɵL = , δɵU = , R 2 2 sT sR sT2 sR2 + + nT nR nT nR

(

)

f / ( tL − tU )

Totality-of-the-Evidence

63

2.6 Concluding Remarks For  regulatory approval of new drugs, Section  314.126 of 21  CFR states that substantial evidence needs to be provided to support the claims of new drugs. For regulatory approval of a proposed biosimilar product as compared to an innovative biological product (usually US-licensed drug product), the FDA requires totality-of-the-evidence be provided to support a demonstration of biosimilarity between the proposed biosimilar product and the US-licensed drug product. In practice, it should be noted that there is no clear distinction between the substantial evidence in new drug development and the totality-of-the-evidence in biosimilar drug product development. In addition, it is not clear whether totality-of-the-evidence provide the same degree of substantial evidence for the assessment of the safety and effectiveness of the test treatment under investigation. Changes in quality attributes or responses in different domains may not be translated to other domains in terms of their criticality or risk ranking relevant to clinical outcomes. It is then suggested that the proposed index for totality-of-the-evidence be used for a more accurate and reliable assessment of the test treatment under investigation.

3 Hypotheses Testing versus Confidence Interval

3.1 Introduction For  review and approval of regulatory submissions of drug products, a

(1 − α ) × 100% confidence interval (CI) approach is commonly used for evaluation of safety and efficacy of new drugs, while a ( 1 − 2α ) × 100% CI is often considered for assessment of bioequivalence for generic drugs and for assessment of biosimilarity for biosimilar products. If α is chosen to be 5%, this leads to a 95% CI approach for evaluation of new drugs and a 90% CI approach for assessment of generics and biosimilars. In the past decade, the FDA has been challenged for adopting different standards (i.e., 5% type I error rate with respect to 95% CI for new drugs and 10% with respect to 90% CI for generics/biosimilars) in the review and approval process of regulatory submissions of drugs and biologics. The issue of using a 95% CI for new drugs and a 90% CI for generics/ biosimilars has received much attention lately. This controversial issue is probably due to the mixed-up use of the concepts of hypotheses testing and confidence interval approach for evaluation of drugs and biologics. The  statistical methods for evaluation of new drugs and for assessment of generics/biosimilars recommended by the FDA are a two-sided test (TST) for testing point hypotheses of equality and a two one-sided tests (TOST) procedure for testing interval hypotheses (i.e., testing bioequivalence or biosimilarity), respectively. However, even under the hypotheses testing framework, test results are interpreted using a confidence interval approach regardless that their corresponding statistical inferences may not  be (operationally) equivalent. As a result, many sponsors and/or reviewers are confused when to use a 90% CI approach or a 95% CI approach for pharmaceutical/clinical research and development. In  practice, there are fundamental differences (i) between the concepts of point hypotheses and interval hypotheses, and (ii) between the concepts of hypotheses testing and confidence interval approach. For (i), a TST is often performed for testing point hypotheses of equality at the 5% level of significance, while a TOST is a valid testing procedure for testing interval hypotheses of

65

66

Innovative Statistics in Regulatory Science

equivalence or similarity at the 5% level of significance for each one-sided test. For (ii), hypotheses testing focuses on power analysis for achieving a desired power (i.e., a type II error), while confidence interval focuses on precision analysis (i.e., a type I error) for controlling the maximum error allowed. The confusion between the use of a 90% CI and the use of a 95% CI for evaluation of drugs and biologics inevitably occur if we use confidence interval approach to interpret the results obtained under the hypotheses testing framework. The purpose of this chapter is to clarify the confusion between (i) the use of hypotheses testing (both point hypotheses and interval hypotheses) and the use of confidence interval approach and (ii) the use of a 90% CI versus a 95% CI for evaluation of drugs and biologics. In  the next couple of sections, the use of the method of hypotheses testing and the confidence interval approach for evaluation of safety and effectiveness of a test treatment under investigation are briefly described, respectively. Section  3.4 explores the relationship between a TOST procedure and its corresponding confidence interval approach for assessment of the safety and effectiveness of the test treatment under investigation. A comparison between the method of hypotheses testing and confidence interval approach is given in Section 3.5. Sample size requirement based on hypotheses testing and confidence interval is discussed in Section 3.6. Some concluding remarks are given in Section 3.7.

3.2 Hypotheses Testing In pharmaceutical/clinical development, hypotheses testing and CI approach are often used interchangeably for evaluation of the safety and efficacy of a test treatment under investigation. It, however, should be noted that the method of hypotheses testing is not generally equivalent to the confidence interval approach. In the next couple of sections, we will explore the distinction between the two approaches although in many cases they are operationally equivalent under certain conditions. For  evaluation the safety and efficacy of a new drug or a test treatment under investigation, a typical approach is to test the hypotheses of equality, i.e., the null hypothesis (H 0 ) of equality versus an alternative hypothesis (H a ) of inequality. We would then reject the null hypothesis of equality (i.e., there is no treatment effect or efficacy) in favor of the alternative hypothesis of inequality (i.e., there is treatment effect or efficacy). In practice, we often select an appropriate sample size for achieving a desired power (i.e., the probability of correctly concluding efficacy when the test treatment is efficacious) at a pre-specified level of significance (to rule out that the observed efficacy is not by chance alone).

Hypotheses Testing versus Confidence Interval

67

For assessment of generics or biosimilar products, on the other hand, the goal is to demonstrate that a proposed generic or biosimilar (test) product is bioequivalent or highly similar to an innovative or brand-name (reference) product. In this case, a test for interval hypotheses of equivalence (for generic drugs) or similarity (for biosimilar products) rather than a test for point hypotheses of equality (for new drugs) is often employed. In practice, however, we consider the two drug products are equivalent or highly similar if their difference falls within a pre-specified equivalent or similar range. In other words, we often start with hypotheses testing but draw conclusion based on confidence interval approach. In practice, there is a distinction between equality and equivalence (similarity). In what follows, the concepts for testing point hypotheses and interval hypotheses are briefly described, respectively. 3.2.1 Point Hypotheses Testing Let µT and µR be the population means of the test product and the reference product, respectively. The following point hypotheses are often considered for testing equality between means of a test (T) product (e.g., a new drug) and a reference (R) product (e.g., a placebo control or an active control): H 0 : µT = µR vs. H a : µT ≠ µR .

(3.1)

For testing equality hypotheses, a two-sided test (TST) at the α = 5% level of significance is often performed. Rejection of the null hypothesis of equality leads to the conclusion that there is a statistical significant difference between µT and µR. In clinical evaluation of a new drug, an appropriate sample size is often selected for achieving a desired power for detecting a clinically meaningful difference (or treatment effect) if such a difference truly exists at a pre-specified level of significance. It  can be verified that TST at the α level of significance is equivalent to (1 − α ) × 100% CI approach for evaluation of the treatment effect under investigation. Thus, in practice, point hypotheses testing for equality is often mixed up with the CI approach for assessment of the treatment effect under investigation. It should be noted that sample size calculations for point hypotheses testing and the CI approach are different. Sample size calculation for point hypotheses testing is typically performed based on power analysis (which focuses on type II error), while sample size calculation for the CI approach is performed based on precision analysis (which focuses on type I error). Thus, the required sample size for point hypotheses testing and the CI approach could be very different.

68

Innovative Statistics in Regulatory Science

3.2.2 Interval Hypotheses Testing On the other hand, the following interval hypotheses are usually used for testing bioequivalence (for generic drugs) or biosimilarity (for biosimilar products) between the test product and the reference product: H 0 : Bioinequivalence or dissimilarity vs. H a : Bioequivalence or similarity

(3.2)

Thus, we would reject the null hypothesis of bioinequivalence or dissimilarity and in favor the alternative hypothesis of bioequivalence or similarity. Interval hypotheses (3.2) are usually written as follows: H 0 : µT − µR ≤ −δ or µT − µR ≥ δ vs. H a : −δ < µT − µR < δ ,

(3.3)

where δ is the so-called bioequivalence limit or similarity margin. Interval hypotheses (3.3) can be re-written as the following two one-sided hypotheses: H 01 : µT − µR ≤ −δ vs. H a1 : −δ < µT − µR ; H 02 : µT − µR ≥ δ vs. H a 2 : µT − µR < δ .

(3.4)

For testing interval hypotheses (3.4), a two one-sided tests (TOST) procedure is recommended for testing bioequivalence for generic drugs or biosimilarity for biosimilar products (Schuirmann, 1987; FDA, 2003b). For interval hypotheses (3.4), the  idea is to test one-side to determine whether the test product is inferior to the reference product at the α level of significance and then test the other side to determine whether the test product is not superior to the reference product at the α level of significance after the non-inferiority of the test product has been established. Note that the FDA recommended Schuirmann’s TOST procedure should be used for testing the above interval hypotheses for bioequivalence or biosimilarity (FDA, 1992, 2003b). In  its guidance on bioequivalence assessment, the FDA  suggested a logtransformation of data be performed before data analysis and recommended the bioequivalence limit be selected as δ = 0.8 for PK responses (FDA, 2003b). Thus, interval hypotheses (3.3) can be rewritten as H 0 : µT /µR ≤ −0.8 or µT /µR ≥ 1.25 vs. H a : 0.8 < µT /µR < 1.25,

(3.5)

where 0.8 (80%) and 1.25 (125%) are lower and upper equivalence or similarity limits. Under hypotheses (3.5), a TOST procedure is to test one side (say non-inferiority) at the α = 5% level of significance and then test the other side (say non-superiority) at the α = 5% level of significance once the noninferiority has been established.

Hypotheses Testing versus Confidence Interval

69

Chow and Shao (2002b) indicated that the two one-sided tests procedure is a size-α test. This indicates that the FDA uses the same standard (i.e., at the α level of significance) for evaluation of new drug products based on TST under hypotheses (3.1) and for assessment of generic and biosimilar drug products based on TOST under hypotheses (3.3) or (3.4). 3.2.3 Probability of Inconclusiveness In practice, one thing regarding point hypotheses testing that is worth mentioning is that the probability of inconclusiveness. In practice, we usually reject the null hypothesis at the α = 5%. Some investigators, however, prefer α = 1% and consider p-values between 1% and 5% as inconclusive. If we let z95% and z99% be the critical values corresponding to α = 5% and α = 1%, respectively, the area under the probability density between z95% and z99% is referred to as the probability of inconclusiveness. The concept of interval hypotheses is very different from that of point hypotheses. Thus, interval hypotheses testing is generally not  equivalent to the (1 − α ) × 100% CI approach for evaluation of the treatment effect under investigation. Interval hypotheses testing, somehow, overcomes the issue of inconclusiveness. For  example, if we consider one-side of interval hypotheses (3.4) for testing non-inferiority, testing this one-side of interval hypotheses (3.4) is actually testing point hypotheses of inferiority. The rejection of the null hypothesis is in favor of non-inferiority. The  concept of non-inferiority incorporates the concepts of equivalence and superiority. Thus, testing for non-inferiority does not imply testing for equivalence or similarity. Thus, for testing equivalence or similarity, the TOST is recommended.

3.3 Confidence Interval Approach 3.3.1 Confidence Interval Approach with Single Reference For clinical investigation of new drugs, a typical approach is to testing the following point hypotheses for equality: H 0 : µT /µR = 1 vs. H a : µT /µR ≠ 1.

(3.6)

We then reject the null hypothesis of no treatment difference and conclude that there is statistically significant treatment difference. The statistical difference is then evaluated to determine whether such a difference is of clinically meaningful difference. In practice, a power calculation is usually performed for determination of sample size for achieving a desired power (e.g., 80%) for

70

Innovative Statistics in Regulatory Science

detecting a clinically meaningful difference (or treatment effect) under the alternative hypothesis that such a difference truly exists. For point hypotheses testing for equality, a two-sided test (TST) at the α level of significance is usually performed. The  TST at the α level of significance is equivalent to the ( 1 − α ) × 100% confidence interval. In  practice, the ( 1 − α ) × 100% confidence interval approach is often used to assess treatment effect instead of hypotheses testing for equality. 3.3.2 Confidence Interval Approach with Multiple References 3.3.2.1 Pairwise Comparisons In comparative clinical trials, it is not uncommon to have multiple controls (or references). For example, for assessment of biosimilarity between a proposed biosimilar product (test product) and an innovative biological product (reference product), there may be multiple references, e.g., a US-licensed reference product and an EU-approved reference version of the same product. In this case, the method of pairwise comparisons is often applied. When two reference products (e.g., a US-licensed reference and an EU-approved reference) are considered, the method of pairwise comparisons includes three comparisons (i.e., the proposed biosimilar product versus the US-licensed reference product, the proposed biosimilar product versus the EU-approved reference product, and the US-licensed reference product versus the EU-approved reference product). The method of pairwise comparisons sounds reasonable. However, at the Oncologic Drugs Advisory Committee (ODAC) meeting on July 13, 2017 for review of biosimilar products of Avastin and Herceptin, the method of pairwise comparisons was criticized by the ODAC panel. The first criticism was regarding the lack of accuracy and reliability of each pairwise comparison, since each comparison does not fully utilize all data collected from the three groups. In  addition, since the equivalence criterion for analytical similarity is based on the variability of reference product, the pairwise comparisons method uses different equivalence criteria in the three comparisons, which may lead to inconsistent conclusions regarding the assessment of biosimilarity. 3.3.2.2 Simultaneous Confidence Interval Alternatively, the ODAC suggested the potential use of simultaneous confidence approach, which has the advantages of utilizing all data collected from the study and using consistent equivalence criterion. As a result, Zheng et al. (2019) proposed three different types (namely, the original version, integrated version, and least favorable version) of simultaneous confidence intervals based on the fiducial inference theory under a parallel-group design (see also Chow, 2018).

Hypotheses Testing versus Confidence Interval

71

Zheng et al. (2019) conducted several simulations for evaluation of the performance of the proposed simultaneous confidence intervals approach as compared to the method of pairwise comparisons. Simulation results indicated that current pairwise comparison methods lacks the accuracy and reliability of each pairwise comparison since each comparison does not fully utilize all data collected from the three groups, and suffers from the inconsistent use of different equivalence criteria in the three comparisons. The simulation results also showed that the methods using the original version and integrated version of simultaneous confidence interval have significantly larger power compared to the pairwise comparisons method and meanwhile can well control the type I error rate. While the method using the least favorable version of simultaneous confidence interval demonstrated the smallest power among the four methods, it was better able to control type I error rate—thus it is a conservative approach which is preferred for avoiding false positive conclusions. To provide a better understanding and to illustrate the inappropriateness of pairwise comparisons, Zheng et al. (2019) provided a couple of examples: one regarding the case where pairwise comparisons failed to conclude similarity when similarity truly exists (i.e., false negative), the other is concerning the case where pairwise comparisons wrongly concludes similarity while similarity does not hold (i.e., false positive). These two examples are briefly described below. 3.3.2.3 Example 1 (False Negative) Suppose we have two reference products US reference and EU reference, denoted by US and EU, and one test product, denoted by T. Assume US, EU, and T follow normal distributions and share equal variance. The true means of the three products were set to be 99, 101, 100, and the true standard deviations were assumed to be equal among the three products were set to be 6. Three groups of samples of 10 were randomly generated from US, EU, and T population, respectively. Another two groups of samples of size 10 were randomly taken from the US and EU population to obtain the ‘true’ standard deviations. The type I error allowed was set to be 0.1. Three pairwise comparisons, US versus EU, US versus T, EU versus T, were analyzed using the FDA  recommended approach, with US, US, and EU as the references, respectively. The data are displayed in Table 3.1 and corresponding scatter plot is given in Figure 3.1. Under this setting, an effective test should be able to reject the null hypothesis and conclude the similarity. However, from Table 3.2, the pairwise comparisons approach failed to reject one of the null hypotheses that the two reference drugs are not similar enough (EU vs. US, 90% CI: 0.42–5.33, exceeds the equivalence acceptance criterion (EAC) margin = 5.01). While two out of the three simultaneous confidence interval methods had fiducial probabilities calculated higher than 0.9 (0.92 for both original version and integrated version), and the corresponding two versions of confidence intervals lie within

Innovative Statistics in Regulatory Science

72

TABLE 3.1 Random Samples Generated from the Three Population (Example 1) Lot Group

1

2

US EU T US (ref)a EU (ref)a

102.13 102.93 96.70 104.38 101.39

102.07 95.29 101.63 99.39 104.90

a

3

4

92.69 92.09 100.21 105.77 110.70 89.96 102.71 103.81 98.09 98.32

5

6

96.99 100.87 98.25 95.35 101.82

101.83 100.72 105.39 102.41 107.23

7

8

9

10

95.15 102.72 95.02 103.45 98.33 108.55 97.74 102.46 101.13 103.91 90.92 102.99 97.56 101.56 95.10 99.96 83.62 100.30 106.98 98.52

Samples randomly taken from the US and EU population to obtain the “true” standard deviations.

FIGURE 3.1 Scatter plot of the random samples generated for each group (Example 1).

the simultaneous margin, thus could successfully reject all three hypotheses, i.e., conclude similarity among US, EU, and T. However, the least favorable version failed to conclude the similarity (fiducial probability = 0.79). This example illustrates the case that pairwise method failed to conclude similarity when true similarity holds (i.e., false negative towards the hypothesis tests), and compared to this, the new proposed simultaneous interval approach was able to reject the null hypothesis and thus more powerful in this case. 3.3.2.4 Example 2 (False Positive) Suppose the true means of the three products was set to be 95, 105, 100, and the true standard deviations were assumed to be equal among the three products and were set to be 6. Similarly, three groups of samples of size 10 were

a

2.87 1.74 −1.13

Mean Difference

Similarity margin = 1.5*σ R.

EU vs. US T vs. US T vs. EU

Comparison

(0.42, 5.33) (−0.72, 4.20) (−6.08, 3.82)

90% CI 5.01 5.01 10.09

EAC Margina

Pairwise Comparisons Approach

Fail Pass Pass

Equivalence Test Original Integrated Least Favorable

Method 0.92 0.92 0.79

Fiducial Probability

Type II 90% CI

(−4.51, 4.51) (−4.79, 4.79) (−4.51, 4.51) (−4.84, 4.84) NA (−4.15, 4.15)

Type I 90% CI

Pass Pass Fail

Simultaneous Similarity

Simultaneous Confidence Interval Approach

The Results of Pairwise Comparisons Method vs. Simultaneous Confidence Interval Approach (Example 1)

TABLE 3.2

Hypotheses Testing versus Confidence Interval 73

Innovative Statistics in Regulatory Science

74

randomly generated from US, EU, and T population, respectively. Another two groups of samples with size 10 were randomly taken from the US and EU population to obtain the ‘true’ standard deviations. The type I error allowed was set to be 0.1. Three pairwise comparisons, US versus EU, US versus T, EU versus T, were analyzed using the FDA recommended approach, with US, US, and EU as the references, respectively. The data were displayed in Table 3.3 and corresponding scatter plot was showed in Figure 3.2. Under this setting, an effective test should be able to accept the null hypothesis and not conclude the similarity. However, Table 3.4 indicates that follow the pairwise comparisons approach, the data passes all the three equivalence tests and incorrectly concluded the similarity between three drug products (all the three 90% CIs lie within the EAC margins). Considering the simultaneous confidence interval approach, although the original and integrated versions of simultaneous confidence interval approaches also incorrectly TABLE 3.3 Random Samples Generated from the Three Population (Example 2) Lot Group

1

US 96.41 EU 94.13 T 100.46 US (ref)a 89.81 EU (ref)a 94.19 a

2

3

4

5

6

7

8

101.81 119.26 95.53 105.22 100.15

100.58 106.72 107.85 95.79 104.23

90.98 99.86 106.19 93.76 116.74

88.06 101.54 112.90 95.01 103.17

108.19 101.33 95.75 96.70 109.69

95.49 105.83 99.85 104.06 106.76

105.62 112.27 101.27 98.06 118.46

9

10

99.98 100.66 94.25 93.75 97.50 97.32 90.52 88.88 104.73 106.25

Samples randomly taken from the US and EU population to obtain the “true” standard deviations.

FIGURE 3.2 Scatter plot of the random samples generated for each group (Example 2).

a

4.12 2.68 −1.43

Mean Difference

Similarity margin = 1.5*σ R.

EU vs. US T vs. US T vs. EU

Comparison

(0.02, 8.21) (−1.41, 6.78) (−6.74, 3.88)

90% CI 8.36 8.36 10.83

EAC Margina

Pairwise Comparisons Approach

Pass Pass Pass

Equivalence Test Original Integrated Least Favorable

Method 0.94 0.95 0.84

Fiducial Probability

(−6.48, 6.48) (−6.38, 6.38) NA

Type I 90% CI

(−7.63, 7.63) (−7.72, 7.72) (−6.60, 6.60)

Type II 90% CI

Pass Pass Fail

Simultaneous Similarity

Simultaneous Confidence Interval Approach

The Results of Pairwise Comparisons Method vs. Simultaneous Confidence Interval Approach (Example 2)

TABLE 3.4

Hypotheses Testing versus Confidence Interval 75

76

Innovative Statistics in Regulatory Science

concluded the similarity (had fiducial probabilities calculated higher than 0.9:0.94 for original version and 0.95 for integrated version, and the corresponding two versions of confidence intervals lie within the simultaneous margin), the least favorable version successfully detected the difference and did not  conclude the similarity (fiducial probability  =  0.79). This  example illustrates the case that pairwise method incorrectly conclude similarity when significant difference between the three groups truly exists (i.e., false positive towards the hypothesis tests), and compared to this, the new proposed least favorable version of simultaneous interval approach was more conservative and avoided the type I error in this case. Further discussion of the new methods’ performance under different parameter settings can be found in the simulation studies of the following sections.

3.4 Two One-Sided Tests versus Confidence Interval Approach For testing equivalence (either bioequivalence, biosimilarity, or therapeutic equivalence), the two one-sided tests procedure and the confidence interval approach are often considered (Schuirmann, 1987; Chow and Liu, 2003). However, some confusion arises. For example, what is the difference between the two approaches, given the fact that in some cases the two approaches produce the same test? For a confidence interval approach, current practice considers 1 − α for establishing therapeutic equivalence and 1 − 2α for demonstration of bioequivalence. Thus, should we use level 1 − α or 1 − 2α when applying the confidence interval approach for establishing equivalence? When different confidence intervals are available, which confidence interval should be used? These questions have an impact on sample size calculation for establishing equivalence in clinical trials. 3.4.1 Two One-Sided Tests (TOST) Procedure Chow and Shao (2002b) indicated that the approach of using 1  −  α confidence intervals produces level α-tests, but the sizes of these tests may be smaller than α, and that the use of 1 − 2α confidence intervals generally does not ensure that the corresponding test be of level α, although there are exceptional cases. In  this section, we will re-visit these questions. Let µT and µS denote, respectively, the mean responses of a study primary endpoint for a test drug and a standard therapy (or active control agent), and let δ > 0 be the magnitude of difference of clinical importance. If the concern is whether the test drug is non-inferior to the standard therapy (or active control agent), then the following hypotheses are tested: H 0 : µT − µS ≤ −δ versus H a : µT − µS > −δ .

Hypotheses Testing versus Confidence Interval

77

The hypotheses related to whether the test drug is superior to the standard therapy (or active control agent) are H 0 : µT − µS ≤ δ versus H a : µT − µS > δ . (see, for example, Hwang and Morikawa, 1999). If it is of interest to show whether the test drug and the standard therapy (or active control agent) are therapeutically equivalent, then we consider the following hypotheses: H 0 : µT − µS ≥ δ versus H a : µT − µS < δ .

(3.7)

Note that non-inferiority, superiority, or therapeutic equivalence is concluded if the null hypothesis H0 is rejected at a given significance level α. In this section we focus on the two-sided hypotheses given in (3.7), which is also of interest in assessing bioequivalence between two drug products. There are two commonly employed statistical tests for hypotheses in (3.7). One is the TOST procedure and the other is the CI approach. Berger and Hsu (1996) studied statistical properties of the tests based on the TOST and CI approaches. However, the fact that the TOST approach with level α is operationally the same as the CI approach with a particular 1  −  2α confidence interval (e.g., Blair and Cole, 2002; Chow and Liu, 2008) has caused some confusion in the pharmaceutical industry. For example, should we use level 1 − α or 1 − 2α when applying the CI approach? If the TOST and CI approaches are operationally the same, then why are they considered to be different approaches? Furthermore, there are several tests corresponding to different confidence intervals, which one should be recommended? Chow and Shao (2002b) clarified this confusion by comparing the two onesided tests (TOST) approach with the confidence interval (CI) approaches as follows. The TOST approach is based on the fact that the null hypothesis H0 in (3.7) is the union of the following two one-sided hypotheses: H 01: µT − µS ≥ δ and H 02 : µT − µS ≤ − δ .

(3.8)

Hence, we reject the null hypothesis H0 when both H01 and H02 are rejected. For example, when observed responses are normally distributed with a constant variance, the TOST procedure rejects H0 in (3.7) if and only if ( yT − yS + δ )/ se > tα and ( yT − yS − δ )/ se < −tα ,

(3.9)

where yT and yS are the sample means of the test product and the standard therapy (or active control agent), respectively, se is an estimated standard deviation and tα is the upper αth percentile of a central t-distribution with

78

Innovative Statistics in Regulatory Science

appropriate degrees of freedom (e.g., Berger and Hsu, 1996; Blair and Cole, 2002). Note that each of the two statements in (3.9) defines the rejection of a level α test for one of the null hypotheses in (3.8).

3.4.2 Confidence Interval Approach For the confidence interval approach, the following confidence intervals are commonly considered: CIW =  yT − yS − tα1 se , yT − yS + tα2 se  CI L =  −|yT − yS |−tα se , |yT − yS |+ + tα se  CIE = min ( 0, yT − yS − tα se ) , max ( 0, yT − yS + tα se )  Here, CIW is the so-called Westlaske symmetric confidence interval with appropriate choices of α1 > 0, α2 > 0, α1 + α2 = α (Westlake, 1976), CIL is derived from the confidence interval of |µT − µS| and CIE is the expanded confidence interval derived by Hsu (1984) and Bofinger (1985, 1992). If we define CI 2α =  yT − yS − tα se , yT − yS + tα se  , then the result in (3.9) is equivalent to the result that CI2α falls within the interval (−δ, δ). Therefore, the TOST approach is operationally the same as the CI approach with the confidence interval CI2α.

3.4.2.1 Level 1 − α versus Level 1 − 2α Note that CIW , CI L, and CIE are 1 − α confidence intervals, whereas CI2α is a 1 − 2α confidence interval. Thus, there is confusion whether level 1 − α or level 1 − 2α should be used when applying the CI approach. This directly affects statistical analysis, starting from sample size calculation. When applying the CI approach to the problem of assessing average bioequivalence, Berger and Hsu (1996) indicated that the misconception that size-α bioequivalence tests generally correspond to (1 − 2α) × 100% confidence sets will be shown to lead to incorrect statistical practices, and should be abandoned. This is because the use of a 1 − α confidence interval guarantees that the corresponding test is of level α, whereas the use of a 1 − 2α confidence interval can only ensure that the corresponding test is of level 2α. However, a further problem arises. Why does the use of 1 − 2α confidence interval CI 2α produce a level α test? To clarify this, we need to first understand the difference between the significance level and the size of a statistical test.

Hypotheses Testing versus Confidence Interval

79

3.4.2.2 Significance Level versus Size Let T be a test procedure. The size of T is defined to be

αT =

sup P(T rejects H 0 ). P under H 0

On the other hand, any α1 ≥ αT is called a significance level of T. Thus, a test of level α is also of level α2 for any α2 > α1 and it is possible that the test is of level α0, which is smaller than α1. The size of a test is its smallest possible level and any number larger than the size (and smaller than 1) is a significance level of the test. Thus, if α (e.g., 1% or 5%) is a desired level of significance and T is a given test procedure, then we must first ensure that αT ≤ α (i.e., T is of level α) and then try to have αT = α; otherwise T is too conservative. Chow and Shao (2002b) discussed the difference between the TOST approach and the CI approach, although they may be operationally the same. It  also reveals a disadvantage of using the CI approach. That is, the test obtained using a 1  −  α confidence interval may be too conservative in the sense that the size of the test may be much smaller than α. An immediate question is: What are the sizes of the tests obtained by using level 1 − α confidence intervals CIW, CIL, and CIE? 3.4.2.3 Sizes of Tests Related to Different Confidence Intervals If can be verified that CIW ⊃ CIL ⊃ CIE ⊃ CI2α. If the tests obtained by using these confidence intervals are TW, TL, TE, and T2α, respectively, then their sizes satisfy αTw ≤ αTL ≤ αTE ≤ αT2α. In the previous section, we conclude that the size of T2α is α. Berger and Hsu (1996) showed that the size of TE is α, although CIE is always wider than CI2α. The size of TL is also α, although CIL is even wider than CIE. This is because

α TL = sup P(|yT − yS |+ tα se < δ ) |µT − µS|≥δ

 = max  sup P(|yT − yS |+ tα se < δ ,  µT − µS =δ

 sup P(|yT − yS |+ tα se < δ  µS − µT =δ 

= α. However, the size of TW is max(α1,α2), which is usually smaller than α since α1  +  α2  =  α. Thus, TW (or CIW) is not  recommended if the desired size is α. It should be noted that the size of a test is not the only measure of its successfulness. The TOST procedure is of size α yet biased because the probability of rejection of the null hypothesis when it is false (power) may be lower than α. Similarly, tests TE and TL are biased. Berger and Hsu (1996) proposed a nearly unbiased and uniformly more powerful test than the TOST

80

Innovative Statistics in Regulatory Science

procedure. Brown et al. (1997) provided an improved test that is unbiased and uniformly more powerful than the TOST procedure. These improved tests are, however, more complicated than the TOST procedure. To avoid confusion, it is strongly suggested that the approaches of confidence interval and two one-sided tests should be applied separately, although sometimes they are operationally equivalent.

3.4.3 Remarks Chow and Liu (1992a) indicated that Schuirmann’s two one-sided tests procedure is operationally (algebraically) equivalent to confidence interval approach in many cases under certain assumptions. In  other words, we claim bioequivalence or biosimilarity if the constructed ( 1 − 2α ) × 100% confidence interval falls completely within the bioequivalence or biosimilarity limits. As a result, interval hypotheses testing (i.e., two one-sided tests procedure) for bioequivalence or biosimilarity has been mixed up with the use of ( 1 − 2α ) × 100% confidence interval approach since then. To provide a better understanding, Table  3.5 summarizes the fundamental differences between hypotheses testing for equality (new drugs) and hypotheses testing for equivalence (generic drugs and biosimilar products). As it can be seen from Table 3.5 that TST and TOST are official test procedures recommended by the FDA for evaluation of new drugs and generic/ biosimilar drugs, respectively. Both TST and TOST are size-α test procedures. In  other words, the overall type I error rates are well controlled. In  practice, it is suggested that the concepts of hypotheses testing (based on type II error) and confidence interval approach (based on type I error) should not be mixed up to avoid possible confusion in evaluation of new drugs and generic/biosimilar products.

TABLE 3.5 Comparison of Statistical Methods for Assessment of Generic/Biosimilar Drugs and New Drugs Characteristics

Generic/Biosimilar Drugs

Hypotheses testing FDA recommended approach Control of α Confidence interval approach

Interval hypotheses TOST Yes Operationally equivalent (1 − 2α ) × 100%CI

Point hypotheses TST Yes Equivalent (1 − α ) × 100%CI

90% CI if α = 5% Based on TOST

95% CI if α = 5% Based on TST

Sample size requirement

Note: TOST = two one-sided tests; TST = two-sided test.

New Drugs

Hypotheses Testing versus Confidence Interval

81

3.5 A Comparison In  this section, we will compare the method of hypotheses testing (i.e., TOST) and the confidence interval approaches (i.e., 90% CI and 95% CI) for assessment of bioequivalence for generic drugs and biosimilarity for biosimilar drug products. The comparison will be made under the framework of hypotheses testing in terms of the probability of correctly concluding bioequivalence or biosimilarity. 3.5.1 Performance Characteristics Under interval hypotheses (3.7), we would reject the null hypothesis of bioinequivalence (or dis-similarity) and conclude bioequivalence (biosimilarity) at the 5% level of significance. The probability of correctly concluding bioequivalence or biosimilarity, when in fact the test product is bioequivalent or biosimilar to the reference product, is the power of the TOST. In practice, power of TOST can be obtained by calculating the probability of rejecting H 0 or accepting H a under H a (i.e., assuming H a is true, which is given below. power = {accept H a under H a of hyotheses ( 3.7 ) or ( 3.8 )} . Since TOST is operationally equivalent to the 90% confidence interval in many cases, we will consider the 90% CI as an alternative method for assessment of bioequivalence for generic drug products and biosimilarity for biosimilar drug products. We would conclude bioequivalence or biosimilarity if the constructed 90% CI falls within (80%, 125%) entirely. Thus, the probability of concluding bioequivalence for generic drug products or biosimilarity for biosimilar drug products is given below

{

}

p90%CI = P 90% CI ⊂ ( 0.8, 1.25 ) . The performance of the 95% CI approach for assessment of bioequivalence for generic drug products and biosimilarity for biosimilar drug products can be similarly evaluated. For the 95% CI approach, if the constructed 95% CI falls within (80%, 125%) entirely. Thus, the probability of concluding bioequivalence for generic drug products or biosimilarity for biosimilar drug products is given below

{

}

p95% CI = P 95% CI ⊂ ( 0.8, 1.25 ) .  the point estimate of log ( µT /µR ), se an estimate of the standard Denote F  , and tα the upper α th percentile of a central t-distribution with deviation of F

Innovative Statistics in Regulatory Science

82

appropriate degrees of freedom. The  following ( 1 − α ) × 100% confidence intervals are commonly considered in the CI approach:  − tα se , F  + tα se  ; CIW =  F 2 1    −tα se , F  + tα se  ; CI L =  − F  

(

)

(

)

 − tα se , max 0, F  + tα se  . CIE = min 0, F 2 1   Here CIW is the so-called Westlake symmetric confidence interval with appropriate choices of α 1 > 0, α 2 > 0, α 1 + α 2 = α (Westlake, 1976), CI L is derived from the confidence interval of log µµTR (Liu, 1990), and CIE is an expended confidence interval derived by Hsu (1984) and Bofinger (1985, 1992). Denote the classic ( 1 − 2α ) × 100% confidence interval as

( )

 − tα se , F  + tα se  , CI 2α =  F   the results of the TOST approach at the α level of significance is equivalent to the result that CI 2α falls within the interval (−δ , δ ), where δ = log ( 1.25 ) for the data after log-transformation. In addition, it can be derived that CI 2α ⊂ ( −δ , δ ) implies CI L ⊂ (−δ , δ ) and CIE ⊂ (−δ , δ ). Thus, the (1 − α ) × 100% CI approaches (CI L and CIE ), the classical (1 − 2α ) × 100% CI approach (CI 2α ), and the TOST approach at the α level of significance are operationally equivalent for testing hypotheses of (3.8). With CI 2α ⊂ CIW , we have the following power relations between the TOST approach and the four CI approaches with the confidence levels of 90% and 95%: p90%CIL = p90%CIE > p90%CIW > p90%CI2α = pTOST = p95%CIL = p95%CIE > p95%CIW > p95%CI2α

(3.10)

3.5.2 Simulation Studies To compare the power (or type I error) of those approaches, we also conducted simulation studies to evaluate small sample performances. We consider the classic 2 × 2 crossover design without carryover effects (a brief introduction of the statistical model for this design is given in Appendix 1). A variety of scenarios were considered with parameter specifications in Table 3.6. The  power (when the alternative hypothesis holds) and the type I error (when the null hypothesis holds) were compared between the nine approaches: 90%CI L, 90%CIE , 90%CIW , 90%CI 2α, 95%CI L, 95%CIE , 95%CIW ,

Hypotheses Testing versus Confidence Interval

83

TABLE 3.6 Parameter Specification for Simulation Studies

µT µR σT σR 2 BT 2 TT

σ σ ρ n1 n2

Note:

=

σ σ

2 BR 2 TR

80

85

95

100

110

120

125

100

100

100

100

100

100

100

10

20

30

10

20

30

0.75

0.25

0

0.3

12

24

12

24

0.6

σT

and σ R are the total standard deviations of drug T and drug R before log2 2 2 2 transformation, respectively. The meanings of σ BT , σ TT , σ BR , σ TR and ρ can be found in the Appendix 1. n1 and n2 are numbers of subjects with the sequences of RT and TR, respectively. Without loss of generality, the fixed effects of periods and sequences in all bioequivalence trials are set to be 0 for all scenarios. There are 7 combinations of ( µT , µ R ), 2 2 3 combinations of (σ T , σ R ), 2 combinations of σ BT , σ BR , 3 combinations of ρ , and 2

(

2 σ TT

2 σ TR

)

combinations of (n1 , n2 ), resulting in 7 × 3 ×2× 3 ×2 scenarios in total.

95%CI 2α, and the FDA recommended TOST approach. Note that the 90%CI 2α , the TOST approach, 95%CI L, and 95%CIE are operationally equivalent. So are 90%CI L and 90%CIE . For illustration, we only present the results for the scenarios of ( n1 , n2 ) = ( 12, 12 ) and the last four combination of ( µT , µR ) only. The results for other scenarios are similar and we do not present them here for simplicity. The type I error rates are presented in Table 3.7. We see the type I error rates of 90%CI L, 90%CIE , and 90%CIW are highly similar and range from 6% to 9% with all values greater than the 5% level of significance, while the rates for all the other methods are well controlled at the level of 5%. Among them, the rates of 90%CI 2α, 95%CI L, 95%CIE , 95%CIW , and the TOST approach are highly similar and range appropriately from 2.7% to 4.5%, while 95%CI 2α is too conservative with the type one errors no greater than 2.3%. The power for those approaches is presented in Table 3.8. The results are consistent with the relationships in (3.10). The  power of three approaches, 90%CI L, 90%CIE , and 90%CIW , are very similar and higher than the other approaches. For 90%CI 2α, 95%CI L, 95%CIE , 95%CIW , and the TOST approach, their power are very similar and the power of 95%CIW is slightly lower than the other four. The 95%CI 2α approach has the lowest power. Regarding controlling the 5% level of significance and power, the TOST approach, as well as the three CI approaches (90%CI 2α, 95%CI L, and 95%CIE ), perform best among the nine approaches.

Innovative Statistics in Regulatory Science

84

TABLE 3.7 Type I Error Rate When H0 in (3.4) Is True for n1 = n2 = 12 GMR

µT

σR

2 σ BT 2 σ TT

1.25 1.25 1.25 1.25 1.25 1.25 1.26 1.26 1.26 1.26 1.26 1.26 1.27 1.27 1.27 1.27 1.27 1.27

125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125 125

10 10 10 10 10 10 20 20 20 20 20 20 30 30 30 30 30 30

0.75 0.75 0.75 0.25 0.25 0.25 0.75 0.75 0.75 0.25 0.25 0.25 0.75 0.75 0.75 0.25 0.25 0.25

ρ

90% %CIL

90%CIW

TOST

95%CIW

95%CI 2α

0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6

0.09 0.084 0.09 0.089 0.09 0.088 0.081 0.078 0.075 0.077 0.078 0.076 0.07 0.065 0.06 0.071 0.069 0.069

0.09 0.084 0.09 0.089 0.09 0.088 0.081 0.078 0.075 0.077 0.078 0.076 0.069 0.064 0.06 0.069 0.068 0.068

0.045 0.043 0.045 0.044 0.042 0.043 0.039 0.037 0.034 0.038 0.037 0.036 0.035 0.031 0.027 0.031 0.034 0.033

0.044 0.043 0.045 0.043 0.042 0.043 0.039 0.037 0.034 0.038 0.037 0.036 0.033 0.03 0.027 0.03 0.033 0.032

0.021 0.023 0.023 0.021 0.02 0.022 0.019 0.018 0.016 0.017 0.018 0.014 0.016 0.013 0.013 0.016 0.016 0.015

Note: GMR: geometric mean ratio between T and R. For  the type I error rate, deeper color indicates higher value.

TABLE 3.8 Power When Ha in (3.4) Is True for n1 = n2 = 12 σ BT 2

GMR µT

σR

σ TT 2

ρ

1 1 1 1 1 1 1 1 1 1 1 1 1 1

10 10 10 10 10 10 20 20 20 20 20 20 30 30

0.75 0.75 0.75 0.25 0.25 0.25 0.75 0.75 0.75 0.25 0.25 0.25 0.75 0.75

0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3

100 100 100 100 100 100 100 100 100 100 100 100 100 100

90%CIL 1 1 1 1 1 1 0.989 0.998 1 0.989 0.993 0.996 0.804 0.902

90%CIW 1 1 1 1 1 1 0.989 0.998 1 0.989 0.993 0.996 0.798 0.899

TOST

95%CIW

1 1 1 1 1 1 0.968 0.991 1 0.965 0.978 0.985 0.632 0.789

1 1 1 1 1 1 0.968 0.991 1 0.965 0.978 0.985 0.606 0.78

95%CI2α 1 1 1 1 1 1 0.925 0.977 0.997 0.925 0.946 0.963 0.421 0.627 (Continued)

Hypotheses Testing versus Confidence Interval

85

TABLE 3.8 (Continued) Power When Ha in (3.4) Is True for n1 = n2 = 12 σ BT 2

GMR µT

σR

σ TT 2

ρ

1 1 1 1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.11 1.11 1.11 1.11 1.11 1.11 1.2 1.2 1.2 1.2 1.2 1.2 1.21 1.21 1.21 1.21 1.21 1.21 1.22 1.22 1.22 1.22 1.22 1.22

30 30 30 30 10 10 10 10 10 10 20 20 20 20 20 20 30 30 30 30 30 30 10 10 10 10 10 10 20 20 20 20 20 20 30 30 30 30 30 30

0.75 0.25 0.25 0.25 0.75 0.75 0.75 0.25 0.25 0.25 0.75 0.75 0.75 0.25 0.25 0.25 0.75 0.75 0.75 0.25 0.25 0.25 0.75 0.75 0.75 0.25 0.25 0.25 0.75 0.75 0.75 0.25 0.25 0.25 0.75 0.75 0.75 0.25 0.25 0.25

0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6 0 0.3 0.6

100 100 100 100 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120

90%CIL 0.971 0.8 0.837 0.878 0.999 1 1 1 1 1 0.832 0.891 0.961 0.834 0.849 0.874 0.567 0.637 0.759 0.559 0.592 0.617 0.572 0.632 0.736 0.562 0.592 0.615 0.263 0.29 0.349 0.269 0.264 0.277 0.176 0.192 0.209 0.177 0.181 0.183

90%CIW 0.97 0.794 0.832 0.874 0.999 1 1 1 1 1 0.831 0.891 0.961 0.834 0.849 0.874 0.562 0.632 0.759 0.554 0.587 0.612 0.572 0.631 0.736 0.561 0.591 0.615 0.263 0.29 0.349 0.269 0.264 0.277 0.173 0.191 0.209 0.174 0.179 0.182

TOST

95%CIW

0.926 0.625 0.683 0.745 0.998 1 1 0.998 0.999 0.999 0.718 0.807 0.909 0.719 0.737 0.773 0.41 0.486 0.622 0.403 0.431 0.467 0.422 0.483 0.608 0.41 0.44 0.462 0.163 0.174 0.223 0.162 0.158 0.168 0.096 0.107 0.117 0.094 0.099 0.101

0.924 0.6 0.667 0.73 0.998 1 1 0.998 0.999 0.999 0.717 0.807 0.908 0.718 0.736 0.773 0.398 0.48 0.619 0.391 0.423 0.46 0.422 0.483 0.608 0.409 0.439 0.462 0.162 0.174 0.223 0.161 0.157 0.167 0.093 0.105 0.116 0.092 0.096 0.1

95%CI2α 0.851 0.422 0.492 0.562 0.994 0.999 1 0.994 0.996 0.998 0.591 0.694 0.833 0.593 0.61 0.659 0.269 0.353 0.484 0.258 0.302 0.329 0.294 0.351 0.477 0.288 0.318 0.338 0.096 0.103 0.139 0.097 0.091 0.101 0.049 0.06 0.065 0.05 0.051 0.056

Note: GMR: geometric mean ratio between T and R. For  the power, deeper color indicates higher value.

86

Innovative Statistics in Regulatory Science

In  addition, we see from the simulation results that the TOST approach is not  equivalent to some 90% approaches, 90%CI L, 90%CIE , and 90%CIW . The later three approaches have higher type I error rates and higher power than the TOST approach. Therefore, if we start with hypotheses testing but use 90% CI approaches (say, 90%CI L, 90%CIE , and 90%CIW ) for bioequivalence assessment, we may have the risk of undesirable high type I error rate, which can be greater than 5%. 3.5.3 An Example—Binary Responses For  binary responses, the following interval hypotheses are usually used for testing bioequivalence or biosimilarity between the test product and the reference product: H 0 : pT − pR ≤ −0.2 or pT − pR ≥ 0.2 vs. H a : −0.2 < pT − pR < 0.2,

(3.11)

where pT and pR are response rates of the test product and the reference product, respectively. To compare the power (or type I error rates) of the TOST approach and the CI approach, we conducted simulations to evaluate their performances. The TOST approach we use is based on the adjusted Wald test (Agresti and Min, 2005). The CI approach we use is based on Tango’s score confidence interval (Tango, 1998). We consider the classic 2  ×  2 crossover design without carryover effects for total sample sizes of n = 24, 48, 100, and 200, respectively. Unlike the crossover design for the continuous responses, here we do not consider period effect and sequence effect. Thus the samples can be seen as n independently and identically distributed matched pairs. For parameter specifications, we consider a variety of response rate combinations ( p00 , p01 , p10 , p11 ), where p jk , j = 0,1, k = 0,1 is the rate of binary reference response being equal to j and binary test response being equal to k. 10,000 repetitions were implemented for each scenario. The type I error rates are presented in Table 3.9. With sample size increasing, the type I error rates of both approaches converge to the level of 5%. When sample size is small, the type I error rate can depart far away from the level of 5%. In addition, the two approaches have different type I error rates. The power is presented in Table 3.10. We see the two approaches have different power. The power of the TOST approach is higher than the other, especially when sample size is small. With limited sample size (say, n = 24 as in the conventional 2 × 2 crossover design for continuous responses), the power of both approaches is not high (sometimes very low). Simulation results indicate that compared with continuous responses, binary responses require larger sample sizes to achieve satisfactory power and type I error rate. In addition, the TOST approach and the CI approach are not operationally equivalent under our simulation settings.

Hypotheses Testing versus Confidence Interval

87

TABLE 3.9 Type I Error Rate for Binary Responses When H0 in (3.11) Is True p00

p01

p10

p11

n

L1

U1

L2

U2

CR1

CR 2

0.1

0

0.2

0.7

24 48

0.049 0.096

0.322 0.288

0.079 0.123

0.359 0.308

0.118 0.067

0.036 0.067

100

0.130

0.263

0.143

0.273

0.081

0.046

0.1

0.3

0.1

0

0.3

0.2

0.5

0.5

200

0.152

0.245

0.158

0.250

0.062

0.041

24 48

−0.009 0.053

0.376 0.331

−0.006 0.055

0.394 0.340

0.034 0.060

0.012 0.049

100

0.098

0.293

0.100

0.297

0.055

0.050

200

0.129

0.268

0.130

0.269

0.052

0.049

24 48

0.048 0.096

0.322 0.289

0.079 0.123

0.358 0.309

0.116 0.060

0.033 0.060

100

0.130

0.262

0.143

0.272

0.080

0.045

200

0.151

0.245

0.158

0.250

0.063

0.043

Note: L1 and U1 are the means of lower and upper bounds of the 95% Adjusted Wald interval for a difference of proportions with matched pairs (Agresti and Min, 2005). L2 and U2 are the means of lower and upper bounds of the 95% Tango’s score confidence interval for a difference of proportions with matched pairs (Tango, 1998). CR1 and CR2 are the rejection rates of the TOST approach based on Adjusted Wald interval and the CI approach based on Tango’s score confidence interval, respectively. The darker the color, the larger the value.

TABLE 3.10 Power for Binary Responses When Ha in (3.11) Is True p00

p01

p10

p11

n 24 48 100 200 24 48 100 200 24 48 100 200 24 48 100 200 24

0

0.1

0.1

0.8

0.1

0.05

0.05

0.8

0.05

0.05

0.1

0.8

0.1

0

0.1

0.8

0

0.05

0.15

0.8

L1 −0.148 −0.105 −0.074 −0.052 −0.112 −0.076 −0.052 −0.037 −0.084 −0.043 −0.015 0.005 −0.016 0.022 0.047 0.063 −0.054

U1 0.147 0.105 0.073 0.052 0.113 0.079 0.053 0.037 0.177 0.140 0.112 0.094 0.202 0.171 0.148 0.134 0.236

L2 −0.163 −0.111 −0.076 −0.053 −0.133 −0.085 −0.056 −0.038 −0.093 −0.045 −0.015 0.005 −0.011 0.040 0.061 0.070 −0.053

U2

CR1

CR 2

0.162 0.111 0.075 0.053 0.135 0.088 0.056 0.039 0.202 0.152 0.117 0.097 0.241 0.192 0.159 0.140 0.262

0.483 0.853 0.994 1.000 0.820 0.978 1.000 1.000 0.581 0.834 0.983 1.000 0.561 0.651 0.932 0.996 0.334

0.306 0.817 0.992 1.000 0.645 0.967 1.000 1.000 0.412 0.793 0.974 1.000 0.290 0.651 0.879 0.992 0.210

(Continued)

Innovative Statistics in Regulatory Science

88

TABLE 3.10 (Continued) Power for Binary Responses When Ha in (3.11) Is True p00

p01

p10

p11

0.1

0.1

0.1

0.7

0.15

0.05

0.1

0.7

0.05

0.1

0.15

0.7

0.1

0.05

0.15

0.7

0.1

0.2

0.2

0.5

0.3

0.1

0.1

0.5

0.3

0.05

0.15

0.5

0.1

0.15

0.25

0.5

n 48 100 200 24 48 100 200 24 48 100 200 24 48 100 200 24 48 100 200 24 48 100 200 24 48 100 200 24 48 100 200 24 48 100 200

L1 −0.007 0.027 0.049 −0.147 −0.105 −0.073 −0.052 −0.085 −0.043 −0.014 0.005 −0.114 −0.068 −0.032 −0.008 −0.053 −0.008 0.026 0.048 −0.202 −0.144 −0.102 −0.074 −0.148 −0.105 −0.073 −0.052 −0.053 −0.007 0.026 0.049 −0.107 −0.049 −0.004 0.027

U1 0.199 0.170 0.150 0.148 0.106 0.073 0.052 0.175 0.140 0.113 0.095 0.210 0.164 0.130 0.107 0.237 0.198 0.169 0.150 0.201 0.147 0.103 0.073 0.147 0.106 0.073 0.052 0.237 0.199 0.169 0.150 0.290 0.239 0.199 0.171

L2 −0.004 0.030 0.050 −0.162 −0.111 −0.076 −0.053 −0.094 −0.044 −0.013 0.006 −0.122 −0.071 −0.033 −0.008 −0.053 −0.005 0.029 0.050 −0.211 −0.148 −0.104 −0.074 −0.164 −0.111 −0.076 −0.052 −0.053 −0.003 0.029 0.050 −0.112 −0.050 −0.004 0.027

U2

CR1

CR 2

0.212 0.176 0.153 0.163 0.112 0.075 0.053 0.200 0.153 0.118 0.097 0.228 0.173 0.134 0.109 0.263 0.211 0.176 0.153 0.211 0.151 0.104 0.073 0.163 0.112 0.075 0.053 0.264 0.212 0.176 0.153 0.304 0.246 0.201 0.172

0.499 0.743 0.939 0.488 0.854 0.996 1.000 0.588 0.836 0.983 1.000 0.321 0.667 0.922 0.996 0.336 0.501 0.752 0.944 0.079 0.475 0.883 0.996 0.482 0.855 0.995 1.000 0.331 0.499 0.749 0.941 0.065 0.294 0.506 0.744

0.450 0.697 0.927 0.310 0.822 0.993 1.000 0.412 0.793 0.974 1.000 0.177 0.621 0.908 0.994 0.214 0.451 0.710 0.932 0.023 0.420 0.874 0.996 0.310 0.821 0.993 1.000 0.209 0.448 0.704 0.927 0.017 0.256 0.489 0.733

Note: L 1 and U1 are the means of lower and upper bounds of the 95% Adjusted Wald interval for a difference of proportions with matched pairs (Agresti and Min, 2005). L2 and U2 are the means of lower and upper bounds of the 95% Tango’s score confidence interval for a difference of proportions with matched pairs (Tango, 1998). CR1 and CR2 are the rejection rates of the TOST approach based on Adjusted Wald interval and the CI approach based on Tango’s score confidence interval, respectively. The darker the color, the larger the value.

Hypotheses Testing versus Confidence Interval

89

3.6 Sample Size Requirement As the TOST approach at 5% level of significance is operationally equivalent to the approaches of 90%CI 2α, 95%CI L, and 95%CIE , for testing (3), the required sample sizes to achieve a desirable power for those approaches are identical. Similarly, the TOST approach at 2.5% level of significance is operationally equivalent to the approach of 95%CI 2α, and the TOST approach at 10% level of significance is operationally equivalent to the approaches of 90%CI L, and 90%CIE . Thus, the required sample sizes for those CI approaches except CIW can be derived equivalently as those derived for the corresponding TOST approach with appropriate level of significance. Thus, without loss of generality, we only need to focus on the TOST approach and the CIW approach for calculating the required sample size for the TOST approach and the CI approaches. The equations for sample size calculation of the two approaches are given below.  and se. The power Denote θ 0 and σ as the true values that are estimated by F of the TOST approach and the CIW approach can be easily expressed as follows.

{  < δ − t se = P {−δ + t se < F }

 − tα se and F  + tα se < δ pTOST = P −δ < F α

}

α

 − θ 0 δ − tα se θ 0    −δ + tα se θ 0 F − < < − se  = E P  σ σ σ σ σ   

{ = P {−δ + t

 − tα se and F  + tα se < δ pCIW = P −δ < F 1 2 α1 se

}

}

 < δ − tα se 5 lb >5%

262 146 12 19

Response rate based on absolute change greater than 5 lb is 60%. Response rate based on relative change greater than 5% is 30%.

the study endpoints. In practice, the sponsors always choose the clinical endpoints to their best interest. The  regulatory agencies, however, require the primary clinical endpoint be specified in the study protocol. Positive results from other clinical endpoints will not be considered as the primary analysis results for regulatory approval. This, however, does not have any scientific or statistical justification for assessment of the treatment effect of the test drug under investigation. In  this chapter, we attempt to provide some insight to the above issues. In particular, the focus is to evaluate the effect on the power of the test when the sample size of the clinical study is determined by an alternative clinical strategy based on different study endpoint and non-inferiority margin. In  the next section, model and assumptions for studying the relationship among these study endpoints are described. Under the model, translations among different study endpoints are studied. Section 4.4 provides a comparison of different clinical strategies for endpoint sections in terms of sample size and the corresponding power. A numerical study is given in Section 4.5 to provide some insight regarding the effect to the different clinical strategies for endpoint selection. Development of therapeutic index function for endpoint selection is given in Section 4.6. Brief concluding remarks are presented in the last section.

4.2 Clinical Strategy for Endpoint Selection In clinical trials, for a given primary response variable, commonly considered study endpoints include: (i) measurements based on absolute change (e.g., endpoint change from baseline); (ii) measurements based on relative change; (iii) proportion of responders based on absolute change; and (iv) proportion of responders based on relative change. We will refer these study endpoints to as the derived study endpoints because they are derived from the original data collected from the same patient population. In practice, it will be more

Innovative Statistics in Regulatory Science

96

TABLE 4.3 Clinical Strategy for Endpoint Selection in Non-inferiority Trials Non-inferiority Margin Absolute Difference (δ 1 )

Relative Difference (δ 2 )

Absolute change (E1 ) Relative change (E2 )

I = E1δ 1 III = E2δ 1

II = E1δ 2 IV = E2δ 2

Responder based on Absolute change (E3 )

V = E3δ 1 VII = E4δ 1

VI = E3δ 2

Study Endpoint

Responder based on Relative change (E4 )

VII = E4δ 2

complicated if the intended trial is to establish non-inferiority of a test treatment to an active control (reference) treatment. In this case, sample size calculation will also depend on the size of the non-inferiority margin, which may be based on either absolute change or relative change of the derived study endpoint. For example, based on responder’s analysis, we may want to detect a 30% difference in response rate or to detect a 50% relative improvement in response rate. Thus, in addition to the four types of derived study endpoints, there are also two different ways to define a non-inferiority margin. Thus, there are many possible clinical strategies with different combinations of the derived study endpoint and the selection of non-inferiority margin for assessment of the treatment effect. These clinical strategies are summarized in Table 4.3. To ensure the success of an intended clinical trial, the sponsor will usually carefully evaluate all possible clinical strategies for selecting the type of study endpoint, clinically meaningful difference, and non-inferiority margin during the stage of protocol development. In practice, some strategies may lead to the success of the intended clinical trial (i.e., achieve the study objectives with the desired power), while some strategies may not. A  common practice for the sponsor is to choose a strategy to their best interest. However, regulatory agencies such as the FDA may challenge the sponsor as to the inconsistent results. This has raised the following questions. First, which study endpoint is telling the truth regarding the efficacy and safety of the test treatment under study? Second, how to translate the clinical information among different derived study endpoints since they are obtained based on the same data collected from the some patient population? These questions, however, remain unanswered.

4.3 Translations among Clinical Endpoints Suppose that there are two test treatments, namely, a test treatment (T and a reference treatment (R). Denote the corresponding measurements of the ith subject in the jth treatment group before and after the treatment by W1ij and

Endpoint Selection

97

W2 ij, respectively, where j = T or R corresponds to the test and the reference treatment, respectively. Assume that the measurement W1ij is lognormal distributed with parameters µ j and σ 12j, i.e., W1ij ~ lognormal ( µ j , σ 12j ). Let W2 ij = W1ij (1 + ∆ ij ), where ∆ ij denotes the percentage change after receiving the treatment. In addition, assume that ∆ ij is lognormal distributed with parameters µ∆ j and σ ∆2 j , i.e., ∆ ij ~ lognormal (µ∆ j , σ ∆2 j ). Thus, the difference and the relative difference between the measurements before and after the treatment are given by W2 ij − W1ij and (W2 ij − W1ij ) /W1ij , respectively. In particular, W2 ij − W1ij = W1ij ∆ ij ~ lognormal ( µ j + µ∆ j , σ j2 + σ ∆2 j ), and W2 ij − W1ij ~ lognormal ( µ∆ j , σ ∆2 j ). W1ij To simplify the notations, define Xij and Yij as Xij = log(W2 ij − W1ij ), W −W Yij = log ( 2Wij 1ij 1ij ). Then, both Xij and Yij are normally distributed with means µ j + µ∆= 1= , 2,… , nj , j T , R, respectively. j and µ ∆ j , i Thus, possible derived study endpoints based on the responses observed before and after the treatment as described earlier include Xij , the absolute difference between “before treatment” and “after treatment” responses of the subjects, Yij , the relative difference between “before treatment” and “after treatment” responses of the subjects, rAj = #{xij > c1 , i = 1, … , nj }/ nj , the proportion of responders, which is defined as a subject whose absolute difference between “before treatment” and “after treatment” responses is larger than a pre-specified value c1, rRj = #{ yij > c2 , i = 1, … , nj }/ nj ,

Innovative Statistics in Regulatory Science

98

the proportion of responders, which is defined as a subject whose relative difference between “before treatment” and “after treatment” responses is larger than a pre-specified value c2. To define notation, for j = T , R, let pAj = E(rAj ) and pRj = E(rRj ). Given the above possible types of derived study endpoints, we may consider the following hypotheses for testing non-inferiority with non-inferiority margins determined based on either absolute difference or relative difference: 1. The absolute difference of the responses H 0 : ( µR − µ∆R ) − ( µT − µ∆T ) ≥ δ 1 vs. H a : ( µR − µ∆R ) − ( µT − µ∆T ) < δ 1

(4.1)

2. The relative difference of the responses H 0 : ( µ∆R − µ∆T ) ≥ δ 2 vs. H a : ( µ∆R − µ∆T ) < δ 2

(4.2)

3. The difference of responders’ rates based on the absolute difference of the responses H 0 : pAR − pAT ≥ δ 3 vs. H a : pAR − pAT < δ 3

(4.3)

4. The  relative difference of responders’ rates based on the absolute difference of the responses H0 :

pAR − pAT p − pAT ≥ δ 4 vs. H a : AR < δ4 pAR pAR

(4.4)

5. The  absolute difference of responders’ rates based on the relative difference of the responses H 0 : pRR − pRT ≥ δ 5 vs. H a : pRR − pRT < δ 5

(4.5)

6. The relative difference of responders’ rate based on the relative difference of the responses H0 :

pRR − pRT p − pRT ≥ δ 6 vs. H a : RR < δ6 pRR pRR

(4.6)

For  a given clinical study, the above are the possible clinical strategies for assessment of the treatment effect. Practitioners or sponsors of the study often choose the strategy to their best interest. It should be noted that current regulatory position is to require the sponsor to pre-specify which study endpoint will be used for assessment of the treatment effect in the study protocol without any scientific justification.

Endpoint Selection

99

In practice, however, it is of particular to study the effect to power analysis for sample size calculation based on different clinical strategies. As pointed out earlier, the required sample size for achieving a desired power based on the absolute difference of a given primary study endpoint may be quite different from that obtained based on the relative difference of the given primary study endpoint. Thus, it is of interest to clinician or clinical scientist to investigate this issue under various scenarios. In particular, hypotheses (4.1) may be used for sample size determination but hypotheses (4.3) are used for testing treatment effect. However, the comparison of these two clinical strategies would be affected by the value of c1, which is used to determine the proportion of responders. However, in the interest of a simple and easier comparison, the number of parameters is kept as small as possible.

4.4 Comparison of Different Clinical Strategies 4.4.1 Test Statistics, Power and Sample Size Determination Note that Xij denotes the absolute difference between “before treatment” and “after treatment” responses of the ith subjects under the jth treatment, and Yij denotes the relative difference between “before treatment” and “after treatment” responses of the i th subjects under the jth treatment. Let x. j = n1j = ∑ in=j 1 xij and y. j = n1j = ∑ ni =j 1 yij be the sample means of Xij and Yij for the jth treatment group, j = T , R, respectively. Based on normal distribution, the null hypothesis in (4.1) is rejected at a level α of significance if x.R − x.T + δ 1 1   1 2 2 2 2  n + n   σ T + σ ∆T + σ R + σ ∆R  R   T

(

) (

)

> zα .

(4.7)

Thus, the power of the corresponding test is given as   Φ  

( µT + µ ∆ T ) − ( µ R + µ ∆ R ) + δ 1

(n

−1 T

)(

) (

+ nR−1  σ T2 + σ ∆2T + σ R2 + σ ∆2R 

)

  − zα  ,    

(4.8)

where Φ(.) is the cumulative distribution function of the standard normal distribution. Suppose that the sample sizes allocated to the reference and test treatments are in the ratio of r, where r is a known constant. Using these

Innovative Statistics in Regulatory Science

100

results, the required total sample size for the test the hypotheses (4.1) with a power level of (1− β ) is N = nT + nR, with nT =

( zα + zβ )2 (σ 12 + σ 22 )(1 + 1/ρ )

[( µ R + µ ∆

2

R

) − ( µT + µ ∆ T ) − δ 1 ]

,

(4.9)

nR = ρ nT and zu is 1− u quantile of the standard normal distribution. Note that yij s are normally distributed. The  testing statistic based on y. j would be similar to the above case. In particular, the null hypothesis in (4.2) is rejected at a significance level α if yT . − yR. + δ 2 1  2  1 2  n + n  (σ ∆T + σ ∆R ) R   T

> zα .

(4.10)

The power of the corresponding test is given as  Φ   

(n

 − zα  . + σ ∆2R )  

µ ∆T − µ ∆ R + δ 2

−1 T

)

+ nR−1 (σ ∆2T

(4.11)

Suppose that nR = ρ nT , where r is a known constant. Then the required total sample size to test hypotheses (4.2) with a power level of (1− β ) is (1+ ρ )nT, where nT =

( zα + zβ )2 (σ ∆2T + σ ∆2R )(1 + 1 / ρ ) 2

[(µR + µ∆R ) − (µT + µ∆T ) − δ 2 ]

.

(4.12)

For sufficiently large sample size nj, rAj is asymptotically normal with mean pA j ) pAJ and variance pAj (1− , j = T , R. Thus, based on Slutsky Theorem, the null nj hypothesis in (4.3) is rejected at an approximate α level of significance if rAT − rAR + δ 3 > zα . 1 1 rAT (1 − rAT ) + rAR (1 − rAR ) nT nR

(4.13)

The power of the above test can be approximated by   pAT − pAR + δ 3 − zα  . Φ  nT−1pAT (1 − pAT ) + nR−1rAR (1 − pAR )   

(4.14)

Endpoint Selection

101

If nR = ρ nT, where r is a known constant. Then, the required sample size to test hypotheses (4.3) with a power level of (1− β ) is (1+ ρ )nT, where nT =

( zα + zβ )2  pAT (1 − pAT ) + pAR (1 − pAR )/ ρ  (pAR − pAT − δ 3 )2

(4.15)

Note that, by definition,  c − (µ + µ )  j 1 ∆j , pA j = 1 − Φ   σ j2 + σ ∆2 j    where j = T , R. Therefore, following similar arguments, the above results c2 − µ ∆ also apply to test hypotheses (4.5) with pAj replaced by pRj = 1 − Φ σ ∆ j j and δ 3 replaced by δ 5. The hypotheses in (4.4) are equivalent to

(

H 0 : (1 − δ 4 )pAR − pAT ≥ 0 vs. H1 : (1 − δ 4 )pAR − pAT < 0.

)

(4.16)

Therefore, the null hypothesis in (4.4) is rejected at an approximate a level of significance if rAT − (1 − δ 4 )rAR 1 (1 − δ 4 )2 rAT (1 − rAT ) + rAR (1 − rAR ) nT nR

> zα .

(4.17)

Using normal approximation to the test statistic when both nT and nR are sufficiently large, the power of the above test can be approximated by   pAT − (1 − δ 4 )pAR − zα  Φ  nT−1pAT (1 − pAT ) + nR−1(1 − δ 4 )2 pAR (1 − pAR )   

(4.18)

Suppose that nR = ρ nT , where r is a known constant. Then the required total sample size to test hypotheses (4.10), or equivalently (4.16), with a power level of (1− β ) is (1+ ρ )nT , where

nT =

( zα + zβ )2  pAT (1 − pAT ) + (1 − δ 4 )2 pAR (1 − pAR )/ ρ   pAT − (1 − δ 4 )pAR 

2

.

(4.19)

Innovative Statistics in Regulatory Science

102

Similarly, the results derived in (4.17) through (4.19) for the hypotheses (4.4) c2 − µ ∆ also apply to the hypotheses in (4.6) with pAj replaced by pRj = 1 − Φ ( σ ∆ j j ) and δ 4 replaced by δ 6 .

4.4.2 Determination of the Non-inferiority Margin Based on the results derived in the previous section, the non-inferiority margins corresponding to the tests based on the absolute difference and the relative difference can be chosen in such a way that the two tests would have the same power. In particular, hypothesis (4.1) and (4.2) would give the power level if the power function given in (4.8) is the same as that given in (4.11). Consequently, the non-inferiority margins δ 1 and δ 2 would satisfy the following equation  (σ T2 + σ ∆2T ) + (σ R2 + σ ∆2R ) ( µT + µ∆T ) − ( µR + µ∆R ) + δ 1 

2

=

(σ ∆2T + σ ∆2R ) ( µ∆T − µ∆R ) + δ 2 

2

.

(4.20)

Similarly, for hypotheses (4.3) and (4.4), the non-inferiority margins δ 3 and δ 4 would satisfy the following relationship pAT (1 − pAT ) + pAR (1 − pAR )/ ρ pAT (1 − pAT ) + (1 − δ 4 )2 pAR (1 − pAR )/ ρ = 2 (pAR − pAT − δ 3 )2  pAR − (1 − δ 4 )pAT 

(4.21)

For hypotheses (4.5) and (4.6), the non-inferiority margins δ 5 and δ 6 satisfy pRT (1 − pRT ) + pRR (1 − pRR )/ ρ pRT (1 − pRT ) + (1 − δ 6 )2 pRR (1 − pRR )/ ρ = 2 (pRR − pRT − δ 5 )2  pRR − (1 − δ 6 )pRT 

(4.22)

Results given in (4.20) through (4.22) provide a way of translating the noninferiority margins between endpoints based on the difference and the relative difference. In  the next section, we would present a numerical study to provide some insight how the power level of these tests would be affected by the choices of different study endpoints for various combinations of parameters values.

4.4.3 A Numerical Study In  this section, a numerical study was conducted to provide some insight about the effect to the different clinical strategies.

Endpoint Selection

103

4.4.3.1 Absolute Difference versus Relative Difference Table  4.4, provides the required sample sizes for the test of non-inferiority based on the absolute difference (Xij) and relative difference (Yij ). In particular, the nominal power level (1− β ) is chosen to be 0.80 and α is 0.05. The corresponding sample sizes are calculated using the formulae in (4.9) and (4.12). It is difficult to conduct any comparison because the corresponding non-inferiority margins are based on different measurement scales. However, to provide some idea to assess the impact of switching from a clinical endpoint based on absolute difference to that based on relative difference, a numerical study on the power of the test was conducted. In  particular, Table  4.5 presents the power of the test for non-inferiority based on relative difference (Y) with the sample sizes determined by the power based on absolute difference (X). The power was calculated using the result given in (4.11). The results demonstrate that the effect is, in general, very significant. In many cases, the power is much smaller than the nominal level 0.8. 4.4.3.2 Responders’ Rate Based on Absolute Difference Similar computation was conducted for the case when the hypotheses are defined in terms of the responders’ rate based on the absolute difference, i.e., hypotheses defined (4.3) and (4.4). Table  4.6 gives the required sample sizes, with the derived results given in (4.15) and (4.19), for the corresponding hypotheses with non-inferiority margins given both in terms of absolute difference and relative difference of the responders’ rates. Similarly, Table 4.7 presents the power of the test for non-inferiority based on relative difference of the responders’ rate with the sample sizes determined by the power based on absolute difference of the responders’ rate. The power was calculated using the result given in (4.14). Again, the results demonstrate that the effect is, in general, very significant. In many cases, the power is much smaller than the nominal level 0.8. 4.4.3.3 Responders’ Rate Based on Relative Difference Similar to the issues considered in the above paragraph with the exception that the responders’ rate is defined based on the relative difference, the required sample sizes for the corresponding hypotheses with non-inferiority margins given both in terms of absolute difference and relative difference of the responders’ rates defined based on the relative difference, i.e., hypotheses defined in (4.5) and (4.6). The results are showed in Table 4.8. Following the similar steps, Table 4.9 presents the power of the test for non-inferiority based on relative difference of the responders’ rate with the sample sizes determined by the power based on absolute difference of the responders’ rate. The similar pattern emerges and the results demonstrate that the power is usually much smaller than the nominal level 0.8.

δ 1 = .50 δ 1 = .55 δ 1 = .60 δ 1 = .65 δ 1 = .70 Relative Difference δ 2  = .40 δ 2  = .50 δ 2  = .60

Absolute Difference

σ T2 + σ R2 σ ∆2T + σ ∆2R

344 253 194 153 124

464 207 116

310 138 78

1.5

275 202 155 123 99

1.0

1.0

619 275 155

413 303 232 184 149

2.0

310 138 78

413 303 232 184 149

1.0

464 207 116

481 354 271 214 174

1.5

2.0

619 275 155

550 404 310 245 198

2.0

310 138 78

550 404 310 245 198

1.0

+ µ ∆ R ) − ( µT + + µ∆T ) = 0.20 ( µR +

464 207 116

619 455 348 275 223

1.5

3.0

619 275 155

687 505 387 306 248

2.0

1237 310 138

619 396 275 202 155

1.0

1855 464 207

773 495 344 253 194

1.5

1.0

2474 619 275

928 594 413 303 232

2.0

1237 310 138

928 594 413 303 232

1.0

1855 464 207

1082 693 481 354 271

1.5

2.0

2474 619 275

1237 792 550 404 310

2.0

( µR + µ∆ R ) − ( µT + µ∆T ) = 0.30

1237 310 138

1237 792 550 404 310

1.0

Sample Sizes for Non-inferiority Testing Based on Absolute Difference and Relative Difference (α = 0.05, β = 0.20, ρ = 1)

TABLE 4.4

1855 464 207

1392 891 619 455 348

1.5

3.0

2474 619 275

1546 990 687 505 387

2.0

104 Innovative Statistics in Regulatory Science

δ 1  = .50 δ 2  = .4 δ 2  = .5 δ 2  = .6 δ 1 = .55 δ 2  = .4 δ 2  = .5 δ 2  = .6 δ 1  = .60 δ 2  = .4 δ 2  = .5 δ 2  = .6 δ 1  = .65 δ 2  = .4 δ 2  = .5 δ 2  = .6 δ 1 = .70 δ 2  = .4 δ 2  = .5 δ 2  = .6

75.8 96.9 99.9 64.2 91.5 99.1 54.6 84.0 97.0 47.0 76.0 93.2 40.6 67.9 87.9

σ T2 + σ R2 σ ∆2T + σ ∆2R 1.0

69.0 94.2 99.6 57.6 86.7 97.9 48.5 77.9 94.2 41.4 69.1 88.7 36.0 61.2 82.3

65.1 92.0 99.2 53.8 83.3 96.7 45.2 73.9 91.9 38.7 65.2 85.7 33.6 57.4 78.7

89.0 99.6 100.0 79.3 98.0 99.9 69.5 94.4 99.6 60.8 89.1 98.6 53.2 82.8 96.5

81.3 98.4 100.0 70.1 94.7 99.7 60.1 88.6 98.4 51.8 81.3 95.8 45.2 73.9 91.9

1.5

1.0

1.5

2.0

2.0

1.0

75.8 96.9 99.9 64.2 91.5 99.1 54.6 84.0 97.0 46.8 75.9 93.1 40.6 67.9 87.9

2.0 95.3 100.0 100.0 88.4 99.6 100.0 80.1 98.2 100.0 71.5 95.3 99.7 63.5 91.0 99.0

1.0

( µR + + µ∆ R ) − ( µT + µ∆T ) = 0.20

89.0 99.6 100.0 79.3 98.0 99.9 69.5 94.4 99.6 60.6 89.0 98.6 53.2 82.7 96.4

1.5

3.0

Power of the Test of Non-inferiority Based on Relative Difference

TABLE 4.5

83.6 98.9 100.0 72.7 95.8 99.8 62.6 90.4 98.9 54.2 83.6 96.8 47.2 76.3 93.4

2.0 54.6 97.0 100.0 40.6 87.9 99.5 31.8 75.8 96.9 26.1 64.2 91.5 22.2 54.6 84.0

1.0 48.4 94.1 99.9 35.9 82.2 98.6 28.3 69.0 94.2 23.4 57.6 86.7 20.0 48.5 77.9

1.5

1.0

45.2 91.9 99.8 33.5 78.6 97.8 26.5 65.1 92.0 21.9 53.8 83.3 18.9 45.2 73.9

2.0 69.5 99.6 100.0 53.1 96.4 100.0 41.8 89.0 99.6 33.9 79.3 98.0 28.5 69.5 94.4

1.0 60.0 98.4 100.0 45.0 91.8 99.8 35.2 81.3 98.4 28.8 70.1 94.7 24.4 60.1 88.6

1.5

2.0

54.5 96.9 100.0 40.6 87.9 99.5 31.8 75.8 96.9 26.1 64.2 91.5 22.2 54.6 84.0

2.0

80.0 100.0 100.0 63.5 99.0 100.0 50.5 95.3 100.0 41.2 88.4 99.6 34.5 80.1 98.2

1.0

( µR + µ∆ R ) − ( µT + µ∆T ) = 0.30

2.0 69.5 62.6 99.6 98.9 100.0 100.0 53.1 47.1 96.4 93.3 100.0 99.9 41.7 36.9 89.0 83.6 99.6 98.9 34.0 30.1 79.3 72.7 98.0 95.8 28.5 25.4 69.5 62.6 94.4 90.4

1.5

3.0

Endpoint Selection 105

δ 3  = .25 δ 3  = .30 δ 3  = .35 δ 3  = .40 δ 3  = .45 Relative Difference δ 4  = .35 δ 4  = .40 δ 4  = .45

Absolute Difference

σ T2 + σ R2 σ ∆2T + σ ∆2R

284 128 73 47 33

344 166 96

458 199 109

1.5

399 159 85 53 36

1.0

1.0

285 147 88

228 111 65 43 31

2.0

285 147 88

228 111 65 43 31

1.0

249 134 82

195 99 60 40 29

1.5

2.0

224 124 78

173 91 56 38 28

2.0

c1 − ( µR + µ∆ R ) = −0.60

224 124 78

173 91 56 38 28

1.0

206 117 75

157 85 53 37 27

1.5

3.0

193 112 72

146 81 51 35 26

2.0

1625 392 168

2191 382 153 82 51

1.0

869 288 139

898 253 117 68 44

1.5

1.0

601 234 121

558 195 98 59 40

2.0

601 234 121

558 195 98 59 40

1.0

469 202 110

410 162 86 54 37

1.5

2.0

391 180 102

329 141 78 50 34

2.0

c1 − ( µR + µ∆ R ) = −0.80

391 180 102

329 141 78 50 34

1.0

340 165 95

279 127 72 47 33

1.5

3.0

Sample Sizes for Non-inferiority Testing Based on Absolute Difference and Relative Difference of Response Rates Defined by the Absolute Difference (Xij ) (α = 0.05, β = 0.20 , ρ = 1, c1 − ( µT + µ ∆T ) = 0 )

TABLE 4.6

304 153 91

245 116 68 44 31

2.0

106 Innovative Statistics in Regulatory Science

δ 3 = .45

δ 3 = .40

δ 3 = .35

δ 3 = .30

δ 3 = 25

δ 4  = .35 δ 4  = .40 δ 4  = .45 δ 4  = .35 δ 4  = .40 δ 4  = .45 δ 4  = .35 δ 4  = .40 δ 4  = .45 δ 4  = .35 δ 4  = .40 δ 4  = .45 δ 4  = .35 δ 4  = .40 δ 4  = .45

75.1 97.0 99.9 42.9 71.9 91.4 28.3 49.3 71.2 21.2 35.9 53.8 17.2 27.9 41.6

σ R2 = σ T2 σ ∆2R = σ ∆2T 1.0

73.1 94.6 99.6 44.9 70.5 89.1 30.9 50.2 70.2 23.4 37.4 54.0 19.1 29.6 42.7

71.9 92.8 99.1 46.3 69.9 87.6 32.4 50.5 69.1 24.9 38.3 54.0 20.5 30.8 43.5

71.9 92.8 99.1 46.3 69.9 87.6 32.4 50.5 69.1 24.9 38.3 54.0 20.5 30.8 43.5

71.2 91.4 98.6 47.0 69.1 86.3 33.6 50.9 68.7 25.9 38.9 53.8 21.3 31.4 43.5

1.5

1.0

1.5

2.0

2.0

1.0

70.6 90.2 98.1 47.7 68.6 85.3 34.4 51.0 68.0 26.8 39.4 53.8 22.2 32.2 44.0

2.0

c1 − ( µR + µ∆ R ) = −0.60

70.6 90.2 98.1 47.7 68.6 85.3 34.4 51.0 68.0 26.8 39.4 53.8 22.2 32.2 44.0

1.0 70.1 89.2 97.6 48.1 68.3 84.5 35.1 51.2 67.6 27.7 40.3 54.4 22.8 32.6 44.2

1.5

3.0

69.9 88.5 97.2 48.8 68.3 84.1 35.8 51.5 67.5 28.0 40.1 53.7 23.3 32.9 44.2

2.0 89.3 100.0 100.0 33.0 79.1 98.2 18.9 46.4 76.7 13.9 30.6 53.7 11.4 22.7 39.2

1.0 81.2 99.7 100.0 38.1 75.5 95.7 23.2 47.7 74.0 17.1 33.2 53.9 13.9 25.1 40.4

1.5

1.0

77.4 98.6 100.0 41.0 73.5 93.5 26.1 48.6 72.4 19.3 34.6 53.7 15.8 26.9 41.5

2.0 77.4 98.6 100.0 41.0 73.5 93.5 26.1 48.6 72.4 19.3 34.6 53.7 15.8 26.9 41.5

1.0 75.2 97.1 99.9 42.8 72.1 91.6 28.1 49.2 71.2 21.2 36.0 54.1 17.2 28.1 42.1

1.5

2.0

73.8 95.7 99.8 44.0 71.1 90.2 29.7 49.7 70.5 22.5 36.9 54.1 18.1 28.7 42.0

2.0

c1 − ( µR + µ∆ R ) = −0.80

73.8 95.7 99.8 44.0 71.1 90.2 29.7 49.7 70.5 22.5 36.9 54.1 18.1 28.7 42.0

1.0

72.9 94.5 99.6 45.1 70.6 89.1 30.9 50.1 69.9 23.6 37.6 54.2 19.2 29.8 42.9

1.5

3.0

Power of the Test of Non-inferiority Based on Relative Difference of Response Rates (α = 0.05, β = 0.20 , ρ = 1, c1 − ( µT + µ ∆T ) = 0 )

TABLE 4.7

72.3 93.4 99.3 45.7 70.0 88.0 32.0 50.6 69.7 24.3 37.8 53.7 19.8 30.0 42.6

2.0

Endpoint Selection 107

173 104

224 124

78

δ 6  = .45

68

130 74 48 33 25

1.5

173 91 56 38 28

1.0

δ 5  = .25 δ 5  = .30 δ 5  = .35 δ 5  = .40 δ 5  = .45 Relative Difference δ 6  = .35 δ 6  = .40

Absolute Difference

σ ∆2R = σ ∆2T

63

151 94

111 66 44 31 23

2.0

c2 − − µ∆ R = −0.30

60

138 88

101 61 41 29 22

2.5

102

391 180

329 141 78 50 34

1.0

83

256 137

201 102 61 41 29

1.5

75

206 117

157 85 53 37 27

2.0

c2 − − µ∆ R = −0.40

69

180 106

135 76 49 34 25

2.5

136

823 279

836 244 114 66 43

1.0

104

412 186

351 147 81 51 35

1.5

90

297 151

238 114 67 44 31

2.0

c2 − − µ∆ R = −0.50

81

243 132

189 97 59 40 29

2.5

189

2586 478

4720 504 180 92 56

1.0

132

754 266

745 229 110 64 42

1.5

109

458 199

399 159 85 53 36

2.0

c2 − − µ∆ R = −0.60

Sample Sizes for Non-inferiority Testing Based on Absolute Difference and Relative Difference of Response Rates Defined by the Relative Difference (Yij ) (α = 0.05, β = 0.20 , ρ = 1, c2 − µ ∆T = 0)

TABLE 4.8

96

344 166

284 128 73 47 33

2.5

108 Innovative Statistics in Regulatory Science

70.6

90.2 98.1 47.7

68.6 85.3 34.4

51.0 68.0 26.8

39.4 53.8 22.2

32.2 44.0

δ6  = .35

δ6  = .40 δ6  = .45 δ6  = .35

δ6  = .40 δ6  = .45 δ6  = .35

δ6  = .40 δ6  = .45 δ6  = .35

δ6  = .40 δ6  = .45 δ6  = .35

δ6  = .40 δ6  = .45

δ 5  = .30

δ 5  = .35

δ 5  = .40

δ 5  = .45

1.0

δ 5  = .25

σ ∆2R + σ ∆2T

33.7 44.7

40.5 53.7 24.2

52.0 67.4 28.8

67.7 83.1 36.9

87.4 96.4 49.3

69.5

1.5

34.1 44.5

41.6 54.2 25.1

52.5 67.0 30.3

67.3 81.8 38.2

85.7 95.2 50.1

68.8

2.0

c2 − − µ∆ R = −0.30

34.6 44.8

41.7 53.7 25.8

52.4 66.3 30.8

66.8 80.9 38.7

84.9 94.5 50.5

68.7

2.5

28.7 42.0

36.9 54.1 18.1

49.7 70.5 22.5

71.1 90.2 29.7

95.7 99.8 44.0

73.8

1.0

31.0 43.1

39.0 54.1 21.0

50.8 68.7 25.8

69.4 86.7 33.3

91.6 98.7 47.0

71.2

1.5

32.6 44.2

40.3 54.4 22.8

51.2 67.6 27.7

68.3 84.5 35.1

89.2 97.6 48.1

70.1

2.0

c2 − − µ∆ R = −0.40

33.1 44.1

40.7 54.0 23.7

51.8 67.4 28.7

67.7 83.3 36.5

87.7 96.7 49.0

69.6

2.5

25.2 40.3

33.2 53.5 14.1

47.8 73.7 17.3

75.2 95.4 23.6

99.6 100.0 38.6

80.5

1.0

28.6 42.1

36.6 54.0 17.9

49.9 71.1 22.1

71.4 90.6 29.4

96.2 99.8 43.7

74.3

1.5

30.3 42.9

38.2 54.1 20.0

50.6 69.6 24.6

69.9 87.9 32.2

93.1 99.2 45.9

72.1

2.0

c2 − − µ∆ R = −0.50

31.7 43.8

39.3 54.2 21.6

50.9 68.5 26.3

68.9 86.0 33.8

91.0 98.5 47.1

70.9

2.5

21.4 38.6

29.0 53.7 10.0

45.3 78.4 12.0

81.9 99.2 16.1

100.0 100.0 29.2

95.7

1.0

25.6 40.5

33.6 53.6 14.5

48.2 73.5 17.9

74.6 94.9 24.4

99.4 100.0 39.2

79.6

1.5

27.9 41.6

35.9 53.8 17.2

49.3 71.2 21.2

71.9 91.4 28.3

97.0 99.9 42.9

75.1

2.0

c2 − − µ∆ R = −0.60

Power of the Test of Non-inferiority Based on Relative Difference of Response Rates (α = 0.05, β = 0.20 , ρ = 1, c2 − µ ∆T = 0 )

TABLE 4.9

29.6 42.7

37.4 54.0 19.1

50.2 70.2 23.4

70.5 89.1 30.9

94.6 99.6 44.9

73.1

2.5

Endpoint Selection 109

110

Innovative Statistics in Regulatory Science

4.5 Development of Therapeutic Index Function 4.5.1 Introduction In  pharmaceutical/clinical development of a test treatment, clinical trials are often conducted to evaluate the effectiveness of the test treatment under investigation. In clinical trials, there may be multiple endpoints available for measurement of disease status and/or therapeutic effect of the test treatment under study (Williams et al., 2004; Filozof et al., 2017). In practice, it is usually not clear which study endpoint can best inform the disease status and can be used to measure the treatment effect. Thus, it is difficult to determine which endpoint should be used as the primary endpoint especially as these multiple potential primary endpoints may be correlated with some unknown correlation structures. Once the primary study endpoint has been selected, sample size requirement for achieving a desired power can then be performed. It, however, should be noted that different study endpoints might not translate one-to another although they might be highly correlated to one another. In other words, for a given clinical trial, some study endpoints may be achieved and some don’t. In this case, it is of interest to know which study endpoint is telling the truth. It also should be noted that different study endpoints might result in different sample size requirements. Typical examples for clinical trials with multiple endpoints would be cancer clinical trials. In cancer clinical trials, overall survival (OS), response rate (RR), and/or time to disease progression (TTP) are usually considered as primary clinical endpoints for evaluation of effectiveness of the test treatment under investigation in regulatory submissions. Williams et  al. (2004) provided a list of oncology drug products approved by the FDA based on single endpoint, co-primary endpoints, and/or multiple endpoints between 1990 and 2002. As can be seen from Table 4.10, a total of 57 oncology drug submissions were approved by the FDA  between 1990 and 2002. Among the 57 applications, 18 were approved based on survival endpoint alone, while 18 were approved based on RR and/or TTP alone. About 9 submissions were approved based on RR plus tumor-related signs and symptoms (co-primary endpoints). More recently, Zhou et  al. (2019) provided a list of oncology and hematology drug approval by the FDA  between 2008 and 2016, and a total of 12 drugs were approved based on multiple endpoints. Table 4.10 and Figure  4.1 indicated that endpoint selection is key to the success of intended clinical trials. For example, for those 9 submissions in Table 4.10, if the selected endpoints were survival or TTP, the clinical trials may meet the study endpoints of RR plus tumor-related signs and symptoms but fail to meet the selected study endpoints. As demonstrated in Table  4.10 and Figure  4.1, composite endpoints are commonly used where multiple outcomes are measured and combined to

Endpoint Selection

111

TABLE 4.10 Endpoints Supporting Regular Approval of Oncology Drug Marketing Application, January 1, 1990, to November 1, 2002 Total Survival RR and/or TTP alone     (predominantly hormone treatment of breast cancer or hematologic malignancies) Tumor-related signs and symptoms     RR + tumor-related signs and symptoms     Tumor-related signs and symptoms alone Disease-free survival (adjuvant setting) Recurrence of malignant pleural effusion Decreased incidence of new breast cancer occurrence Decreased impairment creatinine clearance Decreased xerostomia

57 18 18 13 (9) (4) 2 2 2 1 1

Source: Williams, G. et al., J. Biopharm. Stat., 14, 5–21, 2004.

FIGURE 4.1 Number of applications approved by endpoint. The number of applications approved for a new indication in an oncology product by the US FDA during 2008 and 2016, grouped by primary endpoint of the trial that supported the application. Endpoints are abbreviated as follows: overall survival (OS), progression free survival (PFS), objective response rate (ORR), relapsefree survival (RFS), event-free survival (EFS), multiple endpoints other than a coprimary endpoint of overall survival and progression-free survival (Multiple), and other endpoints not included in the previous categories (Other). Types of approvals are abbreviated as follows: regular approval (RA), conversion to regular approval (Conv), and accelerated approval (AA). Data used were taken from the package inserts of the approved products and FDA records. (From Zhou, J. et al., J. Natl. Cancer Inst., 111, 449–458, 2019.)

112

Innovative Statistics in Regulatory Science

evaluate the treatment effect in clinical trials, especially for oncology drug developments. The  adoption of composite endpoints can be due to that the primary endpoint is rare event or needs a long time to be observed, among other reasons. In  oncology drug developments, commonly used composite endpoints may be comprised from the following four endpoints: (i) overall survival (OS); (ii) response rate (RR); (iii) time to disease progression (TTP) and; (iv) tumor-related signs and symptoms (TSS). Note that OS can be further divided into three categories: disease-free survival (DFS), progression-free survival (PFS), and relapse-free survival (RFS); yet for simplicity and without loss of generality, we won’t differentiate them here. Assume all the four endpoints are used in constructing the composite endpoint(s). Depending on the number of composite endpoints, there can be up to a total of 15 possible combinations of them: 1. Four options of one endpoint, i.e., {OS, RR, TTP, TSS}; 2. Six options of two endpoints combinations, i.e., {(OS, RR), (OS, TTP), (OS, TSS), (RR, TTP), (RR, TSS), (TTP, TSS)}; 3. Four options of three endpoints combinations, i.e., {(OS, RR, TTP), (OS, RR, TSS), (OS, TTP, TSS), (RR, TTP, TSS)}; 4. One option of four endpoints, i.e., {(OS, RR, TTP, TSS)}. In  practice, however, it is usually not  clear which study endpoint and/or composite endpoint can best inform the disease status and measure the treatment effect. Moreover, different study endpoints and/or composite endpoints may not translate one another although they may be highly correlated to one another. In  clinical trials, power calculation for sample size is very sensitive to the selected primary endpoint. Different endpoints may result in different sample sizes. As an example, consider cancer clinical trials, commonly considered primary endpoints include OS, RR, TTP, and TSS. Power calculation for sample size based on OS, RR, TTP, and/or TSS could be very different. For  illustration purpose, Table  4.11 summarizes sample sizes calculated based on different endpoints and their corresponding margins in oncology drug clinical trials based on historical data available in the literature (Motzer et  al., 2019) and the margins are selected conventionally. From Table  4.11, it can be seen that different endpoints will result in different sample size requirements. In this section, we intend to develop a therapeutic index based on a utility function to combine all study endpoints. The developed therapeutic index will fully utilize all the information collected via the available study endpoints for an overall assessment of the effectiveness of the test treatment under investigation. Statistical properties and performances of the proposed therapeutic index are evaluated both theoretically and via clinical trial simulations.

(

6213

Sample Size

)

2

(

45

= p1 0= .26 ; p2 0.55; p2* − p1* = 0

0.29

351

pt = 0.4 ; θ * = 1

0.61

2

( zα + zβ ) 1 π 1π 2 pt ln θ * − ln δ

 p1 ( 1 − p1 )  z + z + p2 ( 1 − p2 )   * α * β   p2 − p1 − δ  κ     

θ >δ 2

)

TTP

p2 − p1 > δ

RR

2

1 2

1. We assume a balanced design, i.e., κ = 1 and π 1 = π 1 = . The  sample size calculation formulas are obtained in Chow et  al. (2018). The  noninferiority margin and other parameters are based on the descriptive statistics for the 560 patients with PD-L1-positive tumors in Motzer et al. (2019), where the margin is selected as the improvement of clinically meaningful difference. 2. H a denotes the alternative hypothesis, δ is the non-inferiority margin, θ = h1 /h2 is the hazard ratio, pi is the response rate for sample i, π 1 and π 2 are the proportions of the sample size allocated to the two groups, po is the overall probability of death occurring within the study period, pt is the overall probability of disease progression occurring within the study period, ln θ is the natural logarithm of the hazard ratio, κ = n1 /n2 is sample size ratio, zα = Φ −1 ( 1 − α ) is 100 ( 1 − α ) % quantile of the standard normal distribution.

po = 0.14 ; θ = 1

Other parameters

*

0.82

Note:

2

OS

( zα + zβ ) 1 π 1π 2 po ln θ * − ln δ

θ >δ

Margin (δ )

Formula

Ha

Endpoint

Sample Size Calculation Based on Different Endpoints for One-sided Test of Non-inferiority with Significance Level α = 0.05 and Expected Power 1 − β = 0.9

TABLE 4.11

Endpoint Selection 113

Innovative Statistics in Regulatory Science

114

4.5.2 Therapeutic Index Function Let e = ( e1 , e2 ,⋯ , e J ) ′ be the baseline clinical endpoints. The therapeutic index is defined as: I i = fi (ωi , e ) , i = 1,⋯ , K

(4.23)

where ωi = (ωi1 ,ωi 2 ,⋯ , ωiJ ) is a vector of weights with ωij be the weight for e j with respect to index I i , fi ( ⋅) is the therapeutic index function to construct therapeutic index Ii based on ωi and e. Generally, e j can be of different data types (e.g., continuous, binary, time to event) and ωij is pre-specified (or calculated by pre-specified criteria) and can be different for different therapeutic index I i . Moreover, the therapeutic index function typically generates a ’ vector of index ( I1 , I 2 ,⋯ , I K ) ; and if K = 1, it reduces to a single (composite) index. As an example, consider I i = ∑ Jj =1ωij e j , then I i is simply a linear combination of the endpoints; moreover, if ωi = ( 1J , 1J ,⋯ 1J ) ′, then I i is the average over all the endpoints. Although e j can be of different data types, we assume they are of the same type at this step. On one hand, we would like to investigate the predictability of I i given that e j can inform the disease (drug) status; one the other hand, we are also interested in the predictability of e j given that I i is informative. Particularly, we are interested in the following two conditional probabilities: ’

1. p1ij = Pr ( I i |e j ) , i = 1,⋯ , K ; j = 1,⋯ , J

(4.24)

2. p2 ij = Pr ( e j |I i ) , i = 1,⋯ , K ; j = 1,⋯ , J

(4.25)

and

Intuitively, we would expect that p1ij to be relatively large given that e j is informative since I i is a function of e j; on the other hand, p2 ij could be small even if I i is predictive since the information I i contained may be attributed to another endpoint e j′ rather other e j. To derive Equations (4.24) and (4.25), we need to specify the weights ωi, the distribution of e and the functions fi ( ⋅) that are described more in details in the following subsections. 4.5.2.1 Selection of ωi One of the important concerns is how to select the weights ωi. There might be various ways of specifying the weights and a reasonable one is based on

Endpoint Selection

115

the p-values. Specifically, denote θ j , j = 1,⋯ , J as the treatment effect assessed by the endpoint e j. Without loss of generality, θ j is tested by the following hypotheses: H 0 j :θ j ≤ δ j versus H aj: θ j > δ j ,

(4.26)

where δ j , j = 1,⋯ , J are the pre-specified margins. Under some appropriate assumptions, we can calculate the p-value p j for each H 0 j based on the sample of e j and the weights ωi can be constructed based on p = ( p1 , p2 ,⋯ , pJ )′ , i.e.,

ωij = ωij ( p ) ,

(4.27)

which is reasonable since each p-value indicates the significance of the treatment effect based on its responding endpoint; thus, it is possible to use all the information available to construct effective therapeutic indexes. Note that ωij ( ⋅) should be constructed such that high value of ωij is with low value of p j. For example, ωij = p1j / ∑ Jj =1 p1j . 4.5.2.2 Determination of fi (.) and the Distribution of e Another important issue is how to select the therapeutic index functions fi (.). fi (.) could be linear or nonlinear, or have an even more complicated form. We consider fi (.) as linear here. Thus, (4.23) reduces to J

Ii =

J

∑ω e = ∑ω ( p) e , i = 1,⋯, K. ij j

j =1

ij

j

(4.28)

j =1

Moreover, we need to specify the distribution of e. To simplify, assume e follows the multi-dimensional normal distribution N ( θ θ, Σ ) , where

( )

θ θ = (θ1 ,⋯ ,θ J )′ and Σ = σ jj2′

JxJ

with

σ jj2′ = σ j2 , j′ = j and σ jj2′ = ρ jj’σ jσ j′ , j′ ≠ j. 4.5.2.3 Derivation of Pr ( Ii |e j ) and Pr ( e j |Ii ) Suppose n subjects are independently and randomly selected from the population for the clinical trial. For each baseline endpoint e j and hypothesis H 0 j ,

Innovative Statistics in Regulatory Science

116

a test statistic eɵ j is constructed based on the observations of the n subjects and the corresponding p-value p j is calculated. e j is informative is equivalent to eɵ j > c j for some pre-specified critical value c j pre-specified based on δ j , significance level α and the variance of eɵ j . The estimate of the therapeutic index I i in (4.28) can be accordingly constructed as ɵI i = ωi’ eɵ =

J

∑ω eɵ , i = 1,⋯, K ,

(4.29)

ij j

j =1

where ωi = (ωi1 ,ωi 2 ,⋯ , ωiJ )′ and ωij = ωij ( p )’ is calculated based on the p-values on p = ( p1 , p2 ,⋯ , pJ )′ , and eɵ = ( eɵ1 , eɵ2 ,⋯ , eɵ J ) . Iˆi is informative if Iˆi > di for some pre-specified threshold di . Thus, (4.24) and (4.25) become

(

)

(4.30)

(

)

(4.31)

ˆ 1. p1ij = Pr I i > di eˆ j > c j , i = 1,⋯ , K ; j = 1,⋯ , J and 2. p2 ij = Pr eˆ j > c j Iˆi > di , i = 1,⋯ , K ; j = 1,⋯ , J

Without loss of generality, suppose eɵ is the vector of sample means, then θ, Σ /n ) based on eɵ follows the multi-dimensional normal distribution N ( θ the normality assumption of e. Moreover, Iˆi follows the normal distribution N ϕi ,ηi2/ n , where

(

)

J

ϕi = ωi′θ =

∑ω θ and η ij j

2 i

= ωi′Σωi

j =1

Further, ( eɵ j , Iɵi )′ jointly follows a binormal distribution N ( µ , Γ / n ) where  σ j2 µ = (θ j , ϕi )′ and Γ =  ’  1j Σωi 

1j ’ Σωi   σ j2 = ηi2   ρ ji* σ jηi

ρ ji* σ jηi  , ηi2 

where 1j is an J dimensional vector of 0 except the jth item which equals 1 and thus

ρ *ji = 1 j′ Σωi / (σ jηi ) = =

1 ηi



J j ′=1



ωij′ ρ jj′σ j′ .

J j ′=1

ωij′σ 2jj′ / (σ jηi ) =



J j ′=1

ωij′ ρ jj′ σ jσ j′ / (σ jηi )

Endpoint Selection

117

Thus, the conditional probabilities (4.30) and (4.31) become 1. p1ij =

(

Pr ɵI i > di , eɵ j > c j

(

Pr eɵ j > c j

)

)

 n ( cj − θ j )   n ( c j − θ j ) n ( di − ϕi ) *   n ( di − ϕi )   − Φ , , ρ ji  (4.32) 1− Φ  + Ψ       σj ηi σj ηi       =  n ( cj − θ j )   1− Φ   σj   2. p2 ij =

(

Pr Iɵ i > di , eɵ j > c j

(

Pr Iɵ i > di

)

)

 n ( c j −θ j )   n ( c j −θ j ) n ( di − ϕi ) *   n ( di − ϕi )  −Φ , ρji  1− Φ  , +Ψ        σ η σ η i j i j       =  n ( di − ϕi )  (4.33) 1− Φ     ηi   Moreover, p2 ij p1ij

 n ( cj − θ j )   1− Φ   Pr eɵ j > c j σ j  , = =  n ( di − ϕi )  Pr ɵI i > di 1− Φ    ηi  

( (

) )

(4.34)

where Φ ( x ) and Ψ ( x , y , ρ ) denote the cumulative distribution functions for standard single variate normal and bivariate normal distributions respectively. Note that both conditional probabilities (4.32) and (4.33) depend on the parameters θ, Σ ; the sample size n; the number of baseline endpoints J ; the pre-specified weights ωi; and the pre-specified thresholds c j , di which further depend on the hypothesis testing margins δ j and pre-specified type I error rate(s) among others. Intuitively, there are not  simple formulas for (4.32) and (4.33) that can be derived directly. Although methods such as Taylor expansion may be employed to approximate (4.32) and (4.33) it is still nontrivial and could be quite complicated. However, note that Φ ( x ) is monotonic increasing, based on (4.34) we have p2ij c − θ j di − ϕi

. p1ij σj ηi

(4.35)

Innovative Statistics in Regulatory Science

118

Moreover, we assume c j = δ j + zα nation of c j , s, i.e., J

di =



J

ωij c j =

j =1

σj conventionally and di is a linear combin



ωijδ j +

j =1

zα n

J

∑ω σ , i = 1,⋯, K , ij

j

(4.36)

j =1

Then, (4.35) can be further expressed as p2 ij c j − θ j di − ϕi

p1ij σj ηi  σ j  z z  σj  −j −j  ⇔  1 − ωij   ∆θ − α σ j  <  ∆θ i( ) − α σ i( )  , η η n n i i      

(4.37)

where ∆θ j = θ j − δ j , ∆θ = ( ∆θ1 , ∆θ 2 ,⋯ , ∆θ J )′ , −j −j ′ ∆θ i( ) = ωi( ) ∆θ =

J

∑ω ∆θ , ij′

j′

j′≠ j

σ i( − j ) =



J j ′≠ j

ωij′σ j′ ,

ωi( ) equals ωi except the ith item equals 0. To obtain more insights of (4.37), we assume J = 2, K = 1 and focus on j = 1 without loss of generality. Then the last inequation in (15) can be simplified as −j

 σ1  zα z  σ   σ 1  < 1 ω2  ∆θ 2 − α σ 2   1 − ω1   ∆θ1 − η η n n      

(4.38)

where ω1 and ω2 are the weights for the two endpoints respectively with ω1 + ω2 = 1, and η = ω12σ 12 + 2 ρω1ω2σ 1σ 2 + ω22σ 22 and ρ is the correlation coefficient of the two endpoints. Obviously, (4.15) depends on the variabilities of the endpoints and their correlation, the underlined effect sizes of both endpoints, the weights, and the sample size. We illustrate several special scenarios of (4.38) in Table 4.12. From Table 4.12, we can see a remarkable situation that when ρ = 1, σ 1 = τ1 σ 2 , whether p1ij is greater than p2 ij depends on whether the underlined effect size

ρ =0

ω1 = ω2 = 1 / 2

σ1 = σ 2

ρ =1

ρ =0

ω1 = ω2 = 1 / 2

Parameters

Inequation (16)

    z zα ω2σ 1   ∆θ1 − α σ 1  < σ2   ∆θ 2 − 2 2 2 2  n n ω1 σ 1 + ω2 σ 2    

    zα σ1   ∆θ1 − zα σ 1  < σ2   ∆θ 2 − 2 2  n n  + + 2 σ ρσ σ σ   1 1 2 2 

 σ1 1− 2 2  σ 1 +σ2 

  zα σ2   ∆θ 2 − n  

    zα ω2   ∆θ1 − zα σ 1  < σ1   ∆θ 2 − 2 2  n n  ω1 + 2 ρω1ω2 + ω2   

  z σ1   ∆θ1 − α σ 1  < 2 2  n σ  1 +σ2 

 ω1 1− 2  ω1 + 2 ρω1ω2 + ω22 

     ω1σ 1 zα ω2σ 1 zα σ1  < σ2   ∆θ 2 − 1−   ∆θ1 − + + ω σ ω σ ω σ ω σ n n 1 1 2 2 1 1 2 2     

 ω1σ 1 1− 2 2  ω1 σ 1 + ω22σ 22 

 σ1 1− 2 2  2 + σ ρσ 1 1σ 2 + σ 2 

(Continued)

     σ1 z z σ1 1− ω2  ∆θ 2 − α σ 2  ω1   ∆θ1 − α σ 1  < 2 2 2 2 2 2 2 2   n  n 2 + + σ + 2 ρω ω σ σ + ω σ ω σ ρω ω σ σ ω σ ω   1 1 2 1 2 2 2 1 1 1 2 1 2 2 2 1  

Illustrate Inequation (16) with Respective to Different Parameter Settings

TABLE 4.12

Endpoint Selection 119

TABLE 4.12 (Continued)

∆θ1
 δ , where M ≥ δ . 5.3.3 Retention of Treatment Effect in the Absence of Placebo According to Figure 5.3b, Hung et al. (2003) proposed the concept of retention ratio, denoted by r, of the effect of the test treatment (i.e., T − P) and the effect of the active control agent (i.e., C − P) as compared to a placebo control regardless the presence of the placebo in the study. That is, r=

T−P , C−P

where r is a fixed constant between 0 and 1. Chow and Shao (2006) introduced the parameter of δ , which is the superiority margin as compared to the placebo. The relationship among P, T, C, δ , and M is illustrated in Figure 5.3c. At the worst possible scenario, we may select M = δ = T − P . In this case, the retention rate becomes r=

T−P δ M = = , C−P C−P C−P

This leads to M = r (C − P ). Jones et  al. (1996) suggests that r = 0.5 be chosen, while r = 0.2 is probably the most commonly employed for selection of non-inferiority margin without any clinical judgment or statistical reasoning. Thus, the selection of noninferiority margin depends upon the estimation of the retention rate of the effect of the test treatment relative to the effect of the active control agent.

Innovative Statistics in Regulatory Science

130

5.4 Methods for Selection of Non-inferiority Margin 5.4.1 Classical Method In  clinical trials, equivalent limits for therapeutic equivalence generally depend on the nature of the drug, targeted patient population, and clinical endpoints (efficacy and safety parameters) for the assessment of therapeutic effect. For example, for some drugs, such as topical antifungals or vaginal antifungals, which may not  be absorbed in blood, the FDA  proposed some equivalent limits for some clinical endpoints such as binary response (Huque and Dubey, 1990). As an example, for the study endpoint of cure rate, if the cure rate for the reference drug is greater than 95%, then a difference in cure rate within 5% is not considered a clinically important difference (see Table 5.2). 5.4.2 FDA’s Recommendations The  2010 FDA  draft guidance recommends two non-inferiority margins, namely M1 and M2 should be considered. The  2010 FDA  draft guidance indicated that M1 is based on (i) the treatment effect estimated from the historical experience with the active control drug, (ii) assessment of the likelihood that the current effect of the active control is similar to the past effect (the constancy assumption), and (iii) assessment of the quality of the non-inferiority trial, particularly looking for defects that could reduce a difference between the active control and the new drug. Thus, M1 is defined as the entire effect of the active control assumed to be present in the noninferiority study M1 = C − P.

(5.1)

On the other hand, FDA indicates that M2 is selected based on a clinical judgment which is never be greater than M1 even if for active control drugs with small effects. It should be noted that a clinical judgment might argue that a larger difference is not  clinically important. Ruling out that a difference TABLE 5.2 Equivalence Limits for Binary Responses Equivalence Limits (%) ±20 ±15 ±10 ±5

Response Rate for the Reference Drug (%) 50–80 80–90 90–95 >95

Non-inferiority/Equivalence Margin

131

between the active control and test treatment that is larger than M1 is a critical finding that supports the conclusion of effectiveness. Thus, M2 can be obtained as M2 = ( 1 − δ 0 ) M1 = ( 1 − δ 0 ) ( C − P ) ,

(5.2)

where

δ0 = 1 − r = 1 −

T − P C −T = C−P C−P

is referred to as the ratio of the effect of the active control agent as compared to the test treatment and the effect of the active control agent as compared to the placebo. Thus, δ 0 becomes smaller if the difference between C and T decreases, i.e., T is close to C (the retention rate of T is close to 1). In this case, the FDA suggests a wider margin for the non-inferiority testing. 5.4.3 Chow and Shao’s Method By the 2010 FDA draft guidance, there are essentially two different approaches to analysis of the non-inferiority study: one is the fixed margin method (or the two confidence interval method) and the other one is the synthesis method. In  the fixed margin method, the margin M1 is based on estimates of the effect of the active comparator in previously conducted studies, making any needed adjustment for changes in trial circumstances. The non-inferiority margin is then pre-specified and it is usually chosen as a margin smaller than M1 (i.e., M2). The synthesis method combines (or synthesizes) the estimate of treatment effect relative to the control from the non-inferiority trial with the estimate of the control effect from a meta-analysis of historical trials. This method treats both sources of data as if they came from the same randomized trial to project what the placebo effect would have been had the placebo been present in the non-inferiority trial. Following the idea of the ICH E10 (2000) that the selected margin should not be greater than the smallest effect size that the active control has, Chow and Shao (2006) introduced another parameter δ which is a superiority margin of the placebo δ > 0) and assumed that the non-inferiority margin M is proportional to δ , i.e., M = λδ . Then, under the worst scenario, i.e., T − C achieves its lower bound –M, then the largest possible M is given by M = C − P − δ , which leads to M=

λ (C − P ) , 1+ λ

where

λ=

r . 1− r

Innovative Statistics in Regulatory Science

132

It can be seen that if 0 < r ≤ 1, then 0 < λ ≤ 1 / 2 . To account for the variability of C − P, Chow and Shao suggested the noninferiority margins, M1 and M2 be modified as follows, respectively, M3 = M1 − ( z1−α + zβ ) SEC −T = C − P − ( z1−α + zβ ) SEC −T ,

(5.3)

where SEC −T is the standard error of Cɵ − Tɵ and za = Φ −1 ( a ) assuming that SEC − P ≈ SET − P ≈ SEC −T . Similarly, M2 can be modified as follows M4 = rM3 = r {C − P − ( z1−α + zβ ) SEC −T }

λ {C − P − ( z1−α + zβ ) SEC −T } , 1+ λ 1   = 1−  M3 ,  1+ λ  =

where δ 0 is chosen to be

(5.4)

1 as suggested by Chow and Shao (2006). 1+ λ

5.4.4 Alternative Methods Let CL and CU be the minimum and maximum effect of C when comparing with P. If the effect of the test treatment falls within the range of ( CL , CU ), we consider T is equivalent to C and superior to P. Consider the worst possible scenario that the effect of the active control falls on CU , while the effect of the test treatment T falls on CL . In this case, we may consider the difference between CL and CU and the non-inferiority margin. That is M5 = Cɵ U − Cɵ L .

(5.5)

In addition, since the selection of M depends upon the choice of δ 0 , in practice, δ 0 is often chosen as either δ 0 = 0.5 ( r = 0.5 ) or δ 0 = 0.8 ( r = 0.2 ). The noninferiority margin becomes narrower when δ 0 closes to 1. Based on the above argument, at the worst possible scenario, δ 0 can be estimated by Tɵ − Pɵ Cɵ L C −T δɵ 0 = = 1− = 1− . C−P Cɵ − Pɵ Cɵ U Thus,  Cɵ L M6 = rM1 =  1 −  Cɵ U 

 ɵ ɵ  C − P 

(

)

(5.6)

Non-inferiority/Equivalence Margin

133

5.4.5 An Example A pharmaceutical company is interested in conducting a non-inferiority trial for evaluation of safety and efficacy (in terms of cure rate) of a test treatment intended for treating patients with certain diseases as compared to a standard of care treatment (or an active control agent). At the planning stage of the non-inferiority trial, the question regarding sample size requirement for achieving a desired power for establishing non-inferiority of the test treatment as compared to the active control agent is raised. Sample size calculation, however, depends upon the clinically meaningful difference (margin). A narrower margin will require a much larger sample size for achieving the desired power for establishing non-inferiority of the test treatment. For selection of non-inferiority margin, both the ICH guideline and the FDA guidance suggest that historical data of the active control agent as compared to placebo if available should be used for determination of the non-inferiority margin. Historical data for comparing the active control agent with a placebo are summarized in Table  5.3. Since the response rate for the active control is C = 61.4%, the classical method suggests a non-inferiority margin of 20% be considered. Also, from Table 5.4, the placebo effect is P = 14.3%. Thus, M1 = C − P = 61.4% − 14.3% = 47.1%. The range of C − P is given by (39.7%, 56.7%). If we assume that the retention rate is 70% (i.e., δ 0 = 1 − r = 0.3), then r = 1 − δ 0 = 0.3. This gives M2 = ( 1 − δ 0 ) M1 = 0.3 × 47.1% = 14.1%. TABLE 5.3 Summary Statistics of Historical Data Active Control Agent C1 C2 C3

C4

Mean SD Minimum Maximum

Year of Submission

N

Active Control (C)

Placebo Cure Rate (P)

Difference in Cure Rate

1984 1985 1986 1986 1986 1986 1988 1982 1986 1988

279 209 101 100 108 90 137 203 88 97

63.1% 60.2% 60.0% 70.0% 55.1% 66.0% 58.7% 60.2% 60.0% 60.9%

7.3% 4.0% 14.0% 13.3% 13.6% 18.6% 17.6% 20.5% 16.7% 17.6%

55.8% 56.2% 46.0% 56.7% 41.5% 47.4% 41.1% 39.7% 43.3% 43.3%

61.4% 4.1% 55.1% 70.0%

14.3% 5.2% 4.0% 20.5%

47.1% 6.7% 39.7% 56.7%

Innovative Statistics in Regulatory Science

134

Assuming that SEC − P ≈ SET − P ≈ SEC −T , we have SEC −T = 6.7%. This leads to M3 = M1 − ( z1−α + zβ ) SEC −T = 47.1% − ( 1.96 + 0.84 ) × 6.7% = 27.7%, Consequently, 1   M4 =  1 −  M3 = 0.76 × 27.7% = 21%. 1 + λ  For the proposed margin M5, since the minimum effect and maximum effect = .7% and Cɵ U 56.7%, we have of C − P are= given by Cɵ L 39 M5 = Cɵ U − Cɵ L = 56.7% − 39.7% = 17%. 39.7% Also, since δɵ 0 = = 0.68, r = 1 − δɵ 0 = 1 − 0.68 = 0.32. This leads to 56.7%  Cɵ L M6 = rM1 =  1 −  Cɵ U 

 ɵ ɵ  C − P = 0.32 × 47.1% = 15.1%. 

(

)

To provide a better understanding, these margins are summarized in Table  5.4. As it can be seen from Table  5.4, the margin ranges from 14.1% to 47.1% (the entire effect of the active control agent) with a median of 21%, which is close to the classical method. It  should be noted that, prior to the publication of the 2010 draft guidance, the FDA recommended a noninferiority margin of 15%, while the sponsor requested a non-inferiority margin of 20%. TABLE 5.4 Non-inferiority Margins Suggested by Various Methods Method Classical Methods Hung et al.’s suggestion with r = 0.5 FDA’s M1 approach FDA’s M2 approach with r = 0.3a Chow and Shao’s M3 margin Chow and Shao’s M4 margin Proposed M5 margin Proposed M6 margin a

Retention rate of 70%.

Suggested Non-inferiority Margin 20.0% 23.1% 47.1% 14.1% 27.7% 21.0% 17.0% 15.1%

Non-inferiority/Equivalence Margin

135

Note that considering M = 0.5(C − P), a conservative estimate of C effect is obtained using the lower 95% confidence limit of 53.4%. Assuming a 14% therapeutic cure rate of placebo, a 34% therapeutic cure rate from T will maintain the retention ratio of (T − P)/(C − P) for 50%. 5.4.6 Remarks It should be noted that the above methods (except for the classical method) for determination of non-inferiority margin M is based on data observed from previous superiority studies comparing the active control agent and a placebo and data collected from superiority studies comparing the test treatment and the placebo if available. Thus, the selected margin is in fact an estimate rather than a fixed margin. In other words, the selected margin is a random variable whose statistical properties are unknown. In addition, since the selected non-inferiority margin has significant impact on power calculation for sample size, it is suggested that a sensitivity analysis be performed to carefully evaluate the potential impact of the selected margin on non-inferiority testing. As indicated by the ICH guideline, the selection of a non-inferiority margin should take both clinical judgment and statistical reasoning into consideration. The 2010 FDA draft guidance, however, emphasizes on statistical reasoning based on historical data from previous superiority studies comparing the active control agent and the placebo. In practice, there is always discrepancy between the margin suggested by the investigator and the margin recommended by the FDA. In  this case, it is suggested that medical/ statistical reviewers be consulted who, one hopes, will be able to reach an agreement on the selection of the non-inferiority margin following the general principles as described in the FDA draft guidance.

5.5 Strategy for Margin Selection For  assessment of non-inferiority and/or equivalence/similarity, the selection of non-inferiority margin and/or equivalence limit (similarity margin) is critical. Too narrow a margin will require a much larger sample size for achieving the study objectives; while too wide a margin may increase the probability of wrongly accept bad products. Besides, the selected margin will have an impact on sample size required for achieving the study objectives. In practice, the sponsor tends to propose a wider margin, which often deviates from the margin recommended by the regulatory agency. The disagreement in margin selection could generate tremendous argument and discussion between the sponsor and the regulatory agency. To close up the gap between the sponsor’s proposal

136

Innovative Statistics in Regulatory Science

and the regulatory agency’s recommendation, Nei et al. (2019) proposed a strategy for selection of similarity margin in comparative clinical trials. Their proposed strategy is based on the evaluation of risk assessment of the sponsor’s proposal assuming that the regulatory agency’s recommended margin is the true margin, which can be summarized in the following steps: Step 1: The sponsors are to identify historical studies available that are accepted by the FDA to determine similarity margin. Step 2: Based on a meta-analysis that combines these identified historical studies, the similarity margin is determined; hence the corresponding sample size required for testing similarity. Power calculation for the required sample size is obtained based on the sponsor’s proposed margin. Step 3: At the same time, the FDA will propose a similarity margin by taking clinical judgment, statistical rationale, and regulatory feasibility into considerations. Step 4: A risk assessment is then conducted for evaluation of the sponsor’s proposed margin assuming that the FDA’s proposal is true. Step 5: The risk assessment is then reviewed by the FDA review team and communicated with the sponsor in order to reach agreement on the final similarity margin. 5.5.1 Criteria for Risk Assessment In this section, we will focus on Step 4 from Nei et al. (2019)’s proposed strategy, which quantifies the risk of different margins based on several criteria. This will assist sponsors to adjust their margins according to the maximum risk that is allowed by FDA. Four criteria are considered regarding different aspects of the similarity test. Numerical derivations are given in the next section based on continuous endpoints (e.g. normality assumption). Let  be the true difference between the proposed biosimilar product and its reference product, i.e.,  = µB − µR, where µB and µR are the treatment effect of the biosimilar product and the reference product, respectively. We also assume that a positive value of  means that the biosimilar product is more efficacious than the reference product in the selected efficacy endpoint. Let δ Sponsor and δ FDA be the sponsor’s proposed margin and the FDA’s recommended margin, respectively. In here, we assume 0 <  < δ FDA < δ Sponsor . Criterion 1: Sample Size Ratio (SSR)—When fixing the power of the similarity test, sample size is a decreasing function of similarity margin, i.e., the smaller the similarity margin is, the bigger the sample size is required. In clinical trial studies, large sample size corresponds to more costs on sponsors. The goal is to move sponsor proposed margin toward the FDA recommended margin, with a moderate increase in sample size while maintaining

Non-inferiority/Equivalence Margin

137

the power at the desired level. Let nFDA be the sample size required to mainFDA tain 1− β power under δ FDA, similarly for nSponsor. Define SSR = nnSponsor by the sample size ratio. Then sample size difference (SSD) SSD = nFDA − nSponsor = ( SSR − 1) ⋅ nSponsor can be viewed as the amount of the information lost for the use of a wider margin (i.e., δ Sponsor) assuming that δ FDA is the true margin. By plotting the curve of sample size ratio, we can choose a threshold SSRM which serves as a guideline for margin determination, say 105%, 110%, 115%, 120%, which is corresponding to 5%, 10%, 15%, 20% loss based on nSponsor. Criterion 2: Relative Difference in Power (RED)—When fixing sample size for the similarity test, power is an increasing function of similarity margin, i.e., the larger the margin is, the bigger the power is. Let PowerSponsor and PowerFDA be the power under δ Sponsor and δ FDA, respectively. Since δ FDA < δ Sponsor, then PowerFDA < PowerSponsor. This is due to the wider region of the alternative hypothesis under wider margin δ Sponsor and because wider margins have smaller type II errors. Although we gain some power (smaller type II error rate) by using a wider margin, we also weaken the result (or say accuracy) when rejecting null hypothesis. Define RED = PowerSponsor − PowerFDA . The quantity RED is the gain in power by scarifying accuracy, i.e., by using wider margin. In order to close up the gap between δ FDA and δ Sponsor, we need to minimize RED. So we can set a threshold distance REDM , say 0.05, 0.10, 0.15, 0.20, between PowerFDA and PowerSponsor to be the maximum power gain by using a wider margin. Criterion 3: Relative Ratio in Power/Relative Risk (RR)—The  power described in the last section is the probability of concluding biosimilarity. When given the same sample size under both FDA’s and sponsor’s margins, PowerFDA < PowerSponsor . That is, the probability of concluding biosimilarity is bigger under δ Sponsor than it is under δ FDA. It means that among all the potential biosimilar products which would be considered biosimilar under the sponsor proposed margin, only a portion of them will be biosimilar under the FDA  recommended margin. The  rest are wrongly claimed as biosimilar by sponsors according to FDA’s margin. This is regarded as a risk factor for using wider margins. Define RR as the probability which a product is not concluded as biosimilar under δ FDA given that it is concluded as biosimilar under δ Sponsor, RR = 1 −

PowerSponsor − PowerFDA PowerFDA = . PowerSponsor PowerSponsor

Under the FDA  recommended margin, RR is the risk of wrongly concluding biosimilarity of a biosimilar drug using sponsor’s margin. Furthermore,

Innovative Statistics in Regulatory Science

138

among all biosimilar drugs concluded using sponsor’s margin, 100⋅ RR of them would have been failed under the FDA recommended margin. Thus, RR is the risk of using sponsor proposed margin. Wider margins lead to larger risks. Thus, we may choose an appropriate margin by assuring that the risk is smaller than a maximum risk RRM that is considered acceptable by the FDA (say 0.15). Let δ M be the margin that corresponds to RRM. We will derive an asymptotically analytical form of δ M in the next section based on continuous endpoint. Criterion 4: Inflation of Type I Error—Type I error inflation is the probability, assuming the smaller margin is the true difference, of rejecting a null hypothesis based on the wider margin in a study powered to rule out the wider margin (i.e., the type I error rate of the test under the “FDA” null). This will be greater than 5% and the degree of its inflation is probably relevant information. The  inflation is also an increasing function of similarity margins. Bigger margins lead to larger inflations, i.e., larger type I error rates. We can set up a threshold value of type I error inflation for choosing the largest margin that is allowed. 5.5.2 Risk Assessment with Continuous Endpoints In this section, both analytic and asymptotic forms of the four criteria proposed in the last section are derived. Without loss of generality, we only consider biosimilar products that have continuous endpoints. All calculations below can be derived in a similar fashion for biosimilar products with categorical endpoints. Let δ > 0 be the similarity margin and the null hypothesis of the similarity test is H 0 :  ≥ δ . Rejection of the null hypothesis implies similarity between the biosimilar product the reference product. For simplicity, we assume samples from both the biosimilar group and the reference group follow normal distributions with mean µB and µR , respectively and same unknown variance σ 2, which means the within-subject variances of both biosimilar and reference product are the same. That is,

(

)

(

)

x1B , x2B , … , xnBB ∼ N µB ,σ B2 , x1R , x2R , … , xnRR ∼ N µR ,σ R2 , where nB and nR are sample sizes for the biosimilar group and the reference group. Let µˆ BR = µˆ B − µˆ R be the estimated treatment effect of the biosimilar product relative to the reference product with standard error of

σˆ BR = σˆ 1 / nB + 1 / nR , where

µˆ B =

1 nB



nB i =1

xiB , µˆ R =

1 nR



nR i =1

xiR ,

Non-inferiority/Equivalence Margin

139

and

σˆ 2 =

1   nB + nR − 2 

∑ (x nB

i =1

B i

− µˆ B

)

2

+

∑ (x nR

i =1

R i

− µˆ R

) }. 2

Note that σˆ BR depends on sample sizes (and treatment effects in some scenarios). The rejection region for testing H 0 with statistical significance level α is  µˆ + δ  > tα ,nB +nR −2  R =  BR ˆ σ BR  

 µˆ BR − δ  < −tα, nB +nR −2  . σˆ BR 

∩

Thus, the power of the study is  µɵ + δ  µˆ − δ > tα , nB + nR − 2 and BR < −tα , nB + nR − 2  . Power = P  BR  σˆ BR  σˆ BR    δ − ≈ 1− TnB + nR − 2  tα , nB + nR − 2  σ 1 / nB + 1 / nR 



  δ +  − TnB + nR − 2  tα , nB + nR − 2   σ 1 / nB + 1 / nR  



 ,  

where Tk (| ⋅ θ ) is the cumulative distribution function of a non-central t-distribution with k degrees of freedom and the non-centrality parameter θ and −δ <  < δ under H a . Sample size ratio (SSR)—Assume that nB = κnR , the sample size nR needed to achieve power 1− β can be obtained by setting the power to 1− β . Since the power is larger than  δ − 1 − 2TnB + nR − 2  tα , nB + nR − 2  σ 1 / nB + 1 / nR 



 ,  

a conservative approximation to the sample size nR can be obtained by solving  nR (δ −  )  β = . T(1+κ )nR − 2  tα ,(1+κ )nR − 2  σ 1 + 1/ κ  2  



When sample size nR is sufficiently large, tα ,(1+κ )nR − 2 ≈ zα , and T(1+ κ )nR − 2 ( t|θ ) ≈ Φ ( t − θ ) , then   nR (δ −  )  n δ − ) β  ≈ Φ  zα − R ( . = T(1+κ )nR − 2  tα ,(1+κ )nR − 2   2  σ 1 + 1/ κ σ 1 + 1/ κ     



Innovative Statistics in Regulatory Science

140

As a result, the sample size needed to achieve power 1− β can be obtained by solving the following equation: zα −

nR (δ − 

σ 1 + 1/ κ

) =z

1− β /2

= − z β /2 .

This leads to nR =

( zα + zβ /2 )

2

σ 2 (1 + 1/ κ )

(δ −  )

2

.

Thus nRFDA =

( zα + zβ /2 )

2

σ 2 (1 + 1/ κ )

(δ FDA −  )

2

, nRSponsor =

( zα + zβ /2 )

2

σ 2 (1 + 1/ κ )

(δ Sponsor −  )

2

and with δ Sponsor = λδ FDA , we have

(1 + κ ) nRFDA =  λδ FDA −  n SSR = FDA =  nSponsor ( 1 + κ ) nRSponsor  δ FDA − 

  

2

and

λM = SSRM +

 1 − SSRM . δ

(

)

Relative difference in power (RED)—Let  nR (δ +  )  , B ( , δ , nR , σ , κ ) = T(1+κ )nR − 2  tα ,(1+κ )nR − 2  σ 1 + 1/ κ   



based on the calculation above, we immediately have RED = PowerSponsor − PowerFDA ≈ B ( , δ FDA , nR , σ , κ ) − B ( , λδ FDA , nR , σ , κ ) + B ( − , δ FDA , nR , σ , κ ) −B ( − , λδ FDA , nR , σ , κ ) . When nR is sufficiently large, let

,

Non-inferiority/Equivalence Margin

141

  ɶ ( , δ , nR , σ , κ ) = Φ  zα − nR (δ FDA +  )  , Φ  σ 1 + 1 / κ   by using the same approximation in last section,         nR (δ FDA +  )  nR ( λδ FDA +  )    − Φ  zα −  RED ≈ Φ  zα −   1 1    σ 1+ σ 1+      κ κ         nR (δ FDA −  )  nR ( λδ FDA −  )   + Φ  zα −  − Φ  zα −   σ 1 + 1/ κ  σ 1 + 1/ κ      ɶ ( − , δ FDA , nR , σ , κ ) − ɶ ( , δ FDA , nR , σ , κ ) − Φ ɶ ( , λδ FDA , nR , σ , κ ) + Φ ≈Φ ɶ ( − , λδ FDA , nR , σ , κ ) . Φ When plugging in the sample size nRSponsor which retains 1− β power under δ Sponsor,     δ δ + − RED β ≈ Φ  zα − FDA zα + zβ /2 )  + Φ  zα − FDA zα + zβ /2 )  − β . ( ( λδ FDA −  λδ FDA −      i. When  > 0, then   δ − RED β < 2Φ  zα − FDA zα + zβ /2 )  − β ; ( λδ FDA −    ii. When  < 0, then   δ + RED β < 2Φ  zα − FDA zα + zβ /2 )  − β . ( λδ FDA −    Thus, combining the above two cases, we have  δ FDA −   RED β < 2Φ  zα −  zα + z β λδ FDA −    2

 β+   − β := RED .  

For  simplicity reason, we will use RED β+ in the following discussion and write it as RED β .

Innovative Statistics in Regulatory Science

142

Relative ratio in power/relative risk (RR)—Let SFDA = {reject H 0 when δ = δ FDA } and SSponsor = {reject H 0 when δ = δ Sponsor } . Since δ FDA < δ Sponsor, then  ≤ δ FDA leads to  ≤ δ Sponsor. Therefore, rejecting H 0 under δ FDA leads to the rejection of H 0 under δ Sponsor, which means SFDA ⊆ SSponsor and SFDA ∩ SSponsor = SFDA. Define ps be the probability of concluding biosimilarity under δ FDA given concluding biosimilarity under δ Sponsor. Then based on the relationship between SFDA and SSponsor, we have ps = Pr ( conclude similarity under δ FDA|conclude similarity under δ Sponsor ) =

Pr ( reject H 0 when δ = δ FDA ) PowerFDA = Pr ( reject H 0 when δ = δ Sponsor ) PowerSponsor



1 − B ( , δ FDA , nR , σ , κ ) − B ( − , δ FDA , nR , σ , κ ) 1 − B ( , δ Sponsor , nR , σ , κ ) − B ( − , δ Sponsor , nR , σ , κ )

Thus, based on the definition of RR in Criterion 3, we have RR = 1 − ps ≈

RED . 1 − B ( , λδ FDA , nR , σ , κ ) − B ( − , λδ FDA , nR , σ , κ )

For large nR we have RR ≈

RED . ˆ ( , λδ FDA , nR , σ , κ ) − Φ ˆ ( − , λδ FDA , nR , σ , κ ) 1− Φ

Type I error inflation (TERI)—When assuming the smaller margin is the true difference, i.e.,  = ± δ FDA and δ FDA < δ Sponsor, then type I error inflation is calculated as follows. Type I Error| = ±δ FDA  µɵ + δ Sponsor  µɵ − δ Sponsor = P  BR > tα ,nB +nR −2 and BR < −tα ,nB +nR −2| = ±δ FDA    σ˘ BR σ˘ BR     δ Sposor + δ FDA  δ Sposor − δ FDA  = 1 − TnB +nR −2  tα ,nB +nR −2|  − TnB +nR −2  tα ,nB +nR −2|  σ 1 / nB + 1 / nR  σ 1 / nB + 1 / nR    = 1 − B (δ FDA , λδ FDA , nR , σ , κ ) − B ( −δ FDA , λδ FDA , nR , σ , κ )

Non-inferiority/Equivalence Margin

143

For large sample we have     nR ( λ + 1) nR ( λ − 1) Inflation ≈ 1 − α − φ  zα − ⋅ δ FDA  − φ  zα − ⋅ δ FDA  .     σ 1 + 1/ κ σ 1 + 1/ κ     5.5.3 Numerical Studies In this section, Numerical studies for all four criteria are conducted and risk curves are plotted. Based on the results, suggestions on choosing a reasonable threshold are discussed for different scenarios. The  validity of large sample approximation is investigated for small sample sizes. During this section, type I error rate and type II error rate are fixed to be 0.05 and 0.2. Sample size ratio (SSR)—It can be verified that SSR =

 δ FDA λ− δ FDA −  δ FDA − 

To further investigate the relationship between SSR and λ , we consider SSR δ FDA instead of SSR since SSR is a linear function of λ with δ FDA as slope and − δ FDA  − δ FDA −  as intercept. The quantity SSR increases by a proportion of δ FDA − . For example, if δ Sponsor is 10% wider than δ FDA , then SSR is increased by 0.1δ FDA . So smaller values of δ FDA −  leads to steeper lines. In other words, if δ FDA δ FDA − is set to be closer to  , sample size of sponsor increases more rapidly when δ Sponsor moves toward δ FDA. We can observe this from the plot (Figure 5.4). Let SSRcur be the sample according to the current δ Sponsor. For  a safe choice, we propose to use δ new that corresponds to SSRcur − ∆, where ∆ can range from 0.2 0.3 0.2 to 0.3. This will make the gap between δ FDA and δ Sponsor smaller. But this is not a universal choice. Different thresholds should be chosen based on a case-by-case basis. The plot the sample size difference (SSD) curve is given in Figure 5.5. As it can be seen from Figure 5.5, SSD curve follows the same pattern as that of SSR curve.

FIGURE 5.4 Plot of sample size ratio (SSR) curve.

Innovative Statistics in Regulatory Science

144

FIGURE 5.5 Plot of sample size difference (SSD) curve.

Relative difference in power (RED)—Since the large sample approximation is used in deriving the asymptotic form of RED, we first investigate the validity of this approximation when sample size is small. As we can see from the four plots (Figure 5.6), when sample size of a single arm is 15, the approximation is still close to the original. For sample size 30, two curves look identical to each other. The condition of the normal approximation of t distribution is that the degree of freedom is greater than 30. In this case, nB + nR − 2 > 30, which is not difficult to satisfy in practice. For reasons of simplicity, we will use the asymptotic form instead of the original one in the following discussion. The rest of the parameters used in sample size comparison plot is set as follows,  = −0.5 , δ FDA = 1.0, σ = 1, κ = 1. Since RED is symmetric about , we only plot when  < 0.

FIGURE 5.6 Plots of RED against respectively.

λ

and plots of asymptotic RED against RED with n= n= 15 and 30, R B

Non-inferiority/Equivalence Margin

145

Next, to eliminate some of the parameters in RED, we rewrite it in terms of R effect size (ES) ∆ = −/σ and ∆ FDA = δ FDA /σ , and let N = 1+n1/κ be the sample size factor, then RED ≈ Φ  zα − N ( ∆ FDA + ∆ )  − Φ  zα − N ( λ∆ FDA + ∆ )  + Φ  zα − N ( ∆ FDA − ∆ )  − Φ  zα − N ( λ∆ FDA − ∆ )  Figure  5.7 plots RED curves for 6 different ESFDA values, which are corresponding to 0% 5% 10% 15% 20% 25% increase from ES = 0.5. Large ESFDA leads to steeper curve, i.e., the drastically increase in RED for smaller values of λ . So for large ESFDA, narrowing the same portion of δ FDA will yield more decrement in RED value. Therefore, under current parameter setting, for large ESFDA we recommend choosing the margin (value of λ ) such that RED is in the range of ( 0.20, 0.40 ); for small ESFDA, less than 0.20 is preferred. RED converges to Φ  zα − N ( ∆ FDA + ∆ )  + Φ  zα − N ( ∆ FDA − ∆ )  when λ → ∞; large sample size factor will result in quicker convergence. The plot of RED curves for 6 different N values with ES = 0.5 is given in Figure 5.8.

FIGURE 5.7 Plot of RED curves for 6 different ESFDA values with ES = 0.5 .

FIGURE 5.8 Plot of RED curves for 6 different N values with ES = 0.5 .

Innovative Statistics in Regulatory Science

146

FIGURE 5.9 Plots of various δ FDA values as λ increases.

Normally, the sample size used in sponsor’s trial is required to maintain certain amount of power, such as 0.8. RED β is used in this scenario. Figure  5.9 plots 5 different values of δ FDA which is gradually increasing from . The sample sizes used for each value of λ here maintain 1− β power for sponsor’s text. RED β is different from RED, i.e., larger value of δ FDA leads to slow growth of difference in power. For a large value of δ FDA, λ which leads to RED β in the range of ( 0.1, 0.2 ) is recommended; for small value, less than 0.3 is preferred. Relative ratio in power/relative risk (RR)—The  definition of RR in the last section is also based on multiple steps of large sample approximation of noncentral t distribution and its quantile. We first check the validity of the large sample approximation when sample size is small. As we can see from the four plots in Figure 5.6, even when sample size is as small as 15 (single arm), the original RR and the asymptotic one look identical. Therefore, we will only use the asymptotic expression in our following decision-making. The rest of the parameters used in sample size comparison plot is set as follows,  = −0.5 , δ FDA = 1.0, σ = 1, κ = 1. RR ≈

RED 1 − Φ  zα − N ( λ∆ FDA + ∆ )  − Φ  zα − N ( λ∆ FDA − ∆ ) 

Based on the expression of RR, we can see RR is the regularized version of RED.But different from RED, RR has a clear definition in terms of risk, which is the probability of wrongly concluding biosimilarity of a biosimilar drug using sponsor’s margin. So smaller value of RR is preferred. We rewrite RR based on the expression of RED and plot 6 curves according to 6 different values of ESFDA , which are the same as in RED plot. From Figure 5.10, we see

Non-inferiority/Equivalence Margin

147

FIGURE 5.10 Plots RED curves with six different sample sizes nR .

that a larger ESFDA leads to smaller risk. RR converges to RED when λ → ∞; large sample size factor will result in quicker convergence. Figure 5.10 plots RED curves with six different sample sizes nR. As we can see larger sample size leads to lower risk. When plugging in the sample size that retains power at 1− β , the shape of the following five are curves is almost identical to those in the RED β plot (see Figure  5.11). This  is because RR β is approximately proportional β to RED . 1− β The threshold value of RR β for selecting λ can be set as the threshold value of RED β divided by 1− β (see Figure 5.12). Risk and sample size—Based on the expression of RR β , relative risk is an increasing function of λ , hence, δ Sponsor. Furthermore, nRSponsor is a decreasing function of δ Sponsor. Minimizing risk is based on the increase of sample size, which leads to the increase of the cost of clinical trial for sponsor. So when

FIGURE 5.11 Plot of RR against λ .

148

Innovative Statistics in Regulatory Science

FIGURE 5.12 Plot of RR β against λ .

FIGURE 5.13 Plot of risk against sample size for the case where ε = −0.5, δ = 0.7 , σ = 1, κ = 1.

minimizing risk, sample size should be considered at the same time. A compromise should be made between the risk and sample size when shrinking the gap between δ FDA and δ Sponsor. So we put risk curve and sample size curve together on the sample plot (Figure 5.13). The  values of parameters are  = −0.5 , δ FDA = 0.7 , σ = 1, κ = 1, α = 0.05, β = 0.2. As we can see, under the condition that sample size should be chosen while maintaining 0.8 as power, risk is increasing as δ Sponsor moves away from δ FDA (can be seen as λ increases) and at the meantime, sample size needed to maintain 0.8 as power decreases. It is reasonable that large sample size is needed to keep the risk low. For this case, we can choose λ = 1.075 . It leads to about 30% risk, but only requires half the sample size as required for the FDA  recommended margin. Figure  5.14 plots sample size against risk. The relationship is almost linear with a negative slope. Type I error inflation—Figure  5.15 plots the type I error inflation when δ Sponsor moves away from δ FDA. The values of parameters are  = −0.5, δ FDA = 0.7 , σ = 1, κ = 1, α = 0.05, β = 0.2. The  sample size used in the plot is nRSponsor , i.e.,

Non-inferiority/Equivalence Margin

149

FIGURE 5.14 Plot of Type I error inflation against risk.

FIGURE 5.15 Plot of Type I error inflation against λ .

maintains 0.8 power. Only the asymptotic expression of type I error inflation is used here. Type I error rate can also be seen as a risk factor. Minimizing the inflation caused by wider margin is the goal here. In this case, about 50% of the inflation is acceptable, since type I error is 0.05 here. Thus, we can choose λ = 1.15 and δ Sponsor = 0.805. 5.5.4 An Example In this section, we present a synthetic example to demonstrate the strategy proposed in this paper and four criteria in margin selection. Assume after the clinical trial, we observe the following settings µˆ B = 2.55 , µˆ R = 2.75, σˆ = 1.35, nB = 140, nR = 200, δ FDA = 0.25, δ Sponsor = 0.35. Based on the FDA recommended margin, the sample size required to maintain 0.8 power is around 274 for reference group (assuming true difference is zero in sample size calculation),

150

Innovative Statistics in Regulatory Science

which is more than the sample size used in the clinical trial. This adjustment may cost too much for the sponsor to adopt. Based on the sponsor-proposed margin, 200 samples for reference group is more than enough to retain the 0.8 power. Obviously, some compromises are needed here to benefit both parties. Sample size ratio (SSR)—Figure  5.16. Plot of SSR against λ . As it can be seen from Figure  5.16, sample size ratio here is 9, which is too big for sponsor to accommodate. Based on the SSR plot, we can choose RSS to be between 3 and 4. Thus δ Sponsor = 1.15 * 0.25 = 0.2875. Relative difference in power (RED)—Based on the RED plot (Figure 5.17), we see that power difference is not big in this case even for large values of λ . So δ Sponsor does not need to move too much toward δ FDA. Therefore, any value of λ between 1.15 and 1.20 is acceptable. Relative ratio in power/relative risk (RR)—After regularization, the risk is more understandable than the previous two criteria. Consider that it has a clear meaning in terms of the probability of wrongly concluding

FIGURE 5.16 Plot of SSR against λ.

FIGURE 5.17 Plot of RED against λ.

Non-inferiority/Equivalence Margin

151

FIGURE 5.18 Plot of RR against λ .

FIGURE 5.19 Plot of Type I error inflation against λ .

similarity. Figure 5.18 plots RR against λ . As it can be seen, anything larger than 40% may be too risky. So 40% can be our maximum risk, and λ = 1.125, δ Sponsor = 0.281. Type I error inflation—Figure 5.19 plots Type I error inflation against λ . Since the significance level here is 0.05, after the inflation we want the significance level not to go beyond 0.10.1. Thus, the maximum inflation allowed is 0.05 and λ = 1.15, δ Sponsor = 0.2875.

5.6 Concluding Remarks In  this chapter, following similar ideas described in the 2010 FDA  draft guidance on non-inferiority clinical trials, several alternative methods for selection of an appropriate non-inferiority margin are discussed. These methods were derived by taking into consideration: (i) the variability of the observed mean difference between the active control agent (C) and the placebo (P), the test treatment (T) and the active control agent, and the test

152

Innovative Statistics in Regulatory Science

treatment and the placebo (if available) and (ii) the retention rate between the effect of test treatment as compared to the placebo (T − P) and the effect of the active control agent as compared to the placebo (C − P). The proposed methods utilize median of estimates of the retention rates based on the historical data observed from superiority studies of the active control agent as compared to the placebo. Since the width of a non-inferiority margin has an impact on power calculation for sample size, the selection of non-inferiority margin is critical in non-inferiority (active-control) trials. The  FDA  suggests that the 2010 FDA draft guidance on non-inferiority clinical trials should be consulted for selection of an appropriate non-inferiority margin. In addition, communications with medical and/or statistical reviewers are encouraged when there is disagreement on the selected margin. It, however, should be noted that in some cases, power calculation for sample size based on binary response (e.g., incidence rate of adverse events and cure rate) might not be feasible for clinical studies with extremely low incidence rates. These methods described in this article show the non-inferiority in efficacy of the test treatment to the active control agent, but do not  have the evidence of the superiority of the test treatment to the active control agent in safety. Tsou et al. (2007) proposed a non-inferiority test statistic for testing the mixed hypothesis based on treatment difference and relative risk for active control trial. One benefit of the mixed test is that we do not need to choose between difference test and ratio test in advance. In particular, this mixed null hypothesis consists of a margin based on treatment difference and a margin based on relative risk. Tsou et al.’s proposed mixed non-inferiority test not only preserves the type I error rate at desired level but also gives the similar power as that from the difference test or as that from the ratio test. Based on risk assessment using the four proposed criteria, the proposed strategy can not only close up the gap between the sponsor proposed margin and the FDA recommended margin, but also select an appropriate margin by taking clinical judgment, statistical rationale, and regulatory feasibility into consideration. In this article, for simplicity, we focus on continuous endpoint. The proposed strategy with the four criteria can be applied to other data types such as discrete endpoints (e.g., binary response) and time-toevent data. In addition to evaluation of risk of sponsor’s proposed margin, we can also assess the risk of FDA recommended margin assuming that the margin proposed by the sponsor is the true margin.

6 Missing Data

6.1 Introduction Missing values or incomplete data are commonly encountered in clinical trials. One of the primary causes of missing data is the dropout. Reasons for dropout include, but are not limited to: refusal to continue in the study (e.g., withdrawal of informed consent); perceived lack of efficacy; relocation; adverse events; unpleasant study procedures; worsening of disease; unrelated disease; non-compliance with the study; need to use prohibited medication; and death (DeSouza et  al., 2009). Following the idea of Little and Rubin (1987), DeSouza et al. (2009) provided an overview of three types of missingness mechanisms for dropouts. These three types of missingness mechanisms include (i) missing completely at random (MCAR), (ii) missing at random (MAR), and (iii) missing not at random (MNAR). Missing completely at random is referred to the dropout process that is independent of the observed data and the missing data. Missing at random indicates that the dropout process is dependent on the observed data but is independent of the missing data. For missing not at random, the dropout process is dependent on the missing data and possibly the observed data. Depending upon the missingness mechanisms, appropriate missing data analysis strategies can then be considered based on existing analysis methods in the literature. For example, commonly considered methods under MAR include (i) discard incomplete cases and analyze complete-case only, (ii) impute or fill-in missing values and then analyze the filled-in data, (iii) analyze the incomplete data by a method such as likelihood-based method (e.g., maximum likelihood, restricted maximum likelihood, and Bayesian approach), momentbased method (e.g., generalized estimating equations  and their variants), and survival analysis method (e.g., Cox proportional hazards model) that does not require a complete data set. On the other hand, under MNAR, commonly considered methods are derived under pattern mixture models (Little, 1994), which can be divided into two types, parametric (see, e.g., Diggle and Kenward, 1994) and semi-parametric (e.g., Rotnitzky et al., 1998). In practice, the possible causes of missing values in a study can generally be classified into two categories. The  first category includes the reasons 153

154

Innovative Statistics in Regulatory Science

that are not directly related to the study. For example, a patient may be lost to follow-up because he/she moves out of the area. This category of missing values can be considered as missing completely at random. The  second category includes the reasons that are related to the study. For example, a patient may withdraw from the study due to treatment-emergent adverse events. In clinical research, it is not uncommon to have multiple assessments from each subject. Subjects with all observations missing are called unit non-respondents. Because unit non-respondents do not provide any useful information, these subjects are usually excluded from the analysis. On the other hand, the subjects with some, but not all, observations missing are referred to as item non-respondents. In  practice, excluding item nonrespondents from the analysis is considered against the intent-to-treat (ITT) principle and, hence is not  acceptable. In  clinical research, the primary analysis is usually conducted based on ITT population, which includes all randomized subjects with at least post-treatment evaluation. As a result, most item non-respondents may be included in the ITT population. Excluding item non-respondents may seriously decrease power/efficiency of the study. Statistical methods for missing values imputation have been studied by many authors (see, e.g., Kalton and Kasprzyk, 1986; Little and Rubin, 1987; Schafer, 1997). To account for item non-respondents, two methods are commonly considered. The first method is the so-called likelihood-based method. Under a parametric model, the marginal likelihood function for the observed responses is obtained by integrating out the missing responses. The parameter of interest can then be estimated by the maximum likelihood estimator (MLE). Consequently, a corresponding test (e.g., likelihood ratio test) can be constructed. The  merit of this method is that the resulting statistical procedures are usually efficient. The drawback is that the calculation of the marginal likelihood could be difficult. As a result, some special statistical or numerical algorithms are commonly applied for obtaining the MLE. For  example, the expectation–maximization (EM) algorithm is one of the most popular methods for obtaining the MLE when there are missing data. The other method for item non-respondents is imputation. Compared with the likelihood-based method, the method of imputation is relatively simple and easy to apply. The idea of imputation is to treat the imputed values as the observed values and then apply the standard statistical software for obtaining consistent estimators. However, it should be noted that the variability of the estimator obtained by imputation is usually different from the estimator obtained from the complete data. In this case, the formulas designed to estimate the variance of the complete data set cannot be used to estimate the variance of estimator produced by the imputed data. As an alternative, two methods are considered for estimation of its variability. One is based on Taylor’s expansion. This method is referred to as the linearization method. The  merit of the linearization method is that it requires less computation. However, the drawback is that its formula could be very complicated

Missing Data

155

and/or not trackable. The other approach is based on re-sampling method (e.g., bootstrap and jackknife). The drawback of the re-sampling method is that it requires intensive computation. The  merit is that it is very easy to apply. With the help of a fast-speed computer, the re-sampling method has become much more attractive in practice. Note that imputation is very popular in clinical research. The  simple imputation method of last observation carried forward (LOCF) at endpoint is probably the most commonly used imputation method in clinical research. Although the LOCF is simple and easy for implementation in clinical trials, its validity has been challenged by many researchers. As a result, the search for alternative valid statistical methods for missing values imputation has received much attention in the past decade. In practice, the imputation methods in clinical research are more diversified due to the complexity of the study design relative to sample survey. As a result, statistical properties of many commonly used imputation methods in clinical research are still unknown, while most imputation methods used in sample survey are well studied. Hence, the imputation methods in clinical research provide a unique challenge and also an opportunity for the statisticians in the area of clinical research. In Section 6.2, statistical properties and the validity of the commonly used LOCF method is studied. Some commonly considered statistical methods for missing values imputation are described in the subsequent sections of this chapter. Some recent development and a brief concluding remark are given in Sections 6.4 and 6.5.

6.2 Missing Data Imputation 6.2.1 Last Observation Carried Forward Last observation carried forward (LOCF) analysis at endpoint is probably the most commonly used imputation method in clinical research. For illustration purposes, one example is described below. Consider a randomized, parallel-group clinical trial comparing r treatments. Each patient is randomly assigned to one of the treatments. According to the protocol, each patient should undergo s consecutive visits. Let yijk be the observation from the kth subject in the ith treatment group at visit j. The following statistical model is usually considered. yijk = µij + ε ijk , where ε ijk ~ N (0, σ 2 ),

(6.1)

where µij represents the fixed effect of the ith treatment at visit j. If there are no missing values, the primary comparison between treatments will be based on the observations from the last visit ( j = s ) because this reflects

Innovative Statistics in Regulatory Science

156

the treatment difference at the end of the treatment period. However, it is not necessary that every subject complete the study. Suppose that the last evaluable visit is j * < m for the kth subject in the ith treatment group. Then the value of yij* k can be used to impute tisk . After imputation, the data at endpoint are analyzed by the usual ANOVA model. We will refer to the procedure described above as LOCF. Note that the method of LOCF is usually applied according to the ITT principle. The ITT population includes all randomized subjects. The  LOCF is commonly employed in clinical research, although it lacks statistical justification. In what follows, its statistical properties and justification are studied. 6.2.1.1 Bias-variance Trade-off The objective of a clinical study is usually to assess the safety and efficacy of a test treatment under investigation. Statistical inferences on the efficacy parameters are usually obtained. In practice, a sufficiently large sample size is required to obtain a reliable estimate and to achieve a desired power for establishment of the efficacy of the treatment. The reliability of an estimator can be evaluated by bias and by variability. A reliable estimator should have a small or zero bias with small variability. Hence, the estimator based on LOCF and the estimator based on completers are compared in terms of their bias and variability. For  illustration purposes, we focus on only one treatment group with two visits. Assume that there are a total of n = n1 + n2 randomized subjects, where n1 subjects complete the trial, while the remaining n2 subjects only have observations at visit 1. Let yik be the response from the kth subject at the ith visit and µi = E( yik ). The parameter of interest is µ2. The estimator based on completers is given by yc =

1 n1

n1

∑y

i 2k

.

k =1

On the other hand, the estimator based on LOCF can be obtained as yLOCF =

1  n 

n1

∑ i =1

 yi1k  .  i = n1 + 1  n

yi 2 k +



It can be verified that the bias of yc is 0 with variance σ 2 = n1, while of the bias of yLOCF is n2 ( µ1 − µ2 )/n with variance σ 2 /(n1 + n2 ). As noted, although LOCF may introduce some bias, it decreases the variability. In a clinical trial with multiple visits, usually, µ j ≈ µs if j ≈ s. This implies that the LOCF is recommended if the patients withdraw from the study at the end of the study. However, if a patient drops out of the study at the very beginning, the bias of the LOCF could be substantial. As a result, it is recommended that the results from an analysis based on LOCF be interpreted with caution.

Missing Data

157

6.2.1.2 Hypothesis Testing In practice, the LOCF is viewed as a pure imputation method for testing the null hypothesis of H 0 : µ1s = ⋯ = µrs , where µij are as defined in (6.1). Shao and Zhong (2003) provided another look at statistical properties of the LOCF under the above null hypothesis. More specifically, they partitioned the total patient population into s subpopulations according to the time when patients dropped out from the study. Note that in their definition, the patients who complete the study are considered a special case of “drop out” at the end of the study. Then µij represents the population mean of the jth subpopulation under treatment i. Assume that the jth subpopulation under the ith treatment accounts for pi × 100% of the overall population under the ith treatment. They argued that the objective of the intend-to-treat analysis is to test the following hypothesis test H 0 : µ1 = ⋯ = µr ,

(6.2)

where s

µi =

∑p µ . ij

ij

j =1

Based on the above hypothesis, Shao and Zhong (2003) indicated that the LOCF bears the following properties: 1. In the special case of r = 2, the asymptotic ( ni → ∞ ) size of the LOCF under H 0 is ≤ α if and only if  nτ2 n τ2   n τ2 nτ2  lim  2 1 + 1 2  ≤ lim  1 1 + 2 2  , n  n   n  n where s

τ i2 =

∑ p (µ ij

ij

− µ i )2 .

j =1

The  LOCF is robust in the sense that its asymptotic size is α if lim(n1 / n) = n2 / n or τ 12 = τ 22 . Note that, in reality, τ 12 = τ 22 is impractical unless µij = µi for all j. However, n1 = n2 (as a result lim(n1/n) = n2 /n) is very typical, in practice. The above observation indicates in such a situation n1 = n2 that LOCF is still valid.

Innovative Statistics in Regulatory Science

158

2. When r  =  2, τ 12 ≠ τ 22 , and n1 ≠ n2, the LOCF has an asymptotic size smaller than α if (n2 − n1 )τ 12 < (n2 − n1 )τ 22

(6.3)

or larger than α if the inequality sign in (6.3) is reversed. 3. When r ≥ 3, the asymptotic size of the LOCF is generally not α except for some special case (e.g., τ 12 = τ 22 = ⋯ = τ r2 = 0). Because the LOCF usually does not produce a test with asymptotic significance level α when r ≥ 3, Shao and Zhong (2003) proposed the following testing procedure based on the idea of post-stratification. The null hypothesis H 0 should be rejected if T > χ12−α , r −1 , where χ12−α ,r −1 is a chi-square random variable with r − 1 degrees of freedom and

r

T=

∑ i =1

Vˆi =

  1 yi⋅⋅ − Vˆi   

1 ni (ni − 1)

s

2

 yi⋅⋅ / Vˆi   i =1 r  , ˆ 1 / Vi   i =1  r





nij

∑ ∑ (y

ijk

− yi⋅⋅ )2 .

j =1 k =1

Under model (6.1) and the null hypothesis of (6.3), this procedure has the exact type I error α . 6.2.2 Mean/Median Imputation Missing ordinal responses are also commonly encountered in clinical research. For those types of missing data, mean or median imputation is commonly considered. Let xi be the ordinal response from the ith subject, where i = 1,..., n. The  parameter of interest is µ = E( xi ). Assume that xi for i = 1,..., n1 < n are observed and the rest are missing. Median imputation will impute the missing response by the median of the observed response (i.e., xi , i = 1,..., n1). The merit of median imputation is that it can keep the imputed response within the sample space as the original response by appropriately defining the median. The sample mean of the imputed data set will be used as an estimator for the population mean. However, as the parameter of interest is population mean, median imputation may lead to biased estimates. As an alternative, mean imputation will impute the missing value by the sample mean of the observed units, i.e., (1 / n1 )∑ ni =1 1 xi . The  disadvantage of

Missing Data

159

the mean imputation is that the imputed value may be out of the original response sample space. However, it can be shown that the sample mean of the imputed data set is a consistent estimator of population mean. Its variability can be assessed by the jackknife method proposed by Rao and Shao (1987). In  practice, usually, each subject will provide more than one ordinal response. The summation of those ordinal responses (total score) is usually considered as the primary efficacy parameter. The parameter of interest is the population mean of the total score. In  such a situation, mean/median imputation can be carried out for each ordinal response within each treatment group. 6.2.3 Regression Imputation The method of regression imputation is usually considered when covariates are available. Regression imputation assumes a linear model between the response and the covariates. The method of regression imputation has been studied by various authors (see, e.g., Srivastava and Carter, 1986; Shao and Wang, 2002). Let yijk be the response from the kth subject in the ith treatment group at the jth visit. The following regression model is considered: yijk = µi + β i xij + ε ijk ,

(6.4)

where xij is the covariate of the kth subject in the ith treatment group. In practice, the covariates xij could be demographic variables (e.g., age, sex, and race) or patient’s baseline characteristics (e.g., medical history or disease severity). Model (6.4) suggests a regression imputation method. Let µˆ i and βˆi denote the estimators of µi and β i based on complete data set, respectively. If yijk is miss* ing, its predicted mean value yijk = µɵ i + βɵi xij is used to impute. The imputed values are treated as true responses and the usual ANOVA is used to perform the analysis.

6.3 Marginal/Conditional Imputation for Contingency In  an observational study, two-way contingency tables can be used to summarize two-dimensional categorical data. Each cell (category) in a two-way contingency table is defined by a two-dimensional categorical variable (A, B), where A and B take values in {1,..., a} and {1,..., b}, respectively. Sample cell frequencies can be computed based on the observed responses of (A, B) from a sample of units (subjects). Statistical interest

Innovative Statistics in Regulatory Science

160

includes the estimation of cell probabilities and testing hypotheses of goodness of fit or the independence of the two components A and B. In an observational study, there can be more than one stratum. It  is assumed that within a stratum, sampled units independently have the same probability π A to have missing B and observed A, π B to have missing A and observed B, π C to have observed A and B. (The probabilities π A , π B , and π C may be different in different imputation classes.) As units with both A and B missing are considered as unit non-respondent, they are excluded in the analysis. As a result, without loss of generality, it is assumed that π A + π B + π C = 1. For a two-way contingency table, it is very important for an appropriate imputation method to keep imputed values in the appropriate sample space. Whether in calculating the cell probability or in testing hypotheses (e.g., testing independence or goodness of fit), the corresponding statistical procedures are all based on the frequency counts of a contingency table. If the imputed value is out of the sample space, additional categories will be produced which is of no practical meaning. As a result, two hot deck imputation methods are thoroughly studied by Shao and Wang (2002). 6.3.1 Simple Random Sampling Consider a sampled unit with observed A = i and missing B. Two imputation methods were studied by Shao and Wang (2002). The marginal (or unconditional) random hot deck imputation method imputes B by the value of B of a unit randomly selected from all units with observed B. The conditional hot deck imputation method imputes B by the value of B of a unit randomly selected from all units with observed B and A = i . All non-respondents are imputed independently. After imputation, the cell probabilities pij can be estimated using the standard formulas in the analysis of data from a two-way contingency table by treating imputed values as observed data. Denote these estimators by pˆ ijI , where i = 1,..., a and j = 1,..., b . Let

(

)

I I pˆ I = pˆ11 ,..., pˆ1Ib ,..., pˆ aI1 ,..., pˆ ab ′,

and p = (p11 ,..., p1b ,..., pa1 ,..., pab )′, where= pij P= ( A i , B = j ). Intuitively, marginal random hot deck imputation leads to consistent estimators of pi⋅ = P( A = i ) and p⋅ j = P(B = j ), but not pij . Shao and Wang (2002) showed that pˆ I under conditional hot deck imputation are consistent, asymptotically unbiased, and asymptotically normal.

Missing Data

161

Theorem 6.1: Assume that π C  > 0. Under conditional hot deck imputation, n (pɵ I − p) →d N (0, MPM ’+ (1 − π C )P), where P = diag{p} − pp′ and M=

1 ( I axb − π A diag{pB|A } I a ⊗ U b − π Bdiag{pA|B }U a ⊗ I b , πC

pA|B = (p11 / p⋅1 ,..., p1b / p⋅b ,..., pa1 / p⋅1 ,..., pab / p⋅b )′, pB|A = (p11 / p1⋅ ,..., p1b / p1⋅ ,..., pa1 / pa⋅ ,..., pab / pa⋅ )′, where I a denotes an a-dimensional identity matrix, U b denotes a bdimensional square matrix with all components being 1, and ⊗ is the Kronecker product. 6.3.2 Goodness-of-Fit Test A direct application of Theorem 6.1 is to obtain a Wald-type test for goodness of fit. Consider the null hypothesis of the form H 0 : p = p0 , where p0 is a known vector. Under H 0 , XW2 = n(pɵ * −p0* )’Σɵ

* −1

(pɵ * −p0* ) →d χ ab2 −1 ,

where χ v2 denotes a random variable having the chi-square distribution with v degrees of freedom, pˆ * (p0*) is obtained by dropping the last component of pˆ I (p0 ), and Σˆ * is the estimated asymptotic covariance matrix of pˆ * , which ˆ the estimated can be obtained by dropping the last row and column of Σ, I asymptotic covariance matrix of pˆ . Note that the computation of Σˆ * −1 is complicated, Shao and Wang (2002) proposed a simple correction of the standard Pearson chi-square statistic by matching the first-order moment, an approach developed by Rao and Scott (1987). Let b

a

j =1

i =1

∑∑

XG2 = n

(pɵ Iij −pij )2 . pij

It is noted that under conditional imputation the asymptotic expectation of XG2 is given by D=

1 ( ab + π A2 a + π B2b − 2π A a − 2π Bb + 2π Aπ B + 2π Aπ Bδ ) − π C ab + ( ab − 1). πC

Innovative Statistics in Regulatory Science

162

Let λ = D /( ab − 1). Then the asymptotic expectation of XG2 / λ is ab − 1, which is the first-order moment of a standard chi-square variable with ab − 1 degrees of freedom. Thus, XG2 / λ can be used just like a normal chi-square statistic to test the goodness of fit. However, it should be noted that this is just an approximated test procedure that is not asymptotically correct. According to a Shao and Wang’s simulation study, this test performs reasonably well with moderate sample sizes.

6.4 Test for Independence Testing for the independence between A and B can be performed by the following chi-square statistic when there is no missing data b

a

j =1

i =1

∑∑

X2 = n

(pɵ Iij −pɵ i⋅ pɵ ⋅ j )2 →d χ(2a−1)( b−1) . pɵ i⋅ pɵ ⋅ j

It is of interest to know what the asymptotic behavior of the above chi-square statistic is under both marginal and conditional imputation. It is found that, under the null hypothesis that A and B are independent, conditional hot deck imputation yields X 2 →d (π C−1 + 1 − π C )χ(2a −1)( b −1) and marginal hot deck imputation yields 2 X MI →d χ(2a −1)( b −1) .

6.4.1 Results Under Stratified Simple Random Sampling When number of strata is small Stratified samplings are also commonly used in medical study. For example, a large epidemiology study is usually conducted by several large centers. Those centers are usually considered as strata. For those types of study, the number of strata is not very large; however, the sample size within each stratum is very large. As a result, imputation is usually carried out within each stratum. Within the hth stratum, we assume that a simple random sample of size nh is obtained and samples across strata are obtained independently. The total sample size is n = ∑ hH=1 nh , where H is the number of strata and nh is the sample size in stratum h. The parameter of interest is the overall cell probability vector p = ∑ hH=1 wh ph , where wh is the hth stratum weight. The estimator of p based on conditional imputation is given

Missing Data

163

by pɵ I = ∑ hH=1 wh pɵ Ih . Assume that nh = n → p as n → ∞, h = 1,..., H . Then, a direct application of Theorem 6.1 leads to n (pɵ I − p) →d N (0, Σ), where H

Σ=

∑ h =1

wh2 Σh ph

and Σ h is the Σ in Theorem 6.1 but restricted to the hth stratum. 6.4.2 When Number of Strata Is Large In a medical survey, it is also possible to have the number of strata (H) very large, while the sample size within each stratum is small. A typical example is that of a medical survey is conducted by family. Then each family can be considered as a stratum and all the members within the family become the samples from within this stratum. In  such a situation, the method of imputation within stratum is impractical because it is possible that within a stratum, there are no completers. As an alternative, Shao and Wang (2002) proposed the method of imputation across strata under the assumption that (π h , A , π h ,B , π h ,C ), where h = 1,..., H , is constant. More specifically, let nhC,ij denote the number of completers in the hth stratum such that A = i and B = j . For a sampled unit in the kth imputation class with observed B = j and missing A, the missing value is imputed by i according to the conditional probability

∑w n p |B, k = ∑w n ij

C h h ,ij

/ nh

C h h ,⋅ j

/ nh

h

.

h

Similarly, the missing value of a sampled unit in the kth imputation class with observed A = i and missing B can be imputed by j according to the conditional probability

∑w n p |A, k = ∑w n ij

C h h ,ij

/ nh

C h h , i⋅

/ nh

h

.

h

Note that pˆ I can be computed by ignoring imputation classes and treating imputed values as observed data. The  following result establishes the

Innovative Statistics in Regulatory Science

164

asymptotic normality of pˆ I based on the method of conditional hot deck imputation across strata. Theorem 6.2: Let (π h , A , π h , B , π h ,C ) = (π A , π B , π C ) for all h. Assume further that H → ∞ and that there are constants c j , for j = 1,..., 4, such that nh ≤ c1 , c2 ≤ Hwh ≤ c3 , and ph , ij ≥ c4 for all h. Then n (pɵ I −p) →d N (0, Σ), where Σ is the limit of  n  

H

∑ h =1

 wh2 Σh + Σ A + ΣB  .  nh 

6.5 Recent Development 6.5.1 Other Methods for Missing Data As indicated earlier, depending upon the mechanisms of missing data, different approaches may be selected in order to address the medical questions asked. In addition to the methods described in the previous sections of this chapter, the methods that are commonly considered included the mixed effects model for repeated measures (MMRM), weighted and unweighted generalized estimating equations  (GEE), multipleimputation-base GEE (MI-GEE), and Complete-case (CC) analysis of covariance (ANCOVA). For  recent development of missing data imputation, the Journal of Biopharmaceutical Statistics (JBS) has published a special issue on Missing Data  – Prevention and Analysis (JBS, 19, No.  6, 2009, Ed. G. Soon). These recent development on missing data imputation are briefly summarized below. For a time-saturated treatment effect model and an informative dropout scheme that depends on the unobserved outcomes only through the random coefficients, Kong et  al. (2009) proposed a grouping method to correct the biases in the estimation of treatment effect. Their proposed method could improve the current methods (e.g., the LOCF and the MMRM) and give more stable results in the treatment efficacy inferences. Zhang and Paik (2009) proposed a class of unbiased estimating equations using a pair-wise conditional technique to deal with the generalized linear mixed model under benign non-ignorable missingness where specification of the missing model is not needed. The proposed estimator was shown to be consistent and asymptotically normal under certain conditions.

Missing Data

165

Moore and van der Laan (2009) applied targeted maximum likelihood methodology to provide a test that makes use of the covariate data that are commonly collected in randomized trials. The proposed methodology does not require assumptions beyond those of the log-rank test when censoring is uninformative. Two approaches based on this methodology are provided: (i) a substitution-based approach that targets treatment and time-specific survival from which the log-rank parameter is estimated, and (ii) directly targeting the log-rank parameter. Shardell and El-Kamary (2009), on the other hand, used the framework of coarsened data to motivate performing sensitivity analysis in the presence of incomplete data. The proposed method (under pattern-mixture models) allows departures from the assumption of coarsening at random, a generalization of missing at random, and independent censoring. Alosh (2009) studied the missing data problem for count data by investigating the impact of missing data on a transition model, i.e., the generalized autoregressive model of order 1 for longitudinal count data. Rothmann et al. (2009) evaluated the loss to follow-up with respect to the intent-to-treat principle on the most important efficacy endpoints for clinical trials of anticancer biologic products submitted to the United States Food and Drug Administration from August 2005–October 2008 and provided recommendations in light of the results. DeSouza et al. (2009) studied the relative performances of these methods for the analysis of clinical trial data with dropouts via an extensive MonteCarlo study. The results indicate that the MMRM analysis method provides the best solution for minimizing the bias arising from missing longitudinal normal continuous data for small to moderate sample sizes under MAR dropout. For the non-normal data, the MI-GEE may be a good candidate as it outperforms the weighted GEE method. Yan et al. (2009) discussed methods used to handle missing data in medical device clinical trials, focusing on tipping-point analysis as a general approach for the assessment of missing data impact. Wang et al. (2009) studied the performance of a biomarker predicting clinical outcome in large prospective study under the framework of outcome- and auxiliary dependent sub-sampling and proposed a semi-parametric empirical likelihood method to estimate the association between biomarker and clinical outcome. Nie et al. (2009) dealt with censored laboratory data due to assay limits by comparing a marginal approach and variance-component mixed effects model approach. 6.5.2 The Use of Estimand in Missing Data An estimand is a parameter that is to be estimated in a statistical analysis. The term is used to more clearly distinguish the target of inference from the function to obtain this parameter, i.e., the estimator and the specific value obtained from a given data set, i.e., the estimate (Mosteller and Tukey, 1987).

166

Innovative Statistics in Regulatory Science

To distinguish the terms of estimator and estimand, consider the following example. Let X be a normally distributed random variable with mean µ and variance σ 2 . The variance is often estimated by sample variance s2, which is an estimator of σ 2 and σ 2 is called the estimand. An estimand reflects what is to be estimated to address the scientific question of interest posed by a trial. In practice, the choice of an estimand involves population of interest, endpoint of interest, and measure of intervention effect. Measure of intervention effect may take into account the impact of post-randomization events such as dropouts, non-compliance, discontinuation of study, discontinuation of intervention, treatment switching, rescue medication, death and so on. As indicated in NRC (2010), an estimand is closely linked to the purpose or objective of an analysis. It describes what is to be estimated based on the question of interest. In clinical trials, since an estimand is often free of the specific assumptions regarding missing data, it is reasonable to conduct sensitivity analyses using different estimators for the same estimand in order to test the robustness of inference to different assumptions of missing data mechanisms (ICH, 2017). The ICH-E9-R1 addendum on estimands and sensitivity analyses states that estimands are defined by the (i) target population; (ii) the outcome of interest; (iii) the specification of how post-randomization events (e.g., dropout, treatment withdrawal, noncompliance, and rescue medication) reflected in the research question; and (iv) the summary measure for the endpoint. All sensitivity analyses should address the same primary estimand for missing data. Also, all model assumptions that are varied in the sensitivity analysis should be in line with the estimand of interest. Note that sensitivity analyses can also be planned for secondary/exploratory estimands and aligned accordingly.

6.5.3 Statistical Methods Under Incomplete Data Structure 6.5.3.1 Introduction In clinical trials, statistical inference is derived based on probability structure that relies on randomization. In case of missing data, statistical inference should be derived based on valid statistical methods developed under the structure of incomplete data rather than based on methods of missing data imputation. As an example, for illustration purpose, consider statistical methods for two-sequence, three-period crossover designs with incomplete data (Chow and Shao, 1997), which is described below. A crossover design is a modified randomized block subject) design in which each subject receives more than one treatment at different periods. A crossover design allows a within-subject comparison between treatments (that is, each subject serves as his/her own control). The use of cross-over designs for clinical trials has had extensive discussion in the literature (for example, Brown, Jones, and Kenward). In particular, the standard two-sequence twoperiod cross-over design is viewed favorably by the FDA for assessment of

Missing Data

167

bioequivalence between drug products. To assess the difference between two treatments, A and B, we describe the standard two-sequence, two-period crossover design as follows. Subjects are randomly assigned to each of two sequences of treatments. Subjects in sequence 1 receive treatment A at the first dosing period and then cross over to receive treatment B at the second dosing period, while subjects in sequence 2 receive treatments in the order of B and A at two dosing periods. Let ykij denote the response of the ith subject in the kth sequence at the jth period. Then, we can describe ykij by the following model: ykij = µ + p j + qk + tg ( k , j ) + ch( k , j ) + rki + ekij ,

(6.5)

where µ is the overall mean; p j is the fixed effect of the jth period, j = 1, 2, and p1 + p2 = 0 ; qk is the fixed effect of the kth sequence, k = 1, 2, and q1 + q2 = 0, tg ( k , j ) is the fixed treatment effect; g ( k , j ) = A if k = j , g ( k , j ) = B if k ≠ j, and tA + tB = 0 ch( k , j ) is the fixed carry-over effect of treatment A or B; ch(1,1) = ch( 2 ,1) = 0; h ( 1, 2 ) = A, h ( 2, 2 ) = B , and cA + cB = 0; rik is the random effect of the ith subject in the kth sequence; i = 1, … , nk ; and ekij is a random measurement error. The carry-over effect ch in (6.5) is the effect of a drug product that persists after the end of the dosing period. It  differs from the treatment effect tg , which is a direct treatment effect during the period in which the treatment is administered. We can see from (6.5) that there are five independent fixed effect parameters in the model: µ , p1 , q1, tA , and cA . In general, it is not possible to obtain unbiased estimators of these five parameters. If there is a sufficient washout between dosing periods, then we can ignore the carry-over effect, that is, c= c= 0, and we can estimate µ , p1 , q1, and tA using linear combinations of A B the four observed sample means y kj =

1 nk



nk i =1

ykij ,

(6.6)

where j = 1, 2, and k = 1, 2. Since the standard two-sequence, two-period crossover design does not provide unbiased estimates of treatment and carry-over effects when carry-over effects are present, it is recommended that one use a replicate crossover design. The simplest and the most commonly used replicate crossover design is the two-sequence, three-period crossover design (Chow and Liu, 1992a, 2013), which one can obtain by simply adding an additional period to the standard two-sequence two-period crossover design such that subjects in sequence 1 receive three treatments in the order of A, B and B, and subjects in sequence 2 receive three treatments in the order of B, A and A. The data from a two-sequence, three-period crossover design can still be described by model (6.5) except that j = 1, 2, 3 and p1 + p2 + p3 = 0, g ( k , j ) = A when k = j or ( k , j ) = ( 2, 3 ) , g ( k , j ) = B otherwise, ch( k ,1) = 0, h ( 1, 2 ) = h ( 2, 3 ) = A and h ( 2, 2 ) = h ( 1, 3 ) = B. There are six independent fixed effect parameters,

168

Innovative Statistics in Regulatory Science

which be estimated unbiasedly using linear combinations of the observed sample means y kj defined in (6.6), j = 1, 2, 3, k = 1, 2 (Jones and Kenward, 1989; Chow and Liu, 1992a, 2008, 2013). In  clinical trials the data set is often incomplete for various reasons (protocol violations, failure of assay methods, lost to follow-up, etc.). Since there are three periods in a two-sequence three-period cross-over design, subjects are likely to drop out at the third period since they are required to return for tests more often. Also, due to cost or other administrative reasons, sometimes not all of the subjects receive treatments for the third period. One cannot directly apply standard statistical methods for a crossover design to an incomplete or unbalanced data set. A simple and naive way to analyze an incomplete data set from a two-sequence, three-period crossover design is to exclude the data from subjects who do not  receive all three treatments so that one can treat the data set as if it is from a twosequence, three-period crossover design with smaller sample sizes. This, however, may result in a substantial loss in efficiency when the dropout rate is appreciable. 6.5.3.2 Statistical Methods for 2 × 3 Crossover Designs with Incomplete Data The  purpose of this section is to describe a statistical method proposed by Chow and Shao (1997) for analysis of incomplete or unbalanced data from a two-sequence, three-period crossover design. Chow and Shao’s proposed method fully utilizes the data from subjects who completed at least two periods of study to obtain more efficient estimators as compared to the method of excluding data. Chow and Shao assumed that model (6.5) holds, {rki } and {ekij } are mutually independent, and that the ekij s are identically distributed as N (0,σ e2 ). Also, it is assumed that the random effects rki are identically distributed as N (0,σ a2 ). The  normality assumption on the random effects is more restrictive than that on the random measurement errors. Let us first consider the case where there are no missing data, under the normality assumptions on rki and ekij, the maximum likelihood estimator of

β = ( µ , p1 , p2 , q1 , tA , cA ) ′ exists but its exact distribution may not have a known form and, therefore, an exact confidence interval for a component of β based on the maximum likelihood estimator may not exist. The ordinary least squares (LS) estimator of β as

βɵ LS = A y

(6.7)

Missing Data

169

where 1 1  1  3 3 3   2 1 1 − −  3 3  3  1 2 1 − − 3 3 1 3 A=  2 1 1 1  2 4 4   1 1 1  2 −4 −4   1 1 −  0 2 2 

1 3

1 3

2 3



1 3

1 3

2 3

1 − 2

1 − 4

1 2

1 4





0



1 2

1 3  1 −  3 1 −  3  1 −  4  1 4  1  2

(6.8)

and  y1  1 y =   , yk = y  nk  2

nk

∑y , y ki

ki

=  yki1 yki 2 yki 3  ′.

i =1

That is, components of the LS estimator are linear combinations of the sample means y kj. For  example, the LS estimator of the treatment effect δ = tA − tB = 2tA is 1 1 1 1 1 1 δɵ LS = y11 − y12 − y13 − y 21 + y 22 + y 23 . 2 4 4 4 4 4 Under the normality assumptions on rki and ekij, we can obtain an exact confidence interval for any given component of β using (6.7) and (6.8) because any component of βˆ LS can be written in the form c′ y1 ± c′ y 2 for an appropriate three-dimensional vector c. We now consider the case where there are missing data. Without loss of generality, we assume that in the kth sequence, the first mk1 subjects have data for all three periods, while the next mk 2 − mk 1 subjects have data for periods 1 and 2 and the next mk 3 − mk 2 subjects have data for periods 2 and 3, and the last nk − mk 3 subjects have data for periods 1 and 3, where 0 ≤ mk 1 ≤ mk 2 ≤ mk 3 ≤ nk (subjects who have data for only one period are excluded). The sample sizes mk1 may be random and whether or not  ykij is missing may depend on the value of ykij . Thus, it is difficult to make an inference on β since the joint

170

Innovative Statistics in Regulatory Science

distribution of ykij ′s (or the conditional joint distribution of ykij ′s , given mkl ′s ) is unknown. It is very likely, however, that whether or not ykij is missing is independent of the measurement error ekij (that is, mkl ′s are related to the random subject effects rki only). If this is true, then we can make inferences on some components of b based on a transformed data set that is unrelated to the random subject effects (see, for example, Mathew and Sinha, 1992; Weerakkody and Johnson, 1992). More precisely, we may write model (6.5) as y = X β + Zr + e and consider the linear transformation Hy with HZ = 0, where X, Z and H are some suitably defined matrices, y, r and e are the vectors of ykij ′s , rki ′s , and ekij ′s , respectively. Since Hy = HX β + He

(6.9)

the conditional distribution of Hy, given mkl ′s, is still normal if e is normal and independent of mkl ′s. Under model (6.9), we usually cannot estimate all components of β . For  many problems in clinical trials, bioavailability and bioequivalence studies, the primary parameters of interest are often the treatment effect δ = tA − tB and the carry-over effect γ = cA − cB. In the following we consider a special transformation H which is equivalent to taking within-subject differences (between two periods) and produces unbiased estimators of p1, p2, δ and γ . Consider the within-subject differences (obtained by taking differences between the two data points for subjects with one missing value, or taking differences between the data from the first two periods and the last two periods for subjects without missing data): d1i1 = y1i1 − y1i 2 = p1 − p2 + tA − tB − cA + e1i1 − e1i 2 , 1 ≤ i ≤ m12 d1i 2 = y1i 2 − y1i 3 = p2 − p3 + cA − cB + e1i 2 − e1i 3 , 1 ≤ i ≤ m11 or m12 < i ≤ m13 d1i 3 = y1i1 − y1i 3 = p1 = p3 + tA − tB − cB + e1i1 − e1i 3 , m13 < i ≤ n1 d2 i1 = y2 i1 − y2 i 2 = p1 − p2 − tA + tB − cB + e2 i1 − e2 i 2 , 1 ≤ i ≤ m22 d2 i 2 = y2 i 2 − y2 i 3 = p2 − p3 − cA + cB + e2 i 2 − e2 i 3 , 1 ≤ i ≤ m21 or m22 < i ≤ m23 d2 i 3 = y2 i1 − y2 i 3 = p1 − p3 − tA + tB − cA + e2 i1 − − e2 i 3 , m23 < i ≤ n2 . Let d be the vector of these differences (arranged according to the order of the subjects). Then d is independent of r and Hy for some H satisfying HZ = 0 (we can obtain explicitly the matrix H, but it is unnecessary). Assuming that mkl ′s are independent of e, we obtain that

Missing Data

171

(

d ∼ N Wθ , σ e2G

)

(6.10)

(conditional on mkl ′s), where

θ = ( p1 − p2 , p2 − p3 ,δ ,γ ) ′   1    1 0 1 − 2   1m11 ⊗     0 1 0 1     1 1m12 − m11 ⊗ 1 0 1 −     2     1m13 − m12 ⊗ 0 1 0 1         1n1 − m13 ⊗ 1 1 1 1   2    W =    1    1 0 − 1 2   1m21 ⊗     0 1 0 − 1      1m − m ⊗ 1 0 − 1 1   21 22   2     1m − m ⊗ 0 1 0 − 1  23 22       1n2 − m23 ⊗ 1 1 − 1 − 1    2   

and   2 − 1  I m11 ⊗     −1 2    0 G=    0    0 

0 2 I n1 − m11 0 0

0 0  2 − 1 I m21 ⊗    −1 2  0

 0    0     0    2 I n2 − m21 

Innovative Statistics in Regulatory Science

172

where ⊗ is the Kronecker product, 1v is the v-vector of ones, Iv is the identity matrix of order v, and 0 is the matrix of 0’s of an appropriate order. Under model (6.10), the maximum likelihood estimator of θ is the weighted least squares estimator

θɵ H = (W ′G −1W )−1 W ′G −1d. the theory of least squares we immediately obtain the following estimator of the covariance matrix of θˆH

σɵ 2e (W ′G −1W )−1 where 2 d’ G −1 − G −1W (W ’G −1W )−1W ’G −1  d σ e =  n1 + m11 + n2 + m21 − 4

(the sum of squared residuals divided by the degrees of freedom). We can then construct an exact confidence interval for l′θ with a fixed vector l using the fact that l’θɵ H − l’θ

σ e

{l (W G ’



}

−1

W )−1 l

has a t-distribution with n1 + m11 + n2 + m21 − 4 degrees of freedom. 6.5.3.3 A Special Case As we discussed in the previous subsection, missing data are often a more serious problem in the third period of study. We now obtain simplified formulae for θɵ H, σɵ 2e, and (W ′G −1W )−1 in the important special case where there is no missing data in the first two periods: 1 ≤ mk 1 ≤ mk 2 = mk 3 = nk . We assume mk1 ≥ 1; otherwise the design becomes a two-sequence two-period crossover. Let mk = mk 1

dkj =

1 nk

nk

∑ i =1

1 dkij and dɶ kj = mk

mk

∑d

kij

i =1

, j = 1, 2, k = 1, 2

Missing Data

173

Then, it can be verified that   d11 + d 21    1  1 1  − d11 + 21 dɶ 11 + dɶ 12 − d 21 + dɶ 21 + dɶ 22  2 2 2  1 θɵ H =   2  3 d11 + 1 dɶ 11 + 1 dɶ 12 − 3 d 21 − 1 dɶ 21 − 1 dɶ 22  4 2 4 4 2 4     − 1 d11 + 1 dɶ 11 + dɶ 12 + 1 d 21 − 1 dɶ 21 − dɶ 22   2  2 2 2 2

2 σ e =

2   3

mk

∑ ∑( k =1

nk

)

dki2 1 + dki2 2 + dki1dki 2 +

i =1

∑d

1 2 i=m

2 ki 1



k +1

n1 + m1 + n2 + m2 − 4

(6.11)

2 nk 2 mk ɶ dk 1 + 2dɶ k 2  dk1 − 2 6  ,

(

)

(6.12) and the upper-triangle part of the symmetric matrix (W ′G −1W )−1 is given by  21       

(

1 n1

+ n12

)

− 41 1 8

(

1 n1

(

1 n1

+ n12

)

+ n12 + m31 + m32

3 8

)

− 163 3 32

(

(

1 n1

3 n1

(

1 n1

− n12

)

− 41

) )

− n12 − m11 + m12

+ n32 + m11 + m12

1 8

(

− 163 1 8

(

1 n1

(

1 n1

− n12

)

− n12 + m31 − m32

1 n1

1 n1

(

)

+ n12 − m11 − m12

+

1 n2

+

3 m1

+

3 m2

)

)

       

It is interesting to consider the case where mk = nk (that is, no missing datum in all three periods). From (6.11) and the fact that dɶkj = dɵ kj when mk = nk   d11 + d 21     d12 + d 22   1 θɵ H =  . 2  d11 + 1 d12 − d 21 − 1 d 22  2 2     d12 − d 22   Comparing (6.7) and (6.8) with (6.13), we have

θɵ H = θɵ LS = B βɵ LS ,

(6.13)

Innovative Statistics in Regulatory Science

174

where 0  0 B= 0  0 

1 − 1 0 0 0  1 2 0 0 0 . 0 0 0 2 0   0 0 0 0 2 

This raises two interesting points: (i) our method reduces to the least squares method when there is no missing datum; (ii) when no datum is missing, the least squares estimator θɵ LS does not depend on the random subject effects rki. Therefore, its properties do not rely on the normality assumption on rki. 6.5.3.4 An Example A two-sequence, three-period crossover experiment was conducted to compare two treatments of a drug product in women who have a diagnosis of late luteal phase dysphoric disorder often referred to as marked premenstrual syndrome. After recording their daily symptoms for one to three months, patients received placebo for a full menstrual cycle, with dosing initiated around the times of menses. Before active treatments, each patient received a one-month washout period to remove possible responders to placebo from the study. Following the washout period, patients received double-blind treatment for two full menstrual cycles with either treatment A or treatment B. After two cycles, the patients were crossed over in a double-blind fashion to the alternative treatment for two full menstrual cycles. Then, the patients received a third double blind treatment using the second treatment medication, for a final two cycles. The analysis of efficacy was based on the depression score, which is the sum of the responses to 13  symptoms in each patient’s symptom checklist completed at each treatment period. The  depression scores appear in Table 6.1. In this example, there is no missing data in the fiacy two periods; = n1 32 = , n2 36, m1 = 24, and m2 = 18 . The dropout rates are between 75% and 50%, respectively, for the two sequences. We assume model (6.5) and focus on the treatment effect δ = tA − tB and the carry-over effect γ = cA − cB. Using (6.11), we have

δˆH = −3.65 and γˆH = −1.94. An estimate of σ e2 calculated according to (6.12) is σˆ e2 = 60.62. Using the formulae given in the previous subsection, the following 95% confidence intervals can be obtained:

δ : ( −6.13, −1.18 ) and γ : ( −5.17 ,1.29 ) . The p-value for the two-sided t-test of δ = 0 is 0.004, while the p-value for the two-sided t-test of γ = 0 is 0.234. Thus, we have strong reason to believe that in

Missing Data

175

TABLE 6.1 Depression Scores ykij Patient Sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 *

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2

Period Period Period Period Period Period 1 2 3 Patient Sequence 1 2 3 20 18 49 26 30 14 38 30 20 13 21 27 13 20 34 25 42 18 15 22 37 22 10 32 16 36 39 40 29 17 46 52 26 38

22 38 49 41 23 18 20 33 13 15 25 34 24 20 37 32 37 22 45 40 22 32 23 35 21 54 43 46 41 16 28 27 29 21

26 22 53 35 22 15 50 31 16 16 32 28 17 16 36 27 40 18 31 47 28 52 25 46 * * * * * * * * 21 27

35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

26 21 35 13 13 24 23 25 18 33 45 36 36 21 33 28 47 17 42 19 25 24 19 47 40 43 28 34 22 21 29 42 44 15

18 23 26 18 13 16 30 36 29 34 21 16 36 39 25 13 24 16 50 31 26 21 34 35 26 33 47 14 16 23 17 28 59 34

15 23 38 15 26 13 18 29 17 24 35 15 26 34 29 21 * * * * * * * * * * * * * * * * * *

Missing value.

this example, the treatment effect is significant whereas the carry-over effect is not. Since the carry-over effect is not significant, one might wonder what would have occurred if one had used a standard two-period crossover design. For  comparison and illustration, we drop the third period data from the patients who received all three treatments (that is, we treat the data as if they

176

Innovative Statistics in Regulatory Science

were from a standard two-period crossover design), and re-perform the analysis using standard methods for a two-sequence two-period crossover design. The resulting estimate of δ is −2.38; the 95% confidence interval for δ is (−5.30, 0.53), and the p-value for the two-sided t-test of δ = 0 is 0.108. Note that the length of the confidence interval based on two-period data is about 15% longer than that based on three-period (incomplete) data. More importantly, based on the two-period data, we can neither reject nor accept δ = 0 with strong evidence and a statistician might conclude that there is a need for more experiments to detect whether or not there is a treatment effect. On the other hand, with the additional third period (incomplete) data, we have concluded that the treatment effect is significant. Also, we emphasize that one cannot assess the carry-over effect using a two-sequence two-period crossover design, and that γ = 0 is a necessary assumption for the use of such a design.

6.6 Concluding Remarks One of the most controversial issues in missing data imputation is that reduction of power. In  practice, it is often considered that the most worrisome impact of missing values on the inference for clinical trials is biased estimation of the treatment effect. As a result, little attention was given to the possible loss of power. In clinical trials, it is recognized that missing data imputation may inflate variability and consequently decrease the power. If there is a significant decrease in power, the intended clinical trial will not be able to achieve the study objectives as planned. This would be a major concern during the regulatory review and approval process. In addition to the issue of the reduction of power, the following is a summary of controversial issues that present challenge to clinical scientists when applying missing data imputation in clinical trials: 1. When the data are missing, the data are missing. How can we makeup data for the missing data? 2. The validity of the method of LOCF for missing data imputation in clinical trials. 3. When there is a high percentage of missing values, missing data imputation could be biased and misleading. For the first question; from a clinical scientist’s point of view, if the data is missing, it is missing. Missing data imputation has been criticized by using legal procedure (i.e., statistical model or procedure) to illegally make-up (i.e., impute) data because (i) missing is missing and (ii) one cannot draw statistical inference based on imputed (i.e., predicted but not observed) data. Thus, one should

Missing Data

177

not  impute (or make-up) data in any way whenever possible—it is always difficult, if not  impossible, to verify the assumptions behind the method/ model for missing data imputation. However, from a statistician’s point of view, we may be able to estimate the missing data based on information surrounding the missing data under certain statistical assumptions/models. Dropping subjects with incomplete data may not be a GSP. For the second question, the method of LOCF for missing values has been widely used in clinical trials for years in practice although its validity has been challenged by many researchers and regulatory agencies such as the FDA. It  is suggested that the method of LOCF for missing values should not be considered as the primary analysis for missing data imputation. As for the third question, in practice, if the percentage of missing values exceeds a pre-specified number, it is suggested that missing data imputation should not be applied. This raises a controversial issue for selection of the criterion of the cut-off value for the percentage of missing value, which will preserve good statistical properties of the statistical inference derived based on the incomplete data set and imputed data. In  summary, missing values or incomplete data are commonly encountered in clinical research. How to handle the incomplete data is always a challenge to the statisticians in practice. Imputation as one of very popular methodology to compensate for the missing data is widely used in biopharmaceutical research. As compared to its popularity, however, its theoretical properties are far away from well understood. Thus, as indicated by Soon (2009), addressing missing data in clinical trials should focus missing data prevention and missing data analysis. Missing data prevention is usually done through the enforcement of GCP during protocol development and clinical operations personnel training for data collection. This  will lead to reduced biases, increased efficiency, less reliance on modeling assumption and less need for sensitivity analysis. However, in practice, missing data cannot be totally avoided. Missing data often occur due to factors beyond the control of patients, investigators, and clinical project team.

7 Multiplicity

7.1 General Concepts In clinical trials, one of the ultimate goals is to demonstrate that the observed difference of a given study endpoint (e.g., the primary efficacy endpoint) is not only of clinical importance (or a clinically meaningful difference) with statistical meaning (or of statistically significance). Statistical meaning is referred to as that the observed difference is not by chance alone and it is reproducible if we are to conduct a similar study under similar experimental conditions. In  practice, the observed clinically meaningful difference that has achieved statistical significance is also known as a statistical difference. Thus, a statistical difference means that the difference is not by chance alone and it is reproducible. In  drug research and evaluation, it is of interest to control the chance of false negative (or making type I error) and minimize the chance of false positive (or making type II error) at a pre-specified level of significance. As a result, based on a given study endpoint, controlling the overall type I error rate at a pre-specified level of significance for achieving a designed power (i.e., the probability of correctly detecting a clinically meaningful difference if such a difference truly exists) has been a common practice for sample size determination. In practice, the investigator may consider more than one endpoint (say two study endpoints) as the primary study endpoints. In this case, our goal is to demonstrate that the observed differences of the two study endpoints are clinically meaningful differences with statistical meaning. In other words, the observed differences are not by chance alone and they are reproducible. In this case, the level of significance is necessarily adjusted for controlling the overall type I error rate at a pre-specified level of significance for multiple endpoints. This  has raised the critical issue of multiplicity in clinical research and development. In clinical trials, multiplicity is usually referred to as multiple inferences that are made in simultaneous context (Westfall and Bretz, 2010). As a result, alpha adjustment for multiple comparisons is to make sure that the simultaneously observed differences are not by chance alone. In clinical trials, commonly seen multiplicity include comparison of (i) multiple treatments (dose groups), (ii) multiple endpoints, (iii) multiple time 179

180

Innovative Statistics in Regulatory Science

points, (iv) interim analyses, (v) multiple tests of the sample hypothesis, (vi) variable/model selection, and (vii) subgroup analyses. In general, if there are k treatments, there are k(k − 1)/ 2 possible pair-wise comparisons. In practice, two types of error rates are commonly considered (Lakshminarayanan, 2010). The first type of error rate is a comparison-wise error rate (CWE), which is a type I error rate for each comparison. That is, it is the probability of erroneously rejecting the null hypothesis between treatments involved in the comparison. The  other type of error rate is an experiment-wise error rate (EWE) or family-wise error rate (FWER) which is the error rate associated with one or more type I errors for all comparisons included in the experiment. Thus, for k comparisons, CWE  =  α and FWER = 1 − (1 − α )k . As a result, the FWER could be much larger than the significance level associated with each test if multiple statistical tests are performed using the same data set. In practice, thus, it is of interest to control the FWER. In the past several decades, several procedures for controlling FWER have been suggested in the literature. These procedures can be classified into either single-step procedures or step-wise (e.g., step-up and step-down) procedures. Note that an alternative approach to multiplicity control is to consider the false discovery rate (FDR) (see, e.g., Benjamini and Hochberg, 1995). In the next section, regulatory perspectives regarding multiplicity adjustment are discussed. Also included in this section are some commonly seen controversial issues of multiplicity in clinical trials. Section  7.3 provides a summary of commonly considered statistical methods for multiplicity adjustment for controlling the overall type I error rate. An example concerning a dose finding study is given in Section 7.4. A brief concluding remark is given in the last section of this chapter.

7.2 Regulatory Perspective and Controversial Issues 7.2.1 Regulatory Perspectives Regulatory position regarding adjustment for multiplicity is not  clear. In 1998, the ICH E9 published guidelines in “Statistical Principles in Clinical Trials.” These guidelines have several comments reflecting concern over the multiplicity problem. The  ICH E9 guidelines recommend that the analysis of clinical trial data may necessitate an adjustment to the Type I error. In  addition, the ICH E9 suggests details of any adjustment procedure or an explanation of why adjustment is not  thought to be necessary should be set out in the analysis plan. The  European Agency for the Evaluation of Medicinal Products (EMEA), on the other hand, in its Committee for Proprietary Medicinal Products (CPMP) draft guidance Points to Consider on Multiplicity Issues in Clinical Trials indicates that multiplicity can have a

Multiplicity

181

substantial influence on the rate of false positive conclusions whenever there is an opportunity to choose the most favorable results from two or more analyses. The EMEA guidance also echoes the ICH recommendation for stating details of the multiple comparisons procedure in the analysis plan. In  2017, the FDA  published a draft guidance on Multiple Endpoints in Clinical Trials Guidance for Industry, which provides sponsors and review staff with the Agency’s thinking about the problems posed by multiple endpoints in the analysis and interpretation of study results and how these problems can be managed in clinical trials for human drugs and biological products. The purpose of this guidance is to describe various strategies for grouping and ordering endpoints for analysis and applying some well-recognized statistical methods for managing multiplicity within a study in order to control the chance of making erroneous conclusions about a drug’s effects. Basing a conclusion on an analysis where the risk of false conclusions has not been appropriately controlled can lead to false or misleading representations regarding a drug’s effects. As indicated by Snapinn (2017), despite recent advance in methods for handling multiple endpoints in clinical trials, some challenges remain. In this chapter I will discuss some of these challenges, including the following: (i) potential confusion surrounding the terminology used to describe the multiple endpoints; (ii) appropriate methods for assessing a treatment’s effect on the components of a composite endpoint; (iii) advantages and disadvantages of fixed-sequence vs. alpha-splitting methods; (iv) the need to report adjusted p-values; and (v) situations where a single trial may be entitled to multiple sets of alpha. 7.2.2 Controversial Issues When conducting clinical trials involving multiple comparisons, the following questions are always raised: 1. 2. 3. 4.

Why do we need to adjust for multiplicity? When do we need to adjust for multiplicity? How do we adjust for multiplicity? Is the FWER well controlled?

To address the first question, it is suggested that the null/alternative hypotheses be clarified since the type I error rate and the corresponding power are evaluate under the null hypothesis and the alternative hypothesis, respectively. Regarding the second question, it should be noted that adjustment for multiplicity is to ensure that the simultaneously observed differences are not by chance alone. For example, for evaluation of a test treatment under investigation, if regulatory approval is based on single endpoint, then no

182

Innovative Statistics in Regulatory Science

alpha adjustment is necessary. However, if regulatory approval is based on multiple endpoints, then α adjustment is a must in order to make sure that the simultaneously observed differences are not  by chance alone and that they are reproducible. Conceptually, it is not correct that alpha needs to be adjusted if more than one statistical test (e.g., primary hypothesis and secondary hypothesis) is to be performed. Whether the α should be adjusted depends upon the null hypothesis (e.g., a single hypothesis with one primary endpoint or a composite hypothesis with multiple endpoints) to be tested. The interpretations of the test results for single null hypothesis and composite null hypothesis are different. For  questions (3) and (4), several useful methods for multiplicity adjustment are available in the literature (see, e.g., Hsu, 1996; Chow and Liu, 1998b; Westfall et al., 1999). These methods are single-step methods (e.g., Bonferroni’s method), step-down methods (e.g., Holm’s method), or step-up methods (e.g., Hochberg’s method). In the next section, some commonly employed methods for multiplicity adjustment are briefly described. As pointed out by Westfall and Bretz (2010), the commonly encountered difficulties surrounding multiplicity in clinical trials include (i) penalizing for doing more or good job (i.e., performing additional test), (ii) adjusting α for all possible tests conducted in the trial, and (iii) problems with determining the family of hypotheses to be tested. Penalizing for doing good job refers to adjustment for multiplicity in dose finding trials that include more dose groups than needed. Adjusting α for all possible tests conducted in the trial, although the α is controlled at the pre-specified level, is overkill because it is not the investigator’s best interest to show that all of the observed differences simultaneously are not by chance alone. In practice, it can be very tricky to select the appropriate family of hypotheses (e.g., primary endpoints and secondary endpoints for efficacy or safety or both) for multiplicity adjustment for clinical evaluation of the test treatment under investigation. It should be added that the most worrisome impact of multiplicity on the inference for clinical trials is not  only the control of FWER—though that can be problematic—but also preserving the power for correctly detecting a clinically meaningful treatment effect. One of the most frustrating issues in multiplicity is having adequate control of FWER but may fail to achieve the desired power due to multiplicity.

7.3 Statistical Method for Adjustment of Multiplicity As indicated earlier, commonly considered procedures or methods for controlling the FWER at some pre-specified level of significance can be classified into two categories: (i) single-step methods (e.g., Bonferroni correction) and

Multiplicity

183

(ii) step-wise procedures, which include step-down methods (e.g., Holm’s method) and step-up methods (e.g., Hochberg’s method). In  practice, commonly used procedures for controlling the FWER in clinical trials are classic multiple comparison procedures (MCP), which include Bonferroni, Tukey, and Dunnett procedures. These procedures and among others are briefly described below. 7.3.1 Bonferroni Method Among the procedures mentioned above, the method of Bonferroni is probably the most commonly considered procedure for addressing multiplicity in clinical trials though it is somewhat conservative. Suppose there are k treatments and we are interested in testing the following hypothesis H 0 : µ1 = µ2 = ⋯ = µk , where µi , i = 1,..., k is the mean for the ith treatment. Let = yij , j 1= ,..., ni , i 1,...k be the jth observation obtained in the ith treatment. Also, let yi and

s

2

∑ ∑ (y − y ) = ∑ (n − 1) k

ni

i =1

j =1

ij

i

2

k

i =1

i

be the least square mean for the ith treatment and an estimate of the variance obtained from an analysis of variance (ANOVA), respectively. Also, ni is the sample size the ith treatment. We then reject the null hypothesis and in favor of the alternative hypothesis that the treatment means µi and µ j are different for every i ≠ j if yi − y j > tα /2 (v)  s2 (ni−1 + n−j 1 )

1/2

,

(7.1)

where tα /2 (v) denotes a critical value for the t-distribution with v = Σ(ni − 1) degrees of freedom and an upper tail probability of α/2. Bonferroni’s method simply requires that if there are k inferences in a family, then all inferences should be performed at the α/k significance level rather than at the α level. Note that the application of Bonferroni’s correction to ensure that the probability of declaring one or more false positives is no more than α . However, this method is not  recommended when there are many pairwise comparisons. In  this case, the following multiple range test procedures are useful.

Innovative Statistics in Regulatory Science

184

7.3.2 Tukey’s Multiple Range Testing Procedure Similar to (7.1), we can declare that the treatment means µi and µ j are different for every i ≠ j if  (ni−1 + n−j 1 )  yi − y j > q(α , k , v)  s2  2  

1/2

(7.2)

,

where q(α , k , v) is the studentized range statistic. This method is known as Tukey’s multiple range test procedure. It should be noted that simultaneous confidence intervals on all pairs of mean differences µi − µ j can be obtained based on the following 1/2   2 (ni−1 + n−j 1 )  P  µ i − µ j ∈ yi − y j ± q  s  for all i ≠ 2   

 j = 1 − α. 

(7.3)

Note that Tables of critical values for the studentized range statistic are widely available. As an alternative to the Tukey’s multiple range testing procedure, the following Duncan’s multiple-range testing procedure is often considered. Duncan’s multiple testing procedure leads us to conclude that the largest and smallest of the treatment means are significantly different if  MSE  yi − y j > q(α p , p, v)   n 

1/2

,

(7.4)

where p is the number of averages, q(α p , p, v) is the critical value from the studentized range statistic with an FWER of α p. 7.3.3 Dunnett’s Test When comparing several treatments with a control, Dunnett’s test is probably the most popular method. Suppose there are k − 1 treatment and one control. Denote by µi , i = 1,..., k − 1 and µk be the mean of the ith treatment and the control, respectively. Further, supposes that the treatment groups can be described by the following balanced one-way analysis of variance model: yij = µi + ε ij , i = 1,..., k ; j = 1,..., n. It is assumed that the ε ij are normally distributed with mean 0 and unknown variance σ 2 . Under this assumption, µi and σ 2 can be estimated. Consequently, one-sided and two-sided simultaneous confidence intervals for µi − µk can be obtained. For the one-sided simultaneous confidence interval of µi − µk , i = 1,..., k − 1, the lower bound is given by

Multiplicity

185

µˆ i − µˆ k − Tσˆ 2 / n , for i = 1,..., k − 1,

(7.5)

where T = Tk −1, v { ρij }(α ) satisfies ∞

∫∫ 0



Φ( z − 2Tu)   −∞

k −1

dΦ( z)γ (u)du = 1 − α ,

where Φ is the distribution function of the standard normal. It  should be noted that T = Tk −1, v { ρij }(α ) are the critical values of the distribution of max Ti , where T1 , T2 ,..., Tk multivariate t distributed with v degrees of freedom and correlation matrix { ρij }. For  the two-sided simultaneous confidence interval µi − µk , i = 1,..., k − 1, the lower bound is given by

µˆ i − µˆ k ± h σˆ 2 / n , for i = 1,..., k − 1,

(7.6)

where h satisfies ∞

∫∫ 0



Φ( z + 2 |h|t) − Φ( z − 2 |h|t)   −∞

k −1

dΦ( z)γ (t)dt = 1 − α .

Similarly, h are the critical values of the distribution of max Ti , where T1 , T2 ,..., Tk follow multivariate t distributed with v degrees of freedom and correlation matrix { ρij }. 7.3.4 Closed Testing Procedure In clinical trials involving multiple comparisons, as an alternative, the use of closed testing procedure has become very popular since introduced by Marcus et al. (1976). The closed testing procedure can be described as follows. First, form all intersections of elementary hypothesis H i, then test all intersections using non-multiplicity adjusted tests. An elementary hypothesis H i is then declared significant if all intersections that include the elementary hypothesis as a component of the intersection are significant. More specifically, suppose there is a family of hypotheses, denoted by { H i , 1 ≤ i ≤ k }. Let H P = ∩ j∈P H j where P = {1, 2,..., k }. H P is rejected if and only if every HQ is rejected for all Q ⊂ P assuming that an α -level test for each hypothesis H P is available. Marcus et al. (1976) showed that this testing procedure controls the FWER. In practice, the closed testing procedure is commonly employed in dosefinding study with several doses of a test treatment under investigation. As an example, consider the following family of hypotheses: { H i : µi − µk ≤ 0, 1 ≤ i ≤ k − 1}

186

Innovative Statistics in Regulatory Science

against one-sided alternatives, where the kth treatment group is the placebo group. Assuming that the sample sizes in the treatment groups are equal (say n) and the sample size for the placebo group is nk . Let

ρ=

n . n + nk

Then, the closed testing procedure can be carried out by the following steps: Step 1: Calculate Ti , the t statistics for 1 ≤ i ≤ k − 1. Let the ordered t statistics be T(1) ≤ T( 2 ) ≤ ⋯ ≤ T( k −1) with their corresponding hypotheses denoted by H(1) ,H( 2 ) ,..., H( k −1) . Step 2: Reject H( j ) if T( i ) > Ti , v , ρ (α ) for i = k − 1, k − 2,..., j. If we fail to reject H( j ), then conclude that H( j −1) ,..., H(1) are also to be retained. The closed testing procedures have been shown to be more powerful than the classic multiple comparisons procedures, such as the classic Bonferroni, Tukey, and Dunnett procedures. Note that the above step-down testing procedure is more powerful than that of the Dunnett’s testing procedure given in (7.5). There is considerable flexibility in the choice of tests for the intersection hypotheses, leading to the wide variety of procedures that fall within the closed testing umbrella. In practice, a closed testing procedure generally starts with the global null hypothesis and proceeds sequentially towards intersection hypotheses involving fewer endpoints. However, it can begin with the individual hypotheses towards the globally null hypothesis. 7.3.5 Other Tests In addition to the testing procedures described above, there are several tests (p-value based stepwise test procedures) that are also commonly considered in clinical trials involving multiple comparisons. These methods include, but are limited to, Simes method (see, e.g., Hochberg and Tamhane, 1987; Hsu, 1996; Sarkar and Chang, 1997), Holm’s method (Holm, 1979), Hochberg’s method (Hochberg,  1988), Hommel’s method (Hommel, 1988), and Rom’s method (Rom, 1990), and, which are briefly summarized below. Simes’ method is designed to reject global null hypothesis if p( i ) ≤ iα /m for at least one i = 1,..., m. The adjusted p-value for the global hypothesis is given by p = m min{p(1) / 1,..., p( m) / m}. Note that Simes’ method improves Bonferroni’s method in controlling the global type I error rate under independence (Sarkar and Chang, 1997). One of the limitations of Simes’ method is that it cannot be used to drawn inferences on individual hypothesis since it only tests the global hypothesis. Holm’s method is a sequentially rejective procedure, which sequentially contrasts ordered unadjusted p-values with a set of critical values and rejects

Multiplicity

187

a null hypothesis if the p-values and each of the smaller p-values are less than their corresponding critical values. Holm’s method not only improves the sensitivity of Bonferroni’s correction method to detect real differences, but also increases in power and provides a strong control of the FWER. Hochberg’s method applies exactly the same set of critical values as the Holm’s method but performs the test procedure in a step-up fashion. The Hochberg’s method enables to identify more significant endpoints and hence is more powerful than that of the Holm’s method. In  practice, the Hochberg’s method is somewhat conservative when individual p-values are independent. In the case where the endpoints are negatively correlated, the FWER control is not guaranteed for all types of dependence among p-values (i.e., the size could potentially exceed α ). Following the principle of closed testing procedure and Simes’ test, Hommel’s method is a powerful sequentially rejective method that allows for inferences on individual endpoints. It  is shown to be marginally more powerful than that of the Hochberg’s method. However, the Hommel procedure also suffers from the disadvantage of not preserving the FWER. It does protect the FWER when the individual tests are independent or positively dependent (Sarkar and Chang, 1997). Rom’s method is a step-up procedure that is slightly more powerful than Hochberg’s method. Rom’s procedure controls the FWER at the α level under the independence of p-values. More details can be found in Rom (1990).

7.4 Gate-Keeping Procedures 7.4.1 Multiple Endpoints Consider a dose-response study comparing m doses of a test drug to a placebo or an active control agent. Suppose that the efficacy of the test drug will be assessed using a primary endpoint and s − 1 ordered secondary endpoints. Suppose that the sponsor is interested in testing null hypotheses of no treatment effect with respect to each endpoint against one-sided alternatives. Thus, there are a total of ms null hypotheses, which can be grouped into s families to reflect the ordering of the endpoints. Now, let yijk denote the measurement of the ith endpoint collected in the jth dose group from the k   th patient, where = k 1= ,..., n, i 1,..., s, and j = 0 (control), 1,..., m. The mean of yijk is denoted by µij . Also, let tij be the t-statistic for comparing the jth dose group to the control with respect to the ith endpoint. It  is assumed that the t-statistics follow a multivariate t distribution. Furthermore, yijk ’ s are normally distributed. Denote by ℑi the family of null hypotheses for the ith endpoint, i = 1,..., s, i.e., ℑi = { H i1 : µi 0 = µi1 ,..., H im : µi 0 = µim }. The s families of null hypotheses are tested in a sequential manner.

188

Innovative Statistics in Regulatory Science

Family ℑ1 (the primary endpoint) is examined first and testing continues to Family ℑ2 (most important secondary endpoint) if at least one null hypothesis has been rejected in the first family. This approach is consistent with a regulatory view that findings with respect to secondary outcome variables are meaningful only when the primary analysis is significant. The  same principle can be applied to the analysis of ordered secondary endpoints. Dmitrienko et  al. (2006) suggest focusing on testing procedures that meet the following condition: Condition A: Null hypotheses in ℑi +1 can be tested only after at least one null hypothesis was rejected in ℑi , i = 1,..., s − 1. Secondly, it is important to ensure that the outcome of the multiple tests early in the sequence does not depend on the subsequent analyses; Condition B: Rejection or acceptance of null hypotheses in ℑi does not depend on the test statistics associated with ℑi +1 ,..., ℑs , i = 1,..., s − 1. Finally, one ought to account for the hierarchical structure of this multiple testing problem and examine secondary dose–control contrasts only if the corresponding primary dose–control contrast was found significant; Condition C: The null hypothesis H ij , i ≥ 2 can be rejected only if H1 j was rejected, j = 1,..., m . It is important to point out that the logical restrictions for secondary analyses in Condition C are caused only by the primary endpoint. This requirement helps clinical researchers streamline drug labeling and improves the power of secondary tests at the doses for which the primary endpoint was significant. Within each of the s families, multiple comparisons can be carried out using the Dunnett’s test as follows. Rejects H ij if the corresponding t-statistic (tij ) is greater than a critical value c for which the null probability of max(ti1 ,..., tim ) > c is α . Note that Dunnett’s test protects the Type I error rate only within each family. Dmitrienko et al. (2006) extended Dunnett’s test for controlling the family-wise error rate for all ms null hypotheses.

7.4.2 Gate-Keeping Testing Procedures Dmitrienko et al. (2006) consider the following example to illustrate the process of constructing a gate-keeping testing procedure for dose-response studies. For simplicity, Dmitrienko et al. (2006) focus on the case where = m 2= and s 2. In this example, it is assumed that the treatment groups are balanced with n patients per group. The four (i.e., ms = 4) null hypotheses are grouped into two (s = 2) families, i.e., ℑ1 = { H11 , H12 } and ℑ2 = { H 21 , H 22 }. Note that ℑ1 consists of hypotheses for comparing low and high doses to placebo with respect to the primary endpoint, while) and ℑ2 contains hypotheses for comparing low and high doses to placebo with respect to the secondary endpoint.

Multiplicity

189

Now let t11 , t12 , t21 , and t22 denote the t-statistics for testing H11 , H12 , H 21 , and H 22. We can then apply the principle of the closed testing for constructing gate-keeping procedures. According to this principle, one first considers all possible non-empty intersections of the four null hypotheses (this family of 15 intersection hypotheses is known as the closed family) and then sets up tests for each intersection hypothesis. Each of these tests controls the Type I error rate at the individual hypothesis level and the tests are chosen to meet Conditions A, B and C described above. To define tests for each of the 15  intersection hypotheses in the closed family, let H denote an arbitrary intersection hypothesis and consider the following rules: 1. If H includes both primary hypotheses, the decision rule for H should not include t21 or t22 . This is done to ensure that a secondary hypothesis cannot be rejected unless at least one primary hypothesis was rejected (Condition A). 2. The  same critical value should be used for testing the two primary hypotheses. This way, the rejection of primary hypotheses is not affected by the secondary test statistics (Condition B). 3. If H includes a primary hypothesis and a matching secondary hypothesis (e.g., H = H11 ∩ H 21), the decision rule for H should not  depend on the test statistic for the secondary hypothesis. This guarantees that H 21 cannot be rejected unless H11 was rejected (Condition C). Note that similar rules used in gate-keeping procedures based on the Bonferroni test can be found in Dmitrienko et  al. (2003) and Chen et  al. (2005). To implement these rules, it is convenient to utilize the decision matrix approach (Dmitrienko et al., 2003). For the sake of compact notation, we will adopt the following binary representation of the intersection hypotheses. * If an intersection hypothesis equals H11; it will be denoted by H1000 . Similarly, * * H1100 = H11 ∩ H12 , H1010 = H11 ∩ H 21 , etc. Table 7.1 (reproduced from Table I of Dmitrienko et al., 2006) displays the resulting decision matrix that specifies a rejection rule for each intersection hypothesis in the closed family. The three constants (c1 , c2 , and c3) in Table 7.2 (reproduced from Table II of Dmitrienko et al., 2006) represent critical values for the intersection hypothesis tests. The  values are chosen in such a way that, under the global null hypothesis of no treatment effect, the probability of rejecting each individual intersection hypothesis is α . Note that the constants are computed in a sequential manner (c1 is computed first, followed by c2, etc) and thus c1 is the one-sided 100 × (1 − α )th percentile of the Dunnett distribution with 2 and 3(n − 1) degrees of freedom. Secondly, the other two critical values (c2 and c3) depend on the correlation between the primary and secondary endpoints, which is estimated from the data. Calculation of these critical values is illustrated later.

Innovative Statistics in Regulatory Science

190

TABLE 7.1 Decision Matrix for a Clinical Trial With Two Dose-Placebo Compassions and Two Endpoints ( m = 2, s = 2 ) Intersection Hypothesis

Rejection Rule

* H1101

t11 > c1 or t12 > c1 t11 > c1 or t12 > c1 t11 > c1 or t12 > c1

* H1100

t11 > c1 or t12 > c1

H

* 1011

t11 > c1 or t22 > c2

H

* 1010

t11 > c1

H H

* 1111 * 1110

* H1001

t11 > c1 or t22 > c2

* H1000

t11 > c1

* H 0111

t12 > c1 or t21 > c2

* H 0110

t12 > c1 or t21 > c2

* 0101

t12 > c1

* H 0100

t12 > c1

H H

* 0011

t21 > c1 or t22 > c1

* H 0010

t21 > c3

* H 0001

t22 > c3

Note: The  test associated with this matrix rejects a null hypothesis if all intersection hypotheses containing it are rejected. For  example, the test rejects H11 if * * * * * * * * H1111 , H1110 , H1101 , H1100 , H1011 , H1010 , H1001 and H1000 are rejected.

TABLE 7.2 Critical Values for Individual Intersection Hypotheses in a Clinical Trial with Two Dose-Placebo Comparisons and Two Endpoints ( m = 2, s = 2 ) Correlation Between the Endpoints ( ρ ) 0.01 0.1 0.5 0.9 0.99

c1

c2

c3

2.249 2.249 2.249 2.249 2.249

2.309 2.307 2.291 2.260 2.250

1.988 1.988 1.988 1.988 1.988

Source: Dmitrienko, A. et al., Pharm. Stat., 5, 19–28, 2006. Note: The correlation between the two endpoints ( ρ ) ranges between 0.01 and 0.99, overall one-sided Type I error probability is 0.025 and sample size per treatment group is 30 patients.

Multiplicity

191

The  decision matrix in Table  7.1 defines a multiple testing procedure that rejects a null hypothesis if all intersection hypotheses containing the selected null hypothesis were rejected. For example, H12 will be rejected if * * * * * * * * H1111 , H1110 , H1101 , H1111 , H 0111 , H 0110 , H 0101 , and H 0100 were all rejected. By the closed testing principle, the resulting procedure protects the family-wise error rate in the strong sense at the α level. It is easy to verify that the proposed procedure possesses the following properties and thus meets the criteria that define a gate-keeping strategy based on the Dunnett test: 1. The secondary hypotheses, H 21 and H 22 cannot be rejected when the primary test statistics, t11 and t12; are non-significant (Condition A). 2. The outcome of the primary analyses (based on H 11 and H12 ) does not depend on the significance of the secondary dose–placebo comparisons (Condition B). In fact, the procedure rejects H 11 if and only if t11 > c1. Likewise, H12 is rejected if and only if t12 > c1. Since c1 is a critical value of the Dunnett test, the primary dose-placebo comparisons are carried out using the regular Dunnett test. 3. The null hypothesis H 21 cannot be rejected unless H11 was rejected and thus the procedure compares the low dose to placebo for the secondary endpoint only if the corresponding primary comparison was significant. The same is true for the other secondary dose–placebo comparison (Condition C). Under the global null hypothesis, the four statistics follow a central multivariate t-distribution. The three critical values in Table 7.1 can be found using the algorithm for computing multivariate t-probabilities proposed by Genz and Bretz (2002). Table 7.2 shows the values of c1 , c2 , and c3 selected values of ρ (correlation between the two endpoints). It is assumed in Table 7.2 that the overall one-sided Type I error rate is 0.025 and the sample size per group is 30 patients. The  information presented in Tables 7.1 and 7.2 helps evaluate the effect of the described gate-keeping approach on the secondary tests. Suppose, for example, that the two dose-placebo comparisons for the primary endpoint are significant after Dunnett’s adjustment for multiplicity (t11 > 2.249 and t12 > 2.249 ). A close examination of the decision matrix in Table 7.I reveals that the null hypotheses in the second family will be rejected if their t-statistics are greater than 2.249. In other words, the resulting multiplicity adjustment ignores the multiple tests in the primary family. However, if the low dose does not separate from placebo for the primary endpoint (t11 ≤ 2.249 and t12 > 2.249 ), it will be more difficult to find significant outcomes in the secondary analyses. First of all, the low dose vs. placebo comparison is automatically declared non-significant. Secondly, the

192

Innovative Statistics in Regulatory Science

high dose will be significantly different from placebo for the secondary endpoint if t22 > c2. Note that c2, which lies between 2.250 and 2.309 when 0.01 ≤ ρ ≤ 0.99 is greater than Dunnett’s critical value c1 = 2.249 (in general, c2 > c1 > c3). The larger critical value is the price of sequential testing. Note, however, that the penalty becomes smaller with increasing correlation.

7.5 Concluding Remarks When conducting clinical trial involving one or more doses (e.g., dose-finding study) or one or more study endpoints (e.g., efficacy versus safety endpoint), the first dilemma at the planning stage of the clinical trial is the establishment of a family of hypotheses a priori in the study protocol for achieving the study objective of the intended clinical trial. Based on the study design and various underlying hypotheses, clinical strategies are usually explored for testing various hypotheses for achieving the study objectives. One such set of hypotheses (e.g., drug versus placebo, positive control agent versus placebo, primary endpoint versus secondary primary endpoint) would help to conclude whether both the drug and positive control agent are superior to placebo or the drug is efficacious in terms of the primary endpoint, secondary primary endpoint or both. Under the family of hypotheses, valid multiple comparison procedures for controlling the over type I error rate should be proposed in the study protocol. The other dilemma at the planning stage of the clinical trial is sample size calculation. A  typical procedure is to obtain required sample size under either an analysis of variance (ANOVA) method or an analysis of covariance (ANCOVA) model based on an overall F-test. This approach may not be appropriate if the primary objective involves multiple comparisons. In practice, when multiple comparisons are involved, the method of Bonferroni is usually performed to adjust the type I error rate. Again, the Bonferroni’s method is conservative and may require more patients than is actually needed. Alternatively, Hsu (1996) suggested a confidence interval approach as follows. Given a confidence interval approach with level of 1− α , perform sample size calculations so that with a pre-specified power 1− β (< 1 − α ), the confidence intervals will cover the true parameter value and be sufficiently narrow (Hsu, 1996). As indicated, multiple comparisons are commonly encountered in clinical trials. Multiple comparisons may involve comparisons of multiple treatments (dose groups), multiple endpoints multiple time points, interim analyses, multiple tests of the sample hypothesis, variable/model selection, and subgroup analyses in a study. In  this case, statistical methods for controlling error rates such as CWE, FWER, or FDR are necessary for multiple comparisons. The closed testing procedure is useful for addressing multiplicity issue

Multiplicity

193

in dose-finding studies. In the case when there are a large number of tests involved such as tests for safety data, it is suggested the method using FDR for controlling the overall type I error rate be considered. From a statistical reviewer’s point of view, Fritsch (2012) indicated that when dealing with the issue of multiplicity, one should carefully select the most appropriate hypotheses, i.e., choose “need to have” endpoints, but don’t pile on “nice to have” endpoints, put the endpoints in the right families, carefully consider which hypotheses represent distinct claims, and ensure all “claims” are covered under the multiplicity control structure. In addition, one should ensure a good match between the study objectives and the multiplicity control methods by utilizing natural hierarchies (but avoid arbitrary ones) and taking the time to understand complex structures to ensure overall control of multiplicity.

8 Sample Size

8.1 Introduction In  clinical research and development, clinical trials are often conducted to scientifically evaluate safety and efficacy of a test treatment under investigation. For approval of a new test treatment or drug therapy, the United States Food and Drug Administration (FDA) requires that at least two adequate well-controlled clinical studies be conducted in humans to demonstrate substantial evidence of the effectiveness and safety of the drug product under investigation. Clinical data collected from adequate and well-controlled clinical trials are considered substantial evidence for evaluation of the safety and efficacy of the test treatment under investigation. Substantial evidence has the characteristics that (1) it is representative of target patient population, (2) it provides an accurate and reliable assessment of the test treatment, and (3) there is sufficient sample size for achieving a desired statistical power at a pre-specified level of significance. Sufficient sample size is often interpreted as the minimum sample size required for achieving the desired statistical power. As indicated by Chow et  al. (2017), sample size calculation can be performed based on either (i) precision analysis, which is to control type I error rate and maximum error of an estimate allowed; (ii) power analysis, which is to control type II error by achieving a desired probability of correctly detecting a clinically meaningful difference if such a difference truly exists; (iii) reproducibility analysis, which is to control both treatment effect (maintaining treatment effect) and variability; and (iv) probability statement, which is to ensure that the probability of observing certain events is less than some pre-specified values. Among these approaches, a pre-study power analysis is probably the most commonly employed method for sample size calculation in clinical research. In practice, sample size calculation in clinical trials are often classified into the categories of (i) sample size estimation or determination, which is ensure the estimated sample size imparts certain benefits (e.g., a desired power); (ii) sample size re-estimation, which is usually performed at interim analysis; (iii) sample size justification, which is to provide the level of assurance with selected sample size, and; (iv) sample size adjustment, which is to ensure that current sample size will achieve the study objectives with a desired power. 195

196

Innovative Statistics in Regulatory Science

In  practice, a typical process for sample size calculation is to estimate/ determine a minimum sample size required for achieving study objectives (or correctly detecting a clinically meaningful difference or treatment effect of a given primary study endpoint if such a difference truly exists) with a desired power at a pre-specified level of significance under a valid study design. Sample size calculation is usually performed based on an appropriate statistical test that is derived under the null hypothesis (which reflects the study objectives) and the study design. Thus, information required for performing sample size calculation include (i) study objectives (e.g., test for equality, non-inferiority/equivalence, or superiority); (ii) study design (e.g., a parallel design, a crossover design, a group sequential design, or other designs such as adaptive designs); (iii) properties of the primary study endpoint(s) (e.g., continuous, discrete, or time-to-event data); (iv) the clinically meaningful difference being sought (e.g., non-inferiority margin or equivalence/similarity limit); (v) significance level (e.g., 1% or 5%); (vi) desired power (e.g., 80% or 90%); and (vii) other information such as stratification, 1:1 ratio or 2:1 ratio, or log-transformation. Thus, procedures used for sample size calculation could be very different from one another depending upon different study objectives/hypotheses and different data types under different study designs (see, e.g., Lachin and Foulkes, 1986; Lakatos, 1986, Wang and Chow, 2002a,b; Wang et al., 2002; Chow and Liu, 2008). For a good introduction and comprehensive summary, one can refer to Chow et al. (2008). In  the next section, classical pre-study power analysis for sample size calculation is briefly reviewed. Also included in this section is a summary table of formulas for sample size calculation for various data types of study endpoints under hypotheses testing for equality, non-inferiority/ equivalence, and superiority. Section  8.3 reviews and studies the relationship among several clinical strategies for selection of derived study endpoints. The  relationship between two one-sided tests procedure and confidence interval approach is examined in Section  8.4. Sample size calculation/allocation for multiple-stage adaptive designs is given in Section  8.5. Section  8.6 discusses sample size adjustment with protocol amendments.  Sample size calculation for multi-regional or global clinical  trials is given in Section  8.7. Some concluding remarks are given in Section 8.8.

8.2 Traditional Sample Size Calculation In clinical trials, sample size is often determined for achieving a desired power at a pre-specified level of significance by evaluating test statistics (which is derived under the null hypothesis) under the alternative hypothesis (Chow et  al., 2008). In  clinical trials, the hypotheses that are of particular interest

Sample Size

197

to the investigators include hypotheses for testing equality, non-inferiority/ equivalence, and superiority, which are briefly described below. Hypothesis testing for equality is a commonly employed approach for demonstration of the efficacy and safety of a test drug product. The purpose is first to show that there is a difference between the test drug and the control (e.g., placebo control). Second, it is to demonstrate that there is a desired power (say at least 80% power) for correctly detecting a clinically meaningful difference if such a difference truly exists. For  hypothesis testing for noninferiority, it is to show that the test drug is not inferior to or as effective as a standard therapy or an active agent. Hypothesis testing for non-inferiority is often applied to situations where (i) the test drug is less toxic, (ii) the test drug is easier to administer, and (iii) the test drug is less expensive. On the other hand, hypothesis testing for superiority is to show that the test drug is superior to a standard therapy or an active agent. It should be noted that the usual test for superiority is referred to test for statistical superiority rather than clinical superiority. In practice, testing for superiority is not preferred by the regulatory agencies unless there is some prior knowledge regarding the test drug. For hypotheses testing for equivalence, it is to show that the test drug can reach the same therapeutic effect as that of a standard therapy (or an active agent) or they are therapeutically equivalent. It should be noted that hypotheses testing for equivalence include hypotheses testing for bioequivalence and hypotheses testing for therapeutic equivalence. More details regarding difference between testing for bioequivalence and testing for therapeutic equivalence will be provided at later section. To provide a better understanding, Figure  8.1 displays the relationship among hypotheses for non-inferiority, superiority, and equivalence, where µT and µS are mean responses for the test product and standard therapy and δ is clinically meaningful difference (e.g., non-inferiority margin or equivalence limit). In  this section, for simplicity, we will focus on one-sample case where the primary response is continuous and the hypotheses of interest is to test whether there is a difference, i.e., ε = µ − µ0 , between the mean responses, where µ0 is a pre-specified constant: H 0 : ε = 0 versus H a : ε ≠ 0.

FIGURE 8.1 Relationship among non-inferiority, superiority, and equivalence.

(8.1)

Innovative Statistics in Regulatory Science

198

When σ2 is known, we reject the null hypothesis at the α level of significance if x − µ0 > zα /2 , σ/ n where x is the sample mean and za is the upper ath quantile of a standard normal distribution. Under the alternative hypothesis that ε ≠ 0, the power of the above test is given by  nε    nε − zα /2  , Φ  − zα /2  + Φ  −  σ   σ 

(8.2)

where Φ is the cumulative standard normal distribution function. By ignoring a small value ≤ α /2, the power is approximately  nε  Φ − zα /2  .  σ    As a result, the sample size needed for achieving power 1− β can be obtained by solving the following equation  nε − zα /2 = zβ . σ This leads to n=

( zα /2 + zβ ) ε2

2

σ2

2

z + z  =  α /2 β  , θ  

(8.3)

2 where θ = ε/σ is the effect size adjusted for standard deviation. When σ 2 is unknown, it can be replaced by the sample variance s , which results in the usual one-sample t-test. Following the process, sample size required for achieving the desired power for testing non-inferiority, superiority, and equivalence under various study designs for comparative trials can be similarly obtained. Table  8.1 provides a summary of formulas for sample size calculations for different data types (continuous, discrete, and time-toevent data) under various hypotheses for testing equality, non-inferiority/ equivalence, and superiority for comparing two independent treatment groups. Note that in practice, the following questions are often of interest to the investigators or sponsors when conducting clinical trials. First, what’s the impact on required sample size when switching from a two-sided test

n2 =

2

) 2

(1 + 1 / k )

− µ1

σ

2

2

(δ − µ

( zα + zβ /2 )

2

σ 2 (1 + 1 / k ) µ 2 − µ1 − δ

( zα + zβ )

n1 = kn2

n2 =

2

σ 2 (1 + 1/ k ) µ 2 − µ1 + δ

( zα + zβ )

n1 = kn2

n2 =

2

( zα /2 + zβ ) σ 2 (1 + 1 / k ) 2 ( µ 2 − µ1 )

n1 = kn2

n2 =

n1 = kn2

Continuous

( zα /2 + zβ ) 2 ( p2 − p1 )

n2 =

2

(δ − p 2

− p1

( zα + zβ /2 )

) 2

2

2

 p1 ( 1 − p1 )  + p2 ( 1 − p2 )   k  

 p1 ( 1 − p1 )  + p2 ( 1 − p2 )   k  

 p1 ( 1 − p1 )  + p2 ( 1 − p2 )   k  

 p1 ( 1 − p1 )  + p2 ( 1 − p2 )   k  

2

( p2 − p1 − δ )

( zα + zβ )

n1 = kn2

n2 =

2

2

( p2 − p1 + δ )

( zα + zβ )

n1 = kn2

n2 =

n1 = kn2

n2 =

n1 = kn2

Binary Responses

n2 =

2

(δ λ

− λ1

)

2

( Zα + Zβ /2 )

2

2

2

( λ2 − λ1 − δ )

( Zα + Zβ ) n1 = kn2

n2 =

2

( λ2 − λ1 + δ )

( Zα + Zβ ) n1 = kn2

n2 =

2

2

( Zα /2 + Zβ ) 2 ( λ2 − λ1 )

n1 = kn2

n2 =

n1 = kn2

 σ 2 ( λ1 )  + σ 2 ( λ2 )   k  

  σ 2 ( λ1 ) + σ 2 ( λ2 )   k  

  σ 2 ( λ1 ) + σ 2 ( λ2 )   k  

 σ 2 ( λ1 )  + σ 2 ( λ2 )   k  

Time-to-Event Data

Note: ε is the difference between the true means, response rates, and hazard rates of a test drug and a control for the case of continuous, binary responses, and time-to-event data, respectively.

Hα : ε < δ

Equivalence H 0 : ε ≥ δ

Hα : ε > δ

Superiority H 0 : ε ≤ δ

Hα : ε > −δ

Non-inferiority H 0 : ε ≤ −δ

Hα : ε ≠ 0

Equality H 0 : ε = 0

Hypotheses

Formulas for Sample Size Calculation

TABLE 8.1

Sample Size 199

200

Innovative Statistics in Regulatory Science

to a one-sided test? Second, will different ratio of treatment allocation (say 2:1 ratio) reduce required sample size and increase the probability of success? Third, can we determine/justify sample size based on effect size that one wishes to detect when there is little or no information regarding the test treatment? Partial answers (if not  all) to the above questions can be obtained by carefully examining the formulas for sample size calculation given in Table 8.1.

8.3 Selection of Study Endpoints When conducting clinical trials, appropriate clinical endpoints are often chosen in order to address the scientific and/or medical questions of interest (study objectives/hypotheses). For  a given clinical study, the required sample size may be determined based on expected absolute change from baseline of a primary study endpoint but the collected data are analyzed based on relative change from baseline (e.g., percent change from baseline) of the primary study endpoint or based on the percentage of patients who show some improvement (i.e., responder analysis). The definition of a responder could be based on either absolute change from baseline or relative change from baseline of the primary study endpoint. In practice, it is not uncommon to observe a significant result on a study endpoint (e.g., absolute change from baseline, relative change from baseline, or the responder analysis) but not on other study endpoints (e.g., absolute change from baseline, relative change from baseline, or responder analysis). Thus, it is of interest to explore how an observed significant difference of a study endpoint (e.g., absolute change from baseline, relative change from baseline, or responder’s analysis) can be translated to those of other study endpoints (e.g., absolute change from baseline, relative change from baseline, or responder’s analysis). The sample size required for achieving a desired power based on the absolute change could be very different from that obtained based on the percent change, or the percentage of patients who show an improvement based on the absolute change or relative change at α level of significance. Thus, the selection of an appropriate study endpoint has an immediate impact on the assessment of treatment effect. In practice, one of the most controversial issues regarding clinical endpoint selection is that which clinical endpoint is telling the truth. The  other controversial issue is how to translate clinical results among the study endpoints. In  what follows, we will make an attempt to answer these questions. 8.3.1 Translations among Clinical Endpoints Suppose that there are two test treatments, namely, a test treatment (T ) and a reference treatment (R). Denote the corresponding measurements of the ith

Sample Size

201

subject in the jth treatment group before and after the treatment by W1ij and W2 ij, where j = T or R corresponds to the test and the reference treatment, respectively. Assume that the measurement W1ij is lognormally distributed with parameters µ j and σ 12j, i.e., W1ij ~ lognormal( µ j , σ 12j ). Let W2 ij = W1ij (1 + ∆ ij ) , where ∆ ij denotes the percentage change after receiving the treatment. In addition, assume that ∆ ij is lognormal distributed with parameters µ∆j and σ ∆2j , i.e., ∆ ij ~ lognormal ( µ∆ j , σ ∆2 j ). Thus, the difference and the relative difference between the measurements before and after the treatment are given by W2 ij − W1ij and (W2 ij − W1ij )/W1ij , respectively. In particular,

(

)

W2 ij − W1ij = W1ij ∆ ij ~ lognormal µ j , µ∆ j , σ j2 + σ ∆2 j , and W2 ij − W1ij ~ lognormal µ∆ j , σ ∆2 j . W1ij

(

)

To simplify the notations, define Xij and Yij as Xij = log(W2 ij − W1ij ), 1ij Yij = log( W2Wij −1W ). Then, both Xij and Yij are normally distributed with means ij µ j + µ∆ j and = , 2,..., nj , j T , R, respectively. µ∆ j , i 1= Thus, possible derived study endpoints based on the responses observed before and after the treatment as described earlier include Xij , the absolute difference between “before treatment” and “after treatment” responses of the subjects, Yij , the relative difference between “before treatment” and “after treatment” responses of the subjects, rAj = # {xij > c1 , i = 1,..., nj } / nj , the proportion of responders, which is defined as a subject whose absolute difference between “before treatment” and “after treatment” responses is larger than a pre-specified value c1, rRj = # {yij > c1 , i = 1,… , nj } / nj , the proportion of responders, which is defined as a subject whose relative difference between “before treatment” and “after treatment” responses is larger than a pre-specified value c2. To define notation, for j = T , R, let pAj = E(rAj ) and pRj = E (rRj ). Given the above  possible types of derived study endpoints, we may consider the following hypotheses for testing non-inferiority with non-inferiority  margins determined based on either absolute difference or relative difference: Case 1: Absolute difference H 0 : ( µR − µ∆R ) − ( µT − µ∆T ) ≥ δ 1 vs. Hα : ( µR − µ∆R ) − ( µT − µ∆T ) < δ 1

(8.4)

Case 2: Relative difference H 0 : ( µ∆R − µ∆T ) ≥ δ 2 vs. Hα : ( µ∆R − µ∆T ) < δ 2

(8.5)

Innovative Statistics in Regulatory Science

202

Case 3: Absolute change in response rate (defined based on absolute difference) H 0 : pAR − pAT ≥ δ 3 vs. Hα : pAR − pAT < δ 3

(8.6)

Case 4: Relative change in response rate (defined based on absolute difference) pAR − pAT p − pAT ≥ δ 4 vs. Hα : AR < δ4 pAR pAR

H0 :

(8.7)

Case 5: Absolute change in response rate (defined based on relative difference) H 0 : pRR − pRT ≥ δ 5 vs. Hα : pRR − pRT < δ 5

(8.8)

Case 6: Relative change in response rate (defined based on relative difference) H0 :

pRR − pRT p − pRT ≥ δ 6 vs. Hα : RR < δ6 pRR pRR

(8.9)

8.3.2 Comparison of Different Clinical Strategies Denote by Xij the absolute difference between “before treatment” and “after treatment” responses of the ith subjects under the i th treatment, and by Yij the relative difference between “before treatment” and “after treatment” responses of the ith subjects under the jth treatment. Let x. j = n1j = ∑ ni=j 1 xij and y. j = n1j = ∑ in=j 1 yij be the sample means of Xij and Yij for the j th treatment group, j = T , R , respectively. Based on normal distribution, the null hypothesis in (8.4) is rejected at a level α of significance if x.R − x.T + δ 1

> zα .

(8.10)

  − zα  ,    

(8.11)

1   1 2 2 2 2  n + n   σ T + σ ∆T + σ R + σ ∆R  R   T

(

) (

)

Thus, the power of the corresponding test is given as   Φ  

( µT + µ ∆ ) − ( µ R + µ ∆ ) + δ 1 T

(n

−1 T

)(

R

) (

+ n2−1  σ T2 + σ ∆2R + σ R2 + σ ∆2R 

)

Sample Size

203

where Φ(.) is the cumulative distribution function of the standard normal distribution. Suppose that the sample sizes allocated to the reference and test treatments are in the ratio of r, where r is a known constant. Using these results, the required total sample size for the test the hypotheses (8.4) with a power level of (1− β ) is N = nT + nR, with nT =

2 ( zα + zβ ) (σ 12 + σ 22 ) (1 + 1/ ρ ) 2 , ( µR + µ∆ ) − ( µT + µ∆ ) − δ 1  R

(8.12)

T

nR = ρ nT and zu is 1− u quantile of the standard normal distribution. Note that yij′ s are normally distributed. The testing statistic based on y.j would be similar to the above case. In particular, the null hypothesis in (8.5) is rejected at a significance level α if y T . − y R. + δ 2 1  2  1 2  n + n  σ ∆T + σ ∆ R R   T

(

> zα .

)

(8.13)

The power of the corresponding test is given as  Φ   

(n

µ ∆T − µ ∆ R + δ 2

−1 T

)(

+ nR−1 σ ∆2T + σ ∆2R

)

 − zα  .  

(8.14)

Suppose that nR = ρ nT , where r is a known constant. Then the required total sample size to test hypotheses (5) with a power level of (1− β ) is (1+ ρ )nT , where 2 ( zα + zβ ) (σ ∆2T + σ ∆2R ) (1 + 1/ ρ ) nT = 2 . ( µR + µ∆R ) − ( µT + µ∆T ) − δ 2 

(8.15)

For sufficiently large sample size nj, rAj is asymptotically normal with mean pAj and variance pAj (1−pAj ) , j = T , R. Thus, based on Slutsky Theorem, the null nj hypothesis in (8.6) is rejected at an approximate α level of significance if rAT − rAR + δ 3 > zα . 1 1 rAT ( 1 − rAT ) + rAR ( 1 − rAR ) nT nR

(8.16)

Innovative Statistics in Regulatory Science

204

The power of the above test can be approximated by   pAT − pAR + δ 3 Φ − zα  .  nT−1pAT ( 1 − pAT ) + nR−1rAR ( 1 − pAR )   

(8.17)

If nR = ρ nT, where r is a known constant, the required sample size to test hypotheses (8.6) with a power level of (1− β ) is (1+ ρ )nT , where

nT =

( zα + zβ )

2

 pAT ( 1 − pAT ) + pAR ( 1 − pAR ) / ρ  . 2 ( pAR − pAT − δ 3 )

(8.18)

 c1 −( µ j + µ∆ j )  Note that, by definition, pAj = 1 − Φ  σ 2 + σ 2  , where j = T , R. Therefore,  j ∆j  following similar arguments, the above results also apply to test c2 − µ ∆ hypotheses (8.8) with pAj replaced by pRj = 1 − Φ σ ∆ j j and δ 3 replaced by δ 5 . Thus, we have

(

)

H 0 : ( 1 − δ 4 ) pAR − pAT ≥ 0 vs. H1 : ( 1 − δ 4 ) pAR − pAT < 0.

(8.19)

Therefore, the null hypothesis in (8.7) is rejected at an approximate a level of significance if rAT − ( 1 − δ 4 ) rAR

(1 − δ 4 ) r 1 − r 1 rAT ( 1 − rAT ) + AR ( AR ) nT nR 2

> zα .

(8.20)

Using normal approximation to the test statistic when both nT and nR are sufficiently large, the power of the above test can be approximated by   pAT − ( 1 − δ 4 ) pAR  Φ − Zα  2  −1  nT pAT ( 1 − pAT ) + nR−1 ( 1 − δ 4 ) pAR ( 1 − pAR )  

(8.21)

Suppose that nR = ρ nT, where r is a known constant. Then the required total sample size to test hypotheses (8.13), or equivalently (8.19), with a power level of (1− β ) is (1+ ρ )nT , where

nT =

( Zα + Zβ )

2

 pAT ( 1 − pAT ) + ( 1 − δ 4 )2 pAR ( 1 − pAR ) / ρ   . 2  pAT − ( 1 − δ 4 ) pAR 

(8.22)

Sample Size

205

Similarly, the results derived in (8.18) through (8.20) for the hypotheses (8.7) c2 − µ ∆ also apply to the hypotheses in (8.9) with pAj replaced by pRj = 1 − Φ σ ∆ j j and δ 4 replaced by δ 6. It should be noted that when conducting clinical trials, the sponsors always choose the clinical endpoints to their best interest. The regulatory agencies, however, require the primary clinical endpoint be specified in the study protocol. Positive results from other clinical endpoints will not  be considered as the primary analysis results for regulatory approval. This, however, does not have any scientific or statistical justification for assessment of the treatment effect of the test drug under investigation.

(

)

8.4 Multiple-stage Adaptive Designs Consider a clinical trial with K interim analyses. The final analysis is treated as the Kth interim analysis. Suppose that at each interim analysis, a hypothesis test is performed followed by some actions that are dependent of the analysis results. Such actions could be an early stopping due to futility/efficacy or safety, sample size re-estimation, modification of randomization, or other adaptations. In this setting, the objective of the trial can be formulated using a global hypothesis test, which is an intersection of the individual hypothesis tests from the interim analyses H 0 : H 01 ∩ ... ∩ H 0 K , where H 0 i , i = 1,..., K is the null hypothesis to be tested at the ith interim analysis. Note that there are some restrictions on H 0 i , that is, rejection of any H 0 i , i = 1,..., K will lead to the same clinical implication (e.g., drug is efficacious); hence all H 0 i , i = 1,..., K are constructed for testing the same endpoint within a trial. Otherwise the global hypothesis cannot be interpreted. In practice, H 0 i is tested based on a sub-sample from each stage, and without loss of generality, assume H 0 i is a test for the efficacy of a test treatment under investigation, which can be written as H 0 i : ηi1 ≥ ηi 2 versus Hα i : ηi1 < ηi 2 , where ηi1 and ηi2 are the responses of the two treatment groups at the ith stage. It is often the case that when ηi1 = ηi 2 , the p-value pi for the sub-sample at the ith stage is uniformly distributed on [0, 1] under H 0 (Bauer and Kohne, 1994). This  desirable property can be used to construct a test statistic for multiple-stage seamless adaptive designs. As an example, Bauer and Kohne

Innovative Statistics in Regulatory Science

206

(1994) used Fisher’s combination of the p-values. Similarly, Chang (2007) considered a linear combination of the p-values as follows K

Tk =

∑ w p , i = 1,… ,K , ki i

(8.23)

i =1

where wki > 0 and K is the number of analyses planned in the trial. For simplicity, consider the case where wki = 1. This leads to K

Tk =

∑ p , i = 1,… , K. i

(8.24)

i =1

The  test statistic Tk can be viewed as cumulative evidence against H 0 . The smaller the Tk is, the stronger the evidence is. Equivalently, we can define the test statistic as Tk = ∑ iK=1 pi / K , which can be viewed as an average of the evidence against H 0 . The stopping rules are given by Stop for efficacy if Tk ≤ α k  Stop for futility if Tk ≥ β k ,  otherwise Continue

(8.25)

where Tk , α k , and β k are monotonic increasing functions of k , α k < β k , k = 1,..., K − 1, and α K = β K . Note that α k , and β k are referred to as the efficacy and futility boundaries, respectively. To reach the kth stage, a trial has to pass 1 to (k − 1)th stages. Therefore, a so-called proceeding probability can be defined as the following unconditional probability:

ψ k ( t ) = P ( Tk < t , α 1 < T1 < β1 ,..., α k −1 < Tk −1 < β k −1 ) =

β1

βk −1

∫ ∫ ∫ α1

...

α k −1

t

−∞

fT1 ...Tk ( t1 ,..., tk ) dtk dtk −1...dt1 ,

(8.26)

where t ≥ 0, ti , i = 1,..., k is the test statistic at the ith stage, and fT1 ...Tk is the joint probability density function. The error rate at the kth stage is given by

π k = ψ k (α k ) .

(8.27)

Sample Size

207

When efficacy is claimed at a certain stage, the trial is stopped. Therefore, the type I error rates at different stages are mutually exclusive. Hence, the experiment-wise type I error rate can be written as follows: K

α=

∑π . k

(8.28)

k =1

Note that (8.26) through (8.28) are the keys to determine the stopping boundaries, which will be illustrated in the next sub-section with two-stage seamless adaptive designs. The  adjusted p-value calculation is the same as the one in a classic group sequential design (see, e.g., Jennison and Turnbull, 2000). The key idea is that when the test statistic at the kth stage Tk = t = α k (i.e., just on the efficacy stopping boundary), the p-value is equal to α spent k ∑ i =1 π i . This is true regardless of which error spending function is used and consistent with the p-value definition of the classic design. The  adjusted p-value corresponding to an observed test statistic Tk = t at the kth stage can be defined as p (t; k ) =

k −1

∑ π +ψ i

k

( t ) , k = 1,.., K.

(8.29)

i =1

This adjusted p-value indicates weak evidence against H 0 , if the H 0 is rejected at a late stage because one has spent some α at previous stages. On the other hand, if the H 0 was rejected at an early stage, it indicates strong evidence against H 0 because there is a large portion of overall α that has not  been spent yet. Note that pi in (8.23) is the stage-wise naive (unadjusted) p-value from a sub-sample at the ith stage, while p ( t ; k ) are adjusted p-values calculated from the test statistic, which are based on the cumulative sample up to the kth stage where the trial stops, equations  (8.28) and (8.29) are valid regardless how the pi values are calculated. An Example Suppose that a clinical trial utilizing an adaptive group sequential design with one planned interim analysis is to be conducted to evaluate the safety and efficacy of a test treatment in treatment patient with certain disease. The sponsor would like to have the option for stopping the trial early due to either efficacy or futility and is able to control the overall type I error rate at the α level of significance. The study can be viewed as a two-stage adaptive design. Sample size calculation was performed based on the primary study endpoint of failure rate at 12  weeks post randomization. This study is powered to detect a clinically meaningful difference of 25% in failure rate between the test treatment and a placebo at the 5% level of significance assuming that the true placebo failure rate

208

Innovative Statistics in Regulatory Science

is 50%. Since there is an intention to stop the trial due to efficacy/futility at the end of Stage 1, sample size calculation was performed based on the method of individual p-values for a two-stage adaptive design proposed by Chang (2007). At  the end of the first stage, the following stopping rules based on individual p-values are considered: Stop for efficacy if T1 ≤ α 1 ; Stop for futility if T1 > β1 ; Continue with adaptation if α 1 < T1 ≤ β1 , where α 1 and β1 (α 1 < β1 ) are efficacy and futility boundaries, respectively, and T1 is the test statistic (based on individual p-value) to be used at the first stage. Note that after the review of the data by an independent data monitoring committee (DMC), additional adaptations such as dose modification and sample size re-estimation may be applied as suggested by the independent DMC if a decision to proceed is reached. As indicated by Chow and Chang (2006), for a two-stage adaptive design based on individual p-values, we have (see also, Chang, 2007)

α = α 1 + α 2 ( β1 − α 1 ) . Thus, for the proposed two-stage seamless adaptive design, we choose the efficacy and futility boundaries as follows

α 1 = 0.005, β1 = 0.40, α 2 = 0.0506 for controlling the overall type I error rate at the 5% (α = 0.05) level of significance. Sample size calculation can then be performed accordingly.

8.5 Sample Size Adjustment with Protocol Amendments In practice, for a given clinical trial, it is not uncommon to have 3–5 protocol amendments after the initiation of the clinical trial. One of the major impacts of many protocol amendments is that the target patient population may have been shifted during the process, which may have resulted in a totally different target patient population at the end of the trial. A typical example is the case when significant adaptation (modification) is applied to inclusion/ exclusion criteria of the study. Denote by ( µ , σ ) the target patient population. After a given protocol amendment, the resultant (actual) patient population may have been shifted to ( µ1 , σ 1 ), where µ1 + µ + ε is the population mean of

Sample Size

209

the primary study endpoint and σ 1 = Cσ (C > 0) is the population standard deviation of the primary study endpoint. The shift in target patient population can be characterized by E1 =

µ1 µ +ε µ = =∆ = ∆ E, σ1 Cσ σ

where ∆ = ( 1 + ε / µ ) /C, and E and E1 are the effect size before and after population shift, respectively. Chow et  al. (2002) and Chow and Chang (2006) refer to Δ as a sensitivity index measuring the change in effect size between the actual patient population and the original target patient population. As indicated in Chow and Chang (2006), the impact of protocol amendments on statistical inference due to shift in target patient population (moving target patient population) can be studied through a model that link the moving population means with some covariates (Chow and Shao, 2005). However, in many cases, such covariates may not  exist or exist but not observable. In this case, it is suggested that inference on Δ be considered to measure the degree of shift in location and scale of patient population based on a mixture distribution by assuming the location or scale parameter is random (Chow et al., 2005). In clinical trials, for a given target patient population, sample size calculation is usually performed based on a test statistic (which is derived under the null hypothesis) evaluated under an alternative hypothesis. After protocol amendments, the target patient population may have been shifted to an actual patient population. In this case, the original sample size may have to be adjusted in order for achieving the desired power for assessment of the treatment effect for the original patient population. As an example, consider sample size adjustment with protocol amendments based on covariateadjusted model for testing non-inferiority hypothesis that H 0 : p10 − p20 ≤ −δ versus H1 : p10 − p20 > −δ , where p10 and p20 are the response rate for a test treatment and an active control or placebo, respectively. Let nClassic and nActual be the sample size based on the original patient population and the actual patient population as the result of protocol amendments. Also, let nActual = RnClassic, where R is the adjustment factor. Following the procedures described in Chow et al. (2008), sample sizes for both nClassic and nActual can be obtained. Let Ytij and Xtij be the response and the corresponding relevant covariate for the jth subject after the ith amendment under the tth treatment ( t = 1, 2, i = 0, 1,..., k , j = 1, 2,..., nti) . For  each amendment, patients selected by the same criteria are randomly allocated to either the test treatment D1 = 1 or control treatment D2 = 0 groups. In this particular case, the true mean values of the covariate for the two treatment groups are the same under each amendment. Therefore, the

Innovative Statistics in Regulatory Science

210

relationships between the binary response and the covariate for both treatment groups can be described by a single model, pti =

exp ( β1 + β 2Dt + β 3 vi + β 4Dt vi ) , t = 1,, 2, i = 0, 1,..., k. 1 + exp ( β1 + β 2Dt + β 3 vi + β 4Dt vi )

Hence, the response rates for the test treatment and the control treatment are P1i =

exp ( β1 , β 2 + ( β 3 + β 4 ) vi )

1 + exp ( β1 + β 2 + ( β 3 + β 4 ) vi )

and p2 i =

exp ( β1 + β 3 vi ) 1 + exp ( β1 + β 3 vi )

respectively. Thus, the joint likelihood function of β = ( β1 ,..., β 4 ) is given by T

2

k

nti

t =1

i =0

j =1

∏∏∏

(

)

 T ( ti )  exp β z  T ( ti )  1 + exp β z 

(

)

   

ytij

 1   1 + exp β T z(ti ) 

(

)

   

1− ytij

× fX.i

  x .i  ,  

( )

where fX.i ( x.i ) is the probability density function of X .i = ∑ t2=1 ∑ nj =ti1 Xtij and ti z( ) = (1, Dt , x.i , Dt x.i )T . The log likelihood function is then given by

l (β ) =

k

2

nti

∑∑∑ t =1 i = 0

j =1

(

)

  exp β T z(ti )  y ln   tij  1 + exp β T z(ti )     + ln fX.i x.i 

(

)

  1  + ( 1 − y ) ln  tij   1 + exp β T z(ti )  

(

( )

)

     .   

 ,..., β  )T , we Given the resulting the maximum likelihood estimate βɵ = ( β 1 4 obtain the estimate of p10 and p20 as follows pɵ 10 =

(

) ) , pɵ  +β  X + (β ) ) (

 +β  + β  +β  X .0 exp β 1 2 3 4

(

 +β  1 + exp β 1 2

3

4

.0

20

=

(

 +β  X .0 exp β 1 3

(

Thus, we have N Classic =

( Zα + Zγ )

( p10 − p20 + δ )

2

 p10 ( 1 − p10 ) p20 ( 1 − p20 )  . +  1 − w  w 

( Zα + Zγ ) V d , 2 ( p10 − p20 + δ ) 2

N Actual =

2

)

 +β  X .0 1 + exp β 1 3

)

Sample Size

211

where w is the proportion of patients for the first treatment, T Vɶd =  g′ ( β )   w 



k i =0

ρ1i I(1i ) + ( 1 − w )



ρ 2 i I( 2 i )  i =0  k

−1

g′ ( β ) , w = n1. / N , ρti = nti / nt. ,

and  p10 ( 1 − p10 ) − p20 ( 1 − p20 )   p10 ( 1 − p10 )  g′ ( β ) =   v0 p10 ( 1 − p10 ) − p20 ( 1 − p20 )   v0 p10 ( 1 − p10 ) 

(

(

)

)

    .    

Note that more details regarding formulas for sample size adjustment based on covariate-adjusted model for binary response endpoint and sample size adjustments based on random location shift and random scale shift can be found in Chow (2011).

8.6 Multi-regional Clinical Trials As indicated by Uesaka (2009), the primary objective of a multi-regional bridging trial is to show the efficacy of a drug in all participating regions while also evaluating the possibility of applying the overall trial results to each region. To apply the overall results to a specific region, the results in that region should be consistent with either the overall results or the results from other regions. A typical approach is to show consistency among regions by demonstrating that there exists no treatment-by-region interaction. Recently, the Ministry of Health, Labor and Welfare (MHLW) of Japan published a guidance on Basic Principles on Global Clinical Trials that outlines the basic concepts for planning and implementation the multi-regional trials in a Q&A format. In this guidance, special consideration was placed on the determination of the number of Japanese subjects required in a multi-regional trial. As indicated, the selected sample size should be able to establish the consistency of treatment effects between the Japanese group and the entire group.

Innovative Statistics in Regulatory Science

212

To establish the consistency of the treatment effects between the Japanese group and the entire group, it is suggested that the selected size should satisfy  D  P J > ρ  ≥ 1− γ , D  All 

(8.30)

where DJ and DAll are the treatment effects for the Japanese group and the entire group, respectively. Along this line, Quan et al. (2010) derived closed form formulas for the sample size calculation/allocation for normal, binary and survival endpoints. As an example, the formula for continuous endpoint assuming that = DJ D= DAll = D, where DNJ is the treatment effect for the NJ non-Japanese subjects, is given below. NJ ≥

z12−γ N

( z1−α /2 + z1−β ) (1 − ρ )2 + z12−γ ( 2 ρ − ρ 2 ) 2

,

(8.31)

where N and N J are the sample size for the entire group and the Japanese group. Note that the MHLW recommends that ρ should be chosen to be either 0.5 or greater and γ should be chosen to be either 0.8 or greater in (8.30). As an example, if we choose ρ = 0.5, γ = 0.8, α = 0.05, and β = 0.9, then N J / N = 0.224. In other words, the sample size for the Japanese group has to be at least 22.4% of the overall sample size for the multi-regional trial. In practice, 1− ρ is often considered a non-inferiority margin. If ρ is chosen to be greater than 0.5, the Japanese sample size will increase substantially. It should be noted that the sample size formulas given in Quan et al. (2010) are derived under the assumption that there is no difference in treatment effects for the Japanese group and non-Japanese group. In practice, it is expected that there is a difference in treatment effect due to ethnic difference. Thus, the formulas for sample size calculation/allocation derived by Quan et al. (2010) are necessarily modified in order to take into consideration of the effect due to ethnic difference. As an alternative, Kawai et al. (2008) proposed an approach to rationalize partitioning the total sample size among the regions so that a high probability of observing a consistent trend under the assumed treatment effect across regions can be derived, if the treatment effect is positive and uniform across regions in a multi-regional trial. Uesaka (2009) proposed new statistical criteria for testing consistency between regional and overall results, which do not require impractical sample sizes and discussed several methods of sample size allocation to regions. Basically, three rules of sample size allocation in multi-regional clinical trials are discussed. These rules include (1) allocating equal size to all regions, (2) minimizing total sample size, and (3) minimizing the sample size of a specific region. It should be noted that

Sample Size

213

the sample size of a multi-regional trial may become very large when one wishes to ensure consistent results between region of interest and the other regions or between the regional results and the overall results regardless which rules of sample size allocation is used. When planning a multi-regional trial, it is suggested that the study objectives should be clearly stated in the study protocol. Once the study objectives are confirmed, a valid study design can be chosen and the primary clinical endpoints can be determined accordingly. Based on the primary clinical endpoint, sample size required for achieving a desired power can then be calculated. Recent approaches for sample size determination in multi-regional trials developed by Kawai et al. (2008), Quan et al. (2010), and Ko et al. (2010) are all based on the assumption that the effect size is uniform across regions. For example, assume that we focus on the multi-regional trial for comparing a test product and a placebo control based on a continuous efficacy endpoint. Let X and Y be some efficacy responses for patients receiving the test product and the placebo control respectively. For convention, both X and Y are nor2 2 mally distributed with variance σ . We assume that σ is known, although it can generally be estimated. Let µT and µ P be the population means of the test and placebo, respectively, and let ∆ = µT − µ P . Assume that effect size (Δ/σ) is uniform across regions. The hypothesis of testing for the overall treatment effect is given as H 0 : ∆ ≤ 0 versus Hα : ∆ > 0. Let N denote the total sample size for each group planned for detecting an expected treatment difference ∆ = δ at the desired significance level α and with power 1− β . Thus, N = 2σ 2 {( z1−α + z1− β ) / δ } , 2

where z1−α is the ( 1− α ) th percentile of the standard normal distribution. Once N is determined, special consideration should be placed on the determination of the number of subjects from the Asian region in the multi-regional trial. The selected sample size should be able to establish the consistency of treatment effects between the Asian region and the regions overall. To establish the consistency of treatment effects between the Asian region and the entire group, it is suggested that the selected sample size should satisfy that the assurance probability of the consistency criterion, given that ∆ = δ and the overall result is significant at α level, is maintained at a desired level, say 80%. That is, Pδ ( DAsia ≥ ρ DAll |Z > z1−α ) > 1 − γ

(8.32)

for some pre-specified 0 < γ ≤ 0.2 . Here Z represents the overall test statistic.

214

Innovative Statistics in Regulatory Science

Ko et  al. (2010) calculated the sample size required for the Asian region based on (8.32). For  β  =  0.1, α   =  0.025, and ρ  =  0.5, the sample size for the Asian region has to be around 30% of the overall sample size to maintain the assurance probability of (8.32) at 80% level. On the other hand, by considering a two-sided test, Quan et al. (2010) derived closed form formulas for the sample size calculation for normal, binary and survival endpoints based on the consistency criterion. For examples, if we choose ρ = 0.5, γ  = 0.2, α  = 0.025, and β = 0.9, then the Asian sample size has to be at least 22.4% of the overall sample size for the multi-regional trial. It  should be noted that the sample size determination given in Kawai et al. (2008), Quan et al. (2010), and Ko et al. (2010) are all derived under the assumption that the effect size is uniform across regions. In practice, it might be expected that there is a difference in treatment effect due to ethnic difference. Thus, the sample size calculation derived by Kawai et al. (2008), Quan et al. (2010), and Ko et al. (2010) may not be of practical use. More specifically, some other assumptions addressing the ethnic difference should be explored. For example, we may consider the following assumptions: 1. Δ is the same but σ 2 is different across regions; 2. Δ is different but σ 2 is the same across regions; 3. Δ and σ 2 are both different across regions. Statistical methods for the sample size determination in multi-regional trials should be developed based on the above assumptions.

8.7 Current Issues 8.7.1 Is Power Calculation the Only Way? In clinical trials comparing a test treatment (T) and a reference product (R) (e.g., a standard of care treatment or an active control agent), it can be verified that sample size formulae is a function of type I error rate (α ), type II error rate ( β ) or power (1− β ), population mean difference (ε ), clinically meaningful difference (δ ), variability associated with the response (σ ), and treatment allocation ratio (k) (see, e.g., Chow et al., 2017). Assuming that n= n= n and T R σ T = σ R = σ , where nT , σ T , and nR , σ R are sample sizes and standard deviations for the test treatment and the reference product, respectively, sample size formulae can generally be expressed as follows: n = f (α , β ,ε ,δ ,σ , and k ),

(8.33)

Sample Size

215

in which ε = µT − µR, where µT and µR are the population means of the test treatment and the reference product, respectively. As indicated in Section  8.2, a typical approach (power calculation or power analysis for sample size calculation) for sample size determination is to fix α (control of type I error rate), ε (under the null hypothesis), δ (detecting the difference of clinical importance), σ (assuming that the variability associated with the response is known), and k (a fixed ratio of treatment allocation) and then select n for achieving a desired power of 1− β (Chow et al., 2017). In practice, it is not possible to select a sample size n and at the same time control all parameters given in (8.33) although many sponsors and most regulatory agencies intend to. Under (8.33), sample size calculation can be divided into two categories: controlling one parameter and controlling two (multiple) parameters. For example, in the interest of maintaining clinically meaningful difference or margin, sample size can be selected by fixing all parameters except for δ . Sample size can also be determined by fixing all parameters except for σ and select an appropriate n for controlling the variability associated with the reference product. We will refer to these sample size calculations as sample size calculation controlling single parameter as described in (8.33). In some cases, the investigators may be interested in controlling two parameters such as δ and σ simultaneously for sample size calculation. Sample size calculation controlling δ and σ at the same time is also known as reproducibility analysis for sample size calculation. It should be noted that the required sample size will increase if one attempts to control more parameters at the same time for sample size calculation, which, however, may not be feasible in most clinical trials due to huge sample size required. Thus, in clinical trials, power calculation is not the only way for sample size calculation. Based on (8.33), different approaches controlling single parameter or multiple parameters may be applied for sample size calculation. In addition, one may consider sample size calculation based on probability statement such as probability monitoring procedure for clinical studies with extremely low incidence rates or rare diseases drug development. Sample size determination based on probability monitoring procedure will be discussed in Chapter 18. 8.7.2 Instability of Sample Size As discussed in Section 8.2, in clinical research, power analysis for sample size calculation is probably the most popular method for sample size determination. However, Chow (2011) indicated that sample size calculation based on estimate of σ2/δ2 might not be stable as expected. It can be verified that the 2 asymptotic bias of E(θɵ = s2 / δɵ ) is given by

()

(

)

E θɵ − θ = N −1 3θ 2 − θ = 3N −1θ 2 {1 + o ( 1)} .

216

Innovative Statistics in Regulatory Science

2 2 Alternatively, it is suggested that the median of s2/δɵ , i.e., P(s2 / δɵ ≤ η0.5 ) = 0.5 be considered. It  can be shown that the asymptotic bias of the median of 2 s2 / δɵ is given by

η0.5 − θ = −1.5N −1θ {1 + o ( 1)} , whose leading term is linear in θ . As it can be seen that bias of the median approach can be substantially smaller than the mean approach for a small sample size and/or small effect size. However, in practice, we usually do 2 not  know the exact value of the median of s2/δɵ . In  this case, a bootstrap approach in conjunction with a Bayesian approach may be useful. 8.7.3 Sample Size Adjustment for Protocol Amendment In practice, it is not uncommon that there is a shift in target patient population due to protocol amendments. In this case, sample size is necessarily adjusted for achieving a desired power for correctly detecting a clinically meaningful difference with respect to the original target patient population. One of the most commonly employed approaches is to consider adjusting the sample size based on the change in effect size. α     E N1 = min N max , max  N min , sign ( E0E1 ) 0 N 0   ,   E1    

where N 0 and N1 are the required original sample size before population shift and the adjusted sample size after population shift, respectively, N max and N min are the maximum and minimum sample sizes, α is a constant that is usually selected so that the sensitivity index ∆ = EE01 is within an acceptable range, and sign(x) = 1 for x > 0; otherwise sign(x) = −1. The  sensitivity index ∆ regarding random target patient population due to protocol amendments and extrapolation is further studied in Chapter 10. 8.7.4 Sample Size Based on Confidence Interval Approach As discussed in Chapter 3, the concepts of interval hypotheses testing and confidence interval approach for bioequivalence assessment for generics and similarity evaluation for biosimilars are different. The  two one-sided tests (TOST) procedure for interval hypotheses testing is the official method recommended by the FDA. However, the 90% confidence interval approach, which is operationally equivalent to TOST under certain conditions, is often mix used. In this case, it is of interest to evaluate sample size requirement based on the 90% confidence interval. Base on the 90% confidence interval

Sample Size

217

approach, sample size should be determined for achieving a desired probability that the constructed 90% confidence interval is totally within the equivalence limit or similarity margin. That is, an appropriate sample size should be selected for achieving a desired probability p based on the following probability statement: p = { 90%CI  [δ L ,δU ]} , where [δ L ,δU ] is the bioequivalence limit or similarity margin. Noted that it can be verified that the above probability statement is not the same as that of the power function of the TOST for testing interval hypotheses based on [δ L ,δU ].

8.8 Concluding Remarks In  summary, sample size calculation plays an important role in clinical research. The purpose of sample size calculation is not only to ensure that there are sufficient or minimum required subjects enrolled in the study for providing substantial evidence of safety and efficacy of the test treatment under investigation, but also to identify any signals or trends—with some assurance—of other clinical benefits to the patient population under study. Since clinical trial is a lengthy and complicated process, standard procedures that fit all types of clinical trials conducted under different trial designs have been developed. In  practice, under a valid study design, sample size is usually selected for achieving study objectives (such as detecting a clinically meaningful difference or treatment effect of a given primary study endpoint) at a pre-specified level of significance. Different types of clinical trials (e.g., non-inferiority/equivalence trials, superiority trials, doseresponse trials, combinational trials, bridging studies, and vaccine clinical trials) are often conducted for different purposes of clinical investigations under different study designs (e.g., parallel-group design, crossover design, cluster randomized design, titration design, enrichment design, group sequential design, blinded reader design, designs for cancer research, and adaptive clinical trial designs). Thus, different procedures may be employed for achieving study objectives with certain desirable statistical inferences. Sample size calculation should be performed under an appropriate statistical test (derived under the null hypothesis) and evaluate under the alternative hypothesis for an accurate and reliable assessment of the clinically meaningful difference or treatment effect with a desired statistical power at a pre-specified level of significance GSP and GCP.

218

Innovative Statistics in Regulatory Science

For  clinical trials utilizing complex innovative designs such as multiple adaptive designs, statistical methods may not be well established. Thus, there may exist no formulae or procedures for sample size calculation/allocation at the planning stage of protocol development. In this case, it is suggested that a clinical trial simulation be conducted for obtaining the required sample size for achieving the study objectives with a desired power or statistical inference/assurance. Clinical trial simulation is a process that uses computers to mimic the conduct of a clinical trial by creating virtual patients and calculating (or predicting) clinical outcomes for each virtual patient based on pre-specified models. It should be noted that although clinical trial simulation does provide “a” solution (not “the” solution) to sample size calculation under complicated study designs, it is useful only when based on a wellestablished predictive model under certain assumptions, which are often difficult, if it is not impossible, to be verified. In addition, “How to validate the assumed predictive model for clinical trial simulation?” is often a major challenge to both investigators and biostatisticians.

9 Reproducible Research

9.1 Introduction In  clinical research and development, it is always a concern to the principal investigator that (i) the research finding does not reach statistical significance, i.e., it is purely by chance alone, and (ii) the significant research finding is not reproducible under the same experimental conditions with the same experimental units. Typical examples include (i) results from genomic studies for screening of relevant genes as predictors of clinical outcomes for building of a medical predictive model for critical and/or life-threatening diseases are often not  reproducible and (ii) clinical results from two pivotal trials for demonstration of safety and efficacy of a test treatment under investigation are not consistent. In practice, it is then of particular interest to assess the validity/reliability and reproducibility of the research findings obtained from studies conducted in pharmaceutical and/or clinical research and development. For genomic studies, thousands of genes are usually screened for selecting a handful of genes that are most relevant to clinical outcomes of a test treatment for treating some critical and/or life threatening diseases such as cancer. These identified genes, which are considered risk factors or predictors, will then be used for building a medical predictive model for the critical and/or life threatening diseases. A validated medical predictive model can definitely benefit patients with the diseases under study. In practice, it is not  uncommon that different statistical methods may lead to different conclusions based on the same data set, i.e., different methods may select different group of genes that are predictive of clinical outcomes. The investigator often struggles with the situation that (i) which set of genes should be reported, and (ii) why the results are not reproducible. Some researchers attribute this to (i) the method is not validated and (ii) there is considerable variability (fluctuation) in data. Thus, it is suggested that necessary actions be taken to identify possible causes of variabilities and eliminate/control the identified variabilities whenever possible. In  addition, it is suggested the method should be validated before it is applied to the clean and quality database. 219

220

Innovative Statistics in Regulatory Science

For  approval of a test drug product, the FDA  requires two pivotal studies be conducted (with the same patient population under the same study protocol) in order to provide substantial evidence of safety and efficacy of the test drug product under investigation. The purpose for two pivotal trials is to assure that the positive results (e.g., p-value is less than the nominal level of 5%) are reproducible with the same patient population under study. Statistically, there is higher probability of observing positive results of future study provided that positive results were observed in two independent trials as compare to that of observing positive results provided that positive results were observed in one single trial. In  practice, however, it is a concern whether two positive pivotal trials can guarantee whether the positive results of future studies are reproducible if the study shall be repeatedly conducted with the same patient population. In clinical research, it is often suggested that testing and/or statistical procedure be validated for reducing possible deviation/fluctuation in research findings to increase the creditability of the research findings in terms of accuracy and reliability. This, however, does not address the question that whether the current observed research findings are reproducible if the study were conducted repeatedly under same or similar experimental conditions with the same patient population. In this chapter, we recommend the use of a Bayesian approach for assessment of the reproducibility of clinical research. In other words, the variability (or degree of fluctuation) in research findings is first evaluated followed by the assessment of reproducibility probability based on the observed variability (Shao and Chow, 2002). The  suggested method provides certain assurance regarding the degree of reproducibility of the observed research findings if the study shall be conducted under the same experimental conditions and target patient population. In the next section, the concept of reproducibility probability is briefly outlined. Section 9.3 introduces the estimated power approach for assessment of reproducibility probability. Alternative methods for evaluation of reproducibility probability are discussed in Section 9.4. Some applications are given in Section 9.5. Section 9.6 provides future perspectives regarding reproducible research in clinical development.

9.2 The Concept of Reproducibility Probability In practice, reliability, repeatability, and reproducibility of research findings are related to various sources of variability such as intra-subject (experimental unit) variability, inter-subject variability, and variability due to subject-by-treatment interaction and so on during the pharmaceutical and/or clinical development process. To achieve the desired reliability, repeatability, and reproducibility of research findings, we will need to identify, eliminate,

Reproducible Research

221

and control possible sources of variability. Chow and Liu (2013) classified possible sources of variability into four categories: (i) expected and controllable (e.g., a new equipment or technician); (ii) expected but uncontrollable (e.g., a new dose or treatment duration); (iii) unexpected but controllable (e.g., compliance); and (iv) unexpected and uncontrollable (e.g., pure random error). In pharmaceutical/clinical research and development, these sources of variability are often monitored through some variability (control) charts for statistical quality assurance and control (QA/QC) (see, e.g., Barrentine, 1991; JMP, 2012). The  selection of acceptance limits, however, is critical to the success of these control charts. Following the idea of Shao and Chow (2002), Salah et al. (2017) proposed the concept of reproducibility based on an empirical power for evaluation of the degree of reliability, repeatability, and reproducibility which may be useful for determining the acceptance limits for monitoring reliability, repeatability, and reproducibility in variability control charts. As mentioned in Section 9.1, for marketing approval of a new drug product, the FDA requires that at least two adequate and well-controlled clinical trials be conducted to provide substantial evidence regarding the safety and effectiveness of the drug product under investigation. The purpose of conducting the second trial is to study whether the observed clinical result from the first trial is reproducible on the same target patient population. Let H 0 be the null hypothesis that the mean response of the drug product is the same as the mean response of a control (for example, placebo) and H a be the alternative hypothesis. An observed result from a clinical trial is said to be significant if it leads to the rejection of H 0 . It is often of interest to determine whether clinical trials that produced significant clinical results provide substantial evidence to assure that the results will be reproducible in a future clinical trial with the same study protocol. Under certain circumstances, the FDA Modernization Act (FDAMA) of 1997 includes a provision (Section 115 of FDAMA) to allow data from one adequate and well-controlled clinical trial investigation and confirmatory evidence to establish effectiveness for risk/ benefit assessment of drug and biological candidates for approval. Suppose that the null hypothesis H 0 is rejected if and only if T > c , where T is a test statistic and c is a positive critical value. In statistical theory, the probability of observing a significant clinical result when H a is indeed true is referred to as the power of the test procedure. If the statistical model under H a is a parametric model, then the power is P ( reject H 0|H a ) = P( T > c|H a ) = P( T > c|θ ),

(9.1)

where θ is an unknown parameter or a vector of parameters under H a . Suppose that one clinical trial has been conducted and the result is significant. What is the probability that the second trial will produce a significant result, that is, the significant result from the first trial is reproducible? Statistically, if the two trials are independent, the probability of observing

Innovative Statistics in Regulatory Science

222

a significant result from the second trial when H a is true is still given by Equation  (9.1), regardless of whether the result from the first trial is significant or not. However, information from the first clinical trial should be useful in the evaluation of the probability of observing a significant result in the second trial. This leads to the concept of reproducibility probability, which is different from the power defined by (9.1). In general, the reproducibility probability is a person’s subjective probability of observing a significant clinical result from a future trial, when he/she observes significant results from one or several previous trials. Goodman (1992) considered the reproducibility probability as the probability in (9.1) with θ replaced by its estimate based on the data from the previous trial. In  other words, the reproducibility probability can be defined as an estimated power of the future trial using the data from the previous trial. When the reproducibility probability is used to provide an evidence of the effectiveness of a drug product, the estimated power approach may produce a rather optimistic result. A  more conservative approach is to define the reproducibility probability as a lower confidence bound of the power of the second trial. Perhaps a more sensible definition of reproducibility probability can be obtained by using the Bayesian approach. Under the Bayesian approach, the unknown parameter θ is a random vector with a prior distribution π (θ ) assumed to be known. Thus, the reproducibility probability can be defined as the conditional probability of T > c in the future trial, given the data set x observed from the previous trial(s), that is, P ( T  > c|x ) = P ( T  > c|θ ) π (θ|x ) dθ ,



(9.2)

where T = T ( y ) is based on the data set y from the future trial and π (θ|x ) is the posterior density of θ , given x.

9.3 The Estimated Power Approach To study the reproducibility probability, we need to specify the test procedure, that is, the form of the test statistic T. In what follows, we consider several different study designs. 9.3.1 Two Samples with Equal Variances Suppose that a total of n = n1 + n2 patients are randomly assigned to two groups, a treatment group and a control group. In  the treatment group, n1 patients receive the treatment (or a test drug) and produce responses x11 ,…, x1n1 . In the control group, n2 patients receive the placebo (or a reference

Reproducible Research

223

drug) and produce responses x21 ,…, x2 n2 . This design is a typical two-group parallel design in clinical trials. Assume that xij ’s are independent and normally distributed with means µi , i = 1, 2, and a common variance σ 2. Suppose that the hypotheses of interest are H 0 : µ1 − µ2 = 0 versus H a : µ1 − µ2 ≠ 0.

(9.3)

Similar discussion applies for the case of a one-sided H a . Consider the commonly used two-sample t-test that rejects H 0 if and only if T  > t0.975 ;n−2 , where t0.975 ;n−2 is the 97.5th percentile of the t-distribution with n − 2 degrees of freedom x1 − x2

T=

( n1 − 1) s + ( n2 − 1) s22 2 1

n−2

,

(9.4)

1 1 + n1 n2

and xi and si2 are the sample mean and variance, respectively, based on the data from the ith treatment group. The power of T for the second trial is

(

p (θ ) = P T ( y )  > t0.975 ; n−2

)

= 1 −  n−2 ( t0.975 ;n−2|θ ) +  n−2 ( −t0.975 ;n−2|θ ) ,

(9.5)

where

θ=

µ1 − µ2 , 1 1 σ + n1 n2

(9.6)

and  n−2 (| ⋅ θ ) denotes the distribution function of the non-central t-distribution with n − 2 degrees of freedom and the non-centrality parameter θ . Note that p (θ ) = p ( θ ) . Values of p (θ ) as a function of θ is provided in Table 9.1. Replacing θ by its estimate T ( x ) , where T is defined by (9.4), reproducibility probability can be obtained as follows Pˆ = 1 −  n−2 ( t0.975 ;n−2|T ( x ) ) +  n−2 ( −t0.975 ;n−2|T ( x ) ) ,

(9.7)

which is a function of T ( x ) . When T ( x ) > t0.975 ;n−2 ,  1 −  n − 2 ( t0.975 ; n − 2|T ( x ) ) Pˆ ≈    n − 2 ( −t0.975 ; n − 2|T ( x ) )

if T ( x ) > 0 if T ( x ) < 0

(9.8)

Innovative Statistics in Regulatory Science

224

TABLE 9.1 Values of the Power Function p (θ ) in (9.5) Total Sample Size

θ 1.96 2.02 2.08 2.14 2.20 2.26 2.32 2.38 2.44 2.50 2.56 2.62 2.68 2.74 2.80 2.86 2.92 2.98 3.04 3.10 3.16 3.22 3.28 3.34 3.40 3.46 3.52 3.58 3.64 3.70 3.76 3.82 3.88 3.94

10

20

30

40

50

60

100



0.407 0.429 0.448 0.469 0.490 0.511 0.532 0.552 0.573 0.593 0.613 0.632 0.652 0.671 0.690 0.708 0.725 0.742 0.759 0.775 0.790 0.805 0.819 0.832 0.844 0.856 0.868 0.879 0.889 0.898 0.907 0.915 0.923 0.930

0.458 0.481 0.503 0.526 0.549 0.571 0.593 0.615 0.636 0.657 0.678 0.698 0.717 0.736 0.754 0.772 0.789 0.805 0.820 0.834 0.848 0.861 0.873 0.884 0.895 0.905 0.914 0.923 0.931 0.938 0.944 0.950 0.956 0.961

0.473 0.496 0.519 0.542 0.565 0.588 0.610 0.632 0.654 0.675 0.695 0.715 0.735 0.753 0.771 0.788 0.805 0.820 0.835 0.849 0.862 0.874 0.886 0.897 0.907 0.916 0.925 0.933 0.940 0.946 0.952 0.958 0.963 0.967

0.480 0.504 0.527 0.550 0.573 0.596 0.618 0.640 0.662 0.683 0.704 0.724 0.743 0.761 0.779 0.796 0.812 0.827 0.842 0.856 0.868 0.881 0.892 0.902 0.912 0.921 0.929 0.937 0.944 0.950 0.956 0.961 0.966 0.970

0.484 0.508 0.531 0.555 0.578 0.601 0.623 0.645 0.667 0.688 0.708 0.728 0.747 0.766 0.783 0.800 0.816 0.831 0.846 0.859 0.872 0.884 0.895 0.905 0.915 0.924 0.932 0.939 0.946 0.952 0.958 0.963 0.967 0.971

0.487 0.511 0.534 0.557 0.581 0.604 0.626 0.648 0.670 0.691 0.711 0.731 0.750 0.769 0.786 0.803 0.819 0.834 0.848 0.862 0.874 0.886 0.897 0.907 0.917 0.925 0.933 0.941 0.947 0.953 0.959 0.964 0.968 0.972

0.492 0.516 0.540 0.563 0.586 0.609 0.632 0.654 0.676 0.697 0.717 0.737 0.756 0.774 0.792 0.808 0.824 0.839 0.853 0.866 0.879 0.890 0.901 0.911 0.920 0.929 0.936 0.943 0.950 0.956 0.961 0.966 0.970 0.974

0.500 0.524 0.548 0.571 0.594 0.618 0.640 0.662 0.684 0.705 0.725 0.745 0.764 0.782 0.799 0.815 0.830 0.845 0.860 0.872 0.884 0.895 0.906 0.916 0.925 0.932 0.940 0.947 0.953 0.959 0.965 0.969 0.973 0.977

Source: Shao, J. and Chow, S.C., Stat. Med., 21, 1727–1742, 2002.

Reproducible Research

225

If  n−2 is replaced by the normal distribution and t0.975 ;n−2 is replaced by the normal percentile, then (9.8) is the same as that in Goodman (1992) who studied the case where the variance σ 2 is known. Note that Table 9.1 can be used to find the reproducibility probability Pˆ in (9.7) with a fixed sample size n. As an example, if T ( x ) = 2.9 was observed in a clinical trial with n = n1 + n2 = 40, then the reproducibility probability is 0.807. If T ( x ) = 2.9 was observed in a clinical trial with n = 36, then an extrapolation of the results in Table 9.1 (for n = 30 and 40) leads to a reproducibility probability of 0.803. 9.3.2 Two Samples with Unequal Variances Consider the problem of testing hypotheses (9.3) under the two-group parallel design without the assumption of equal variances. That is, xij ’s are independently distributed as N ( µi , σ i2 ), i = 1, 2. When σ 12 ≠ σ 22 , there exists no exact testing procedure for the hypotheses in (9.3). When both n1 and n2 are large, an approximate 5% level test rejects H 0 when T > z0.975 , where T=

x1 − x 2 s12 s22 + n1 n2

(9.9)

.

Since T is approximately distributed as N (θ ,1) with

θ=

µ1 − µ2

(9.10)

σ 12 σ 22 + n1 n2

the reproducibility probability obtained by using the estimated power approach is given by Pˆ = Φ ( T ( x ) − z0.975 ) + Φ ( −T ( x ) − z0.975 ) .

(9.11)

When the variances under different treatments are different and the sample sizes are not large, a different study design, such as a matched-pair parallel design or a 2 × 2 crossover design is recommended. A matched-pair parallel design involves m pairs of matched patients. One patient in each pair is assigned to the treatment group and the other is assigned to the control group. Let xij be the observation from the jth pair and the ith group. It  is assumed that the differences x1 j − x2 j , j = 1,…, m, are independent and identically distributed as N ( µ1 − µ2 ,σ D2 ). Then, the null hypothesis H 0 is rejected at the 5% level of significance if T > t0.975 ;m−1, where

T=

(

m x1 − x 2

σˆ D

)

(9.12)

Innovative Statistics in Regulatory Science

226

and σ D2 is the sample variance based on the differences x1 j − x2 j , j = 1,…, m. Note that T has the non-central t-distribution with m − 1 degrees of freedom and the non-centrality parameter m ( µ1 − µ2 ) . σD

θ=

(9.13)

Consequently, the reproducibility probability obtained by using the estimated power approach is given by (9.7) with T defined in (9.12) and n − 2 replaced by m − 1. Suppose that the study design is a 2  ×  2 cross-over design in which n1 patients receive the treatment at the first period and the placebo at the second period and n2 patients receive the placebo at the first period and the treatment at the second period. Let xlij be the normally distributed observation from the jth patient at the ith period and lth sequence. Then the treatment effect µD can be unbiasedly estimated by

µˆ D =

x11 − x12 − x21 + x22 2

 σ 2  1 1  ~ N  µD , D  +   , 4  n1 n2   

where x ij is the sample mean based on xlij , j = 1,…, nl and σ D2 = var ( xl1 j − xl 2 j ) . An unbiased estimator of σ D2 is

σˆ D2 =

1 n1 + n2 − 2

2

nl

∑∑(x

l1 j

− xl 2 j − x l1 + x l 2 )2 ,

l =1 j =1

which is independent of µˆ D and distributed as σ D2 / ( n1 + n2 − 2 ) times the chi-square distribution with n1 + n2 − 2 degrees of freedom. Thus, the null hypothesis H 0 : µD = 0 is rejected at the 5% level of significance if T > t0.975 ;n−2 , where n = n1 + n2 and T=

σˆ D 2

µˆ D . 1 1 + n1 n2

(9.14)

Note that T has the non-central t-distribution with n − 2 degrees of freedom and the non-centrality parameter

θ=

σD 2

µD . 1 1 + n1 n2

(9.15)

Reproducible Research

227

Consequently, the reproducibility probability obtained by using the estimated power approach is given by (9.7) with T defined by (9.14). 9.3.3 Parallel-Group Designs Parallel-group designs are often adopted in clinical trials to compare more than one treatment with a placebo control or to compare one treatment, one placebo control and one active control. Let a ≥ 3 be the number of groups and xij be the observation from the jth patient in the ith group, j = 1,…, ni , i = 1,…, a. Assume that xij ’s are independently distributed as N µi ,σ 2 . The  null hypothesis H 0 is then

(

)

H 0 : µ1 = µ2 = … = µ a which is rejected at the 5% level of significance if T > F0.95 ; a−1,n− a , where F0.95 ; a−1,n− a is the 95th percentile of the F-distribution with a  −  1 and n  −  a degrees of freedom, n = n1 +…+ na T=

SST / ( a − 1) , SSE / ( n − a )

(9.16)

a

SST =

∑n (x − x) , i

i

2

i =1

a

SSE =

ni

∑∑(x

ij

− x i )2 ,

i =1 j =1

x i is the sample mean based on the data in the ith group, and x is the overall sample mean. Note that T has the non-central F-distribution with a − 1 and n − a degrees of freedom and the non-centrality parameter

θ=



ni ( µi − µ )2 , i =1 σ2 a

(9.17)

where µ = ∑ ia=1 ni µi/n. Let a−1,n− a (⋅|θ ) be the distribution of T. Then, the power of the second clinical trial is

(

)

P T ( y ) > F0.95 ; a−1,n− a = 1 − a−1,n− a ( F0.95 ; a−1,n− a |θ ). Thus, the reproducibility probability obtained by using the estimated power approach is Pˆ = 1 − a−1,n− a ( F0.95 ; a−1,n− a|T ( x ) ) , where T ( x ) is the observed T based on the data x from the first clinical trial.

Innovative Statistics in Regulatory Science

228

9.4 Alternative Methods for Evaluation of Reproducibility Probability Since Pˆ in (9.7) or (9.11) is an estimated power, it provides a rather optimistic result. Alternatively, we may consider a more conservative approach, which considers a 95% lower confidence bound of the power as the reproducibility probability. In addition, we may also consider Bayesian approach for evaluation of reproducibility probability. 9.4.1 The Confidence Bound Approach Consider first the case of the two-group parallel design with a common unknown variance σ 2. Note that T ( x ) defined by (9.4) has the noncentral t-distribution with n − 2 degrees of freedom and the non-centrality parameter θ given by (9.6). Let  n−2 (⋅|θ ) be the distribution function of T ( x ) for any given θ . It can be shown that  n−2 (t|θ ) is a strictly decreasing function of θ for any fixed t. Consequently, a 95% confidence interval for θ is θˆ− ,θˆ+ , where θˆ− is the unique solution of  n−2 ( T ( x )|θ ) = 0.975 and θˆ+ is the unique solution of  n−2 ( T ( x )|θ ) = 0.025. Then a 95% lower confidence bound for θ is

(

)

 θˆ−  θˆ = −θˆ+  0 

if θˆ− > 0 if θˆ+ < 0 if θˆ− ≤ 0 ≤ θˆ+

(9.18)

and a 95% lower confidence bound for the power p (θ ) in (9.5) is

(

)

Pˆ− = 1 −  n−2 t0.975 ;n−2|θ − +  n−2 −t0.975 ;n−2|θ −

(

)

(9.19)

If θ − > 0 and Pˆ− = 0 if θ − = 0. The lower confidence bound in (9.19) is useful when the clinical result from the first trial is highly significant. To provide a better understanding, values of the lower confidence bound θ − corresponding to |T(x)| values ranging from 4.5 to 6.5 are summarized in Table 9.2. If 4.5 ≤ T ( x ) ≤ 6.5 and the value of θɵ − is found from Table 9.2, the reproducibility probability Pˆ− (9.19) can be obtained from Table 9.1. For example, suppose that T ( x ) = 5 was observed from a clinical trial with n = 30. From Table  9.2, θˆ − = 2.6. Then, by Table  9.1, Pˆ− = 0.709. Consider the two-group parallel design with unequal variances σ 12 and σ 22. When both n1 and n2 are large, T given by (9.9) is approximately distributed as N (θ ,1) with θ given by (9.10). Hence, the reproducibility probability obtained by using the lower confidence bound approach is given by

Reproducible Research

229

TABLE 9.2 95% Lower Confidence Bound θˆ



Total Sample Size T ( x) 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5

10

20

30

40

50

60

100



1.51 1.57 1.64 1.70 1.76 1.83 1.89 1.95 2.02 2.08 2.14 2.20 2.26 2.32 2.39 2.45 2.51 2.57 2.63 2.69 2.75

2.01 2.09 2.17 2.25 2.33 2.41 2.48 2.56 2.64 2.72 2.80 2.88 2.95 3.03 3.11 3.19 3.26 3.34 3.42 3.49 3.57

2.18 2.26 2.35 2.43 2.52 2.60 2.69 2.77 2.86 2.95 3.03 3.11 3.20 3.28 3.37 3.45 3.53 3.62 3.70 3.78 3.86

2.26 2.35 2.44 2.53 2.62 2.71 2.80 2.88 2.97 3.06 3.15 3.21 3.32 3.41 3.50 3.59 3.67 3.76 3.85 3.93 4.02

2.32 2.41 2.50 2.59 2.68 2.77 2.86 2.95 3.04 3.13 3.22 3.31 3.40 3.49 3.58 3.67 3.76 3.85 3.94 4.03 4.12

2.35 2.44 2.54 2.63 2.72 2.81 2.91 3.00 3.09 3.18 3.27 3.36 3.45 3.55 3.64 3.73 3.82 3.91 4.00 4.09 4.18

2.42 2.52 2.61 2.71 2.80 2.90 2.99 3.09 3.18 3.28 3.37 3.47 3.56 3.66 3.75 3.85 3.94 4.03 4.13 4.22 4.32

2.54 2.64 2.74 2.84 2.94 3.04 3.14 3.24 3.34 3.44 3.54 3.64 3.74 3.84 3.94 4.04 4.14 4.24 4.34 4.44 4.54

Source: Shao, J. and Chow, S.C., Stat. Med., 21, 1727–1742, 2002.

(

Pˆ− = Φ T ( x ) − 2 z0.975

)

with T defined by (9.9). For  the matched-pair parallel design, T given by (9.12) has the noncentral t-distribution with m − 1 degrees of freedom and the non- centrality parameter θ given by (9.13). Hence, the reproducibility probability obtained by using the lower confidence bound approach is given by (9.19) with T defined by (9.12) and n  −  2 replaced by m  −  1. Suppose now  that the study design is the 2 × 2 cross-over design. Since T defined by (9.14) has the non-central t-distribution with n − 2 degrees of freedom and the non-centrality parameter θ given by (9.15), the reproducibility probability obtained by using the lower confidence bound approach is given by (9.19) with T defined by (9.14). Finally, consider the parallel-group design, since T in (9.16) has the noncentral F-distribution with a − 1 and n − a degrees of freedom and the noncentrality parameter θ given by (9.17) and a−1,n− a (t|θ ) is a strictly decreasing

Innovative Statistics in Regulatory Science

230

function of θ , the reproducibility probability obtained by using the lower confidence bound approach is Pˆ− = 1 − a−1,n− a ( F0.95 ; a−1,n− a |θˆ− ), where θˆ− is the solution of a−1,n− a ( T ( x )|θ ) = 0.95. 9.4.2 The Bayesian Approach Shao and Chow (2002) studied how to evaluate the reproducibility probability using Equation  (9.1) under several study designs. When the reproducibility probability is used to provide an evidence of the effectiveness of a drug product, the estimated power approach may produce a rather optimistic result. A more conservative approach is to define the reproducibility probability as a lower confidence bound of the power of the second trial. Alternatively, a more sensible definition of reproducibility probability can be obtained by using the Bayesian approach. Under the Bayesian approach, the unknown parameter θ is a random vector with a prior distribution π (θ ) assumed to be known. Thus, the reproducibility probability can be defined as the conditional probability of|T|> c in the future trial, given the data set x observed from the previous trial, that is,



P( T > c|x) = P( T > c|θ )π (θ |x)dθ , where T = T ( y) is based on the data set y from the future trial and π (θ |x) is the posterior density of θ , given x. In practice, the reproducibility probability is useful when the clinical trials are conducted sequentially. It provides important information for regulatory agencies in deciding whether it is necessary to require the second clinical trial when the result from the first clinical trial is strongly significant. Note that power calculation for required sample size for achieving a desired reproducibility probability at a pre-specified level of significance can be performed with appropriate selection of prior. As discussed in Section 9.3, the reproducibility probability can be viewed as the posterior mean of the power function p (θ ) = P( T > c|θ ) for the future trial. Thus, under the Bayesian approach, it is essential to construct the posterior density π(θ|x) in formula (9.2), given the data set x observed from the previous trial(s). Consider first the two-group parallel design with equal variances, that is, xij ’s are independent and normally distributed with means µ1 and µ2 and a common variance σ 2. If σ 2 is known, then the power for testing hypotheses in (9.3) is Φ (θ − z0.975 ) + Φ ( −θ − z0.975 )

Reproducible Research

231

with θ defined by (9.6). A  commonly used prior for ( µ1 , µ2 ) is the noninformative prior π ( µ1 , µ2 ) ≡ 1. Consequently, the posterior density for θ is N ( T ( x ) ,1) , where T=

x1 − x 2 1 1 σ + n1 n2

and the posterior mean given by (9.2) is



 T ( x ) − z0.975  Φ (θ − z0.975 ) + Φ ( −θ − z0.975 )  π (θ|x ) dθ = Φ   2    −T ( x ) − z0.975  + Φ . 2  

When T ( x ) > z0.975 , this probability is nearly the same as  T ( x ) − z0.975  , Φ   2   which is exactly the same as that in formula (9.1) in Goodman (1992). When σ 2 is unknown, a commonly used non-informative prior for 2 σ is the Lebesgue (improper) density π (σ 2 ) = σ −2 . Assume that the pri2 ors for µ1 , µ2, and σ 2 are independent. The  posterior density for (δ , u ) is π (δ |u2 , x)π (u2 |x), where

µ1 − µ2

δ=

( n1 − 1) s + ( n2 − 1) s22 2 1

n−2

u2 =

, 1 1 + n1 n2

( n − 2 )σ 2 , ( n1 − 1) s12 + ( n2 − 1) s22

1  δ − T (x)  π δ|u2 , x = φ  , u u  

(

)

in which φ is the density function of the standard normal distribution, T is given by (9.4), and π u2|x = f ( u ) with

(

)

Innovative Statistics in Regulatory Science

232

−1

  n − 2   n − 2  f ( u ) = Γ       2   2 

( n− 2 )/2

− n− 2 /2 u2 u− n e ( ) .

Since θ in (9.6) is equal to δ /u, the posterior mean of p ( θ ) in (9.5) is Pˆ =



  

 δ   δ − T (x)   p  φ   dδ  2 f ( u ) du, u −∞  u    

∫ ∫ 0



(9.20)

which is the reproducibility probability under the Bayesian approach. It is clear that Pˆ depends on the data x through the function T ( x ) . The  probability Pˆ in (9.20) can be evaluated numerically. A  Monte Carlo method can be applied as follows. First, generate a random variate γ j from the gamma distribution with the shape parameter ( n − 2 ) / 2 and the scale parameter 2 / ( n − 2 ), and generate a random variate δ j from N T ( x ) , uj2 , where u2j = γ j−1. Repeat this process independently N times to obtain (δ j , u2j ), j = 1,…, N . Then Pˆ in (9.20) can be approximated by

(

)

1 P˘N = 1 − N



   δ  δ   n − 2  t0.975 ; n − 2 j  −  n − 2  −t0.975 ; n − 2 j  .   j =1  uj  uj      

N

(9.21)

Values of PˆN for N = 10, 000 and some selected values of T ( x ) and n are given in Table 9.3. As it can be seen from Table 9.3, in assessing reproducibility, the Bayesian approach is more conservative than the estimated power approach, but less conservative than the confidence bound approach. Consider the two-group parallel design with unequal variance and large nj ’ s . The approximate power for the second trial is p (θ ) = Φ (θ − z0.975 ) + Φ ( −θ − z0.975 ) , where

µ1 − µ2

θ=

σ 12 σ 22 + n1 n2

.

Suppose that we use the non-inferiority prior density

(

)

π µ1 , µ2 , σ 12 , σ 22 = σ 1−2σ 2−2 , σ 12 > 0,σ 22 > 0.

(

) ( −1

)

−1

Let τ i2 = σ i−2 , i = 1, 2 and ξ 2 = n1τ 12 + n2τ 22 . Then, the posterior density π ( µ1 − µ2 |τ 12 ,τ 22 , x) is the normal density with mean x1 − x2 and variance ξ 2 and the posterior density π (τ 12 ,τ 22 |x) = π (τ 12 |x)π (τ 22 |x), where π (τ i2 |x) is the

Reproducible Research

233

TABLE 9.3 Reproducibility Probability Under the Bayesian Approach Approximated by Monte Carlo Simulation Total Sample Size T ( x)

10

20

30

40

50

60

100



2.02 2.08 2.14 2.20 2.26 2.32 2.38 2.44 2.50 2.56 2.62 2.68 2.74 2.80 2.86 2.92 2.98 3.04 3.10 3.16 3.22 3.28 3.34 3.40 3.46 3.52 3.58 3.64 3.70 3.76 3.82 3.88 3.94

0.435 0.447 0.466 0.478 0.487 0.505 0.519 0.530 0.546 0.556 0.575 0.591 0.600 0.608 0.629 0.636 0.649 0.663 0.679 0.690 0.701 0.708 0.715 0.729 0.736 0.745 0.755 0.771 0.778 0.785 0.795 0.800 0.806

0.482 0.496 0.509 0.529 0.547 0.558 0.576 0.585 0.609 0.618 0.632 0.647 0.660 0.675 0.691 0.702 0.716 0.726 0.738 0.754 0.762 0.773 0.784 0.793 0.806 0.816 0.828 0.833 0.839 0.847 0.857 0.869 0.873

0.495 0.512 0.528 0.540 0.560 0.577 0.590 0.610 0.624 0.638 0.654 0.665 0.679 0.690 0.706 0.718 0.735 0.745 0.754 0.767 0.777 0.793 0.803 0.815 0.826 0.834 0.841 0.854 0.861 0.867 0.878 0.881 0.890

0.501 0.515 0.530 0.547 0.564 0.580 0.597 0.611 0.631 0.647 0.655 0.674 0.685 0.702 0.716 0.730 0.742 0.753 0.766 0.776 0.790 0.804 0.809 0.819 0.832 0.843 0.849 0.859 0.867 0.874 0.883 0.891 0.897

0.504 0.519 0.535 0.553 0.567 0.581 0.603 0.613 0.634 0.648 0.657 0.675 0.686 0.705 0.722 0.733 0.744 0.756 0.771 0.781 0.792 0.806 0.812 0.829 0.837 0.845 0.857 0.863 0.870 0.882 0.889 0.896 0.904

0.508 0.523 0.543 0.556 0.571 0.587 0.604 0.617 0.636 0.650 0.664 0.677 0.694 0.712 0.723 0.738 0.748 0.759 0.776 0.786 0.794 0.809 0.818 0.830 0.839 0.846 0.859 0.865 0.874 0.883 0.891 0.899 0.907

0.517 0.532 0.549 0.565 0.577 0.590 0.610 0.627 0.640 0.658 0.675 0.687 0.703 0.714 0.729 0.742 0.756 0.765 0.779 0.792 0.804 0.820 0.828 0.838 0.847 0.855 0.867 0.872 0.884 0.890 0.898 0.904 0.910

0.519 0.536 0.553 0.569 0.585 0.602 0.618 0.634 0.650 0.665 0.680 0.695 0.710 0.724 0.738 0.752 0.765 0.778 0.790 0.802 0.814 0.825 0.836 0.846 0.856 0.865 0.874 0.883 0.891 0.898 0.906 0.913 0.919

Source: Shao, J. and Chow, S.C., Stat. Med., 21, 1727–1742, 2002. Note: Prior for µ1 , µ2 , σ −2 = σ −2 with respect to the Lebesgue measure.

(

)

Innovative Statistics in Regulatory Science

234

gamma density with the shape parameter ( ni − 1) / 2 and the scale parameter 2 ( ni −1)s2  , i = 1, 2. Consequently, the reproducibility probability is the posterior 

j



mean of p (θ ) given by

  x1 − x 2 z0.975   x1 − x 2 z0.975   − − Pˆ = Φ   + Φ−   π (ξ|x ) dξ , 2  2ξ 2     2ξ 



(

)

where π (ξ |x) is the posterior density of ξ constructed using π τ i2|x , i = 1, 2. The  Monte Carlo method previously discussed can be applied to approximate Pˆ . Note that reproducibility probability under the Bayesian approach can be similarly obtained for the matched-pairs parallel design and the 2 × 2 crossover design described in the previous section. Finally, consider the a −group parallel design, where the power is given by p (θ ) = 1 − a−1,n− a ( F0.95 ; a−1,n− a |θ ). with θ given by (9.17). Under the non-informative prior

(

)

π µ1 ,…, µ a ,σ 2 = σ −2 ,σ 2 > 0

(

)

The  posterior density π θ|τ 2 , x , where τ 2 =

SSE ( n− a )σ 2   

, is the density of the

non-central chi-square distribution with a − 1 degrees of freedom and the  non-centrality parameter τ 2 ( a − 1) T ( x ). The  posterior density π (σ 2 |x) is the gamma distribution with the shape parameter ( n − a ) / 2 and the scale parameter n2− a . Consequently, the reproducibility probability under the Bayesian approach is Pˆ =



  

∫ ∫ 0



0

 p (θ ) π θ|τ 2 , x dθ  π τ 2|x dτ 2 . 

(

)

(

)

The reproducibility probability based on the Bayesian approach depends on the choice of the prior distributions. The  non-informative prior we choose produces a more conservative reproducibility probability than that obtained using the estimated power approach, but is less conservative than that under the confidence bound approach. If a different prior such as an informative prior is used, a sensitivity analysis may be performed to evaluate the effects of different priors on the reproducibility probability.

Reproducible Research

235

9.5 Applications 9.5.1 Substantial Evidence with a Single Trial An important application of the concept of reproducibility discussed in the previous sections is to address the following question: Is it necessary to conduct a second clinical trial when the first trial produces a relatively strong significant clinical result (for example, a relatively small p-value is observed), assuming that other factors (such as consistent results between centers, discrepancies related to gender, race and other factors, and safety issues) have been satisfactory addressed?

As mentioned in Section 9.1, the FDA Modernization Act of 1997 includes a provision (Section 115 of FDAMA) to allow data from one adequate and well-controlled clinical trial investigation and confirmatory evidence to establish effectiveness for risk/benefit assessment of drug and biological candidates for approval. This provision essentially codified an FDA policy that had existed for several years but whose application had been limited to some biological products approved by the Center for Biologic Evaluation and Research (CBER) of the FDA  and a few pharmaceuticals, especially orphan drugs such as zidovudine and lamotrigine. A relatively strong significant result observed from a single clinical trial (say, p-value is less than 0.001) would have about 90% chance of reproducing the result in future clinical trials. Consequently, a single clinical trial is sufficient to provide substantial evidence for demonstration of efficacy and safety of the medication under study. In  1998, the FDA  published a guidance which shed the light on this approach despite the fact that the FDA  has recognized that advances in sciences and practice of drug development may permit an expanded role for the single controlled trial in contemporary clinical development (FDA, 1988). Suppose it is agreed that the second trial is not needed if the probability for reproducing a significant clinical result in the second trial is equal to or higher than 90%. If a significant clinical result is observed in the first trial and the confidence bound Pˆ is equal to or higher than 90%, then we have 95% statistical assurance that, with a probability of at least 90%, the significant result will be reproduced in the second trial. As an example, under the two-group parallel design with a common unknown variance and n = 40, the 95% lower confidence bound Pˆ given in (9.19) is equal to or higher than 90% if and only if T ( x ) ≥ 5.7 , that is, the clinical result in the first trial is highly significant. Alternatively, if the Bayesian approach is applied to the same situation, the reproducibility probability in (9.20) is equal to or higher than 90% if and only if T ( x ) ≥ 3.96 .

Innovative Statistics in Regulatory Science

236

9.5.2 Sample Size When the reproducibility probability based on the result from the first trial is not higher than a desired level, the second trial must be conducted in order for obtaining substantial evidence of safety and effectiveness of the test drug under investigation. The results on the reproducibility probability discussed in the previous sections can be used to adjust the sample size for the second trial. If the sample size for the first trial was determined based on a power analysis with some initial guessing values of the unknown parameters, then it is reasonable to make a sample size adjustment for the second trial based on the results from the first trial. If the reproducibility probability is lower than a desired power level of the second trial, then the sample size should be increased. On the other hand, if the reproducibility probability is higher than the desired power level of the second trial, then the sample size may be decreased to reduce costs. In the following we illustrate the idea using the two-group parallel design with a common unknown variance. Suppose that Pˆ in (9.7) is used as the reproducibility probability when T(x) 2

given in (9.4) is observed from the first trial. Let σ =

( n1 −1)s12 +( n2 −1)s22  n− 2 *

. For sim-

plicity, consider the case where the same sample size n /2 is used for two treatment groups in the second trial, where n* is the total sample size in the second trial. With fixed x i and σˆ 2 but a new sample size n* , the T-statistic becomes T* =

(

n* x 1 − x 2

)



and the reproducibility probability is Pˆ with T replaced by T *. By letting T * be the value to achieve a desired power, the new sample size n* should be 2

 T*   1 1  n* =   /  . + T 4 n 4 n2     1

(9.22)

For example, if the desired reproducibility probability is 80%, then T * needs to be 2.91 (Table  9.1). If T = 2.58 is observed in the first trial with n = 30 (n= n= 15), then n* ≈ 1.27n ≈ 38 according to (9.22), that is, the sample size 1 2 should be increased by about 27%. On the other hand, if T = 3.30 is observed in the first trial with n = 30 (n= n= 15), then n* ≈ 0.78n ≈ 24, that is, the sam1 2 ple size can be reduced by about 22%. 9.5.3 Generalizability between Patient Populations In  clinical development, after the investigational drug product has been shown to be effective and safe with respect to a target patient population (for example, adults), it is often of interest to study a similar but different patient

Reproducible Research

237

population (for example, elderly patients with the same disease under study or a patient population with different ethnic factors) to see how likely the clinical result is reproducible in the different population. This information is useful in regulatory submission for supplement new drug application (SNDA) (for example, when generalizing the clinical results from adults to elderly patients) and regulatory evaluation for bridging studies (for example, when generalizing clinical results from Gaussian to Asian patient population). For this purpose, we propose to consider the generalizability probability, which is the reproducibility probability with the population of a future trial slightly deviated from the population of the previous trial(s). Consider a parallel-group design for two treatments with population means µ1 and µ2 and an equal variance σ 2. Other designs can be similarly treated. Suppose that in the future trial, the population mean difference is changed to µ1 − µ2 + ε and the population variance is changed to C 2σ 2 , where C > 0. The signal-to-noise ratio for the population difference in the previous trial is µ1 − µ2 / σ , whereas the signal-to-noise ratio for the population difference in the future trial is

µ1 − µ2 + ε ∆( µ1 − µ2 ) = , σ Cσ where ∆=

1 + ε / ( µ1 − µ2 ) C

(9.23)

is a measure of change in the signal-to-noise ratio for the population difference. For most practical problems, ε < µ1 − µ2 and, thus, ∆ > 0. Table 9.4 gives an example on the effects of changes of ε and C on ∆. If the power for the previous trial is p (θ ), then the power for the future trial is p ( ∆θ ). Suppose that ∆ is known. Under the frequentist approach, the generalizability probability is Pˆ∆ , which is Pˆ given by (9.7) with T(x) replaced by ∆T ( x ), or Pˆ∆− , which is Pˆ− given in (9.19) with θɵ − replaced by ∆ θɵ − . Under the Bayesian approach, the generalizability probability is Pˆ∆ , which is Pˆ given by (9.20) with p(δ /u) replaced by p(∆δ /u). When the value of ∆ is unknown, we may consider a set of ∆-values to carry out a sensitivity analysis. An example is given as follows. A double-blind randomized trial was conducted in patients with schizophrenia for comparing the efficacy of a test drug with a standard therapy. A  total of 104 chronic schizophrenic patients participated in this study. Patients were randomly assigned to receive the treatment of the test drug or the standard therapy for at least one year, where the test drug group has 56 patients and the standard therapy group has 48 patients. The primary clinical endpoint of this trial was the total score of Positive and Negative Symptom Scales (PANSS). No significant differences in demographics and baseline

Innovative Statistics in Regulatory Science

238

TABLE 9.4 Effects of Changes in Mean and Standard Deviation εε / ( µ1 − µ 2 ) 1 − δ ). Similar procedures for testing the consistency can be derived to suit this case. For simplicity, we do not give the details here.

11.3.3 Evaluation of Sensitivity Index In a MRCT, denote the entire population (all regions combined) by ( µ0 , σ 0 ) , where µ0 and σ 0 are the population mean and population standard deviation, respectively. Similarly, denote the sub-population (a or some specific regions) by ( µ1 , σ 1 ) . Since the two populations are similar but slightly different, it is reasonable to assume that µ1 = µ0 + ε and σ 1 = Cσ 0 (C > 0), where ε is referred to as the shift in location parameter (population mean) and C is the inflation factor of the scale parameter (population standard deviation). Thus, the (treatment) effect size adjusted for standard deviation of the subpopulation ( µ1 , σ 1 ) can be expressed as follows:

δ1 =

µ1 µ +ε µ = ∆ 0 = ∆ δ0 , = 0 σ1 Cσ 0 σ0

where ∆ = 1+εC/µ0 and δ 0 is the effect size adjusted for standard deviation of the entire population ( µ0 , σ 0 ) . Δ is referred to as a sensitivity index measuring the change in effect size between the sub-population and the entire population (Chow et al., 2002). As it can be seen, if ε = 0 and C = 1, then δ 1 = δ 0 . That is, the effect sizes of the two populations are identical. Applying the concept of bioequivalence assessment, we can claim that the effect sizes of the two patient populations are consistent if the confidence interval of |Δ| is within (80%, 120%). In practice, the shift in location parameter (ε ) and/or the change in scale parameter (C) could be random. If both ε and C are fixed, the sensitivity index can be assessed based on the sample means and sample variances obtained from the entire population and the sub-population. As indicated

Innovative Statistics in Regulatory Science

270

by Chow et al. (2002), ε and C can be estimated by εˆ = µˆ1 − µˆ 0 and Cɵ = σɵ 1 / σɵ 0 , respectively, where ( µˆ 0 , σˆ 0 ) and ( µˆ1 , σˆ1 ) are some estimates of ( µ0 , σ 0 ) and ( µ1 , σ 1 ) based on the entire population and the sub-population, respectively. ˆ ˆ Consequently, the sensitivity index ∆ can be estimated by ∆ˆ = 1+εCˆ/µ0 and the corresponding confidence interval can be obtained based on normal approximation (Chow et al., 2002; Lu et al., 2017). In real world problems, however, ε and C could be either fixed or random variables. In other words, there are three possible scenarios: (i) the case where ε is random and C is fixed, (ii) the case where ε is fixed and C is random, and (iii) the case where both ε and C are random. Statistical inference of ∆ under each of these possible scenarios has been studied by Lu et al. (2017). Besides, we also give a simplified version: if ∆ˆ is within (80%, 120%), the consistency is claimed. For superiority test, the criterion can be ∆ˆ > 80%.

11.3.4 Achieving Reproducibility and/or Generalizability Alternatively, Shao and Chow (2002) proposed the concept of reproducibility and generalizability probabilities for assessing consistency between specific sub-populations and the global population. If the influence of the ethnic factors is negligible, then we may consider the reproducibility probability to determine whether the clinical results observed in a patient population is reproducible in a different patient population. If there is a notable ethnic difference, the concept of generalizability probability can be used to determine whether the clinical results observed in a patient population can be generalized to a similar but slightly different patient population with notable differences in ethnic factors. We proceed from the original definitions of reproducibility and generalizability probabilities, and proposed three modified methods for normal-based test procedure and power calculation. Versions for t-distribution based test procedure and power calculation can similarly be derived. 11.3.4.1 Specificity Reproducibility Probability for Inequality Test Suppose we are concerned with the null hypothesis H 0 : µ1 = µ2 and alternative hypothesis H a : µ1 ≠ µ2, where µ1 and µ2 are means of a region and the global area. Denote the variances of the region and the global area by σ 12 and σ 22, respectively. Consider the case without the assumption of equal variance. When both sample sizes n1 and n2 are large, an approximate α level test rejects H0 when T > z1−α /2, where T=

x1 − x2  s12 s12   +   n1 n2 

Consistency Evaluation

271

and z1−α /2 is the 1 − α / 2 quantile of the standard normal distribution. Since T is approximately distributed as N (θ ,1) with

θ=

µ1 − µ2  σ 12 σ 22  +    n1 n1 

,

with a pre-specified level of β (say, 0.2), we get the two-sided 1− β confidence interval for θ as (θɵ − − θɵ + ) = ( T − z1− β , T + z1− β ). Then we get 2 2 two  1types of one-sided 1− β confidence upper bound for θ , denoted 2 as θɵ + and θɵ +: the first is max { θɵ − , θɵ + }; the second is the non-negative value v satisfying P { θ ≤ v|T} = 1 − β , which is equivalent to the equation Φ ( v − T ) − Φ ( −v − T ) = 1 − β , with Φ(⋅) being the distribution function of the standard normal distribution. We define specificity reproducibility probability as j   P  T ≤ z α|θɵ  , j = 1, 2, 1− + 2   j j which is equal to Φ ( z1− α2 − θɵ + ) − Φ ( − z1− α − θɵ + ). Compare the specificity 2 reproducibility probability with a pre-specified constant (say, 0.8). If the specificity reproducibility probability is larger, we conclude that the consistency holds.

11.3.4.2 Superiority Reproducibility Probability With respect to the superiority test with the case of the larger the better, consider null hypothesis H 0 : µ1 ≤ q * µ2 and alternative hypothesis H a : µ1 > q * µ2, where q is a pre-specified constant (say 0.5). The notations here have the same meanings as those in the last paragraph. We have T=

x1 − qx2  s12 q2 s12   +  n2   n1

and

θ=

µ1 − qµ2  σ 12 q2σ 22  +   n1   n1

.

Innovative Statistics in Regulatory Science

272

The rejection region is {T > z1−α }. The one-sided 1− β confidence lower bound for θ , denoted as θɵ − , is T − z1− β . The superiority reproducibility probability is defined as

{

} (

)

P T > z1−α|θɵ − = Φ θɵ − − z1−α = Φ ( T − z1− β − z1−α ) . Compare the superiority reproducibility probability with a pre-specified constant (say, 0.8). If the superiority reproducibility probability is larger, we conclude that the consistency holds.

11.3.4.3 Reproducibility Probability Ratio for Inequality Test Suppose we have the null hypothesis H 0 : µ1 = µ2 and alternative hypothesis H a : µ1 ≠ µ2, where µ1 and µ2 are means of the treatment and the reference for a specific region (or the global area). Denote by σ 12 and σ 22 the variances of the treatment and the reference for a specific region (or the global area). Consider the case without the assumption of equal variance. When both sample sizes n1 and n2 are large, an approximate α level test rejects H0 when T > z1−α /2, where T=

x1 − x2  s12 s12   +   n1 n2 

and z1−α /2 is the 1 − α/2 quantile of the standard normal distribution. Since T is approximately distributed as N (θ ,1) with

θ=

µ1 − µ2  σ 12 σ 22  +    n1 n1 

,

the reproducibility probability by using the estimated power approach is given by  = Φ ( T − z1−α /2 ) = Φ ( −T − z1−α /2 ) . P  S and P  G as the reproducibility probabilities for a region and Denote P S / P  G ≥ q where q is a pre-specified conthe global area, respectively. If P stant, then the consistency of the region’s results and the global results is concluded.

Consistency Evaluation

273

In  addition, the reproducibility probability using the confidence bound approach can be employed, as suggested in the literature by Shao and Chow (2002), with the same criterion of consistency test as seen above being applicable. 11.3.4.4 Reproducibility Probability Ratio for Superiority Test Using techniques similar to the reproducibility probability ratio for inequality test, we can also consider the ratio for superiority test (with the case of the larger the better): H 0 : µ1 ≤ µ2 versus H a : µ1 > µ2. After obtaining the ratios for the region and the global area, the same criterion for testing consistency applies. 11.3.5 Bayesian Approach For testing consistency or similarity between clinical results observed from a sub-population and those obtained from the global population, Chow and Hsiao (2010) proposed the following Bayesian approach via comparing the following hypotheses H 0 : ∆ ≤ 0 vs. H a : ∆ > 0, where ∆ = µS − µG , in which µS and µG are the population means of the subpopulation and the global population, respectively. Let Xi and Yj , i = 1, 2,…, n and j = 1, 2,…, N , be some efficacy responses observed in the sub-population and the global population, respectively. For  simplicity, assume that both Xi s and Yj s are normally distributed with known variance σ 2 . When σ 2 is unknown, it can generally be estimated by the sample variance. Thus, ∆ can be estimated by  = x − y, ∆ where x = ∑ ni =1 xi / n and y = ∑ Nj =1 y j / N . Under the following mixed prior information for ∆

π = γπ 1 + ( 1 − γ ) π 2 , which is a weighted average of two priors, where π 1 ≡ c is a non-informative prior and π 2 is a normal prior with mean θ 0 and variance σ 02, and γ is defined as the weight with 0 ≤ γ ≤ 1. For the consistency test between a specific region and  from all the global area, the choice of (θ 0 , σ 02 ) can be derived from the values of ∆  regions except the specific region. Thus, the marginal density of ∆ is given by

( )

 = γ + (1 − γ ) m ∆

(

1

2π σ 02 + σ

2

)

 − θ 0 )2   (∆ exp − 2 , 2  2(σ 0 + σ 

Innovative Statistics in Regulatory Science

274

where σ = σ 2 ( n1 + N1 ) . Given the clinical data and prior distribution, the posterior distribution of ∆ is 2

( )

 = m ∆|∆

1  m ∆

( )

 γ 

+ (1 − γ )

 )2   (∆ − ∆ 1 exp  −  2 2π σ 2σ  

 )2    ( ∆ − θ 0 )2 ( ∆ − ∆ 1 exp − −  .  2 2 2σ 0 2πσ 0 σ 2σ   

Thus, given the data and prior information, similarity (consistency) on efficacy in terms of a positive treatment effect for the sub-population can be concluded if the posterior probability of consistency (similarity) is ∞

∫( )

 d∆ > 1 − α . pc = P ( µS − µG > 0|clinical data and prior ) = π ∆|∆ 0

For some pre-specified 0 < α < 0.5 . In practice, α is determined such that the sub-population is generally smaller than 0.2 to ensure that posterior probability of consistency is at least 80%. Based on discussion given in Chow and Hsiao (2010), the marginal density  can be re-expressed as of ∆  − θ 0 )2   (∆ exp − . 2 2 2π σ 02 + 2σ 2 / N  2(σ 0 + 2σ / N ) 

( )

1

 = γ + (1 − γ ) m ∆

(

)

As a result, the posterior distribution of ∆ is therefore given by

( )

 = m ∆|∆

1  m ∆

( )

 γ 

+ (1 − γ )

 )2   (∆ − ∆ exp  − 2  4πσ / N  4σ / N  1

2

 )2    ( ∆ − θ 0 )2 ( ∆ − ∆ exp  − −  2σ 02 4σ 2 / N   8πσ 02σ 2 / N  1

Consequently, we have ∞

pc =

∫ π ( ∆|∆ˆ ) d∆ > 1 − α . 0

Consistency Evaluation

275

11.3.6 Japanese Approach In  a multi-regional trial, to establish the consistency of treatment effects between the Japanese population and the entire population under study, the Japanese MHLW suggested evaluating the relative treatment effect (i.e., treatment effect observed in Japanese population as compared to that of the entire population) to determine whether the Japanese population has achieved a desired assurance probability for consistency. Let δ = µT − µ P be the treatment effect of the test treatment under investigation. Denote by δ J and δ All the treatment effect of the Japanese population and treatment effect of the entire population (all regions), respectively. Let δɵ J and δɵ All be the corresponding estimators. For a multi-regional clinical trial, given δ , sample size ratio (i.e., the size of Japanese population as compared to the entire population) and the overall result is significant at the α level of significance, the assurance probability, pa , of consistency is defined as

(

)

pa = Pδ δɵ J ≥ ρδɵ All Z z1−α > 1 − γ , where Z represents the overall test statistic, ρ is minimum requirement for claiming consistency, and 0 < γ ≤ 0.2 is a pre-specified desired level of assurance. The Japanese MHLW suggests that ρ be 0.5 or greater. The determination of ρ , however, should be different from product to product. In  practice, above required sample size may either not  be easy to be achieved or not  pre-planned. Instead, consistency maybe evaluated using above idea if the effect of a subgroup achieves a specified proportion (usually ≥50%) of the observed overall effect. This can be seen a simplified version of the Japanese approach.

11.3.7 The Applicability of Those Approaches Considering the data type and which term the consistency test is based on, the approaches we have described above exhibit different levels of applicability. With respect to the data type, the normal data and binary data are the two widely used types. As for the term “consistency,” the following two meanings can be considered: comparing the two effects directly, e.g., comparing µT 1 − µR1 and µT 2 − µR 2 directly, where ( µTi , µRi ) are the treatment mean and the reference mean for the ith region, i = 1, 2; testing the consistency in terms of the original aim, e.g., testing the consistency between {µT 1 − µR1 > 0} and {µT 2 − µR 2 > 0} when the original hypothesis is H0: µT − µR ≤ 0 versus Ha: µT − µR > 0. Table 11.1 displays the applicability of the approaches. An example of direct comparison and original aim: assume a global clinical trial was carried out to compare the effect of a new drug and the effect of a reference drug globally and regionally. Denote µGT , µGR, µST and µST are the means of the new drug globally, the reference drug globally, the new drug

Innovative Statistics in Regulatory Science

276

TABLE 11.1 Displays the Applicability of the Approaches Data Type Approach

Aim

Binary

Normal

Compare Directly

Original Aim

✔ ✖ ✖ ✖

✔ ✔ ✔ ✔

✔ ✔ ✔ ✔

✖ ✖ ✖ ✖

















✖ ✔

✔ ✔

✔ ✔

✔ ✖

11.3.1 Shih 11.3.2 Tse 11.3.3 Sensitivity Index 11.3.4 Specificity reproducibility probability 11.3.4 Superiority reproducibility probability 11.3.4 Reproducibility probability ratio 11.3.5 Bayesian approach 11.3.6 Japanese approach

for a specific region and the reference drug for the specific region. We were concerned with the hypothesis testing H 0 : µGT − µGR ≥ 0

vs. H a : µGT − µGR < 0

as well as H 0 : µST − µSR ≥ 0 vs. H a : µST − µSR < 0 To test the consistency between the global effect µGT − µGR and the specific region effect µST − µSR, we may consider a variety of hypothesis testing. One is just to test H 0 : µGT − µGR = µST − µSR versus H a : µGT − µGR ≠ µST − µSR, which can be seen as direct comparison and is not directly related to whether  µST − µSR is smaller than 0 or not. Another one is directly related to H 0 : µST − µSR ≥ 0 versus H a : µST − µSR < 0, such as the Bayesian method introduced previously that is used to estimate the posterior probability of {µST − µSR < 0} using both the specific regional data and global data.

11.4 Simulation Study 11.4.1 The Case of the Matched-Pair Parallel Design with Normal Data and Superiority Test First, we perform a simulation study for the matched-pair parallel design with normal data and superiority test. Assume the clinical trial with the aim of comparing the treatment effect mean µT and reference effect mean µR

Consistency Evaluation

277

(the  smaller the better), consists of 80 patients from 10 regions with equal number of patients recruited in each region. Consider the superiority hypothesis test: H 0 : µT − µR ≥ 0 vs. H a : µT − µR < 0. Presume µT − µR is equal to −1 with the standard deviation being 3 globally for the power analysis. Given the pre-specified type I error level of α = 0.05, 80 samples can achieve a power of 90%. For the simulation parameter specification, let the standard deviations of the difference between treatment effect and reference effect be equal to 3 within each region. The values of µT − µR for six regions is equal to −1, while the other four regions have the values of −2, −1.5, −0.5 and 0, respectively. The true standard deviations of the difference between treatment effect and reference effect for the global population are 3.04. For  each simulation, we simulated data for each region, combined them as the global data, and tested the superiority hypothesis both regionally and globally with the standard t-statistic method. Denoting X S and X G as the sample means for the region and global area respectively, the results for the subgroup analyses could be divided into the following 4 categories (see Table 11.2). Seventeen approaches based on the methods discussed in Section  11.3 were selected to test the consistency. As the applicability of those methods are different, for each method, the version for the original aim would be used if the method applies to the original aim, otherwise the version for direct comparison would be adopted. Those methods are listed below with their abbreviations (Table 11.3). We did altogether 100,000 repetitions of the simulation, with the following results recorded: the rejection rate of the null hypothesis H 0 : µT − µR ≥ 0 regionally and globally (see Table  11.4), the frequency of the categories regionally and globally (see Table 11.5), consistency rates of each approach under different categories and in different regions (see Table 11.6). As the six regions have identical population parameters, we combined the results for them. TABLE 11.2 Categories of Subgroup Analysis Category 1 2 3 4

Superiority Trial X S < X G (met superiority) X G ≤ X S < 0 & upper bound of 1 − α CI for the region using global CI width < 0 X G ≤ X S < 0 but upper bound of 1 − α CI for the region using global CI width ≥ 0 XG < 0 ≤ XS

Innovative Statistics in Regulatory Science

278

TABLE 11.3 List of Abbreviations Approach

Choice of Parameters

Abbreviation

11.3.1 Shih

Shih Superiority

11.3.2 Tse

p0 = 0.8 , δ = 0.2 , α = 0.1

11.3.2 Tse

R–R choice for p0 , δ = 0.2 , α = 0.1 p0 = 0.8 , δ = 0.2 , α = 0.1 the version for superiority

11.3.2 Tse 11.3.2 Tse 11.3.3 Sensitivity index 11.3.4 Specificity reproducibility probability 11.3.4 Superiority reproducibility probability 11.3.4 Reproducibility probability ratio 11.3.4 Reproducibility probability ratio 11.3.4 Reproducibility probability ratio 11.3.4 Reproducibility probability ratio 11.3.5 Bayesian approach 11.3.5 Bayesian approach 11.3.5 Bayesian approach 11.3.5 Bayesian approach 11.3.6 Japanese approach

Tse Tse Reference Tse Sup

R–R choice for p0 , δ = 0.2 , α = 0.1, the version for superiority p0 = 0.8, simple version

Tse Sup Reference

R–R choice for p0 , α = 0.05, β = 0.2

RPspeP Reference RPsupP Reference GenP

SenInd Simple

R–R choice for p0 , q = 0.5, α = 0.05, β = 0.2 α = 0.05, β = 0.2 , p0 = 0.8 , method of power estimation for inequality test α = 0.05, β = 0.2 , p0 = 0.8 , method of confidence bound for inequality test α = 0.05, β = 0.2 , p0 = 0.8 , method of power estimation for superiority test α = 0.05, β = 0.2 , p0 = 0.8 , method of confidence bound for superiority test

GenCB GenP Sup GenCB Sup

p0 = 0.8, select min {Pc ( γ ) , γ = 0.1,… , 1} R–R choice for p0 , select min {Pc (γ ) , γ = 0.1,… , 1} p0 = 0.8, select mean {Pc (γ ) , γ = 0.1,… , 1} R–R choice for p0 , select mean {Pc (γ ) , γ = 0.1,… , 1} p0 = 0.8, simple version

Bayesian Min Constant Bayesian Min Reference Bayesian Mean Constant Bayesian Mean Reference Japan Simple

TABLE 11.4 Passing Rate by Global and Regions Passing Rate of Superiority Test

Global

Region 1–6

Region 7

Region 8

Region 9

Region 10

0.902

0.214

0.521

0.357

0.114

0.050

The  simulation results show that, for this scenario, the methods of the simple version for sensitivity index, superiority reproducibility probability, reproducibility probability ratio for superiority test, Bayesian approach and simple version for Japanese approach performed better than others, with overall consistency rate close to 65% (the rate of category 1 and 2).

Consistency Evaluation

279

TABLE 11.5 Frequencies by Global and Regions Category Frequency Category All 1 2 3 4

Region 1–6

Region 7

Region 8

Region 9

Region 10

All

600000 300121 99332 96893 103247

100000 84111 7583 5223 3055

100000 68970 12746 10409 7828

100000 31077 17283 20144 31393

100000 15930 14400 19744 49818

1000000 500209 151344 152413 195341

TABLE 11.6 Consistency Rates of Each Approach under Different Categories and in Different Regions Consistency Rate Approach Shih superiority

Tse

Tse reference

Tse sup

Tse sup reference

Category

Region 1–6

Region 7

Region 8

Region 9

Region 10

All

All 1 2 3 4 All 1 2 3 4 All 1 2 3 4 All 1 2 3 4 All 1 2 3 4

0.922 1 0.993 0.95 0.602 0 0 0 0 0 0.515 0.515 0.584 0.557 0.408 0 0 0 0 0 0.429 0.829 0.07 0.014 0

0.988 1 0.994 0.965 0.686 0 0 0 0 0 0.427 0.405 0.569 0.554 0.454 0.004 0.004 0 0 0 0.782 0.921 0.078 0.022 0.002

0.969 1 0.994 0.961 0.661 0 0 0 0 0 0.493 0.474 0.577 0.551 0.449 0.001 0.001 0 0 0 0.617 0.877 0.078 0.017 0

0.834 1 0.99 0.936 0.519 0 0 0 0 0 0.494 0.54 0.579 0.551 0.365 0 0 0 0 0 0.257 0.784 0.062 0.011 0

0.676 1 0.984 0.905 0.393 0 0 0 0 0 0.429 0.553 0.573 0.534 0.307 0 0 0 0 0 0.126 0.731 0.054 0.009 0

0.9 1 0.992 0.944 0.539 0 0 0 0 0 0.493 0.494 0.581 0.553 0.378 0.001 0.001 0 0 0 0.435 0.845 0.068 0.013 0

(Continued)

Innovative Statistics in Regulatory Science

280

TABLE 11.6 (Continued) Consistency Rates of Each Approach under Different Categories and in Different Regions Consistency Rate Approach SenInd simple

RPspeP reference

RPsupP reference

GenP

GenCB

GenP sup

GenCB sup

Category

Region 1–6

Region 7

Region 8

Region 9

All 1 2 3 4 All 1 2 3 4 All 1 2 3 4 All 1 2 3 4 All 1 2 3 4 All 1 2 3 4 All 1 2 3 4

0.589 0.988 0.494 0.075 0 0.623 0.624 0.959 0.74 0.189 0.545 0.994 0.258 0.033 0 0.668 0.99 0.586 0.061 0.377 NA NA 0.363 NA NA 0.622 0.995 0.681 0.075 0 0.575 0.981 0.465 0.049 0

0.882 0.994 0.53 0.109 0 0.433 0.365 0.976 0.826 0.297 0.864 0.997 0.299 0.052 0.001 0.898 0.996 0.615 0.084 0.296 NA NA 0.393 NA NA 0.898 0.998 0.708 0.104 0 0.875 0.992 0.492 0.069 0

0.759 0.992 0.516 0.088 0 0.57 0.502 0.968 0.782 0.247 0.727 0.996 0.281 0.044 0 0.795 0.993 0.597 0.074 0.331 NA NA 0.382 NA NA 0.784 0.996 0.689 0.087 0 0.749 0.988 0.483 0.059 0

0.399 0.986 0.458 0.062 0 0.573 0.724 0.948 0.695 0.138 0.352 0.992 0.226 0.024 0 0.554 0.987 0.555 0.05 0.446 NA NA 0.336 NA NA 0.435 0.993 0.656 0.063 0 0.385 0.974 0.433 0.039 0

Region 10

All

0.227 0.58 0.979 0.989 0.425 0.487 0.045 0.071 0 0 0.436 0.575 0.803 0.575 0.933 0.957 0.642 0.727 0.092 0.16 0.189 0.541 0.989 0.995 0.197 0.253 0.016 0.031 0 0 0.503 0.676 0.981 0.991 0.537 0.581 0.037 0.058 0.524 0.423 NA NA NA NA 0.313 0.358 NA NA NA NA 0.259 0.611 0.991 0.995 0.634 0.676 0.048 0.072 0 0 0.219 0.568 0.965 0.983 0.412 0.459 0.029 0.047 0 0 (Continued)

Consistency Evaluation

281

TABLE 11.6 (Continued) Consistency Rates of Each Approach under Different Categories and in Different Regions Consistency Rate Approach

Category

Region 1–6

Region 7

Region 8

Region 9

Region 10

All

Bayesian min constant

All 1 2 3 4 All 1 2 3 4 All 1 2 3 4 All 1 2 3 4 All 1 2 3 4

0.549 0.917 0.532 0.015 0 0.719 1 0.945 0.382 0.004 0.59 0.937 0.7 0.031 0 0.749 1 0.99 0.518 0.007 0.579 0.999 0.442 0.032 0

0.849 0.959 0.546 0.015 0 0.94 1 0.962 0.485 0.009 0.873 0.971 0.721 0.033 0 0.95 1 0.993 0.629 0.017 0.882 0.999 0.516 0.051 0

0.718 0.94 0.539 0.014 0 0.857 1 0.953 0.431 0.009 0.751 0.955 0.698 0.032 0 0.877 1 0.993 0.57 0.014 0.755 0.999 0.483 0.042 0

0.37 0.895 0.514 0.012 0 0.539 1 0.933 0.33 0.002 0.412 0.922 0.689 0.031 0 0.576 1 0.99 0.463 0.004 0.385 0.999 0.395 0.023 0

0.214 0.878 0.5 0.012 0 0.346 1 0.915 0.277 0.001 0.251 0.909 0.694 0.03 0 0.384 1 0.985 0.415 0.002 0.214 0.998 0.355 0.017 0

0.545 0.925 0.528 0.014 0 0.7 1 0.942 0.368 0.003 0.583 0.944 0.699 0.031 0 0.728 1 0.99 0.505 0.005 0.571 0.999 0.436 0.03 0

Bayesian min reference

Bayesian mean constant

Bayesian mean reference

Japan simple

11.4.2 The Case of the Two-Group Parallel Design with Normal Data and Superiority Test We also performed a simulation study for two-group parallel design with normal data and superiority test. Assume the clinical trial with the aim of comparing the treatment effect mean µT and reference effect mean µR (the smaller the better), consists of 140 patients from 10 regions with equal number of patients recruited in each region and equal number of patients in each arm. Consider the superiority hypothesis test: H 0 : µT − µR ≥ 0 vs. H a : µT − µR < 0.

Innovative Statistics in Regulatory Science

282

Presume µT − µR is equal to −1 with the standard deviation of each arm being 2 globally for the power analysis. Given the pre-specified type I error level of α = 0.05, 140 samples can achieve a power of 90%. For the simulation parameter specification, let both standard deviations of the treatment arm and the reference arm be equal to 2 within each region. Let µR of each region is equal to 0. The values of µT − µR for six regions is equal to −1, while the other four regions have the values of −2, −1.5, −0.5 and 0, respectively. The true standard deviations of the treatment arm and the reference arm for the global population are 2.06 and 2. For each simulation, we simulated data for each region, combined them as the global data, and tested the superiority hypothesis both regionally and globally with the standard t-statistic method. Denoting X S and X G as the sample means of the difference between the treatment arm and reference arm for the region and global area, respectively, the results for the subgroup analyses could be divided into the following 4 categories (see Table 11.7). Like the simulation study for the matched-pair parallel design, seventeen approaches based on the methods discussed in Section 11.3 were selected to test the consistency. We did 100,000 repetitions of the simulation, with the following results recorded: the rejection rate of the null hypothesis H 0 : µT − µR ≥ 0 regionally and globally (see Table 11.8), the frequency of the categories regionally and globally (see Table 11.9), consistency rates of each approach under different categories and in different regions (see Table 11.10). As the six regions have identical population parameters, we combined the results for them. The  simulation results show that, for this scenario, the methods of the simple version for sensitivity index, superiority reproducibility probability, reproducibility probability ratio for superiority test, Bayesian approach and

TABLE 11.7 Test Results by Category (Global and Regions) Category 1 2 3 4

Superiority Trial X S < X G (met superiority) X G ≤ X S < 0  & upper bound of 1 − α CI for the region using global CI width g}. Under the assumption of the statistical independence between Xij and Yij, we have  X′ − X′ − d   Yi′ − Y0′ − g  , pis = Φ  i p 0  Φ   p Sɶ i Q    i  where p Sɶ i =

p = Q i

( N i − 1) Si′2 /N i + ( N 0 − 1) S0′2 /N 0 , Ni + N0 − 2

( N i − 1) Qi′ 2 /N i + ( N 0 − 1) Q0′ 2 /N 0 , Ni + N0 − 2

and d, g are pre-specified values. Probability of being the best dose—Select the dose with the largest pib . That is pib = pr max i ( di ) = di , max i ( gi ) = gi

{

=



(

 1  z − Xi′ − X0′ φ Si′  Si′ 

)   



}

(

 z − X′g − X0′ Φ g =1,i ≠ g  S′g 

k

)  dz  

Innovative Statistics in Regulatory Science

376



(

 1  w − Yi′ − Y0′ φ ′  ′ Q Q i i 

)   



(

 w − Yh′ − Y0′ Φ h =1,i ≠ h ′  Q h 

k

)  dw,  

under the assumption that Xij and Yij are statistically independent.

14.3.3 A Numeric Example A clinical study was intended to be designed as a two-stage phase II/III seamless, adaptive, randomized, double-blind, parallel group, placebo-controlled, multicenter study to evaluate the efficacy, safety, and dose-response of a compound in treating postmenopausal women and adjuvant breast cancer patients with vasomotor symptoms (VMS). Hot flashes, flushes, and night sweats, sometimes accompanied by shivering and a sense of cold, are wellknown VMS. The primary study endpoints would be frequency and severity of VMS evaluated at Weeks 4 and 12 as co-primary endpoints. Since the real data for this study is unavailable, we provide this example with numerical simulated data to illustrate the use of the dose-selection methods, based on such studies. Under this two-stage adaptive trial design, the first stage is a phase II trial for dose selection. The  second stage is a phase III trial for confirming the efficacy and safety of the compound. An interim analysis will be conducted to evaluate the dose effect and to review the efficacy and safety effects when about 50% subjects at Stage 1 complete Week 4 of the Doubleblind Treatment period. One or two optimal dose(s) will be selected and the sample size will be re-estimated. The second stage of the study starts after the interim analysis. Newly enrolled subjects will be randomized to receive the selected optimal dose(s) or placebo for 12 weeks. The second stage will serve as a pivotal study. To establish the measure frequency of VMS, at week 4, each study subject reported the number of VMS each day and those numbers were averaged for each subject. For  severity, each subject was asked to rate how much they were bothered by VMS at the end of week 4 using scores from 1 to 10 with higher value indicating more severity. Three doses and one placebo were compared. 800 subjects were recruited into the study with 200 in each group. At interim, each group had 100 subjects having completed week 4 of the treatment period. The logarithm-transformation of frequency and the original scale of severity were treated as following normal distributions and were used for interim analysis. Using the approach of probability of being the best dose, which was shown to be the best one in the simulation studies, we selected dose 3 which had the largest overall probability of being the best dose. The  results were shown in Table  14.1 with means and standard errors of the two co-primary endpoints.

Criteria for Dose Selection

377

TABLE 14.1 Results of the Example

Endpoint 1

Sample mean Sample standard error Probability of being the best Endpoint 2 Sample mean Sample standard error Probability of being the best Overall probability of being the best

Placebo

Dose 1

Dose 2

Dose 3

3.37 1.61

3.42 2.10 0.0% 6.17 1.76 7.0% 0.0%

3.07 1.62 1.1% 5.94 1.89 45.0% 0.5%

2.56 1.53 98.9% 5.93 1.87 47.9% 47.4%

6.74 1.74

14.4 Clinical Trial Simulation In  this section, we conducted simulation studies to investigate the performances of the four methods described in Section 14.2. 14.4.1 Single Primary Endpoint Assume the mean effects of four doses (dose 1 being a placebo) are to be compared, and all effects follow normal distributions. Different parameter settings were given to investigate the performances of the methods. Dose 4 was always set to be the most effective dose among the four doses compared. We consider the following total sample sizes: 100, 150, and 200. The proportion of interim sample size was set to be 20%, resulting in interim sample sizes of 20, 30, and 40, respectively. For each case, 5,000 repetitions were implemented. We summarized the simulation results as the frequency of being selected as the most effective dose in Tables 14.2 through 14.4, according to difference sample sizes. From the simulation results, we observed that the last method probability of being the best dose performed better in all cases than other methods, with higher probability of choosing the right best dose and lower probability of choosing the wrong dose. 14.4.2 Co-primary Endpoints Assume the mean effects of the co-primary endpoints of four doses (dose 1 being a placebo) are to be compared, and all effects follow normal distributions. Different parameter settings were given to investigate the performances of the methods. In the first scenario, dose 4 was set to be the most effective dose among the four doses compared, with both co-primary endpoints being the most effective. In the second scenario, the first endpoint of

SD

0.005

0.01

Criterion

C1 C2 C3 C4

C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

0.03

0.02

0.5

Mean

Dose 1

0.52

Dose 3 0.53

Dose 4

18.5% 32.6% 23.7% 0.0% 2.9% 22.3% 5.5% 0.1% 2.1% 16.1% 3.2% 1.0%

33.3% 33.3% 33.3% 0.0%

40.7% 33.7% 38.1% 0.1% 27.7% 38.0% 33.7% 5.5% 18.2% 34.6% 21.7% 13.6%

33.3% 33.3% 33.3% 0.0% 40.8% 33.7% 38.1% 99.9% 69.4% 39.7% 60.9% 94.4% 79.7% 49.3% 75.1% 85.4%

33.3% 33.3% 33.3% 100.0%

Probability of being selected

0.51

Dose 2

0.03

0.02

0.01

0.005

SD

0.5

Dose 1 0.525

Dose 3 0.53

Dose 4

18.8% 32.5% 23.5% 0.0% 2.9% 22.3% 5.5% 0.0% 1.7% 15.5% 2.9% 0.7%

33.3% 33.3% 33.3% 0.0% 40.6% 33.8% 38.2% 6.1% 41.6% 38.7% 43.3% 20.8% 34.3% 40.2% 36.0% 30.3%

33.4% 33.3% 33.3% 0.7% 40.6% 33.8% 38.2% 93.9% 55.6% 39.0% 51.2% 79.2% 64.0% 44.3% 61.1% 69.0%

33.4% 33.3% 33.3% 99.3%

Probability of being selected

0.51

Dose 2

N = 100

Simulation Results for Single Primary Endpoint When the Total Sample Size Is 100

TABLE 14.2

0.03

0.02

0.01

0.005

SD

0.5

Dose 1 0.51

Dose 3

0.53

Dose 4

3.0% 22.4% 5.7% 0.0% 0.8% 13.1% 1.7% 0.0% 0.9% 10.1% 1.4% 0.4%

18.5% 32.5% 23.3% 0.0%

26.7% 37.7% 32.6% 0.0% 4.4% 28.2% 8.0% 0.2% 3.1% 20.7% 4.6% 1.8%

40.7% 33.8% 38.4% 0.0%

(Continued)

70.3% 39.9% 61.7% 100.0% 94.8% 58.7% 90.3% 99.8% 95.9% 69.2% 94.0% 97.8%

40.8% 33.8% 38.4% 100.0%

Probability of being selected

0.505

Dose 2

378 Innovative Statistics in Regulatory Science

SD

0.05

Criterion

C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

0.52

Dose 3 0.53

Dose 4

6.9% 13.3% 7.3% 6.4% 16.9% 19.1% 17.0% 16.8% 23.9% 24.6% 23.9% 23.9%

25.6% 30.6% 25.9% 24.5% 30.0% 31.1% 30.1% 29.9% 33.0% 33.4% 33.0% 33.0%

67.6% 56.1% 66.8% 69.1 53.1% 49.8% 53.0% 53.3% 43.1% 42.0% 43.1% 43.2%

Probability of being selected

0.51

Dose 2

Note: The darker value, the higher the probability. C1: conditional power. C2: precision analysis based on confidence interval. C3: Predictive probability of success. C4: Probability of being the best dose.

0.2

0.1

0.5

Mean

Dose 1

0.2

0.1

0.05

SD

0.5

Dose 1 0.525

Dose 3 0.53

Dose 4

5.7% 12.8% 6.0% 5.4% 14.6% 16.9% 14.6% 14.3% 22.5% 23.3% 22.5% 22.4%

36.5% 39.0% 36.9% 36.3% 37.6% 37.6% 37.7% 37.6% 36.2% 35.9% 36.2% 36.6%

57.8% 48.3% 57.1% 58.3% 47.9% 45.6% 47.7% 48.2% 41.3% 40.8% 41.3% 41.0%

Probability of being selected

0.51

Dose 2

N = 100

Simulation Results for Single Primary Endpoint When the Total Sample Size Is 100

TABLE 14.2 (Continued)

0.2

0.1

0.05

SD

0.5

Dose 1 0.51

Dose 3

0.53

Dose 4

4.7% 10.6% 4.9% 4.4% 15.0% 16.8% 15.0% 14.9% 24.2% 25.1% 24.2% 24.1%

9.7% 16.9% 10.1% 9.6% 22.5% 24.4% 22.5% 22.5% 28.4% 28.8% 28.4% 28.2%

85.6% 72.4% 85.0% 86.0% 62.5% 58.8% 62.4% 62.6% 47.4% 46.1% 47.4% 47.7%

Probability of being selected

0.505

Dose 2

Criteria for Dose Selection 379

0.03

0.02

0.01

0.005

SD

Criterion

C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

0.5

Mean

Dose 1

0.52

Dose 3

0.53

Dose 4

33.3% 33.3% 33.3% 0.0% 26.4% 33.2% 29.6% 0.0% 4.9% 26.1% 8.6% 0.0% 1.9% 18.6% 3.7% 0.4%

33.3% 33.3% 33.3% 0.0% 36.8% 33.4% 35.2% 0.0% 37.1% 36.8% 40.0% 3.1% 20.4% 37.4% 25.4% 9.6%

33.3% 33.3% 33.3% 100.0% 36.8% 33.4% 35.2% 100.0% 58.1% 37.1% 51.4% 96.9% 77.7% 44.0% 70.9% 90.0%

Probability of being selected

0.51

Dose 2

0.03

0.02

0.01

0.005

SD

0.5

Dose 1 0.525

Dose 3 0.53

Dose 4

33.3% 33.3% 33.3% 0.0% 26.2% 33.2% 29.5% 0.0% 4.8% 25.8% 8.3% 0.0% 1.6% 18.3% 3.5% 0.2%

33.3% 33.3% 33.3% 0.1% 36.9% 33.4% 35.2% 2.8% 45.9% 37.1% 45.2% 16.3% 36.1% 40.1% 39.3% 26.3%

33.3% 33.3% 33.3% 99.9% 36.9% 33.4% 35.2% 97.2% 49.3% 37.1% 46.5% 83.7% 62.4% 41.6% 57.2% 73.5%

Probability of being selected

0.51

Dose 2

N = 150

Simulation Results for Single Primary Endpoint When the Total Sample Size Is 150

TABLE 14.3

0.03

0.02

0.01

0.005

SD

0.5

Dose 1 0.51

Dose 3

0.53

Dose 4

26.5% 33.2% 29.7% 0.0% 4.9% 26.0% 8.5% 0.0% 1.0% 15.4% 2.2% 0.0% 0.4% 10.9% 1.2% 0.0%

36.8% 33.4% 35.2% 0.0% 37.2% 36.8% 40.5% 0.0% 7.7% 32.7% 12.6% 0.0% 2.7% 22.8% 5.1% 0.6%

(Continued)

36.8% 33.4% 35.2% 100.0% 57.9% 37.2% 50.9% 100.0% 91.3% 51.9% 85.2% 100.0% 96.9% 66.3% 93.8% 99.4%

Probability of being selected

0.505

Dose 2

380 Innovative Statistics in Regulatory Science

0.52

Dose 3

0.53

Dose 4

4.4% 13.6% 4.8% 3.8% 15.0% 17.5% 15.0% 14.7% 23.0% 24.0% 23.0% 23.0%

22.3% 31.9% 23.3% 21.2% 29.1% 30.7% 29.1% 28.9% 31.0% 31.4% 31.1% 31.1%

73.3% 54.5% 71.9% 75.0% 55.9% 51.9% 55.9% 56.4% 45.9% 44.6% 45.9% 45.9%

Probability of being selected

0.51

Dose 2

Note: The darker value, the higher the probability. C1: conditional power. C2: precision analysis based on confidence interval. C3: Predictive probability of success. C4: Probability of being the best dose.

0.2

0.1

0.05

SD

Criterion

C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

0.5

Mean

Dose 1

0.2

0.1

0.05

SD

0.5

Dose 1 0.525

Dose 3 0.53

Dose 4

2.9% 13.1% 3.4% 2.3% 12.3% 15.3% 12.4% 12.2% 20.4% 21.8% 20.4% 20.4%

34.5% 38.5% 34.9% 33.9% 37.8% 37.7% 37.8% 37.3% 37.9% 37.2% 37.9% 37.8%

62.6% 48.4% 61.7% 63.7% 49.9% 47.0% 49.8% 50.5% 41.8% 41.0% 41.7% 41.8%

Probability of being selected

0.51

Dose 2

N = 150

Simulation Results for Single Primary Endpoint When the Total Sample Size Is 150

TABLE 14.3 (Continued)

0.2

0.1

0.05

SD

0.5

Dose 1 0.51

Dose 3

0.53

Dose 4

2.2% 9.7% 2.5% 2.1% 11.9% 14.6% 12.0% 11.5% 22.4% 23.5% 22.5% 22.4%

6.9% 17.5% 7.6% 6.6% 18.8% 21.3% 18.9% 18.8% 26.1% 26.9% 26.1% 26.0%

90.9% 72.8% 89.9% 91.3% 69.3% 64.1% 69.0% 69.7% 51.5% 49.6% 51.4% 51.6%

Probability of being selected

0.505

Dose 2

Criteria for Dose Selection 381

SD

0.005

Criterion

C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

0.03

0.02

0.01

0.5

Mean

Dose 1

0.52

Dose 3 0.53

Dose 4

33.3% 33.3% 33.3% 0.0% 30.8% 33.3% 32.3% 0.0% 7.5% 28.4% 12.0% 0.0% 2.5% 21.3% 4.7% 0.1%

33.3% 33.3% 33.3% 0.0% 34.6% 33.3% 33.8% 0.0% 42.3% 35.8% 42.2% 1.2% 25.1% 37.7% 31.2% 6.7%

33.3% 33.3% 33.3% 100.0% 34.6% 33.3% 33.8% 100.0% 50.3% 35.8% 45.8% 98.8% 72.5% 41.1% 64.1% 93.2%

Probability of being selected

0.51

Dose 2

0.03

0.02

0.01

0.005

SD

0.5

Dose 1 0.525

Dose 3 0.53

Dose 4

33.3% 33.3% 33.3% 0.0% 30.7% 33.3% 32.2% 0.0% 7.6% 28.5% 11.9% 0.0% 2.1% 20.7% 4.2% 0.0%

33.3% 33.3% 33.3% 0.0% 34.7% 33.3% 33.9% 1.5% 45.9% 35.8% 44.0% 12.6% 39.9% 39.4% 42.2% 23.0%

33.3% 33.3% 33.3% 100.0% 34.7% 33.3% 33.9% 98.5% 46.5% 35.8% 44.1% 87.4% 58.0% 39.9% 53.6% 76.9%

Probability of being selected

0.51

Dose 2

N = 200

Simulation Results for Single Primary Endpoint When the Total Sample Size Is 200

TABLE 14.4

0.03

0.02

0.01

0.005

SD

0.5

Dose 1 0.51

Dose 3

0.53

Dose 4

30.8% 33.3% 32.3% 0.0% 7.4% 28.2% 11.6% 0.0% 1.4% 16.6% 2.9% 0.0% 0.7% 12.3% 1.5% 0.0%

34.6% 33.3% 33.9% 0.0% 42.5% 35.9% 42.5% 0.0% 10.1% 35.0% 16.3% 0.0% 3.4% 26.5% 6.6% 0.2%

(Continued)

34.6% 33.3% 33.9% 100.0% 50.1% 35.9% 45.9% 100.0% 88.5% 48.4% 80.8% 100.0% 96.0% 61.2% 91.9% 99.7%

Probability of being selected

0.505

Dose 2

382 Innovative Statistics in Regulatory Science

SD

0.05

Criterion

C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

0.52

Dose 3 0.53

Dose 4

2.5% 14.5% 3.1% 2.0% 12.2% 16.5% 12.4% 12.1% 21.3% 22.6% 21.3% 21.4%

20.2% 32.8% 21.9% 18.1% 27.9% 30.3% 27.8% 27.6% 32.0% 31.8% 32.0% 31.8%

77.3% 52.7% 75.0% 79.9% 59.9% 53.2% 59.8% 60.2% 46.8% 45.6% 46.7% 46.9%

Probability of being selected

0.51

Dose 2

Note: The darker value, the higher the probability. C1: conditional power. C2: precision analysis based on confidence interval. C3: Predictive probability of success. C4: Probability of being the best dose.

0.2

0.1

0.5

Mean

Dose 1

0.2

0.1

0.05

SD

0.5

Dose 1 0.525

Dose 3 0.53

Dose 4

1.8% 13.6% 2.7% 1.2% 9.8% 14.1% 9.9% 9.7% 19.0% 20.0% 18.9% 18.7%

33.8% 39.5% 35.2% 32.1% 38.5% 38.5% 38.5% 38.5% 36.2% 36.3% 36.2% 36.7%

64.4% 46.9% 62.1% 66.6% 51.8% 47.4% 51.6% 51.9% 44.9% 43.7% 44.8% 44.6%

Probability of being selected

0.51

Dose 2

N = 200

Simulation Results for Single Primary Endpoint When the Total Sample Size Is 200

TABLE 14.4 (Continued)

0.2

0.1

0.05

SD

0.5

Dose 1 0.51

Dose 3

0.53

Dose 4

1.3% 9.2% 1.7% 1.1% 10.0% 13.8% 10.1% 9.9% 20.4% 21.5% 20.4% 20.0%

4.2% 17.5% 5.2% 3.5% 15.4% 19.2% 15.5% 15.2% 24.5% 25.2% 24.5% 24.7%

94.4% 73.3% 93.1% 95.4% 74.6% 67.0% 74.4% 74.9% 55.1% 53.2% 55.1% 55.3%

Probability of being selected

0.505

Dose 2

Criteria for Dose Selection 383

Innovative Statistics in Regulatory Science

384

dose 4 is most effective, while the second endpoint of dose 3 is most effective. We consider the following total sample sizes: 100 and 200. The proportion of interim sample size was set to be 20%, resulting in interim sample sizes of 20 and 40, respectively. For each case, 5,000 repetitions were implemented. We summarized the simulation results as the frequency of being selected as the most effective dose in Tables 14.5 and 14.6, according to difference sample sizes. TABLE 14.5 Simulation Results for Co-primary Endpoints: Scenario 1 Scenario 1 Dose 1 Dose 2 Mean1

0.5

0.51

Dose 3

Dose 4

0.52

0.53

Mean2

Dose 1

Dose 2

0.5

0.51

N = 100

0.52

0.53

N = 200

SD1

SD2 Probability of being selected

SD1

SD2

0.03

0.03

0.03

0.03

0.03

0.05

0.03

0.05

0.03

0.1

0.03

0.1

0.05

0.05

0.05

0.05

0.05

0.1

0.05

0.1

0.1

0.1

0.1

0.1

0.5% 5.2% 0.7% 0.2% 3.0% 5.4% 3.1% 0.7% 9.9% 10.4% 10.3% 1.6% 4.4% 5.4% 4.5% 2.4% 9.8% 10.0% 10.3% 5.5% 14.4% 14.3% 14.5% 12.0%

Dose 3 Dose 4

12.9% 29.4% 14.8% 6.7% 20.2% 26.2% 20.5% 11.2% 30.1% 31.4% 30.3% 15.5% 22.1% 24.1% 22.3% 18.7% 27.8% 28.1% 27.8% 23.3% 29.2% 29.4% 29.6% 29.2%

86.6% 65.4% 84.6% 93.1% 76.7% 68.4% 76.4% 88.2% 60.0% 58.2% 59.4% 82.8% 73.4% 70.6% 73.2% 79.0% 62.5% 61.9% 61.9% 71.2% 56.4% 56.3% 55.9% 58.9%

Probability of being selected 0.2% 9.9% 0.7% 0.0% 0.9% 6.6% 1.1% 0.0% 7.4% 9.1% 7.5% 0.2% 1.3% 4.4% 1.4% 0.4% 5.9% 7.2% 6.1% 1.7% 8.2% 8.4% 8.4% 6.3%

12.1% 39.9% 21.4% 1.9% 14.8% 31.5% 17.0% 4.7% 28.4% 31.4% 28.6% 9.8% 14.9% 25.5% 15.5% 9.7% 26.3% 27.7% 26.4% 18.1% 27.3% 27.7% 27.6% 24.2%

Note: The darker value, the higher the probability. Mean1: mean of 1st endpoint; Mean2: mean of 2nd endpoint. SD1: standard deviation of 1st endpoint; SD2: standard deviation of 2nd endpoint.

87.8% 50.2% 78.0% 98.1% 84.3% 61.9% 81.9% 95.3% 64.2% 59.5% 63.8% 90.0% 83.8% 70.1% 83.1% 89.9% 67.8% 65.1% 67.5% 80.2% 64.5% 63.9% 64.0% 69.6%

Criteria for Dose Selection

385

From the simulation results, we observed that the last method probability of being the best dose performed better in all cases than other methods, with higher probability of choosing the right best dose and lower probability of choosing the wrong dose.

TABLE 14.6 Simulation Results for Co-Primary Endpoints: Scenario 2 Scenario 2

Mean1

Dose 1

Dose 2

Dose 3

Dose 4

0.5

0.51

0.52

0.53

Mean2

Dose 1

Dose 2

0.5

0.51

N = 100 SD1

SD2

0.03

0.03

0.03

0.05

0.03

0.1

0.05

0.05

0.05

0.1

0.1

0.1

0.53

0.52

N = 200

Probability of being selected 1.1% 5.6% 1.2% 0.3% 3.0% 5.2% 3.0% 0.8% 11.1% 11.6% 11.5% 1.9% 5.3% 6.4% 5.5% 2.9% 10.3% 10.4% 10.6% 6.2% 14.4% 14.2% 14.7% 11.8%

Dose 3 dose 4

49.8% 47.4% 49.5% 50.0% 56.8% 55.0% 57.0% 39.3% 50.8% 50.5% 50.7% 30.3% 46.6% 46.0% 46.6% 48.1% 45.0% 44.8% 45.1% 38.7% 42.5% 42.8% 42.3% 44.0%

49.2% 47.0% 49.3% 49.7% 40.2% 39.9% 40.0% 59.8% 38.2% 37.9% 37.8% 67.7% 48.1% 47.6% 47.9% 49.0% 44.7% 44.8% 44.3% 55.1% 43.2% 43.1% 43.0% 44.2%

SD1

SD2

0.03

0.03

0.03

0.05

0.03

0.1

0.05

0.05

0.05

0.1

0.1

0.1

Probability of being selected 0.2% 9.9% 0.8% 0.0% 1.4% 7.1% 1.6% 0.1% 7.6% 9.9% 7.8% 0.4% 1.8% 4.6% 1.9% 0.5% 6.7% 7.8% 7.1% 2.3% 9.9% 10.0% 10.3% 7.4%

50.1% 45.1% 49.9% 50.7% 67.6% 57.8% 67.4% 34.7% 60.0% 57.5% 60.0% 24.5% 49.5% 48.2% 49.5% 49.8% 51.2% 50.5% 51.2% 37.4% 45.2% 45.0% 45.3% 45.3%

Note: The darker value, the higher the probability. Mean1: mean of 1st endpoint; Mean2: mean of 2nd endpoint. SD1: standard deviation of 1st endpoint; SD2: standard deviation of 2nd endpoint.

49.6% 45.0% 49.4% 49.3% 31.1% 35.1% 31.0% 65.1% 32.4% 32.7% 32.2% 75.1% 48.8% 47.2% 48.6% 49.8% 42.0% 41.8% 41.7% 60.4% 44.9% 45.1% 44.4% 47.3%

386

Innovative Statistics in Regulatory Science

14.5 Concluding Remarks In  this chapter, we discussed the problem of dose selection in two-stage adaptive clinical trials with the aim of selecting the most effective dose for further investigation in the next stage. Four criteria for dose selection were introduced and compared through extensive simulation studies. From the simulation results, the method of probability of being the best dose performed better than all other three methods. Thus, we recommend this method in practical application as appropriate. Besides, a numerical example concerning dose-response of compound in treating postmenopausal women and adjuvant breast cancer patients with vasomotor symptoms was given to illustrate the use of this method.

15 Generics and Biosimilars

15.1 Introduction In the United States, for traditional chemical (small molecule) drug products, when an innovative (brand-name) drug product is going off patent, pharmaceutical and/or generic companies may file an abbreviated new drug application (ANDA) for approval of generic copies of the brand-name drug product. In  1984, the FDA  was authorized to approve generic drug products under the Drug Price Competition and Patent Term Restoration Act, which is also known as the Hatch-Waxman Act. For approval of small molecule generic drug products, the FDA requires that evidence of average bioavailability, in terms of the rate and extent of drug absorption, be provided. The assessment of bioequivalence as a surrogate endpoint for quantitative evaluation of drug safety and efficacy is based on the Fundamental Bioequivalence Assumption that if two drug products are shown to be bioequivalent in average bioavailability, it is assumed that they will reach the same therapeutic effect or they are therapeutically equivalent and hence can be used interchangeably. Under the Fundamental Bioequivalence Assumption, regulatory requirements, study design, criteria, and statistical methods for assessment of bioequivalence have been well established (see, e.g., Schuirmann, 1987; EMEA, 2001; FDA 2001, 2003a, 2003b; WHO, 2005; Chow and Liu, 2008). Unlike small molecule drug products, the generic versions of biologic products are viewed as similar biological drug products (SBDP). The  SBDP are not generic drug products, which are drug products with identical active ingredient(s) as the innovative drug product. Thus, the concept for development of SBDP, which are made of living cells, is very different from that of the generic drug products for small molecule drug products. The SBDP are usually referred to as biosimilars by the European Medicines Agency (EMA) of the European Union, follow-on biologics (FOB) by the FDA, and subsequent entered biologics (SEB) by the Public Health Agency (PHA) of Canada. As a number of biologic products are due to expire in the next few years, the subsequent production of follow-on products has aroused interest within the pharmaceutical/biotechnology industry as biosimilar manufacturers strive to obtain part of an already large and rapidly growing market. 387

388

Innovative Statistics in Regulatory Science

The potential opportunity for price reductions versus the original biologic products remains to be determined, as the advantage of a slightly cheaper price may be outweighed by the hypothetical increased risk of side effects from biosimilar molecules that are not exact copies of their originators. In this chapter, the focus will not only be placed on the fundamental differences between small molecule drug products and biologic products, but also on issues surrounding quantitative evaluation of bioequivalence (for small molecule drug products) and biosimilarity (for biosimilars or follow-on biologics). In  Section 15.2, fundamental differences between small molecule drug products and biologic drug products are briefly described. Sections 15.3 and 15.4 provide brief descriptions of current process for quantitative evaluation of bioequivalence and biosimilarity, respectively. A general approach using biosimilarity index for assessment of bioequivalence and biosimilarity, which was derived based on the concept of reproducibility probability is proposed and discussed in Section  15.5. Section  15.6 summarizes some current scientific factors and practical issues regarding the assessment of biosimilarity. Brief concluding remarks are given in Section 15.7.

15.2 Fundamental Differences Biosimilars or follow-on biologics are fundamentally different from those of traditional chemical generic drugs. Unlike traditional chemical generic drug products, which contain identical active ingredient(s), the generic versions of biologic products are made of living cells. Unlike classical generics, biosimilars are not  identical to their originator products and therefore should not be brought to market using the same procedure applied to generics. This is partly a reflection of the complexities of manufacturing and safety and efficacy controls of biosimilars when compared to their small molecule generic counterparts (see, e.g., Chirino and Mire-Sluis, 2004; Schellekens, 2004; Crommelin et al., 2005; Roger and Mikhail, 2007). Some of the fundamental differences between biosimilars and generic chemical drugs are summarized in Table 15.1. For example, biosimilars are known to be variable and very sensitive to the environmental conditions such as light and temperature. A small variation may translate to a drastic change in clinical outcomes (e.g., safety and efficacy). In addition to differences in the size and complexity of the active substance, important differences also include the nature of the manufacturing process. Since biologic products are often recombinant protein molecules manufactured in living cells, manufacturing processes for biologic products are highly complex and require hundreds of specific isolation and purification steps. Thus, in practice, it is impossible to produce an identical copy of a biologic product, as changes to the structure of the molecule can occur with changes in the

Generics and Biosimilars

389

TABLE 15.1 Fundamental Differences Chemical Drugs Made by chemical synthesis Defined structure Easy to characterize Relatively stable

No issue of immunogenicity Usually taken orally Often prescribed by a general practitioner

Biologic Drugs Made by living cells Heterogeneous structure Mixtures of related molecules Difficult to characterize Variable Sensitive to environmental conditions such as light and temperature Issue of immunogenicity Usually injected Usually prescribed by specialists

manufacturing process. Since a protein can be modified during the process (e.g., a side chain may be added, the structure may have changed due to protein misfolding, and so on), different manufacturing processes may lead to structural differences in the final product, which result in differences in efficacy and safety, and may have a negative impact on the immune responses of patients. It should be noted that these issues occur also during the postapproval changes of the innovator’s biological products. Thus, SBDP are not  generic products. Hence, the standard generic approach is not  applicable and acceptable due to the complexity of biological/biotechnology derived products. Instead, similar biological approach depending upon the state-of-art of analytical procedures should be applied.

15.3 Quantitative Evaluation of Generic Drugs For  approval of small molecule generic drug products, the FDA  requires that evidence of average bioequivalence in drug absorption in terms of some pharmacokinetic (PK) parameters such as the area under the blood and/or plasma concentration-time curve (AUC) and peak concentration (Cmax) be provided through the conduct of bioequivalence studies. In practice, we may claim that a test drug product is bioequivalent to an innovative (reference) drug product if the 90% confidence interval for the ratio of geometric means of the primary PK parameter is completely within the bioequivalence limits of (80%, 125%). The confidence interval for the ratio of geometric means of the primary PK parameter is obtained based on log-transformed data. In what follows, study designs and statistical methods that are commonly considered in bioequivalence studies are briefly described.

390

Innovative Statistics in Regulatory Science

15.3.1 Study Design As indicated in the Federal Register [Vol. 42, No. 5, Sec. 320.26(b) and Sec. 320.27(b), 1977], a bioavailability study (single-dose or multi-dose) should be crossover in design, unless a parallel or other design is more appropriate for valid scientific reasons. Thus, in practice, a standard two-sequence, two-period (or 2 × 2) crossover design is often considered for a bioavailability or bioequivalence study. Denote by T and R the test product and the reference product, respectively. Thus, a 2 × 2 crossover design can be expressed as (TR, RT), where TR is the first sequence of treatments and RT denotes the second sequence of treatments. Under the (TR, RT) design, qualified subjects who are randomly assigned to sequence 1 (TR) will receive the test product T first and then crossed over to receive the reference product R after a sufficient length of wash-out period. Similarly, subjects who are randomly assigned to sequence 2 (RT) will receive the reference product (R) first and then receive the test product (T) after a sufficient length of wash-out period. One of the limitations of the standard 2 × 2 crossover design is that it does not  provide independent estimates of intra-subject variabilities since each subject will receive the same treatment only once. In the interest of assessing intra-subject variabilities, the following alternative higher-order crossover designs for comparing two drug products are often considered: (i) Balaam’s design, i.e., (TT, RR, RT, TR); (ii) two-sequence, three-period dual design, e.g., (TRR, RTT); and (iii) four-sequence, four-period design, e.g., (TTRR, RRTT, TRTR, RTTR). For comparing more than two drug products, a Williams’ design is often considered. For example, for comparing three drug products, a six-sequence, three-period (6  ×  3) Williams’ design is usually considered, while a 4  ×  4 Williams’ design is employed for comparing four drug products. Williams’ design is a variance stabilizing design. More information regarding the construction and good design characteristics of Williams’ designs can be found in Chow and Liu (2008). In  addition to the assessment of average bioequivalence (ABE), there are other types of bioequivalence assessment such as population bioequivalence (PBE) that is intended for addressing drug prescribability and individual bioequivalence (IBE) that is intended for addressing drug switchability. For assessment IBE/PBE, the FDA recommends that a replicated design be considered for obtaining independent estimates of intra-subject and inter-subject variabilities and variability due to subject-by-drug product interaction. A  commonly considered replicate crossover design is the replicate of a 2  ×  2 crossover design is given by (TRTR, RTRT). In  some cases, an incomplete block design or an extra-reference design such as (TRR, RTR) may be considered depending upon the study objectives of the bioavailability/bioequivalence studies (Chow et al., 2002b).

Generics and Biosimilars

391

15.3.2 Statistical Methods As indicated in Chapter 3, the FDA recommends using a two one-sided tests (TOST) procedure for testing interval hypotheses for average bioequivalence (ABE) assessment of generic drugs. However, ABE is evaluated based on confidence interval approach. That is, ABE is claimed if the ratio of average bioavailabilities between test and reference products is within the bioequivalence limit of (80%, 125%) with 90% assurance based on log-transformed data. In many cases, TOST at the 5% level of significance at each side is operationally equivalent to the 90% confidence interval approach although they are very different conceptually. Along this line, commonly employed statistical methods for assessing bioequivalence for generic drugs are the TOST for interval hypotheses and the confidence interval approach. For  the confidence interval approach, a 90% confidence interval for the ratio of means of the primary pharmacokinetic response such as AUC or Cmax is obtained under an analysis of variance model. We claim bioequivalence if the obtained 90% confidence interval is totally within the bioequivalence limit of (80%, 125%) inclusively. For  the method of interval hypotheses testing, the interval hypotheses that H 0 : Bioinequivalence vs. H a : Bioequivalence.

(15.1)

Note that the above hypotheses are usually decomposed into two sets of one-sided hypotheses. For  the first set of hypotheses is to verify that the average bioavailability of the test product is not too low, whereas the second set of hypotheses is to verify that average bioavailability of the test product is not too high. Under the two one-sided hypotheses, Schuirmann’s two onesided tests procedure is commonly employed for testing ABE (Schuirmann, 1987). Note that Schuirmann’s two one-sided tests procedure is a size-α test (Chow and Shao, 2002b). In  practice, other statistical methods such as Westlake’s symmetric confidence interval approach, confidence interval based on Fieller’s theorem, Chow and Shao’s joint confidence region approach, Bayesian methods, and non-parametric methods such as Wilcoxon-Mann-Whitney two one-sided tests procedure, distribution-free confidence interval based on the HodgesLehmann estimator, and bootstrap confidence interval are sometimes considered (Chow and Liu, 2008). It, however, should be noted that the concept of interval hypotheses testing and the concept of confidence interval approach for bioequivalence assessment are very different, although they are operationally equivalent under certain conditions. In the case of binary responses, TOST is not operationally equivalent to that of the confidence interval approach. Thus, bioequivalence evaluation using statistical methods other than the official method TOST should be proceed with caution to avoid possible false positive and/or false negative rates.

392

Innovative Statistics in Regulatory Science

15.3.3 Other Criteria for Bioequivalence Assessment Although the assessment of ABE for generic approval has been in practice for years, it has the following limitations: (i) it focuses only on population average; (ii) it ignores the distribution of the metric; (iii) it does not provide independent estimates of intra-subject variabilities and ignores the subjectby-formulation interaction. In  addition, the use of one-fits-all criterion for assessment of ABE has been criticized in the past decade. It  is suggested that the one-fits-all criterion be flexible by adjusting intra-subject variability of the reference product and therapeutic window whenever possible. Many authors criticize that (i) the assessment of ABE does not address the question of drug interchangeability, and (ii) it may penalize drug products with lower variability. 15.3.3.1 Population Bioequivalence and Individual Bioequivalence (PBE/IBE) To address the issue of drug interchangeability, between early 1990s and early 2000s, as more and more generic drug products become available, it became a point of concern as to whether generic drugs are safe and efficacious when they are used interchangeably. To address this issue, an aggregated criterion which take into consideration of both inter- and intra-subject variabilities of the test product and the reference product and the variability due to subject-by-drug interaction. The  criteria were proposed to address interchangeability (in terms of prescribability and switchability) through the assessment of population bioequivalence (PBE) for prescribability and individual bioequivalence (IBE) for switchability. The proposed PBE/IBE criteria have some undesirable properties (see, e.g., Chow, 1999) due to possible masking and cancellation effects in the complicated aggregated criteria. As a result, as indicated in the 2003 FDA guidance, PBE/IBE are not required for BE assessment for generic approval. 15.3.3.2 Scaled Average Bioequivalence (SABE) To address the issue that current ABE may penalize drug products with lower variability, Haidar et al. (2008) proposed a scaled average bioequivalence (SABE) criterion for assessment of bioequivalence for highly variable drug products. The SABE criterion is useful not only for assessment of highly variable drug products, but also for drug products with narrow therapeutic index (NTI). However, it should be noted that the SABE criterion is, in fact, an ABE criterion adjusted for the standard deviation of the reference product. Thus, it is a special case of the following criteria for IBE: 2 2 ( µT − µR )2 + σ D2 + (σ WT − σ WR ) ≤ θI , 2 2 max(σ WR , σ W 0 )

(15.2)

Generics and Biosimilars

393

2 2 where σ WT and σ WR are the within-subject variances of the test drug product and the reference drug product, respectively, σ D2 is the variance component due to subject-by-drug interaction, σ W2 0 is a constant that can be adjusted to control the probability of passing IBE, and θ I is the bioequivalence limit for IBE.

15.3.3.3 Scaled Criterion for Drug Interchangeability (SCDI) Chow et al. (2015) proposed a criterion based on the first two components of the criterion for IBE in (15.2), which consists of (i) criterion for ABE adjusted for intra-subject variability of the reference product (i.e., SABE), and (ii) correction for variability due to subject-by-product variability (i.e. σD2 ). The proposed criterion for assessing interchangeability is briefly derived below. Step 1: Unscaled ABE criterion Let BEL be the BE limit which equals 1.25. Thus, bioequivalence requires that 1 ≤ GMR ≤ BEL BEL This implies − log ( BEL ) ≤ log ( GMR ) ≤ log ( BEL ) or − log ( BEL ) ≤ µT − µR ≤ log ( BEL ) , where µT and µR are logarithmic means. Step 2: Scaled ABE (SABE) criterion Difference in logarithmic means is adjust for intra-subject variability as follows: − log ( BELS ) ≤

µT − µ R ≤ log ( BELS ) , σW

or − log ( BELS )σ W ≤ µT − µR ≤ log ( BELS )σ W 2 where σ W2 is a within-subject variation. In practice, σ WR , within-subject variation of the reference product is often considered. Step 3: Development of SCDI Consider the first two components of the IBE in (15.2), we have the following relationship

Innovative Statistics in Regulatory Science

394

( µT − µR )2 + σ D2 (δ + σ D )2 − 2δσ D , = σ W2 σ W2 where δ = µT − µR. When δ = 0 and σ D approaches to 0, we have (δ + σ D )2 2δσ D ≈ 2 . σ W2 σW Thus, Chow et al. (2015) proposed that SCDI can be summarized as follows:  µ − µR   2σ D  − log ( BELS ) ≤  T   ≤ log ( BELS )  σW   σW  Now, let f = σ W / ( 2σ D ), correction factor for drug interchangeability. Then, Chow et al. (2015) proposed SCDI criterion is given by − log ( BELS ) f σ W ≤ µT − µR ≤ log ( BELS ) f σ W Note that statistical properties and finite sample performance need further research. 15.3.3.4 Remarks Following the concept of criterion for IBE and the idea of SABE, the Chow et al.’s (2015) proposed SCDI criterion is developed based on the one-size-fitsall criterion adjusted for both intra-subject variability of the reference product and the variability due to subject-by-product interaction. As compared to SABE, SCDI may result in a wider or narrower limit depending upon the correction factor f, which is a measurement of the relative magnitude between σ WR and σ D. Chow et  al.’s (2015) proposed SCDI criterion for drug interchangeability depends upon the selection of regulatory constants of σ WR and σ D. In practice, the observed variabilities may be deviated far off the regulatory constants. Thus, it is suggested that the following hypotheses be tested before the use of SCDI criterion: H 01 : σ WR ≤ σ W 0

vs.

H a1 : σ WR ≤ σ W 0 ,

H 02 : σ D ≤ σ D0

vs.

H a 2 : σ D ≤ σ D0 .

and

If we fail to reject the null hypotheses H 01 or H 02 , then we will stick with the individual regulatory suggested constants; otherwise, estimates of σ WR and/or σ D should be used in SCDI criterion. It, however, should be

Generics and Biosimilars

395

noted that statistical properties and/or the finite sample performance of SCDI with estimates of σ WR and/or σ D are not well established. Further research is needed.

15.4 Quantitative Evaluation of Biosimilars As indicated earlier, the assessment of bioequivalence is possible under the Fundamental Bioequivalence Assumption. Due to the fundamental differences between the small molecule drug products and biological products, the Fundamental Bioequivalence Assumption and the well-established standard methods may not be appropriately applied directly for assessment of biosimilarity. 15.4.1 Regulatory Requirement On March 23, 2010, the Biologics Price Competition and Innovation (BPCI) Act was written into law. This action gave the FDA the authority to approve similar biological drug products. Following the passage of the BPCI Act, in order to obtain input on specific issues and challenges associated with the implementation of the BPCI Act, the FDA conducted a two-day public hearing November 2–3, 2010, in Silver Spring, Maryland, on the approval pathway for biosimilar and interchangeability, biological products. Several scientific factors were suggested and discussed at this public hearing, include criteria for assessing biosimilarity, study design and analysis methods for assessment of biosimilarity, and tests for comparability in quality attributes of manufacturing process and/or immunogenicity (see, e.g., Chow et al., 2010). These issues primarily focus on the assessment of biosimilarity. The  issue of interchangeability in terms of the concepts of alternating and switching were also mentioned and discussed. These discussions have led to the development of regulatory guidance. On February  9, 2012, the FDA  circulated three draft guidances on the demonstration of biosimilarity for comments. These three draft guidance were (i) Scientific Considerations in Demonstrating Biosimilarity to a Reference Product, (ii) Quality Considerations in Demonstrating Biosimilarity to a Reference Protein Product, and (iii) Biosimilars: Questions and Answers Regarding Implementation of the Biologics Price Competition and Innovation (BPCI) Act of 2009 (FDA, 2012a, 2012b, 2012c). Another FDA public hearing devoted to the discussion of these draft guidance was held at the FDA on May 11, 2012. These three guidances were finalized in 2015. As indicated in the guidance on scientific considerations, the FDA recommends a stepwise approach for obtaining the totality-of-the-evidence for demonstrating biosimilarity between a proposed biosimilar (test) product and an innovative biological (reference) product. The  stepwise approach

396

Innovative Statistics in Regulatory Science

starts with analytical studies for structural and functional characterization of critical quality attributes and followed by the assessment of pharmacokinetics/pharmacodynamics (PK/PD) similarity and the demonstration of clinical similarity including immunogenicity and safety/efficacy evaluation. Based on the BPCI Act (part of the Affordable Care Act), quantitative evaluation of biosimilarity includes the concepts of biosimilarity and drug interchangeability, which will be briefly described below. 15.4.2 Biosimilarity In the BPCI Act, a biosimilar product is defined as a product that is highly similar to the reference product notwithstanding minor differences in clinically inactive components if there are no clinically meaningful differences in terms of safety, purity, and potency. Based on this definition, a biological medicine is considered biosimilar to a reference biological medicine if it is highly similar to the reference in safety, purity (quality) and efficacy. However, little or no discussion regarding “How similar is considered highly similar?” is given in the BPCI Act. 15.4.2.1 Basic Principles The BPCI Act seems to suggest that a biosimilar product should be highly similar to the reference drug product in all spectrums of good drug characteristics such as identity, strength, quality, purity, safety, and stability. In practice, however, it is almost impossible to demonstrate that a biosimilar product is highly similar to the reference product in all aspects of good drug characteristics in a single study. Thus, to ensure a biosimilar product is highly similar to the reference product in terms of these good drug characteristics, different biosimilar studies may be required. For example, if safety and efficacy is a concern, then a clinical trial must be conducted to demonstrate that there are no clinically meaningful differences in terms of safety and efficacy. On the other hand, to ensure highly similar in quality, assay development/validation, process control/validation, and product specification of the reference product are necessarily established. In addition, tests for comparability in manufacturing process between biosimilars and the reference must be performed. In some cases, if a surrogate endpoint such as pharmacokinetic (PK), pharmacodynamics (PD), or genomic marker is predictive of the primary efficacy/safety clinical endpoint, then a PK/PD or genomic study may be used to assess biosimilarity between biosimilars and the reference product. It should be noted that current regulatory requirements are guided based on a case-by-case basis following basic principles that requirements reflect: (i) the extent of the physicochemical and biological characterization of the product; (ii) the nature or possible changes in the quality and structure of the biological product due the changes in the manufacturing process (and their

Generics and Biosimilars

397

unexpected outcomes); (iii) clinical/regulatory experiences with the particular class of the product in question; and (iv) several factors that need to be considered for biocomparability. 15.4.2.2 Criteria for Biosimilarity For the comparison between drug products, some criteria for the assessment of bioequivalence, similarity (e.g., the comparison of dissolution profiles), and consistency (e.g., comparisons between manufacturing processes) are available in either regulatory guidelines/guidances or the literature. These criteria, however, can be classified into either (i) absolute change versus relative change, (ii) aggregated versus disaggregated, or (iii) moment-based versus probability-based. In practice, we may consider assessing bioequivalence or biosimilarity by comparing average and variability separately or simultaneously. This leads to the so-called disaggregated criterion and aggregated criterion. A disaggregate criterion will provide different levels of biosimilarity. For example, the study that passes criteria of both average and variability of biosimilarity provides stronger evidence of biosimilarity as compared to those studies that pass only the average biosimilarity. On the other hand, it is not clear whether an aggregated criterion would provide a stronger evidence of biosimilarity due to potential offset (or masked) effect between the average and variability in the aggregated criterion. Further research for establishing the appropriate statistical testing procedures based on the aggregate criterion and comparing its performance with the disaggregate criterion may be needed. Chow et  al. (2010) compared the moment-based criterion with the probability-based criterion for assessment of bioequivalence or biosimilarity under a parallel group design. The results indicate that the probability-based criterion is not only a much more stringent criterion, but also has sensitivity to any small change in variability. This justifies the use of the probabilitybased criterion for assessment of biosimilarity between follow-on biologics if a certain level of precision and reliability of biosimilarity is desired. 15.4.2.3 Study Design As indicated earlier, a crossover design is often employed for bioequivalence assessment. In a crossover study, each drug product is administered to each subject. Thus, estimated (approximate) within-subject variance can be made to serve to address switchability and interchangeability. For a parallel-group study, each drug product is administered to a different group of subjects. Thus, we can only estimate total variance (between and within subject variances) not  individual variance components. For  follow-on biologics with long halflives, crossover study would be ineffective and unethical. In this case, we need to undertake study with parallel groups. However, a parallel-group study does not provide an estimate for within-subject variation (since there is no R vs. R).

398

Innovative Statistics in Regulatory Science

15.4.2.4 Statistical Methods Similar to the assessment of average bioequivalence, FDA recommends that Schuirmann’s two one-sided tests (TOST) procedure be used for assessment of biosimilarity, although this method has been mix-used with the confidence interval approach. On the other hand, if similar criteria for assessment of population/individual bioequivalence are considered, the 95% confidence upper bound can be used for assessing biosimilarity based on linearized criteria of population/individual bioequivalence. Note that the clarification of the confusion between when to use a 90% CI and when to use a 95% CI can be found in Chapter 3. 15.4.3 Interchangeability As indicated in the Subsection (b)(3) amended to the Public Health Act Subsection 351(k)(3), the term interchangeable or interchangeability in reference to a biological product that is shown to meet the standards described in subsection (k)(4), means that the biological product may be substituted for the reference product without the intervention of the health care provider who prescribed the reference product. Along this line, in what follows, definitions and basic concepts of interchangeability (in terms of switching and alternating) are given. 15.4.3.1 Definition and Basic Concepts As indicated in the Subsection (a)(2) amends the Public Health Act Subsection 351(k)(3), a biological product is considered to be interchangeable with the reference product if (i) the biological product is biosimilar to the reference product; and (ii) it can be expected to produce the same clinical result in any given patient. In addition, for a biological product that is administered more than once to an individual, the risk in terms of safety or diminished efficacy of alternating or switching between use of the biological product and the reference product is not greater than the risk of using the reference product without such alternation or switch. Thus, there is a clear distinction between biosimilarity and interchangeability. In other words, biosimilarity does not imply interchangeability which is much more stringent. Intuitively, if a test product is judged to be interchangeable with the reference product then it may be substituted, even alternated, without a possible intervention, or even notification, of the health care provider. However, interchangeability implies that a test product is expected to produce the same clinical result in any given patient, which can be interpreted as that the same clinical result can be expected in every single patient. Conceivably, lawsuits could be filed if adverse effects are recorded in a patient after switching from one product to another, interchangeable product. It should be noted that when the FDA declares the biosimilarity of two drug products, it may not  be assumed that they are interchangeable. Therefore, labels ought to state whether for a follow-on biologic which is biosimilar

Generics and Biosimilars

399

to a reference product, interchangeability has or has not  been established. However, payers and physicians may, in some cases, switch products even if interchangeability has not been established. 15.4.3.2 Switching and Alternating Unlike drug interchangeability (in terms of prescribability and switchability (Chow and Liu, 2008), the FDA  has slightly different perception of interchangeability for biosimilars. From the FDA’s perspectives, interchangeability includes the concept of switching and alternating between an innovative biologic product (R) and its follow-on biologics (T). The concept of switching is referred to as a single switch including not only the switch from “R to T” or “T to R” (narrow sense of switchability), but also “T to T” and “R to R” (broader sense of switchability). As a result, in order to assess switching, biosimilarity for “R to T,” “T to R,” “T to T,” and “R to R” need to be assessed based on some biosimilarity criteria under a valid switching design. On the other hand, the concept of alternating is referred to as multiple switches including either the switch from T to R and then switch back to T (i.e., “T to R to T”) or the switch from R to T and then switch back to R (i.e., “R to T to R.” Thus, the difference between “the switch from T to R” or “the switch from R to T” and “the switch from R to T” or “the switch from T to R” needs to be assessed for addressing the concept of alternating. 15.4.3.3 Study Design For assessment of bioequivalence for chemical drug products, a standard two-sequence, two-period (2  ×  2) crossover design is often considered, except for drug products with relatively long half-lives. Since most biosimilar products have relatively long half-lives, it is suggested that a parallel group design should be considered. However, parallel group design does not provide independent estimates of variance components such as interand intra-subject variabilities and variability due to subject-by-product interaction. Thus, it is a major challenge for assessing biosimilars under parallel group designs. In order to assess biosimilarity for “R to T,” “T to R,” “T to T,” and “R to R,” the Balaam’s 4 × 2 crossover design, i.e., (TT, RR, TR, RT) may be useful. For addressing the concept of alternating, a two-sequence, three-period dual design, i.e., (TRT, RTR) may be useful. For addressing both concepts of switching and alternating for drug interchangeability of biosimilars, a modified Balaam’s crossover design, i.e., (TT, RR, TRT, RTR) is then recommended. For switching designs, FDA recommends (RT,RR) (single switch) and (RTR, RRR) and (RTRT, RRRR) (multiple switches) be used. However, Chow and Lee (2009) suggest a complete n-of-1 design be considered because the FDA recommended switching designs are partial designs of complete n-of-1 trial designs.

400

Innovative Statistics in Regulatory Science

15.4.4 Remarks With small molecule drug products, bioequivalence generally reflects therapeutic equivalence. Drug prescribability, switching, and alternating are generally considered reasonable. With biologic products, however, variations are often higher (factors other than pharmacokinetic factors may be sensitive to small changes in conditions). Thus, often only parallel-group design rather than crossover kinetic studies can be utilized. It should be noted that very often, with follow-on biologics, biosimilarity does not reflect therapeutic comparability. Therefore, switching and alternating should be pursued only with substantial caution.

15.5 General Approach for Assessment of Bioequivalence/Biosimilarity As indicated earlier, the concept of biosimilarity and interchangeability for follow-on biologics is very different from that of bioequivalence and drug interchangeability for small molecule drug products. It  is debatable whether standard methods for assessment of bioequivalence and drug interchangeability can be applied to assessing biosimilarity and interchangeability of follow-on biologics due to the fundamental differences as described in Section 2. While appropriate criteria or standards for assessment of biosimilarity and interchangeability are still under discussion within the regulatory agencies and among the pharmaceutical industry and academia, we would like to propose the a general approach for assessing biosimilarity and interchangeability by comparing the relative difference between “a test product vs. a reference product” and “the reference vs. the reference” based on the concept of reproducibility probability of claiming biosimilarity between a test product and a reference product in a future biosimilarity study provided that the biosimilarity between the test product and the reference product has been established in the current study. 15.5.1 Development of Bioequivalence/Biosimilarity Index Shao and Chow (2002) proposed a reproducibility probability as an index for determining whether it is necessary to require a second trial when the result of the first clinical trial is strongly significant. Suppose that the null hypothesis H0 is rejected if and only if |T| > c, where c is a positive known constant and T is a test statistic. Thus, the reproducibility probability of observing a significant clinical result when Ha is indeed true is given by

(

)

p = P ( T > c H a ) = P T > c θɵ ,

(15.3)

Generics and Biosimilars

401

where θɵ is an estimate of θ, which is an unknown parameter or vector of parameters. Following the similar idea, a reproducibility probability can also be used to evaluate biosimilarity and interchangeability between a test product and a reference product based on any pre-specified criteria for biosimilarity and interchangeability. As an example, biosimilarity index proposed by Chow et al. (2011) is illustrated based on the well-established bioequivalence criterion by the following steps: Step 1. Assess the average bioequivalence/biosimilarity between the test product and the reference product based on a given bioequivalence/biosimilarity criterion. For  illustration purpose, consider bioequivalence criterion. That is, bioequivalence/ biosimilarity is claimed if the 90% confidence interval of the ratio of means of a given study endpoint falls within the bioequivalence/biosimilarity limit of (80%, 125%) inclusive based on log-transformed data. Step 2. Once the product passes the test for bioequivalence/ biosimilarity in Step 1, calculate the reproducibility probability based on the observed ratio (or observed mean difference) and variability. We will refer to the calculated reproducibility probability as the bioequivalence/biosimilarity index. Step 3. We then claim bioequivalence/biosimilarity if the following null hypothesis is rejected: H 0 : P ≤ p0 vs. H a : P > p0 .

(15.4)

A confidence interval approach can be similarly applied. In other words, we claim bioequivalence/biosimilarity if the lower 95% confidence bound of the reproducibility probability is larger than a pre-specified number p0. In practice, p0 can be obtained based on an estimated of reproducibility probability for a study comparing a reference product to itself (the reference product). We will refer to such a study as an R-R study. In an R-R study, define

d the  concluding average biosimiliarity between the test and  reference products in a future trial given that the average PTR = P  biosimiliarity based on ABE criterion has been esstablished   in first trial 

   (15.5)   

402

Innovative Statistics in Regulatory Science

Alternatively, a reproducibility probability for evaluating the bioequivalence/biosimilarity of the two same reference products based on ABE criterion is defined as:  concluding average biosimiliarity of the two same refeerence  PRR = P  products in a future trial given that the average biosimilarity  (15.6)    based on ABE criterion have been established in first trial  Since the idea of the bioequivalence/biosimilarity index is to show that the reproducibility probability in a study for comparing generics/biosimilars with the innovative (reference) product is higher than a reference product with the reference product. The  criterion of an acceptable reproducibility probability ( p0 ) for assessment of bioequivalence/biosimilarity can be obtained based on the R-R study. For example, if the R-R study suggests the reproducibility probability of 90%, i.e., PRR = 90%, the criterion of the reproducibility probability for bioequivalence/biosimilarity study could be chosen as 80% of the 90%, i.e., p0 = 80% × PRR = 72%. The above described bioequivalence/biosimilarity index has the advantages that (i) it is robust with respect to the selected study endpoint, bioequivalence/biosimilarity criteria, and study design; (ii) it takes variability into consideration (one of the major criticisms in the assessment of average bioequivalence); (iii) it allows the definition and assessment of the degree of similarity (in other words, it provides a partial answer to the question “How similar is considered similar?”); and (iv) the use of the bioequivalence/biosimilarity index will reflect the sensitivity of heterogeneity in variance. Most importantly, the bioequivalence/biosimilarity index proposed by Chow et al. (2011) can be applied to different functional areas (domains) of biological products such as good drug characteristics; as safety (e.g., immunogenicity); purity, and potency (as described in BPCI Act); pharmacokinetics (PK) and pharmacodynamics (PD); biological activities; biomarkers (e.g., genomic markers); and manufacturing process, etc. for an assessment of global biosimilarity. An overall biosimilar index across domains can be obtained by the following steps: Step 1. Obtain Pi , the probability of reproducibility for the ith domain, i = 1,..., K. Step 2. Define the global biosimilarity index P = ∑ iK=1 wi Pi , where wi is the weight for the ith domain. Step 3. Claim global biosimilarity if the lower 95% confidence bound of the reproducibility probability (P) is larger than a pre-specified number p0, where p0 is a pre-specified acceptable reproducibility probability.

Generics and Biosimilars

403

15.5.2 Remarks Hsieh et al. (2010) studied the performance of the biosimilarity index under an R-R study for establishing a baseline for assessment of biosimilarity based on current criterion for average bioequivalence. The results indicate that biosimilarity index is sensitive to the variability associated with the reference product. The  biosimilarity index decreases as the variability increases. As an example, Figure 15.1 gives reproducibility probability curves under a 2 × 2 crossover design with sample sizes n= n= 10, 20, 30, 40, 50, and 60 at the 1 2 0.05 level of significance and (θ L ,θU ) = ( 80%, 125%) when σ d = 0.2 and 0.3 , where σ d is the standard deviation of period difference within each subject. In  practice, alternative approaches for assessment of the proposed biosimilarity index are available (see, e.g., Hsieh et al., 2010, Yang et al., 2010). The methods include maximum likelihood approach and Bayesian approach. For  the Bayesian approach, let p(θ ) be the power function, where θ is an unknown parameter or vector of parameters. Under this Bayesian approach, θ is random with a prior distribution assumed to be known. The reproducibility probability can be viewed as the posterior mean of the power function for the future trial

∫ p(θ )π (θ |x)dθ ,

(15.7)

where π (θ |x) is the posterior density of θ , given the data set x observed for the previous trial(s). However, there may exist no explicit form for the estimation of the biosimilarity index. As a result, statistical properties of the

FIGURE 15.1 Impact of variability on reproducibility probability. (The reproducibility decreases when µ1/µ2 (original scale) moves away from 1 and σd (log scale) is larger.)

404

Innovative Statistics in Regulatory Science

derived biosimilarity index may not be known. In this case, the finite sample size performance of the derived biosimilarity index may only be evaluated by clinical trial simulations. As an alternative measure for assessment of global biosimilarity across TRi , which is the relative domains, we may consider rd = ∑ iK=1 wi rdi , where rdi = PPRRi measure of biosimilarity between T and R as compared to that of between R and R. Based on rdi , i = 1,..., K , we may conduct a profile analysis as described in the 2003 FDA guidance on Bioavailability and Bioequivalence Studies for Nasal Aerosols and Nasal Sprays for Local Action (FDA, 2003b). However, statistical properties of the profile analysis based on rdi , i = 1,..., K are not fully studied. Thus, further research is required.

15.6 Scientific Factors and Practical Issues for Biosimilars Following the passage of the BPCI Act, in order to obtain input on specific issues and challenges associated with the implementation of the BPCI Act, the FDA conducted a two-day public hearing on Approval Pathway for Biosimilar and Interchangeability Biological Products held on November 2–3, 2010, at the FDA in Silver Spring, Maryland, USA. In what follows, some of the scientific factors and practical issues are briefly described. 15.6.1 Fundamental Biosimilarity Assumption Similar to Fundamental Bioequivalence Assumption for assessment of bioequivalence, Chow et  al. (2010) proposed the following Fundamental Biosimilarity Assumption for follow-on biologics: When a biosimilar product is claimed to be biosimilar to an innovator’s product based on some well-defined product characteristics and is therapeutically equivalent provided that the well-defined product characteristics are validated and reliable predictors of safety and efficacy of the products.

For the chemical generic products, the well-defined product characteristics are the exposure measures for early, peak, and total portions of the concentrationtime curve. The Fundamental Bioequivalence Assumption allows us to assume that equivalence in the exposure measures implies therapeutically equivalent. However, due to the complexity of the biosimilar drug products, one has to verify that some validated product characteristics are indeed reliable predictors of the safety and efficacy. It follows that the design and analysis for evaluation of equivalence between the biosimilar drug product and innovator products are substantially different from those of the chemical generic products.

Generics and Biosimilars

405

15.6.2 Endpoint Selection For  assessment of biosimilarity of follow-on biologics, the following questions are commonly asked. First, what endpoints should be used for assessment of biosimilarity? Second, should a clinical trial always be conducted? To address these two questions, we may revisit the definition of biosimilarity as described in the BPCI Act. A  biological product that is demonstrated to be highly similar to an FDA-licensed biological product may rely on certain existing scientific knowledge about safety, purity (quality), and potency (efficacy) of the reference product. Thus, if one would like to show that the safety and efficacy of a biosimilar product are highly similar to that of the reference product, then a clinical trial may be required. In some cases, clinical trials for assessment of biosimilarity may be waived if there exists substantial evidence that surrogate endpoints or biomarkers are predictive of the clinical outcomes. On the other hand, clinical trials are required for assessment of drug interchangeability in order to show that the safety and efficacy between a biosimilar product and a reference product are similar in any given patient of the patient population under study. 15.6.3 How Similar Is Similar? Current criteria for assessment of bioequivalence/biosimilarity are useful for determining whether a biosimilar product is similar to a reference product. However, it does not provide additional information regarding the degree of similarity. As indicated in the BPCI Act, a biosimilar product is defined as a product that is highly similar to the reference product. However, little or no discussion regarding the degree of similarity implied by highly similar was provided. Besides, it is also of concern to the sponsor that “what if a biosimilar product turns out to be superior to the reference product?” A simple answer to the concern is that superiority is not biosimilarity. 15.6.4 Guidance on Analytical Similarity Assessment On September 27, 2017, FDA circulated a draft guidance on analytical similarity assessment for comments (FDA, 2017b). In  the draft guidance, the FDA recommended that equivalence tests be used for analytical similarity assessment between a proposed biosimilar (test) product and an innovative biological (reference) product for critical quality attributes (CQAs) with high risk ranking relevant to clinical outcomes. For CQAs with mild to moderate risk ranking relevant to clinical outcomes, FDA suggested the quality range (QR) approach be considered. The equivalence test for analytical similarity evaluation have been criticized by many authors due to its inflexibility in similarity margin selection of 1.5  σ R , where σ R is the standard deviation of the reference product (Chow et al., 2016), while the QR method is considered inadequate because it relies on the primary assumption that the proposed

406

Innovative Statistics in Regulatory Science

biosimilar product and the reference product have similar mean and standard deviation, which is never true in practice. The draft guidance was subsequently withdrawn in June 2018, although the equivalence test (for CQAs with high risk ranking relevant clinical outcomes) and the QR method (for CQAs with mild to moderate risk ranking relevant clinical outcomes) are still used for analytical similarity assessment in some biosimilar regulatory submissions. Recently, the FDA  circulated a new draft guidance for comparative analytical assessment (FDA, 2019). In the draft guidance, FDA suggested the use of QR method for comparative analytical assessment. As indicated in the draft guidance, the objectives of the QR method is to verify the assumption that the proposed biosimilar product and the reference product have similar mean and similar standard deviation. The guidance also indicated that analytical similarity for a quality attribute would generally be supported when a sufficient percentage of biosimilar lot values (e.g., 90%) fall within the quality range defined for that attribute. This statement, however, may have been misinterpreted by non-statistical reviewers that the concept of comparative analytical assessment is eventually the same as that of analytical similarity evaluation. As indicated by Chow et al. (2016), the QR method is designed for the purpose of quality control/assurance in the sense that we would expect that there will be about 95% (99%) test results of the test lots will fall within the quality range developed based on 2 (3) standard deviations below and above the mean test results of the reference lots. The QR method is only valid under the assumption that the test product and the reference product have similar population mean and similar population standard deviation (i.e., they are highly similar and can be viewed as coming from the same population). 15.6.5 Practical Issues Since there are many critical (quality) attributes of a potential patient’s response in follow-on biologics, for a given critical attribute, valid statistical methods are necessarily developed under a valid study design and a given set of criteria for similarity, as described in the previous section. Several areas can be identified for developing appropriate statistical methodologies for the assessment of biosimilarity of follow-on biologics. These areas include, but are not limited to: 15.6.5.1 Criteria for Biosimilarity (in Terms of Average, Variability, or Distribution) We suggest establishing disaggregated criteria for biosimilarity in terms of average, variability, and/or distribution address the question “How similar is similar?” We suggest establishing disaggregated criteria for biosimilarity in terms of average, variability, and/or distribution. In other words, we can

Generics and Biosimilars

407

establish generally similar by demonstrating similarity in average first. Then, we can establish highly similar by demonstrating similarity in variability or distribution. 15.6.5.2 Criteria for Interchangeability In  practice, it is recognized that drug interchangeability is related to the variability due to subject-by-drug interaction. However, it is not  clear whether criterion for interchangeability should be based on the variability due to subject-by-drug interaction or the variability due to subject-by-drug interaction adjusted for intra-subject variability of the reference drug. 15.6.5.3 Reference Product Changes In practice, it is not uncommon to observe a shift in mean response over time for the reference product. This reference product change over time may be due to (i) minor changes in manufacturing process, (ii) use of new and/or advanced technology, and/or (iii) some unknown factors. During the review process of a biosimilar regulatory submission, it is of great concern when there are reference products shifts because products before and after shift may not be similar. In this case, which lots (e.g., lots before shift, lots after shift, or all lots combined) should be used for biosimilarity assessment is a major challenge to the reviewers. On the second thought, should a significant reference product change be considered a major violation (e.g., 483 observation) and appropriate action should be taken for the purpose of quality control/ assurance? Since the possible reference product change over time will have an impact on the biosimilarity assessment, “How to detect potential reference product change?” has become an important issue in the review and approval process of biosimilar regulatory submissions. FDA is currently working on a specific guidance for reference product change based on similar to SUPAC (scaled-up and post-approval change) guidance for generic drugs. 15.6.5.4 Extrapolation For a given indication and a CQA, the validity of extrapolation depends upon that there is a well-established relationship (linear or non-linear) between the CQA  and PK/clinical outcomes. Without such a well-established relationship, a notable difference in a given CQA (e.g., CQAs in Tier 1, which are considered most relevant to PK/clinical outcomes) may or may not be translated to a clinically meaningful difference in clinical outcomes. In practice, a notable difference in Tier 1 CQAs may vary from one indication to another even they have similar PK profile or mechanism of action (MOA). Thus, the validity of extrapolation across indications without collecting any clinical data is a great concern. In this case, statistical method for evaluation of sensitivity index proposed by Lu et al. (2017) may be helpful.

408

Innovative Statistics in Regulatory Science

15.6.5.5 Non-medical Switch Non-medical switch is referred to as the switch from the reference product (more expensive) to an approved biosimilar product (less expensive) based on factors unrelated to clinical/medical considerations. Typical approaches for assessment of non-medical switch include (i) observational studies and (ii) limited clinical studies. However, there are concerns regarding (i) validity, quality and integrity of the data collected, and (ii) scientific validity of design and analysis of studies conducted for assessment of safety and efficacy of non-medical switch (see also Chow, 2018). In recent years, several observational studies and a national clinical study (NOR-SWITCH) were conducted to evaluate the risk of non-medical switch from a reference product to an approved biosimilar product (Løvik Goll, 2016). The conclusions from these studies, however, are biased and hence may be somewhat misleading due to some scientific and/or statistical deficiencies in design and analysis of the data collected. Chow (2018) recommended some valid study designs and appropriate statistical methods for a more accurate and reliable assessment of potential risk of medical/non-medical switch between a proposed biosimilar product and a reference product. The results can be easily extended for evaluation of the potential risk of medical/nonmedical switch among multiple biosimilar products and a reference product. 15.6.5.6 Bridging Studies for Assessing Biosimilarity As most biosimilars studies are conducted using a parallel design rather than a replicated crossover design, independent estimates of variance components such as the intra-subject and the variability due to subject-by-drug interaction are not possible. In this case, bridging studies may be considered. Other practical issues include (i) the use of a percentile method for the assessment of variability; (ii) comparability in biologic activities; (iii) assessment of immunogenicity; (iv) consistency in manufacturing processes (see, e.g., ICH, 1996b, 1999, 2005b); (v) stability testing for multiple lots and/or multiple labs (see, e.g., ICH, 1996c); (vi) the potential use of sequential testing procedures and multiple testing procedures; and (vii) assessing biosimilarity using a surrogate endpoint or biomarker such as genomic data (see, e.g., Chow et al., 2004). Further research is needed in order to address the above-mentioned scientific factors and practical issues recognized at the FDA Public Hearings, FDA Public Meeting, and FDA biosimilars review process.

15.7 Concluding Remarks As indicated earlier, we claim that a test drug product is bioequivalent to a reference (innovative) drug product if the 90% confidence interval for the ratio of means of the primary PK parameter is totally within the bioequivalence

Generics and Biosimilars

409

limits of (80%, 125%). This  one size-fits-all criterion only focuses on average bioavailability and ignores heterogeneity of variability. Thus, it is not scientifically/statistically justifiable for assessment of biosimilarity of followon biologics. In practice, it is then suggested that appropriate criteria, which can take the heterogeneity of variability into consideration be developed since biosimilars are known to be variable and sensitive to small variations in environmental conditions (Chow and Liu, 2010; Chow et al., 2010; Hsieh et al., 2010). At the FDA public hearing, questions that are commonly asked are “How similar is considered similar?” and “How the degree of similarity should be measured and translated to clinical outcomes (e.g., safety and efficacy)[?]” These questions closely related to drug interchangeability of biosimilars or follow-on biologics which have been shown to be biosimilar to the innovative product (Roger, 2006; Roger and Mikhail, 2007). For assessment of bioequivalence for chemical drug products, a crossover design is often considered, except for drug products with relatively long half-lives. Since most biosimilar products have relatively long half-lives, it is suggested that a parallel group design should be considered. However, parallel group design does not  provide independent estimates of variance components such as inter- and intra-subject variabilities and variability due to subject-by-product interaction. Thus, it is a major challenge for assessing biosimilars under parallel group designs. Although the EMA has published several product-specific guidance based on the concept papers (e.g., EMEA 2003a, 2003b, 2005a, 2005b, 2005c, 2005d, 2005e, 2005f, 2005g), it has been criticized that there are no objective standards for assessment of biosimilars because it depends upon the nature of the products. Product-specific standards seem to suggest that a flexible biosimilarity criterion should be considered and the flexible criterion should be adjusted for variability and/or the therapeutic index of the innovative (or reference) product. As described above, there are many uncertainties for assessment of biosimilarity and interchangeability of biosimilars. As a result, it is a major challenge to both clinical scientists and biostatisticians to develop valid and robust clinical/statistical methodologies for assessment of biosimilarity and interchangeability under the uncertainties. In  addition, how to address the issues of quality and comparability in manufacturing process is another challenge to both the pharmaceutical scientists and biostatisticians. The  proposed general approach using the bioequivalence/biosimilarity index (derived based on the concept of reproducibility probability) may be useful. However, further research on the statistical properties of the proposed bioequivalence/biosimilarity index is required.

16 Precision Medicine

16.1 Introduction In clinical trials, a typical approach for evaluation of safety and efficacy of a test treatment under investigation is to first test for the null hypothesis of no treatment difference in efficacy based on clinical data collected from adequate and well-controlled clinical studies. If significant, the investigator would reject the null hypothesis of no treatment difference and then conclude the alternative hypothesis that there is a difference in favor of the test treatment. If there is a sufficient power for correctly detecting a clinically meaningful difference (treatment effect) when such a difference truly exists, we claim that the test treatment is efficacious. The test treatment will then be reviewed and approved by the regulatory agency such as the FDA if the test treatment is well tolerated and there appear to be no safety concerns. We will refer to medicine developed based on this typical approach as traditional medicine. In  his State of the Union address on January  20, 2015, President Barack Obama announced that he was launching the Precision Medicine Initiative – a bold new research effort to revolutionize how we improve health and treat disease. As President Obama indicated, precision medicine is an innovative approach that takes into account individual differences in people’s genes, environments, and lifestyles. Under the auspices of the traditional approach (traditional medicine), most medical treatments have been designed for the average patient. This one-size-fits-all approach, treatments can be very successful for some patients but not for others. Precision medicine, on the other hand, gives medical professionals the resources they need to target the specific treatments of the illnesses we encounter (News Release, 2015). In response to President Obama’s Precision Medicine Initiative, the NIH initiated off cohort grants for precision medicine to develop treatments tailored to an individual based on their genetics and other personal characteristics subsequently (McCarthy, 2015). Seeking for precision medicine of cure has become the center of clinical research in pharmaceutical development since then.

411

412

Innovative Statistics in Regulatory Science

The purpose of this chapter is to provide a comprehensive summarization regarding concept, design, and analysis of precision medicine in pharmaceutical research and development. In the next section, the concept of precision medicine will be described. Design and analysis of precision medicine will be reviewed and discussed in Section 16.3. Section 16.4 provides alternative enrichment designs for precision medicine. Some concluding remarks are given in the Section 16.5.

16.2 The Concept of Precision Medicine 16.2.1 Definition of Precision Medicine Unlike traditional medicine, precision medicine (PM) is a medical model that proposes the customization of healthcare, with medical decisions, practices, and/or products being tailored to the individual patient (NRC, 2011). In  this model, diagnostic testing is often employed for selecting appropriate and optimal therapies in the context of a patient’s genetics or other molecular or cellular analysis. Tools employed in PM could include molecular diagnostics, imaging, and analytics/software. This has led to biomarker development in genomics studies for target clinical trial. A  validated biomarker (diagnostic tool) is then used to identify patients who are most likely to respond to the test treatment under investigation in the enrichment process of the target clinical trials (FDA, 2005; FDA, 2007a, 2007b; Liu et al., 2009). As a result, precision medicine will benefit the subgroup of patients who are biomarker positive. In practice, however, there may exist no perfect diagnostic tool for determining whether a given patient is with or without molecular target in the enrichment process of the target clinical trials. Possible misclassification, which could cause significant bias in assessment of treatment effect in target clinical trials, is probably the most challenging issue in precision medicine. 16.2.2 Biomarker-Driven Clinical Trials With the surge in advanced technology, especially in the “-omics” space (e.g., genomics, proteomics, etc.), the clinical trial designs that incorporate biomarker information for interim decisions have attracted much attention lately. The  biomarker, which usually is a short-term endpoint that is indicative of the behavior of the primary endpoint, has the potential to provide substantial added value to interim study population selection (e.g., biomarker-enrichment designs) and interim treatment selection (e.g., biomarker-informed adaptive designs). As an example, for biomarker-enrichment designs, it is always of particular interest to clinicians to identify patients

Precision Medicine

413

with disease targets under study, who are most likely to respond to the treatment under study. In practice, an enrichment process is often employed to identify such a target patient population. Clinical trials utilizing an enrichment design are referred to as targeted clinical trials. After the completion of a human genome project, the disease targets at a certain molecular level can be identified and should be utilized for treatment of diseases (Maitournam and Simon, 2005; Casciano and Woodcock, 2006). As a result, diagnostic devices for detection of diseases using biotechnology such microarray; polymerase chain reaction (PCR); mRNA  transcript profiling; and others have become possible in practice (FDA, 2005, 2007). The  treatments specific for specific molecular targets could then be developed for those patients who were most likely to benefit. Consequently, personalized medicine could become a reality. Clinical development of Herceptin (trastuzumab), which is targeted at patients suffering from metastatic breast cancer with an over-expression of HER2 (human epidermal growth factor receptor) protein, is a typical example (see Table 16.1). As it can be seen from Table 16.1, Herceptin plus chemotherapy provides statistically significantly additional clinical benefit in terms of overall survival over chemotherapy alone for patients with a staining score of 3+, while Herceptin plus chemotherapy fails to provide additional survival benefit for patients with a FISH (fluorescence in situ hybridization) or CTA (clinical trial assay) score of 2+. Note that CTA is an investigational immunohistochemica (IHC) assay consists of four-point ordinal score system (0, 1+, 2+, 3+). However, as indicated in the Decision Summary of HercepTest (a commercial IHC assay for over-expression of HER2 protein), about 10% of samples have discrepancy results between 2+ and 3+ staining intensity. In  other words, some patients tested with a score of 3+ may actually have a score of 2+ and vice versa. TABLE 16.1 Treatment Effects as a Function of HER2 Over-Expression or Amplification HER2 Assay Result CTA 2+ or 3+ FISH (+) FISH (−) CTA 2+ FISH (+) FISH (−) CTA 3+ FISH (+) FISH (−)

Number of Patients

Relative Risk for Mortality (95%)

469 325 126 120 32 83 349 293 43

0.80 (0.64, 1.00) 0.70 (0.53, 0.91) 1.06 (0.70, 1.63) 1.26 (0.82, 1.94) 1.31 (0.53, 3.27) 1.11 (0.68, 1.82) 0.70 (0.51, 0.89) 0.67 (0.51, 0.89) 0.88 (0.39, 1.98)

Source: From U.S. FDA  Annotated Redlined Draft Package Insert for Herceptin, Rockville, Maryland, 2006.

414

Innovative Statistics in Regulatory Science

We will refer to these treatments as targeted treatments or drugs. Development of targeted treatments involves translation from the accuracy and precision of diagnostic devices for the molecular targets to the effectiveness and safety of the treatment modality for the patient population with the targets. Therefore, evaluation of targeted treatments is much more complicated than that of the traditional drugs. To address the issues of development of the targeted drugs, in April  2005, the FDA  published The Drug-Diagnostic Co-development Concept Paper. In clinical trials, subjects with and without disease targets may respond to the treatment differently with different effect sizes. In  other words, patients with disease targets may show a much larger effect size, while patients without disease targets may exhibit a relatively small effect size. In  practice, fewer subjects are required for detecting a bigger effect size. Thus, the traditional clinical trials may conclude that the test treatment is ineffective based on the detection of a combined effect size, while the test treatment is in fact effective for those patients with positive disease targets. Consequently, personalized medicine is possible if we can identify those subjects with positive disease targets. 16.2.3 Precision Medicine versus Personalized Medicine The term precision medicine is often confused with the term personalized medicine. To distinguish the difference between precision medicine and personalized (or individualized) medicine, the National Research Council (NRC) indicates that precision medicine refers to the tailoring of medical treatment to the individual characteristics of each patient. It does not literally mean the creation of drugs or medical devices that are unique to a patient, but rather the ability to classify individuals into subpopulations that differ in their susceptibility to a particular disease, in the biology and/or prognosis of those diseases they may develop, or in their response to a specific treatment. In summary, precision medicine is to benefit subgroup of patients with the diseases under study, while personalized medicine is to benefit individual subjects with the diseases under investigation. Statistically, the term precision is usually referred to the degree of closeness of the observed data to the truth. High degree of closeness is an indication of high precision. Thus, precision is related to the variability associated with the observed data. In practice, the variability associated with observed data include (i) intra-subject variability, (ii) inter-subject variability, and (iii) variability due to subject-by-treatment interaction. As a result, precision medicine can be viewed as the identification of subgroup population with larger effect size (i.e., smaller variability) assuming that the difference in mean response is fixed. Consequently, precision medicine focuses on minimizing inter-subject variability, while personalized medicine focuses on minimizing intra-subject variability. Table 16.2 provides a comparison between precision medicine and personalized medicine.

Precision Medicine

415

TABLE 16.2 Precision Medicine versus Personalized Medicine Traditional Medicine

Active Ingredient Target population Primary focus Dose/Regimen Beneficial Statistical Method Use of biomarker Blinding Objective

Single

Single

Multiple

Population

Population

Individuals

Mean Fixed Average patient Hypotheses testing Confidence interval No

Inter-subject variability Fixed Subgroup of patients Hypotheses testing Confidence interval Yes

Intra-subject variability Flexible Individual patients Hypotheses testing Confidence interval Yes

Yes Accuracy

Yes Accuracy Precision

Study design

Parallel/crossover

Probability of success

Low

Parallel/crossover Adaptive design Mild-to-moderate

May be difficult Accuracy Precision Reproducibility Parallel/Crossover Adaptive design High

a

Precision Medicine

Personalized Medicinea

Characteristic

Personalized medicine = individualized medicine

16.3 Design and Analysis of Precision Medicine 16.3.1 Study Designs As indicated in the FDA’s Drug-Diagnostic Co-development Concept Paper, one of the useful designs for evaluation of the targeted treatments is the enrichment design (see also Chow and Liu, 2003). Under the enrichment design, the targeted clinical trials consist of two phases. The first phase is the enrichment phase in which each patient is tested by a diagnostic device for detection of the pre-defined molecular targets. Then, patients with a positive result by the diagnostic device are randomized to receive either the targeted treatment or a concurrent control. However, in practice, no diagnostic test is perfect with 100% positive predicted value (PPV). As a result, some of the patients enrolled in targeted clinical trials under the enrichment design might not  have the specific targets and hence the treatment effects of the drug for the molecular targets could be under-estimated due to misclassification (Liu and Chow, 2008). Under the enrichment design, following the idea as described in Liu and Chow (2008), Liu et  al. (2009) proposed using the EM algorithm (Dempster et  al., 1977; McLachlan and Krishnan, 1997) in conjunction with the bootstrap technique (Efron and

416

Innovative Statistics in Regulatory Science

Tibshirani, 1993) for obtaining the inference of the treatment effects. Their method, however, depends upon the accuracy and reliability of the diagnostic device. A poor (i.e., less accurate and reliable) diagnostic device may result in a large proportion of misclassification that has an impact on the assessment of the true treatment effect. To overcome (correct) the problem of an inaccurate diagnostic device, we propose using Bayesian approach in conjunction with the EM algorithm and bootstrap technique for obtaining a more accurate and reliable estimate of treatment effect under various study designs recommended by FDA. Under an enrichment design, one of the objectives of targeted clinical trials is to evaluate the treatment effects of the molecular targeted test treatment in the patient population with the molecular target. The diagram in the FDA concept paper (FDA, 2005) for demonstration of these designs are reproduced in Figure  16.1 (Design A) and Figure  16.2 (Design  B), respectively. Let Yij be the responses of the jth subject in the ith group, where j = 1,…, ni ; i = T , C. Yij are assumed approximately normality distributed with homogeneous variances between the test and control treatments. Also, let µT + , µC + , ( µT − , µC − ) be the means of test and control groups for the patients with (without) the molecular target. Table  16.3 summarizes population means by treatment and diagnosis. Under the enrichment design (Design A), Liu et al. (2009) considered a twogroup parallel design in which patients with a positive result by the diagnostic device are randomized in a 1:1 ratio are randomized to receive the molecular targeted test treatment (T) or a control treatment (C) (see Figure  16.2). In other words, only patients with positive diagnosed results are included in

FIGURE 16.1 Design A—Targeted clinical trials under an enrichment design.

FIGURE 16.2 Design B—Enrichment design for patients with positive results.

Precision Medicine

417

TABLE 16.3 Population Means by Treatment and Diagnosis Positive Diagnosis +

True Target Condition

Indicator of Diagnostic

Test Group

Control Group

Difference

+ −

γ1 1− γ1

µ T+ µ T−

µ C+ µ C−

µ T + − µ C+ µ T − − µ C−

Note: γ is the positive predicted value.

the study. For simplicity, Liu et al. (2009) assumed that the primary efficacy endpoint is a continuous variable. The results can be easily extended to other date types such as binary response and time-to-event data. 16.3.2 Statistical Methods Under Design B (Figure 16.2) for target clinical trials, it is of interest to estimate the treatment effect for the patients truly having the molecular target, i.e., θ = µT + − µC + . However, this effect may be contaminated due to misclassification, i.e., for those subjects who do not have the molecular target but got positive diagnosed results and those subjects who have the molecular target but got negative diagnosed results. The following hypothesis for detecting a clinically meaningful treatment difference in the patient population truly with the molecular target is of interest: H 0 : µT + − µC + = 0 vs. H a : µT + − µC + ≠ 0

(16.1)

Let yT and yC be the sample means of test and control treatments, respectively. Since no diagnostic test is perfect for diagnosis of the molecular target of interest without error, therefore, some patients with a positive diagnostic result may in fact do not have the molecular target. It follows that E( yT − yC ) = γ ( µT + − µC + ) + (1 − γ )( µT − − µC − ),

(16.2)

where γ is the positive predicted value (PPV), which is often unknown. Thus, an accurate and reliable estimate of γ is the key to the success of target clinical trials (Liu et al., 2009) and hence precision medicine. Liu and Chow (2008) indicated that the expected value of the difference in sample means consists of two parts. The first part is the treatment effects of the molecular target drug in patients with a positive diagnosis who truly have the molecular target of interest. The second part is the treatment effects of the patients with a positive diagnosis but in fact they do not have the molecular target. The  reason for developing the targeted treatment is based on the assumption that the efficacy of the targeted treatment is greater than in the patients truly with the molecular target than those

Innovative Statistics in Regulatory Science

418

without the target. In addition, the targeted treatment is also expected to be more efficacious than the untargeted control in the patient population truly with the molecular targets. It  follows that µT + − µC + > µT − − µC − . As a result, the difference in sample means obtained under the enrichment design for targeted clinical trials actually under-estimated the true treatment effects of the molecular target test drug in the patient population truly with the molecular target of interest. As it can be seen from (16.2), the bias of the difference in sample means decreases as the positive predicted value increases. On the other hand, the positive predicted value of a diagnostic test increases as the prevalence of the disease increases (Fleiss et al., 2003). For  a disease that is highly prevalent, say greater than 10%, even with a high diagnostic accuracy of 95% sensitivity and specificity for the diagnostic device, the positive predicted value is only about 67.86%. It follows that the downward bias of the traditional difference in sample means could be substantial for estimation of treatment effects of the molecular target drug in patients who do have the target of interest. The traditional unpaired two-sample t-test approach is to reject the null hypothesis in (16.1) at the at the α level of significance level if

(

)

t = y T − y C / sp2 ( 1 / nT + 1 / nC ) ≥ tα /2 , nT + nC − 2 , where sp2 is the pooled sample variance, and tα ,nT +nC −2 , is the αth upper percentile of a central t distribution with nT + nC − 2 degrees of freedom. Since y T − y C under-estimates µT + − µC + , the planned sample size may not be sufficient for achieving the desired power for detecting the true treatment effects in the patients truly with molecular target of interest. Based on the above t-statistic, the corresponding (1 − α ) × 100% confidence interval can be obtained as follows

(y

T

1   1 − y C ± tα /2 ,nT +nC −2 sp2  + .  nT nC 

)

Although all patients randomized under the enrichment design have a positive diagnosis, the true status of the molecular target for individual patients in the targeted clinical trials is in fact unknown. It  follows that under the assumption of homogeneity of variance, Yij are independently distributed as a mixture of two normal distributions with mean μi+ and μi− respectively, and common variance σ 2 (McLachlan and Peel, 2000):

(

) ( γ

ϕ yij µi + σ 2 ϕ yij µ i − σ 2

( )

)

1−γ

i = T , C ; j = 1,..., ni ,

where ϕ ⋅ ⋅ denotes the density of a normal variable.

(16.3)

Precision Medicine

419

However, γ is an unknown positive predicted value, which is usually estimated from the data. Therefore, the data obtained from the targeted clinical trials are incomplete because the true status of the molecular target of the patients is missing. The EM algorithm is a one of the methods for obtaining the maximum likelihood estimators of the parameters for an underlying distribution from a given data set when the data is incomplete or has missing values. On the other hand, the diagnostic device for detection of molecular targets has been validated in diagnostic effectiveness trials for its diagnostic accuracy. Therefore, the estimates of the positive predictive value for the diagnostic device can be obtained from the previously conducted diagnostic effectiveness trials. As a result, we can apply the EM algorithm to estimate the treatment effect for the patients truly with the molecular target by incorporating the estimates of the positive predictive value of the device obtained from the diagnostic effectiveness trials as the initial values. For  each patient, we have a pair of variables (Yij , Xij ), where Yij is the observed primary efficacy endpoint of patient j in treatment i and Xij is the latent variable indicting the true status of the molecular target of patient j in treatment i ; j = 1, … , ni , I = T , C . In  other words, Xij is an indicator variable with value of 1 for the patients truly with the molecular target and with a value of 0 for the patients truly without the target. In addition, Xij are assumed i.i.d. Bernoulli random variables with probability γ for the molecular target. Let Ψ Ψ = (γ , µT + , µ T − , µC + , µC − , σ 2 )′ be the vector containing all unknown parameters and yobs = ( yT1 ,...,yTnT ,yC1 ,...,yCnC )′ be the vector of the observed primary efficacy endpoints from the targeted clinical trials. It follows that the complete-data log-likelihood function is given by log Lc ( Ψ ) =

nT

∑X

Tj

log γ + log ϕ ( yTj | µT + , 0 2 )

j =1

nT

+

∑(1 − x

Tj

) log ( 1 − γ ) + log ϕ ( yTj |µT − , 0 2 )

j =1

(16.4)

nC

+

∑X

Cj

log γ + log ϕ ( yCj |µC + , 0 2 )

j =1 nC

+

∑(1 − x

Cj

) log ( 1 − γ ) + log ϕ ( yCj |µc − , 0 2 ) .

j =1

Furthermore, from the previous diagnostic effectiveness trials, an estimate of the positive predictive value of the device is known. Therefore, at the initial step of the EM algorithm for estimation the treatment effects in the patients with the molecular target, the observed latent variable Xij are generated as i.i.d. Bernoulli random variables with the positive predicted

Innovative Statistics in Regulatory Science

420

value γ estimated by that obtained from the diagnostic effectiveness trial. The procedures for implementation of the EM algorithm in conjunction with the bootstrap procedure for inference of θ in the patient population truly with the molecular target are briefly described below. At the ( k + 1) st iteration, the E-step requires the calculation of the conditional expectation of the complete-data log-likelihood Lc ( Ψ ), given the  ( k ) for Ψ. observed data yobs, using currently fitting Ψ

)

(

 ( k ) =E {log L c ( Ψ ) | y obs } Q Ψ; Ψ Ψ( k ) Since log Lc ( Ψ ) is a linear function of the unobservable component labeled variables xij , the E-step is calculated by replacing xij , by its conditional expec ( k ) for Ψ. That is, xij is replaced by tation given yij , using Ψ (k )   (k )  2   γ i ϕ  y ij|µ i + , σ i   (k )    X ij = EΨ( k ) {xij |yij } = , i = T , C. k ( ) k 2     (k )  2 (k )  ( ) (k ) (k )   γ i ϕ  yij|µ i + , σ i  + 1 − γ i ϕ  yij|µ i − , σ i     

( )

(k )

( ) (

)

( )

which is the estimate of the posterior probability of the observation yij with molecular target after the kth iteration. The  M-step requires the computation of  ( k +1) , µ  ( k +1) , and (σ i2 )( k +1) , i = T , C , γ i( k +1) , µ i+ i− by maximizing log Lc ( Ψ ). It is equivalent to computing the sample proportion, the weighted sample mean and sample variance with the weight xij . Since log Lc ( Ψ ) is linear in the xij , it follows that xij are replaced by their con(k ) ditional expectations xɵ ij . On the (k + 1)th iteration, the intent is to choose the k+ 1 ( )  ( k ) ). It follows that on the M-step  value of Ψ, say Ψ , that maximizes Q(Ψ ;Ψ of the ( k + 1) st iteration, the current fit for the positive predicted value of test drug group and control group is given by ) ∑ =

γ i( k +1

ni

 (ijk ) X

j =1

ni

,

i = T , C.

Under the assumption that nT = nc , it follows that the overall positive predicted value is estimated by

(

)

γ ( k +1) = γ T( k +1) + γ C( k +1) / 2.

Precision Medicine

421

The means of the molecularly target test drug and control can then be estimated respectively as nT

 ( k + 1) = µ T+



 (Ckj) yTj X

nC

 C( kj) yCj X

j =1

 ( k + 1) = µ C+



nT



 (Tkj ) , µ  ( k +1) = X T−

j =1

j =1

nC



nT

∑ j =1

1− X  (Tkj )  yTj    

 C( kj) , and µ  ( k + 1) = X C−

j =1

nC

∑ j =1

nT

∑  1 − X j =1

1− X  C( kj)  yCj    

nC

(k )  Tj

, 

∑  1 − X

(k )  Cj

j =1

, 

with unbiased estimators for the variances of molecularly targeted drug and control given respectively by

(σ T )( 2

k + 1)

 =  

n



2

n

2

 T( kj )  yTj − µ  (k )  + X  T+    j =1

n

2  1− X  T( kj )    yTj − µ  (k )     T−      

( n T −2 ),

 1− X  C( kj)   yCj − µ  (k )2      C−    

( nc − 2 ) .

∑ j =1

and (σ C )( 2

k + 1)

 =  

 C( kj)  yCj − µ  (k )  + X  C+    j =1



n

∑ j =1

It follows that an unbiased estimated for the pooled variance is given as

( )

( k +1)    2 ( k +1) + ( nc − 2 ) × σ C2 ( nT − 2 ) × (σ T )  2 ( k +1)  .  (σ ) = ( nT + nc − 4 )

Therefore, the estimator for the treatment effects in the patients with the  −µ  . molecular target θ obtained from the EM algorithm is given as θɵ = µ T+ C+ Liu et al. (2009) proposed to apply the parametric bootstrap method to estimate the standard error of θɵ . Step 1: Choose a large bootstrap sample size, say B = 1,000. For 1 ≤ b ≤ B, b generate the bootstrap sample yobs according to the probability model b in (16.3). The parameters in (6.3) for generating bootstrap samples yobs are substituted by the estimators obtained from the EM algorithm based on the original observations of primary efficacy endpoints from the targeted clinical trials. b Step 2: The  EM algorithm is applied to the bootstrap sample yobs to * obtain estimates θ b , b = 1, …, B.

Innovative Statistics in Regulatory Science

422

Step 3: An estimator for the variance of θɵ by the parametric bootstrap B

procedure is given as SB2 =

* * * (θɵ b − θɵ )2 / ( B − 1) , where θɵ =

∑ b =1

B

* b

∑θɵ / B. b =1

Let θɵ be the estimator for the treatment effects in the patients truly with the molecular target obtained from the EM algorithm. Nityasuddhi and Böhning (2003) show that the estimator obtained under the EM algorithm is asymptotic unbiased. Let SB2 denote the estimator of the variance of θɵ obtained by the bootstrap procedure. It  follows that the null hypothesis is rejected and the efficacy of the molecular targeted test drug is different from that of the control in the patient population truly with the molecular target at the α level if t = θɵ / SB2 ≥ zα /2 ,

(16.5)

where zα/2 is the α/2 upper percentile of a standard normal distribution. Thus, the corresponding (1–α)  ×  100% asymptotic confidence interval for θ = µT + − µC + can be constructed as

θɵ ± z1−α /2 SB2 (see, e.g., Basford et al., 1997). It should be noted that although the assumption that

µT + − µC + > µT − − µC − is one of the reasons for developing the targeted treatment, this assumption is not used in the EM algorithm for estimation of θ. Hence, the inference for θ by the proposed procedure is not biased in favor of the targeted treatment. 16.3.3 Simulation Results Liu et al. (2009) conducted a simulation study to evaluate finite sample performance of the proposed method of EM algorithm. In the simulation, μT−, μC+, and μC− are assumed equal and set to be a generic value of 100. To investigate the impact of the positive predictive value, sample size, difference in means, and variability, Liu et  al. (2009) considered the following specifications of parameters: (1) the positive predicted value is set to be 0.5, 0.7, 0.8, and 0.9 which reflect a range of low, median, and high positive predicted value, and (2) the range of the standard deviation σ is set as 20, 40, or 60. To investigate the finite sample properties, the sample sizes are set as 50, 100, and 200 per group. The  mean differences are chosen a fraction of the standard deviation, from 10% to 60% by 10%; and 75% and 100%. In addition, the size of the

Precision Medicine

423

proposed testing procedure was investigated at μT+  =  100. For  each  of 288 combinations, 5,000 random samples were generated and the number of the bootstrap samples was set to be 1000. The  simulation results indicate that the absolute relative bias of the estimator for θ by the current method ranges from 10% to more than 50% and increases as the positive predictive value decreases. On the other hand, most of absolute relative bias measurements of the estimator for θ obtained by the EM algorithm are smaller than 0.05% although it can as high as 10% for few combinations when the difference in means is 2. The variability has little impact on the bias of both methods. However, for the EM procedure, the relative bias tends to decrease as the sample size increases. The bias of the current method with consideration of the true status of molecular target can be as high as 50% when the positive predictive value is low. Consequently, the empirical coverage probabilities of the corresponding 95% confidence interval can be as low as only 0.28% when the positive predictive value is 50%, mean difference is 20, standard deviation is 20, and n is 200. The coverage probability of the 95% confidence interval by the current method is an increasing function of the positive predictive value. On the other hand, only 36 of the 288 coverage probabilities (12.5%) of the 95% confidence intervals by the current method exceed 0.9449 and 24 of them occurred when the positive predictive value is 0.9. On the contrary, only 14.6% of the 288 coverage probabilities of the 95% confidence intervals by the EM method are below 0.9449. However, 277 of the 288 coverage probability of the 95% confidence interval constructed by the EM algorithm are above 0.94. No coverage probability of the EM method is below 0.91. Therefore, the proposed procedures for estimation of the treatment effects in the patient population with the molecular target by the EM algorithm is not only unbiased but also provide sufficient coverage probability.

16.4 Alternative Enrichment Designs 16.4.1 Alternative Designs with/without Molecular Targets As indicated above, Liu et al. (2009) proposed statistical methods for assessment of the treatment effect for patients with positive diagnosed results under the enrichment design described in Figure 16.2. Their methods suffer from the lack information regarding the proportion of subjects who truly have molecule targets in the patient population and the unknown positive predicted value. Consequently, the conclusion drawn from the collected data may be biased and misleading. In addition to the study designs as given in Figures 16.1 and 16.2, the 2005 FDA concept paper also recommended the following two study designs for different study objectives (see Figure 16.3 for Design C and Figure 16.4 for Design D).

Innovative Statistics in Regulatory Science

424

FIGURE 16.3 Design C—Enrichment design for patients with and without molecular targets.

All subjects

Test

No Diagnosed

® Control Test Diagnosis +

® Control

Subset Diagnosed Diagnosis – FIGURE 16.4 Design D—Alternative enrichment design for targeted clinical trials.

This  study design allows the evaluation of the treatment effect within subpopulations, i.e., the subpopulation of patients with positive or negative results. Similar to Table  16.2 for the study design given in Figure  16.1, the expected values of Yij by treatment and diagnostic result of the molecular targets are summarized in Tables 16.3 and 16.4. As a result, it may of interest in estimating the following treatment effects:

θ1 = γ 1 ( µT ++ − µC ++ ) + ( 1 − γ 1 ) ( µT +− − µC +− ) ; θ 2 = γ 2 ( µT −+ − µC −+ ) + ( 1 − γ 2 ) ( µT − − µC ) ; θ 3 = δγ 1 ( µT ++ − µC ++ ) + ( 1 − δ ) γ 2 ( µT −+ − µC −+ ) ; θ 4 = δγ 1 ( µT +− − µC +− ) + ( 1 − δ ) γ 1 ( µT − − µC −− ) ; θ 5 = δ γ 1 ( µT +− − µC +− ) + ( 1 − γ 1 ) ( µT +− − µC +− )  + ( 1 − δ ) γ 2 ( µT −+ − µC −+ ) + ( 1 − γ 2 ) ( µT − − µC −− )  , where δ is the proportion of subjects with positive molecular targets. Following similar ideas described in the previous section, estimates

Precision Medicine

425

TABLE 16.4 Population Means by Treatment and Diagnosis Positive Diagnosis + − Notes:

True Target Condition

Indicator of Diagnostic

+ − + −

γ1 1− γ1

Test Group

Control Group

µ T++ µ T+− µ T−+ µ T−−

µ C++ µ C+− µ C−+ µ C−−

γ2 1− γ2

Difference µ T++ − µ C++ µ T + − − µ C+ − µ T −+ − µ C−+ µ T−− − µ C−−

γ i is the positive predicted value, i  =  1 (positive diagnosis) and i  =  2 (negative diagnosis). µijk is the mean for subjects in the ith group with kth true target status but with jth diagnosed result.

of θ1 − θ 5 can be obtained. In  other words, estimates of θ1 and θ 2 can be obtained based on data collected from the subpopulations of subjects with and without positive diagnoses who truly have the molecular target of interest. Similarly, the combined treatment effect θ 5 can be assessed. These estimates, however, depend upon both γ i , i = 1, 2, and δ . To obtain some information regarding γ i , i = 1, 2, and δ , the FDA  recommends the following alternative enrichment design that includes a group of subjects without any diagnoses and a subset of subjects who will be diagnosed at the screening stage. 16.4.2 Statistical Methods As indicated earlier, the method proposed by Liu et  al. (2009) suffers from the lack of information regarding the uncertainty in accuracy of the diagnostic device. As an alternative, we propose considering a Bayesian approach to incorporate the uncertainty in accuracy and reliability of the diagnostic device for the molecular target into the inference of treatment effects of the targeted drug. For each patient, we have a pair of variables ( yij , xij ), where yij is the observed primary efficacy endpoint of patient j in treatment i and xij is the latent variable indicting the true status of the molecular target of patient j in treatment i, j = 1,…, ni , i = T , C . In other words, xij is an indicator variable with value of 1 for patients with the molecular target and with a value of 0 for patients without the target. The xij are assumed i.i.d. Bernoulli random variables with probability of the molecular target being γ . Thus,

(

xij = 1 if yij ∼ N µi + , σ 2

)

and

(

)

xij = 0 if yij ∼ N µi − , σ 2 , i = T , C ; j = 1,…, ni .

Innovative Statistics in Regulatory Science

426

The likelihood function is given by

L(Ψ|Yobs , xij ) =

∏ γϕ(y

Tj

|µ T + , σ 2 ) ×

j , xTj =1

×

∏ γϕ(y

Cj

|µ C + , σ 2 ) ×

j , xC j =1

∏ (1 − γ )ϕ(y

Tj

|µ T − , σ 2 )

j , xTj =0

∏ (1 − γ )ϕ(y

Cj

|µc − , σ 2 ),

j , xC j =0

where i = T , C ; j = 1,…, ni and ϕ (.|.) denotes the density of a normal variable. For Bayesian approach, a beta distribution can be employed as the prior distribution forγ , while normal prior distributions can be used for µi+ and µi− . In addition, a gamma distribution can be used as a prior for σ −2 . Under the assumptions of these prior distributions, the conditional posterior distributions of γ , µi + , µi − , σ −2 can be derived. In  other words, assuming that

(

)

(

)

f ( γ ) ∼ β (α r , β r ) , f ( µi + ) ∼ N λi + , σ 02 , f ( µi − ) ∼ N λi − , σ 02 , and

( )

f σ −2 ∼ γ (α g , β g ) , where µi + , µi − and γ are assumed to be independent and α r , β r , α g , β g , λi + , λi − and σ 02 are assumed to be known. Thus, the conditional posterior distribution of xij i is given by   γϕ ( yij |µi + , σ 02 ) f ( xij |γ , µi + , µi − , Yobs ) ∼ Bernoulli  , 2 2  γϕ ( yij |µi + , σ 0 ) + ( 1 − γ )ϕ ( yij |µi − , σ 0 )    where

EΨ [xij |γ , µi + , µi − , Yobs ] =

γϕ ( yij |µi + , σ 2 ) , i = T , C ; j = 1,…, ni γϕ ( yij |µi + , σ 2 ) + ( 1 − γ )ϕ ( yij |µi − , σ 2 )

in the EM algorithm. The joint distribution of γ , µi + , µi − and σ 2 is given by

Precision Medicine

427

f (γ , µi + , µi − , σ 2 |Yobs , xij ) =

∏ ϕ(y

| µT + , σ 2 ) ×

Tj

j , xTj =1

×

Tj

| µT − , σ 2 )

j , xTj = 0

∏ ϕ(y

Cj

j , xC j =1

∏ ϕ(y

| µC + , σ 2 ) ×

∏ ϕ(y

Cj

| µC − , σ 2 )

j , xC j = 0

×ϕ ( µT + |λ7 + , σ 02 ) × ϕ ( µT − |λT − , σ 02 ) ×ϕ ( µC + |λC + , σ 02 ) × ϕ ( µC − |λc − , σ 02 ) nC

nT

∑ (1− xTj )+ ∑ (1− xC j )+ βγ −1 nT nC j =1 j =1 Γ (α r + β r ) xTj + ∑xC j +αγ −1 ∑ 1 × γ γ . − ( ) j=1 j=1 ( ) Γ (α r ) Γ ( β r )

Thus, the conditional posterior distribution of γ , µi + , µi − and σ −2 can be obtained as follows: nC  nT  xTj + xCj  j =1 j =1 f (γ |µi + , µi − , σ −2 , Yobs , xij ) ∼ β  nT  + , α (1 − xTj ) +  γ  = 1 j 

∑ ∑ ∑

   ,  nC  (1 − xCj ) + βγ , )   j =1 



 −2 ni xij yij + σ 0−2λi + σ  j =1 f ( µi + |γ , µi − , σ −2 , Yobs , xij ) ∼ N  , ni  σ −2 x + σ −2 σ −2 ij 0  j =1 





ni   σ −2 (1 − xij )yij + σ 0−2λi −  j =1  ni  σ −2 (1 − x ) + σ −2 0 ij  j =1 f ( µi − |γ , µi + , σ −2 , Yobs , xij ) ∼ N    1  ni  σ −2 (1 − x ) + σ −2 ij 0  1 j = 







   1 , ni xij + σ 0−2   j =1 

∑    ,   ,      

Innovative Statistics in Regulatory Science

428

f (σ −2 |γ , µi + , µi − , Yobs , xij )  nT + nc  2        1   ∼ γ  +σ g , 2 i =T ,C          +β g  



    ni  2 xij ( yij − µi + )   j =1 , ni  2   + (1 − xij ) yij − µi − )    j =1     





(

respectively. Consequently, the conditional posterior distribution of θ = µT + − µC + can be obtained as follows:

(

f θɵ |γ , µi + , µi − , σ 2 , Yobs , xij

)

 −2 nT xTj yTj + σ 0−2λT + σ  j =1  nT  σ −2 xTj + σ 0−2  j =1  ∼ N nC  −2 xCj yCj + σ 0−2λC + σ  j =1 + , nC  −2 −2  σ −2 xCj + σ 0 σ  j =1 









1

+

nT

∑ j =1

xTj + σ

−2 0

nC

σ −2

∑ j =1

       .   1   xCj + σ 0−2   

As a result, statistical inference for θ = µT + − µC + can be obtained. Following similar ideas, statistical inferences for the treatment effects can be derived. Note that different priors for γ , µi + , µi − and σ −2 may be applied depending upon disease targets across different therapeutic areas. However, different prior assumptions will result in different statistical inference for assessment of treatment effect under study. 16.4.3 Remarks In  addition to the study designs proposed by FDA, Freidlin and Simon (2005) proposed an adaptive signature design for randomized clinical trials of targeted agents in settings where an assay or signature that identified

Precision Medicine

429

sensitive patients was not  available at the outset of the study. The  design combined prospective development of a gene expression-based classifier to select sensitive patients with a properly powered test for overall effect. Jiang et al. (2007) proposed a biomarker-adaptive threshold design for settings in which a putative biomarker to identify patients who were sensitive to the new agent was measured on a continuous or graded scale. The design combined a test for overall treatment effect in all randomly assigned patients with the establishment and validation of a cut point for a prespecified biomarker of the sensitive subpopulation. Freidlin et al. (2010) proposed a crossvalidation extension of the adaptive signature design that optimizes the efficiency of both the classifier development and the validation components of the design. Zhou et al. (2008) and Lee et al. (2010) proposed Bayesian adaptive randomization enrichment designs for targeted agent development. On the other hand, Todd and Stallard (2005) presented a group sequential design, which incorporates interim treatment selection based upon a biomarker, followed by a comparison of the selected treatment with control in terms of the primary endpoint. A statistical approach that controls type I error rate for the design was proposed. Stallard (2010) later proposed a method for group sequential trials that use both the available biomarker and primary endpoint information for treatment selections. The proposed method controlled the type I error rate in the strong sense. Shun et al. (2008) studied a biomarker informed two-stage winner design with normal endpoints. Di Scala and Glimm (2011) studied the case with correlated time-toevent biomarker and primary endpoint where Bayesian predictive power combining evidence from both endpoints is used for interim selection, they investigated the precise conditions under which type I error control is attained. Friede et al. (2011) considered an adaptive seamless phase II/III design with treatment selection based on early outcome data (“biomarker informed drop-the-losers design”). Bringing together combination tests for adaptive designs and the closure principle for multiple testing, control of the familywise type I error rate in the strong sense was achieved. Focused clinical trials using a biomarker strategy have been shown the potential to: result in shorter trial duration; allow smaller study sizes,; provide higher probability of trial success; result in enhancement of the benefit-risk relationship; and potentially mitigating ever-escalating development costs. In  the planning of a study that uses biomarker informed adaptive procedures, it is desirable to perform statistical simulations in order to understand the operating characteristics of the design, including sample size required for a target power. It  is therefore necessary to specify a model for simulation of trial data. Friede et al. (2011) proposed a simulation model based on standardized test statistics that allows the generation of virtual trials for a variety of outcomes. The test statistics of the trial were simulated directly instead of trial data. To simulate individual patient data for the trial, on the other hand, a model that describes the relationship between biomarker and primary endpoint needs to be specified.

430

Innovative Statistics in Regulatory Science

If  both  endpoints follow normal distribution, Shun et  al. (2008) used a bivariate normal distribution for modeling the two endpoints. Wang et al. (2014) showed that the bivariate normal model that only considers the individual level correlation between biomarker and primary endpoint is inappropriate when little is known about how the means of the two endpoints are related. Wang et  al. (2014) further proposed a two-level correlation (individual level correlation and mean level correlation) model to describe the relationship between biomarker and primary endpoint. The two-level correlation model incorporates a new variable that describes the mean level correlation between the two endpoints. The  new variable, together with its distribution, reflects the uncertainty about the mean-level relationship between the two endpoints due to a small sample size of historical data. It  was shown that the two-level correlation model is a better choice for modeling the two endpoints.

16.5 Concluding Remarks As discussed in the previous sections, traditional medicine can only benefit average patients with the diseases under study, while precision medicine can further benefit a specific group (subgroup) of patients who have certain characteristics (e.g., specific genotype or molecular target). President Obama’s Precision Medicine Initiative is an important step moving away from the traditional medicine and toward personalized medicine so that the specific group of patients could be beneficial. A typical example is the development of Herceptin for treatment of female patients with breast cancer. As indicated in Section 16.2.2, Herceptin plus chemotherapy provides statistically significantly additional clinical benefit in terms of overall survival over chemotherapy alone for patients with a staining score of 3+ (see also Table 16.1). Under a valid design (e.g., Designs B, C, and D) for assessment of precision medicine, the accuracy and reliability of the analysis results depend upon an accurate and reliable estimate of positive predicted value (PPV). The method of EM algorithm in conjunction with Bayesian approach may be useful to resolve the issue. Alternative methods for estimation of treatment effects as described in Section 16.4.1, especially under FDA recommended Design C and Design D, however, are necessary developed. In  addition to precision medicine, the ultimate goal of personalized (individualized) medicine is seeking cure for individual patients if it is not impossible. President Obama’s Precision Medicine Initiative is an important step moving toward personalized (or individualized) medicine so that the individual patients could be beneficial from the treatment for treating specific diseases under investigation. For this purpose, traditional Chinese

Precision Medicine

431

medicine, which often consists of multiple components and focuses on global dynamic harmony (or balance) among specific organs within individual patients is expected to be the center of attention in the next century (Chow, 2015). Personalized (individualized) medicine development (e.g., traditional Chinese medicine) moving toward the next century, however, regulatory requirement and quantitative/statistical methods for assessment of treatment effect for drug products with multiple components are necessarily developed. More details regarding traditional Chinese medicine (TCM) development are given in Chapter 12.

17 Big Data Analytics

17.1 Introduction In  healthcare-related biomedical research, big data analytics refers to the analysis of large data sets that contain a variety of data sets (with similar or different data types) from various data structured, semi-structured, or unstructured sources such as registry; randomized or non-randomized studies; published or unpublished studies; and healthcare databases. The  purpose of big data analytics is to detect any possible hidden signals, patterns, and/or trends of certain test treatments under study as that pertain to safety and efficacy. In addition, big data analytics may uncover possible unknown associations and/or correlations between potential risk factors and clinical outcomes, and other useful biomedical information such as risk/benefit ratio of certain clinical endpoints/outcomes. The finding of big data analytics could lead to more efficient assessment of treatments under study and/or identification of new intervention opportunities; better disease management; other clinical benefits; and improvement of operational efficiency for planning of future biomedical studies. As indicated in an Request for proposals (RFP) at the website of the United States National Institutes of Health (NIH), biomedical research is rapidly becoming data-intensive as investigators are generating and using increasingly large, complex, multidimensional, and diverse data sets. However, the ability to release data; to locate, integrate, and analyze data generated by others; and to utilize the data is often limited by the lack of tools, accessibility, and training. Thus, the NIH has developed the Big Data to Knowledge (BD2K) initiative to solicit development of software tools and statistical methods for data analysis in the four topic areas of data compression and reduction, data visualization, data provenance, and data wrangling as part of the overall BD2K initiative.

433

434

Innovative Statistics in Regulatory Science

Big data analytics is promising and provides the opportunities of uncover hidden medical information, determining possible associations or correlations between potential risk factors and clinical outcomes; predictive model building; validation and generalization; data mining for biomarker development; and critical information for future clinical trials planning (see, e.g., Bollier, 2010; Ohlhorst, 2012; Raghupathi and Raghupathi, 2014). Although big data analytics is promising and has great potential in biomedical research, there are some limitations due to possible selection bias and heterogeneity across data sets from different data sources that may affect the representativeness and validity of big data analytics. In biomedical research, the most commonly considered big data analytics is probably a meta-analysis by combining several independent studies. In meta-analysis, the most commonly employed method is probably the application of a random effects model or a mixed effects model (see, e.g., DerSimonian and Laird, 1986; Chow and Liu, 1997). Chow and Kong (2015) pointed out that the findings from the big data analytics may be biased due to selection bias that is commonly encountered in big data analytics. This is because of the diverse sources of data sets that may be accepted into the big data center. In  practice, it is likely that published and/or positive studies would be included in the big data center. The published or positive studies tend to report much larger treatment effect size, which will over-estimate the true treatment effect of the target patient population. As a result, the findings obtained from the big data analytics could be biased and misleading. The bias could be substantial especially when there is a large portion of unpublished and/or negative studies which are not included in the big data analytics. To overcome these problems, in this chapter, we attempt to propose a method for estimation of the true treatment effect of the target patient population by taking possible selection bias into consideration. The remaining of this chapter is organized as follows. In Section 17.2, in addition to the challenges outlined by Raghupathi and Raghupathi (2014), we will focus on some basic considerations for assuring the quality, integrity, and validity of big data analytics in biomedical research. These basic considerations include, but are not limited to, representativeness; quality and integrity of big data; validity of big data analytics; FDA Part 11 compliance for electronic records; and statistical methodology and software development. Section  17.3 will briefly outline types of big data analytics in clinical research. Selection bias of big data analytics is explored in Section  17.4. Section  17.5 discusses statistical methods for estimation of bias; bias adjustment; treatment effect in big data analytics; and issues that are commonly encountered bias in big data analytics. Section 17.6 presents a simulation study that was conducted to evaluate the performances of the proposed statistical methods for estimation of bias, bias adjustment, and treatment effect under various scenarios. Some concluding remarks are given in Section 17.7.

Big Data Analytics

435

17.2 Basic Considerations 17.2.1 Representativeness of Big Data In biomedical research, a big data often contains a variety of data sets (with different data types) from various data sources including registry; randomized or non-randomized clinical studies; published or unpublished data and health care databases. As a result, it is a concern whether the big data is a truly representative of the target patient population with the diseases under study because possible selection bias may have occurred when accepting individual data sets into the big data. In addition, heterogeneity is expected within and across individual data sets (studies). The issues of selection bias, heterogeneity, and consequently reproducibility and generalizability are briefly discussed below. 17.2.2 Selection Bias In practice, it is likely that most data sets with positive results will enter the big data, in which case selection bias may have occurred. Let µ and µb be the true means of the target patient population and big data, respectively; let µ P and µN be the true means of data sets with positive and negative results, respectively; and suppose that r is the true proportion of data with positive results. In this case, µ = r µ P + ( 1 − r ) µN , where r is often unknown. Thus, selection bias for accepting individual data sets could have a significant impact on the finding of big data analytics. In  other  b could be words, the assessment of μ through the big data analytics µ biased. If the big data only contains data sets with positive results, then µb = µ P. Consequently, the bias be substantial if µ P is far away from µN . As a result, the findings of big data analytics could be biased and hence misleading due to selection bias. 17.2.3 Heterogeneity In addition to the representativeness and selection bias of data sets, heterogeneity within and between individual data sets from different sources is also a great concern. In practice, although individual data sets may come from clinical studies conducted with the same patient population, data from these studies may be collected under similar but different study protocols with similar but different doses or dose regimens at different study sites with local laboratories. These differences will cause heterogeneity within and between individual data sets. In  other words, these data sets may follow similar distributions with different means and different variances. The heterogeneity could decrease the reliability of the assessment of the treatment effect.

436

Innovative Statistics in Regulatory Science

17.2.4 Reproducibility and Generalizability As indicated above, the heterogeneity within and across individual data sets (studies) in the big data center could have an impact on the reliability of the assessment of the treatment effect. In addition, as the big data continues growing, it is a concern whether the findings from the big data analytics is reproducible and generalizable from one big data center (database) to another big data center (database) of similar patient population with the same diseases or conditions under study. For evaluation of reproducibility and generalizability, the concept using a sensitivity index proposed by Shao and Chow (2002) is useful. Let ( µ0 , σ 0 ) and ( µ1 , σ 1 ) denote the population of the original database (big data center) and another database (another big data center), respectively. Thus, since the two databases are for similar patient populations with the same diseases and/or conditions, it is reasonable to assume that µ1 = µ0 + ε , and σ 1 = Cσ , where ε and C are shift parameters in location and scale, respectively. After some algebra, it can be verified that

µ1 µ +ε µ = 0 =∆ 0 , σ1 σ0 Cσ 0 where ∆ = ( 1 + ε / ( µ0 ) /C ) is the sensitivity index for generalizability. In other words, if 1− ∆ ≤ δ , where δ is a pre-specified small number, we then claim that the results from the original big data center are generalizable to another big data center with data obtained from similar patient population with the same diseases and/or conditions. In practice, since ε and C are random, statistical methodology for assessment of Δ is necessarily developed.

17.2.5 Data Quality, Integrity, and Validity In  biomedical research, data management, that ensures the quality, integrity and validity of data collected from trial subjects to a database system. Proper data management delivers a clean and high-quality database for statistical analysis and consequently enables clinical scientists to draw conclusions regarding the effectiveness, safety, and clinical benefit/ risk of the test treatment under investigation. An invalid and/or poor quality database may result in wrong and/or misleading conclusions regarding the drug product under investigation. Thus, the objective of the data management process in clinical trials is not only to capture the information that the intended clinical trials are designed to capture, but also to ensure the quality, integrity and validity of the collected data. These data sets are then aggregated into a big data through a database system. Since the big data center contains electronic data records from a variety of sources, some regulatory requirements must be met for assurance of data quality, integrity and validity of the electronic data in the big data center.

Big Data Analytics

437

17.2.6 FDA Part 11 Compliance The  FDA  Part 11 compliance is referred to as requirements or criteria as described in 21 Codes of Federal Registration (CFR) Part 11 under which the FDA will consider electronic records and signatures to be generally equivalent to paper records and handwritten signatures. It applies to any records required by the FDA or submitted to the FDA under agency regulations. To reinforce Part 11 compliance, the FDA has published a compliance policy guide–CPG 7153.17, Enforcement Policy: 21 CFR Part 11 Electronic Records, Electronic Signatures. In addition, the FDA also published numerous draft guidance documents to assist the sponsors for Part 11 compliance. FDA Part 11 compliance has a significant impact on the process of clinical data management and consequently the big data management, which has recently become the focus for Good Data Management Practice (GDMP) in compliance with Good Statistics Practice (GSP) and Good Clinical Practice (GCP) for data quality, integrity, and validity. For example, 21 CFR Part 11 requires that procedures regarding creation, modification, maintenance, and transmission of records must be in place to ensure the authenticity and integrity of the records. In addition, the adopted systems must ensure that electronic records are accurately and reliably retained. 21 CFR Part 11 has specific requirements for audit trail systems to discern invalid or altered records. For electronic signatures, they must be linked to their respective electronic records to ensure that signatures cannot be transferred to falsify an electronic record. The FDA requires that systems must have the ability to generate documentation suitable for FDA inspection to verify that the requirements set forth by the 21 CFR Part 11 are met. In  practice, data management of big data is the top priority in the plan for 21 CFR Part 11 compliance for assurance of data quality, integrity and validity. A  typical plan for Part 11 compliance for data management process usually includes (1) gap assessment; (2) user requirements specification; (3) validation master plan; and (4) tactical implementation plan. The task is implemented through a team consisting of senior experienced personnel from multiple disciplinary areas such as information technology (IT), programming, and data managers.

17.2.7 Missing Data Missing values or incomplete data are commonly encountered in biomedical research and hence have become a major issue for big data analytics. One of the primary causes of missing data is the dropouts. Reasons for dropouts include, but are limited to: refusal to continue in the study (e.g., withdrawal of informed consent); perceived lack of efficacy; relocation; adverse events; unpleasant study procedures; worsening of disease; unrelated disease; non-compliance with the study; need to use prohibited medication; and death. How to handle incomplete data is always a challenge to statisticians in practice. Imputation is a very popular methodology to compensate for

438

Innovative Statistics in Regulatory Science

the missing data and is widely used in biomedical research. Despite to its popularity, however, its theoretical properties are far from well understood. Addressing missing data in clinical trials also involves missing data prevention and missing data analysis. Missing data prevention is usually done through the enforcement of GCP during protocol development and clinical operations and personnel training for data collection. This will lead to reduced biases, increased efficiency, less reliance on modeling assumption and less need for sensitivity analysis. However, in practice, missing data cannot be totally avoided. Missing data often occur due to factors beyond the control of patients, investigators, and clinical project team.

17.3 Types of Big Data Analytics 17.3.1 Case-Control Studies In  clinical research, studies utilizing big data commonly include, but are not limited to: incorporating retrospective cohort and/or case-control studies; meta-analysis by combining several independent studies; and data mining in genomic studies for biomarker development. For illustration purpose, in this chapter, we will focus on case-control studies only. The ideas explored here can be similarly applied to meta-analysis and data mining. The primary objective of case-control studies in clinical research is not only to study possible risk factors and to develop a medical predictive model based on the identified risk factors, but also to examine the generalizability of the established predictive model (e.g., from one patient population such as adults to another similar but different patient population such as pediatrics or from one medical center to another). For  this purpose, multivariate (logistic) regression process is probably the most commonly used method. The use of logistic regression analysis for identifying potential risk factors (or predictors) in order to build a predictive model has become very popular in clinical research (Hosmer and Lemeshow, 2000). Typically, logistic process for model building in case-control studies starts with propensity score matching between the case and control groups and followed by the steps of (i) descriptive analysis to have better understanding of the data; (ii) univariate analysis to test associations between the variables and the outcomes; (iii) collinearity analysis to test associations/correlations between explanatory variables; (iv)  multivariate analysis to test association of a variable after adjusting for other variables or confounders; and (v) model diagnostic/ validation to assess whether the final model fulfils the assumptions it was based on. The first three steps can be achieved by the use of univariate logistic regression and the remaining two by the use of multiple logistic regression analysis, both of which are described below.

Big Data Analytics

439

17.3.1.1 Propensity Score Matching In  clinical research, one of the major concerns of a case-control study is selection bias, which often caused by significant difference or imbalance between the case and control groups, especially for large observational studies (Rosenbaum and Rubin, 1983, 1984; Austin, 2011). In this case, the target patient population under study in the control group may not  be comparable to that of the case group. This  selection bias could alter the conclusion of the treatment effect due to possible confounding effect. Consequently, the conclusion may be biased and hence misleading. To overcome this problem, Rosenbaum and Rubin (1983) proposed the concept of propensity score as a method to reduce selection bias in observational studies. Propensity score is a conditional probability (or score) of the subject being in a particular group when given chosen characteristics. That is, consider the case group as those who received a certain treatment (T = 1), and the control group as those who did not receive this treatment (T = 0). Let X be a vector represents baseline demographics and/or patient characteristics that are important (e.g., possible confounding factors) for matching the case and control populations for reducing the selection bias. The propensity score is then given by p ( X ) = Pr T = 1 X  = E ( T X ) , where 0 < p ( X ) < 1. Propensity score matching is a powerful tool reducing the bias due to possible confounding effects. Sometimes, propensity score matching is also considered as post-study randomization as compared to a randomized clinical trial for reducing bias. For propensity score matching, it should be noted that data available for matching will decrease as the number of matching factors (potential confounding factors) increase. 17.3.1.2 Model Building Let Y be the outcome variable, which could be a discrete response variable, e.g., a binary response variable where Y = 1 if it is a success and Y = 0 if it is a failure or a categorical variable where Y = 1 if there is an improvement, Y = 0 where there is no change, or Y = −1 when the symptom has worsen. Also, let X be any type of covariate (e.g., continuous or dichotomous). For the study of six-month survival of cirrhotic patients, X could be TCR functional status, TCR marked muscle wasting and serum creatinine, or demographics characteristics which could be potential risk factors (predictors) such as gender, age, or weight. Consider univariate logistic regression, the general model with on covariate is given by  π  logit (π ) = log   = α + β x,  1−π 

Innovative Statistics in Regulatory Science

440

where π is the probability of success at covariate level x. The logistic regression model can be rewritten as:

π = eα + β x = eα e β 1− π

{ }

x

,

where e β represents the change in the odds of the outcome by increasing x by one unit. In other words, every one-unit increase in x increases the odds by a factor of e β . More specifically, if β = 0 (i.e., e β = 1), the probability of success is the same at each level of x. When β > 0 (i.e., e β > 1), the probability of success increases as x increases. Similarly, when β   0), where ε is referred to as the shift in location parameter (population mean) and C is the inflation factor of the scale parameter (population standard deviation). Thus, the (treatment) effect size adjusted for standard deviation of population ( µ1 , σ 1 ) can be expressed as follows: E1 =

µ1 µ +ε µ = ∆ 0 = ∆ E0 , = 0 σ1 Cσ 0 σ0

where ∆ = ( 1 + ε / ( µ0 ) /C ), and E0 and E1 are the effect size (of clinically meaningful importance) of the original target patient population and the similar but different patient population, respectively. Chow et  al. (2002) and Chow and Chang (2006) refer to Δ as a sensitivity index measuring the change in effect size between patient populations. As it can be seen from

Innovative Statistics in Regulatory Science

442

the above, if  ε = 0 and C = 1 (there is no shift in target patient population; the two patient populations are identical), E1 = E0 . That is, the effect sizes of the two populations are identical. In this case, we claim that the results observed from the original target patient population (e.g., adults) can be generalized to the similar but different patient populations. Thus, if we can show that the sensitivity index ∆ is within an acceptable range say 80% and 120%, we may claim that the results observed at the original medical center can be generalized to another medical center with similar but different patient population. As indicated in Chow et  al. (2005), the effect sizes of the two population could be linked by baseline demographics or patient characteristics if there is a relationship between the effect sizes and the baseline demographics and/or patient characteristics (a covariate vector). However, such covariates may not exist or may not be observable in practice. In this case, Chow et  al. (2005) suggested assessing the sensitivity index by simply replacing ε and C with their corresponding estimates. Intuitively, ε and C can be estimated by 0 1 − µ εɵ = µ

and C = σ 1 / σ 0 ,

where ( µ 0 , σ 0 ) and ( µˆ1 , σˆ1 ) are some estimates of ( µ0 , σ 0 ) and ( µ1 , σ 1 ) , respectively. Thus, the sensitivity index can be estimated by 0  = 1 + εɵ /µ ∆ .  C Note that in practice, the shift in location parameter (i.e., ε) and/or the change in scale parameter (i.e., C) could be random. If both ε and C are fixed, the sensitivity can be assessed based on the sample means and sample variances obtained from the two populations. In real world problems, however, ε and C could be either a fixed or a random variable. In other words, there are three possible scenarios: (1) the case where ε is random and C is fixed, (2) the case where ε is fixed and C is random, and (3) the case where both ε and C are random. These possible scenarios have been studied by Lu et al. (2017).

17.3.2 Meta-analysis L’Abbe et al. (1987) defined a meta-analysis as a systematic reviewing strategy for addressing research questions that is especially useful when (1) results from real studies disagree with regard to direction of effect, (2) sample sizes are individually too small to detect an effect, or (3) a large trial is too costly and time-consuming to perform. The primary objectives of a meta-analysis are not only to reduce bias (e.g., have a much narrower confidence interval for the true treatment effect) but also to increase statistical power (e.g., increase

Big Data Analytics

443

the probability of correctly detecting the true treatment effect if it does exist). In addition, it is to provide a more accurate and reliable statistical inference on the true treatment effect of a given therapy. 17.3.2.1 Issues in Meta-analysis There are, however, several critical issues such as representativeness, selection bias, heterogeneity (similarities and dis-similarities), and poolability that may affect the validity of a meta-analysis. These issues are inevitably encountered because a meta-analysis often combines a number of studies with similar but different protocols, target patient population, doses or dose regimens, sample sizes, study endpoints, laboratories (local or central laboratories), equipment/analyst/times, and so on. In  practice, it is often a concern that only positive clinical studies are included in the meta-analysis, which may have introduced selection bias for a fair and unbiased comparison. To avoid possible selection bias, criteria for selection of studies to be included in the meta-analysis need to be clearly stated in the study protocol. Also, the time period (e.g., in the past 5 years or the most recent 10 studies) should also be specified. If there is a potential trend over time, a statement need to be provided for scientific justification for inclusion of the studies selected for the meta-analysis. Before the studies are combined for a meta-analysis, similarity and/or dissimilarity among studies need to be assessed in order to reduce variability for a more accurate and reliable assessment of the test treatment under investigation. This  is critical because different studies may be conducted with similar but different: (1) study protocols; (2) drug products and doses; (3) patient populations; (4) sample sizes; and (5) evaluability criteria. Thus, the FDA suggested that before the data sets obtained from different studies can be combined for a meta-analysis, it is suggested that a statistical test for poolability be performed to determine whether there is significant treatment-by-study interaction. If a significant qualitative treatment-bystudy interaction is observed, the studies should be combined for a metaanalysis, while if a significant quantitative treatment-by-study interaction is detected, the studies could be combined for a meta-analysis. In  practice, several methods are commonly employed for meta-analysis. These methods include, but are not limited to: (1) simple lumping or collapsing (which may be misleading); (2) graphical presentation (which provides a visual impression of inter-study consistency but does not provide any statistical inference and estimation of the treatment effect); (3) averaging p-values (which cannot show effects in different direction and does not reflect sample size of each study; (4) averaging test statistics (which provides adjustment by sample size but cannot test for interaction and heterogeneity); (5) blocking the data (which allows test for interaction and heterogeneity and provides estimates of inter-study and intra-study variations; and (6) random effects model for continuous variable and log-linear model for categorical data

444

Innovative Statistics in Regulatory Science

(see,  e.g., DerSimonian and Laird, 1986). Most recently, it is suggested that a mixed effects model (by treating study as a random effect) be considered. The mixed effects model can utilize all data to (1) test for poolability, (2) compare the test treatment with a given control, and (3) handle missing data in conjunction with the GEE method.

17.4 Bias of Big Data Analytics As indicated, big data usually consists of data sets from randomized and/or non-randomized (published or unpublished) studies. Thus, imbalances between the treatment group and the control group are likely to occur. In addition, studies with positive results are most likely to be published and accepted into the big data. In this case, even though the use of propensity score matching can help to reduce the bias due to some selection bias, the bias due to the fact that majority of data sets accepted to the big data are most likely those studies with positive results could be substantial and hence cannot be ignored. As a result, there is a bias of big data analytics regardless the big data analytics are case-control studies, meta-analyses, or data mining in genomics studies. In this section, we attempt to assess the bias due to the selection bias of accepting positive data sets into the big data. Let µ and µB be the true mean of the target patient population with the disease under study and the true mean of the big data, respectively. Let ε = µB − µ , which depends upon the unknown percentage of data sets with positive results in the big data. Now, let µ P and µN be the true means of data sets of positive studies and non-positive studies conducted in the target patient population, respectively. Also, let r be the proportion of positive studies conducted on the target patient population, which is usually unknown. For illustration and simplicity purposes, we assume that there is no treatment-by-center interaction for those multicenter studies and there is no treatment-by-study interaction. In this case, we have

µB = r µ P + ( 1 − r ) µ N ,

(17.1)

where µ P > δ > µN , in which δ is the effect of clinical importance. In other words, a study with an estimated effect greater than δ is considered a positive study. As it can be seen from (17.1), if the big data only contain data sets from positive studies, i.e., r = 1, (17.1) reduces to

µB = µ P . In other words, in this extreme case, the big data do not contain any studies with non-positive results. In practice, we would expect 21 < r ≤ 1.

Big Data Analytics

445

For a given big data, r can be estimated based on the number of positive studies in the big data (i.e., rɵ ). In practice, rɵ usually under-estimates the true r because the big data tends to accept data sets from published or positive studies. Thus, we have

()

E rɵ + ∆ = r. Now, for simplicity, assume that all positive studies and non-positive studies are of the same size nP and nN, respectively. Let xij be the response of the ith subject in the jth positive study, i = 1,…, nP and j = 1,…, rn, where n is the total number of studies in the big data. Also, let yij be the response of the ith subject in the jth non-positive study, i = 1,…, nN and j = 1,…, ( 1 − r ) n. Thus, the  B is given by bias of µ

( ) ( )

( )

 =E µ  − µ = E  rɵ µ  ɵ   Bias µ B B  P + 1 − r µ N  − µ ≈ ( r − ∆ ) µ P + (1 − r + ∆ ) µN − µ

(17.2)

= ε − ∆ ( µ P − µN ) ,  =x= where µ P

1 rnnP

nP  ∑ rn j =1 ∑ i =1 xij and µ N = y =

1 rnnN

(1− r )n

∑ j =1 ∑ ni =N1 yij . Thus, we have

ε = ∆ ( µ P − µN ) , where µ P > δ > µN . As an example, suppose there are 50% of positive studies and non-positive studies in the target patient population (i.e., r = 0.5) but 90% of positive studies were included in the big data (i.e., rɵ = 0.9). In this case, ∆ = 0.9 − 0.5 = 0.4. If we further assume that µ P = 0.45 and µN = 0.2 , then the bias of the big data  B could be as high as ε = ∆ ( µ P − µN ) = ( 0.4 )( 0.25 ) = 0.1 or 10%. analytics µ To provide a better understanding, Table 17.1 summarizes potential biases that could occur due to the selection bias of accepting more positive studies into the big data when assessing the true treatment effect. Regarding the power, by (1), the variance of µɵ B is given by 2 2  = 1 σ B2 = r 2  σ P  + (1 − r)2  σ N  Var µ     B n  nnP   nnN 

( )

=

1  rσ P2 / nP + ( 1 − r )σ N2 / nN  ≥ 0, n

where nP and nN are the size of positive and non-positive studies, respectively, and n is the total number of studies accepted into the big data. In addition, if we take derivative of the above, it leads to

Innovative Statistics in Regulatory Science

446

TABLE 17.1 Potential Biases of Big Data Analytics ∆ ∆ (%) 10

20

30

40

50

( )

µP − µ N (%)

Bias ( ε )

20 30 40 20 30 40 20 30 40 20 30 40 20 30 40

2 3 4 4 6 8 6 9 12 8 12 16 10 15 20

∂    = 1 σ P2 / nP − σ N2 / nN . Var µ B     n ∂r

(

)

Thus, if σ P2 / nP > σ N2 / nN , σ B2 is an increasing function of r. In this case, it is expected that the power of big data analytics may decrease as r increases. The above discussion suggests that the power of the big data analytics can be studied through the evaluation of the following probability

{

2 2 P σ P / nP > σ N / nN |µ P , µN , σ P2 , σ N2 , and r

}

based on data available in the big data.

17.5 Statistical Methods for Estimation of ∆ and µ P − µ N 17.5.1 Estimation of ∆ ∆ As indicated in the previous section, rɵ (the proportion of number of positive studies in the big data) is always an over-estimate of the true r (the proportion of the number of positive studies conducted with the target patient population), which is often unknown. However, following the concept of

Big Data Analytics

447

empirical power or reproducibility probability (Shao and Chow, 2002), we can estimate r by the reproducibility probability of observing positive result of a future study given that observed mean response and the corresponding sample variance as follows

{

}

 B and σ ≡ σ B . p = P future study is positive µ ≡ µ

(17.3)

The  above expression can be interpreted as given the observed mean  B ) and the corresponding sample standard deviation (σ B ). We response ( µ expect to see p ×100 studies with positive results if we shall conduct the clinical trial under similar experimental conditions 100 times. Thus, intuitively, p is a reasonable estimate for the unknown r. For simplicity and illustration purposes, suppose the investigator is interested in detecting a clinically meaningful difference (or an effect size that is of clinical importance). A typical approach is to test the following hypotheses of equality H 0 : µ1 = µ0 versus H a : µ1 ≠ µ0 . Under the null hypothesis, the test statistic is given by

T=

(

)

  n1n0 µ 1 − µ 0 . n1 + n0 σ

We reject the null hypothesis if T > tα /2 , n1 +n0 −2 , where tα /2 , n1 +n0 −2 is the (α /2 ) th upper quantile of a standard t distribution with n1 + n0 − 2 degrees of freedom. A typical approach is then to evaluate the power at under the alternative hypothesis that µ1 − µ0 = δ . If there is at least 80% power for detecting the clinically meaningful difference of δ at the pre-specified α (say α = 5%) level of significance, we claim that the result is positive if the null hypothesis is rejected. In big data analytics, we have

2

 0 and σ B = B = µ 1 − µ µ

2

n1σ 1 + n2 σ 0 . n1 + n0

Thus, (17.3) becomes

{

}

 B and σ ≡ σ B . p = P T > tα /2 , n1 +n0 −2 |µ ≡ µ We then propose r be estimated by p, i.e., rɵ = p.

Innovative Statistics in Regulatory Science

448

17.5.2 Estimation of µP − µ N Let ( LP , U P ) and ( LN , U N ) denote the ( 1 − α ) × 100% confidence interval for µ P and µN, respectively. Under normality assumption and the assumption that σ P = σ N = σ B, we have

( LP , U P ) = µ P ± z1−α /2

 σ P  ± z1−α /2 σ N , and ( LN , U N ) = µ N nP nN

where nP = rn, nN = ( 1 − r ) n, and n is the sample size used to estimate µB . Since µp > δ > µN , the confidence intervals of µ P and µN, i.e., ( LP , U P ) and ( LN , U N ) would not  overlap each other. At  extreme case, U N is close to LP . Thus, we have  −z µ P

α 1− 2

 σ P  +z α σN . ≈µ N 1− nP nN 2

This leads to  −µ  ≈z µ P N

α 1− 2

=z

α 1− 2

σ P σ N +z α 1− nP nN 2

 σ P σ N  +   nN   nP

(17.4)

In some extreme cases, there are only data from positive studies achievable. The study with the least effect size would be used to estimate LN and U N .

17.5.3 Assumptions and Application The two parameters, ∆ and ( µ P − µN ), are corresponding to two assumptions: (i) the positive study is more likely to be published, and (ii) positive study and negative study have different distributions, which are reasonable considerations for selection bias in big data analytics. Based on the assumptions of the proposed approach, we suggest a two-step procedure to determine whether it is proper to apply the approach. Step 1: Calculate the proportion of positive studies in big data center, rɵ and compare it with the designed power of each study included in the historical data set. If the proportion of positive studies is larger

Big Data Analytics

449

than power of most studies (mostly larger than all the studies in practice), then conduct step 2. Step 2: Calculate the mean difference of positive studies and non −µ  , and compare it with the theoretical distance positive studies, µ P N given in (17.4). The adjustment could be conducted when  −µ  t  ,16n − 5  . 2 1 2  σ e 11n

The corresponding confidence interval at α significance level is  ± t  α ,16n − 5  σ 2e 1 . D  2  11n

Innovative Statistics in Regulatory Science

468

TABLE 18.4 Coefficients for Estimates of Drug in Complete n-of-1 Design 132 β s to estimate D = DT − DR Sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

I

II

III

IV

3 −3 −5 −11 −5 −11 −13 −19 19 13 11 5 11 5 3 −3

−1 −7 −9 −15 15 9 7 1 −1 −7 −9 −15 15 9 7 1

−1 −7 15 9 −1 −7 15 9 −9 −15 7 1 −9 −15 7 1

−1 17 −1 17 −9 9 −9 9 −9 9 −9 9 −17 1 −17 1

Note: Adjusting for Carryover Effect.

Similarly, the carryover effect coefficient estimates are derived, and the corresponding statistic for testing bioequivalence is shown in Table 18.5. Based  can be constructed, and we have on Table 18.5, the unbiased estimator C 2  = CT − CR , Var C  = 4σ e . E C 33n

( )

( )

Under the model without the first-order carryover effect, we can derive a dif such that ferent unbiased estimator D 2  = DT − DR , Var D  = σe . E D 12n

( )

( )

Thus, the corresponding confidence interval at α significance level can be similarly constructed. 18.4.1.3 Sample Size Requirement The sample size determination under the fixed power and significance level is derived based on the following hypothesis testing.

Rare Diseases Drug Development

469

TABLE 18.5 Coefficients for Estimates of Carryover Effect in Complete n-of-1 Design 132 β s to estimate C = CT − CR Sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

H 0 : DT − DR > θ

I 12 10 2 0 2 0 −8 −10 10 8 0 −2 0 −2 −10 −12

II

III

IV

−4 −6 −14 −16 −6 −8 −16 −18 18 16 8 6 16 14 6 4

−4 −6 −6 −8 18 16 16 14 −14 −16 −16 −18 8 6 6 4

−4 2 18 24 −14 −8 8 14 −14 −8 8 14 −24 −18 −2 4

versus

H1 : DT − DR ≤ θ

According to the ±20% rule, the bioequivalence is concluded if the average bioavailability of the test drug effect is within ±20% of that of the reference drug effect with a certain assurance. Therefore, θ is usually represented by ∇µR where ∇ = 20% and the hypothesis testing can be rewritten as follows. H 0 : µT − µR < −∇µR or µT − µR > ∇µR versus H a : −∇µR ≤ µT − µR ≤ ∇µR The power function can be written as  ∇−R     ∇ + R  P (θ ) = Fv   − t α , υ )  − Fv  t (α , υ ) −    ,  CV b/n  (     CV b/n      where R = µTµ−RµR is the relative change; CV = µSR ; µT and µR are the average bioavailability of the test and reference formulations, respectively; S is the squared root of the mean square error from the analysis of variance table for each crossover design; [ −∇µR , ∇µR ] is the bioequivalence limit interval; t (α , υ ) is the upper α th quantile of a t-distribution with υ degrees of freedom; Fv is the cumulative distribution function of the t-distribution; and b is the constant value for the variance of drug effect.

Innovative Statistics in Regulatory Science

470

Accordingly, the exact sample size formula when R = 0 is given by 2

 2  β  n ≥ b t (α , υ ) + t  ,υ   [ CV / ∇ ] ; 2     the approximate sample size formula when R > 0 can be obtained as 2

2

n ≥ b t (α ,υ ) + t ( β ,υ )   CV / ( ∇ − R )  . Another way to determine the sample size is based on testing the ratio of drug effect between the biosimilar product and the reference product. Consider δ ∈ ( 0.8,1.25 ) is the bioequivalence range of µT / µR , the hypothesis changes to

H0 :

µT µ µ < 0.8 or T > 1.25 versus H a : 0.8 ≤ T ≤ 1.25 µR µR µR

In case of the skewed distribution, the hypotheses are transformed to logarithmic scale, H 0 : logµT − logµR < log ( 0.8 ) or logµT − logµR > log ( 1.25 ) versus

H a : log ( 0.8 ) ≤ logµT − logµR ≤ log ( 1.25 )

Then, the sample size formulas for different δ are given below (See details in Appendix), 2

 2  β  n ≥ b t (α ,υ ) + t  ,υ    CV/ln 1.25   2  

if δ = 1

n ≥ b t ( α ,υ ) + t ( β ,υ )   CV/( ln 1.25 − ln δ )  2

n ≥ b t (α ,υ ) + t ( β ,υ )   CV/( ln 0.8 − ln δ )  2

2

2

if 1 < δ < 1.25 if 0.8 < δ < 1

18.4.2 Analysis under an Adaptive Trial Design Statistical analysis for Category 1 (SS) two-stage seamless designs is similar to that of a group sequential design with one interim analysis. Thus, standard statistical methods for a group sequential design can be applied. For  other kinds of two-stage seamless trial designs, standard statistical methods for a group sequential design are not appropriate and hence should not be applied directly. In this section, statistical methods for other types of

Rare Diseases Drug Development

471

two-stage adaptive seamless design are described. Without loss of generality, in this section, we will discuss n-stage adaptive design based on individual p-values from each stage (Chow and Chang, 2011). Consider a clinical trial with K interim analyses. The  final analysis is treated as the Kth interim analysis. Suppose that at each interim analysis, a hypothesis test is performed followed by some actions that are dependent on the analysis results. Such actions could be an early stopping due to futility/ efficacy or safety, sample size re-estimation, modification of randomization, or other adaptations. In this setting, the objective of the trial can be formulated using a global hypothesis test, which is an intersection of the individual hypothesis tests from the interim analyses H 0 : H 01 ∩ ⋯ ∩ H 0 K , where H 0 i , i = 1,..., K is the null hypothesis to be tested at the ith interim analysis. Note that there are some restrictions on H 0 i , that is, rejection of any H 0 i , i = 1,..., K will lead to the same clinical implication (e.g., drug is efficacious); hence all H 0 i , i = 1,..., K are constructed for testing the same endpoint within a trial. Otherwise the global hypothesis cannot be interpreted. In practice, H 0 i is tested based on a sub-sample from each stage, and without loss of generality, assume H 0 i is a test for the efficacy of a test treatment under investigation, which can be written as H 0 i : ηi1 ≥ ηi 2 versus H ai : ηi1 < ηi 2 , where ηi1 and ηi2 are the responses of the two treatment groups at the ith stage. It is often the case that when ηi1 = ηi 2, the p-value pi for the sub-sample at the ith stage is uniformly distributed on [0, 1] under H 0 (Bauer and Kohne, 1994). This  desirable property can be used to construct a test statistic for multiple-stage seamless adaptive designs. As an example, Bauer and Kohne (1994) used Fisher’s combination of the p-values. Similarly, Chang (2007) considered a linear combination of the p-values as follows K

Tk =

∑ w p , i = 1,..., K , ki i

(18.1)

i =1

where wki > 0, and K is the number of analyses planned in the trial. For simplicity, consider the case where wki = 1. This leads to K

Tk =

∑ p , i = 1,..., K. i

(18.2)

i =1

The  test statistic Tk can be viewed as cumulative evidence against H 0 . The smaller the Tk is, the stronger the evidence is. Equivalently, we can define

Innovative Statistics in Regulatory Science

472

the test statistic as Tk = ∑ iK=1 pi / K , which can be viewed as an average of the evidence against H 0 . The stopping rules are given by  Stop for efficacy if Tk ≤ α k   Stop for futility if Tk ≥ β k , Continue otherwise 

(18.3)

where Tk , α k , and β k are monotonic increasing functions of k , α k < β k , k = 1,..., K − 1, and α K = β K . Note that α k and β k are referred to as the efficacy and futility boundaries, respectively. To reach the kth stage, a trial has to pass 1 to (k − 1)th stages. Therefore, a so-called proceeding probability can be defined as the following unconditional probability:

ψ k (t) = P ( Tk < t , α 1 < T1 < β1 ,..., α k −1 < Tk −1 < β k −1 ) =

β1

βk −1

∫ ∫ ∫ α1



α k −1

(18.4)

t

−∞

fT1⋯Tk (t1 , ..., tk )dtk dtk −1 ⋯ dt1 ,

where t ≥ 0 ; ti , i = 1,..., k is the test statistic at the ith stage; and fT1⋯Tk is the joint probability density function. The error rate at the kth stage is given by

π k = ψ k (α k ).

(18.5)

When efficacy is claimed at a certain stage, the trial is stopped. Therefore, the type I error rates at different stages are mutually exclusive. Hence, the experiment-wise type I error rate can be written as follows: K

α=

∑π . k

(18.6)

k =1

Note that (18.4–18.6) are the keys to determine the stopping boundaries, which will be illustrated in the next sub-section, with two-stage seamless adaptive designs. The adjusted p-value calculation is the same as the one in a classic group sequential design. The key idea is that when the test statistic at the kth stage Tk = t = α k (i.e., just on the efficacy stopping boundary), the p-value is equal to alpha spent ∑ ki =1π i . This is true regardless of which error spending function is used and consistent with the p-value definition of the classic design. The adjusted p-value corresponding to an observed test statistic Tk = t at the kth stage can be defined as k −1

p(t ; k ) =

∑π +ψ (t), k = 1,…, K. i

i =1

k

(18.7)

Rare Diseases Drug Development

473

This adjusted p-value indicates weak evidence against H 0 , if the H 0 is rejected at a late stage because one has spent some alpha at previous stages. On the other hand, if the H 0 was rejected at an early stage, it indicates strong evidence against H 0 because there is a large portion of overall alpha that has not been spent yet. Note that pi in (18.1) is the stage-wise naive (unadjusted) p-value from a sub-sample at the ith stage, while p(t ; k ) are adjusted p-values calculated from the test statistic, which are based on the cumulative sample up to the kth stage where the trial stops; Equations (18.6) and (18.7) are valid regardless how the pi values are calculated. 18.4.2.1 Two-Stage Adaptive Design In  this sub-section, we will apply the general framework to the two-stage designs. Chang (2007) derived the stopping boundaries and p-value formula for three different types of adaptive designs that allow (i) early efficacy stopping, (ii) early stopping for both efficacy and futility, and (iii) early futility stopping. The  formulation can be applied to both superiority and non-inferiority trials with or without sample size adjustment. 18.4.2.1.1 Early Efficacy Stopping For a two-stage design ( K = 2 ) allowing for early efficacy stopping ( β1 = 1), the type I error rates to spend at Stages 1 and 2 are

π 1 = ψ 1(α 1 ) =



α1

dt1 = α 1 ,

(18.8)

0

and

π 2 = ψ 2 (α 2 ) =

α2

∫ ∫ α1

α1

t

dt2 dt1 =

1 (α 2 − α 1 )2 , 2

(18.9)

respectively. Using (18.8) and (18.9), (18.6) becomes 1 α = α 1 + (α 2 − α 1 )2 . 2

(18.10)

α 2 = 2(α − α 1 ) + α 1.

(18.11)

Solving for α 2, we obtain

Note that when the test statistic t1 = p1 > α 2 , it is certain that t2 = p1 + p2 > α 2 . Therefore, the trial should stop when p1 > α 2 for futility. The  clarity of the method in this respect is unique, and the futility stopping boundary is often hidden in other methods. Furthermore, α 1 is the stopping probability (error spent) at the first stage under the null hypothesis condition, and α − α 1 is the

Innovative Statistics in Regulatory Science

474

TABLE 18.6 Stopping Boundaries for Two-Stage Efficacy Designs One-sided α

α1

0.005

0.010

0.015

0.020

0.025

0.030

0.025

α2

0.2050

0.1832

0.1564

0.1200

0.0250



0.05

α2

0.3050

0.2928

0.2796

0.2649

0.2486

0.2300

Source: Chang, M., Stat. Med., 26, 2772–2784.

error spent at the second stage. Table  18.6 provides some examples of the stopping boundaries from (18.11). The adjusted p-value is given by if k = 1 t  p(t ; k ) =  , 1 2 α 1 + 2 (t − α 1 ) if k = 2

(18.12)

where t = p1 if the trial stops at Stage 1, and t = p1 + p2 if the trial stops at Stage 2. 18.4.2.1.2 Early Efficacy or Futility Stopping It  is obvious that if β1 ≥ α 2 , the stopping boundary is the same as it is for the design with early efficacy stopping. However, futility boundary β1 when β1 ≥ α 2 is expected to affect the power of the hypothesis testing. Therefore,

π1 =



α1

dt1 = α 1 ,

(18.13)

0

and   π2 =   

β1

∫ ∫ ∫ ∫ α1

α2

α1

α2

dt2 dt1 for β1 ≤ α 2

t1

α2

(18.14) dt2 dt1 for β1 > α 2

t1

Carrying out the integrations in (18.13) and substituting the results into (18.6), we have 1 2  2  α 1 + α 2 ( β1 − α 1 ) − 2 ( β1 − α 1 ) for β1 < α 2 α = α 1 + 1 (α 2 − α 1 )2 for β1 ≥ α 2  2

(18.15)

Various stopping boundaries can be chosen from (18.15). See Table 18.7 for examples of the stopping boundaries.

Rare Diseases Drug Development

475

TABLE 18.7 Stopping Boundaries for Two-Stage Efficacy and Futility Designs β 1 = 0.15

One-sided α 0.025

α1

0.005

0.010

0.015

0.020

0.025

α2

0.2154

0.1871

0.1566

0.1200

0.0250

α1

0.005

0.010

0.015

0.020

0.025

α2

0.3333

0.3155

0.2967

0.2767

0.2554

β 1 = 0.2 0.05

Source: Chang, M., Stat. Med., 26, 2772–2784.

The adjusted p-value is given by  t if k = 1   1 2  p(t ; k ) = α 1 + t( β1 − α 1 ) − ( β1 − α 12 ) if k = 2 and β1 < α 2 2  1  if k = 2 β1 ≥ α 2 α 1 + (t − α 1 )2  2

(18.16)

where t = p1 if the trial stops at Stage 1 and t = p1 + p2 if the trial stops at Stage 2. 18.4.2.1.3 Early Futility Stopping A  trial featuring early futility stopping is a special case of the previous design, where α 1 = 0 in (18.15). Hence, we have 1 2  α 2 β1 − 2 β1 for β1 < α 2 α =  1 α 22 for β1 ≥ α 2  2

(18.17)

Solving for α 2, it can be obtained that α 1  + β1 for β1 < 2α α 2 =  β1 2  2α for β1 ≥ α 2 

(18.18)

Examples of the stopping boundaries generated using (18.18) are presented in Table 18.8.

Innovative Statistics in Regulatory Science

476

TABLE 18.8 Stopping Boundaries for Two-Stage Futility Design One-sided α

β1

0.1

0.2

0.3

≥ 0.4

0.025

α2

0.3000

0.2250

0.2236

0.2236

0.05

α2

0.5500

0.3500

0.3167

0.3162

Source: Chang, M., Stat. Med., 26, 2772–2784.

The adjusted p-value can be obtained from (18.16), where α 1 = 0, that is,  t  1  p(t ; k ) = α 1 + tβ1 − β12 2  1 2   α 1 + 2 t

if k = 1 if k = 2 and β1 < α 2 if k = 2 β1 ≥ α 2

18.4.2.2 Remarks In  practice, one of the questions that are commonly asked when applying a two-stage adaptive seamless design in clinical trials is about sample size calculation/allocation. For the first kind of two-stage seamless designs, the methods based on individual p-values as described in Chow and Chang (2006) can be applied. However, these methods are not  appropriate for Category IV (DD) trial designs with different study objectives and endpoints at different stages. For Category IV (DD) trial designs, the following issues are challenging to the investigator and the biostatistician. First, how do we control the overall type I error rate at a pre-specified level of significance? Second, is the typical O’Brien-Fleming type of boundaries feasible? Third, how to perform a valid final analysis that combines data collected from different stages?

18.5 Evaluation of Rare Disease Clinical Trials Due the small sample size in rare disease clinical trials, the conclusion drawn may not achieve a desired level of statistical inference (e.g., power or confidence interval). In this case, it is suggested that the following methods can be considered for evaluation of the rare disease clinical trial to determine whether substantial evidence of safety and efficacy has been achieved. Let n1 , n2 , and N be the sample size of the intended trial at interim, sample size of the data borrowed from previous studies, and sample size required for achieving a desired power (say 80%), respectively.

Rare Diseases Drug Development

477

18.5.1 Predictive Confidence Interval (PCI) Let T i and Ri be the sample mean of the ith sample for the test product and the reference product, respectively. Also, let σ 1, σ 2 , and σ ∗ be the pooled sample standard deviation of difference in sample mean between the test product and the reference product based on the first sample ( n1 ), the second sample ( n1 + n2 ), and the third sample ( N ). Under a parallel design, the usual confidence interval of the treatment effect can be obtained based on the ith sample and jth sample as follows: CI i = T i − Ri ± z1−α σ i , where i = 1, 2, and N. In  practice, for rare disease clinical trials, we can compare these confidence intervals in terms of their relative efficiency for a complete clinical picture. Relative efficiency of CI i as compared to CI j is defined as Rij = σ i / σ j , where i and j represent the ith sample and jth sample, respectively. 18.5.2 Probability of Reproducibility Although there will not always be sufficient power due to the small sample size available in rare diseases clinical trials, alternatively, we may consider empirical power based on the observed treatment effect and the variability associated with the observed difference adjusted for the sample size required for achieving the desired power. The empirical power is also known as reproducibility probability of the clinical results for future studies if the studies shall be conducted under similar experimental conditions. Shao and Chow (2002) studied how to evaluate the reproducibility probability using this approach under several study designs for comparing means with both equal and unequal variances. When the reproducibility probability is used to provide substantial evidence of the effectiveness of a drug product, the estimated power approach may produce an optimistic result. Alternatively, Shao and Chow (2002) suggested that the reproducibility probability be defined as a lower confidence bound of the power of the second trial. The reproducibility probability can be used to determine whether the observed clinical results from previous studies is reproducible in future studies for evaluation of safety and efficacy of the test treatment under investigation. In addition, Shao and Chow also suggested a more sensible definition of reproducibility probability using the Bayesian approach. Under the Bayesian approach, the unknown parameter θ is a random vector with a prior

Innovative Statistics in Regulatory Science

478

distribution, say, π(θ), which is assumed known. Thus, the reproducibility probability can be defined as the conditional probability of |T| > C in the future trial, given the data set x observed from the previous trial(s), that is, P { T > C | x } = P ( T > C | θ ) π (θ |x)dθ



where T = T ( y ) is based on the data set y from the future trial, and π (θ |x) is the posterior density of θ , given x. Moreover, similar idea can be applied to assess generalizability from one patient population (adults) to another (e.g., pediatrics or elderly). The generalizability can be derived as follows.

18.6 Some Proposals for Regulatory Consideration For rare disease drug development, the Orphan Drug Act provides incentives associated with orphan drug designation to make the development of rare disease drug products financially viable with small number of patients (FDA, 2019). However, the FDA does not intend to create a statutory standard for approval of orphan drugs that is different from the standard for approval of more typical drugs. Thus, the level of substantial evidence and sample size required for approval of rare diseases drug products are probably the most challenging issues in rare disease drug product development. Given these facts, we suggest the following proposals for regulatory considerations. 18.6.1 Demonstrating Effectiveness or Demonstrating Not Ineffectiveness For approval of a new drug product, the sponsor is required to provide substantial evidence regarding safety and efficacy of the drug product under investigation. In practice, a typical approach is to conduct adequate and wellcontrolled clinical studies and test the following point hypotheses: H 0 : Ineffectiveness versus H a : Effectiveness.

(18.19)

The  rejection of the null hypothesis of ineffectiveness is in favor of the alternative hypothesis of effectiveness. Most researchers interpret that the rejection of the null hypothesis is the demonstration of the alternative hypothesis of effectiveness. It, however, should be noted that “in favor of effectiveness” does not  imply “the demonstration of effectiveness.” In practice, hypotheses (18.19) should be

Rare Diseases Drug Development

H 0 : Ineffectiveness versus H a : Not ineffectiveness.

479

(18.20)

In other words, the rejection of H 0 would lead to the conclusion of “not H 0 ,” which is H a given in (18.20). As it can be seen from H a in (18.19) and (18.20), the concept of effectiveness (18.19) and the concept of not ineffectiveness (18.20) are not the same. Not ineffectiveness does not imply effectiveness in general. Thus, the traditional approach for clinical evaluation of the drug product under investigation can only demonstrate “not ineffectiveness” but not “effectiveness”. The relationship between demonstrating “effectiveness” (18.19) and demonstrating “not ineffectiveness” (18.20) is illustrated in Figure  18.1. As it can be seen from Figure 18.1, “not ineffectiveness” consists of two parts, namely, the portion of “inconclusiveness” and the portion of “effectiveness.” For a placebo-control clinical trial comparing a test treatment (T) and a placebo control (P), let θ = µT − µR be the treatment effect of the test treatment as compared to the placebo, where µT and µR are the mean response of the test treatment and the placebo, respectively. For a given sample, e.g., test results from a previous or pilot study, let (θ L ,θU ) be a ( 1 − α ) × 100% confidence interval of θ . In this case, hypothesis (18.19) becomes H 0 :θ ≤ θ L versus H a :θ > θU ,

(18.21)

while hypothesis (18.20) is given by H 0 :θ ≤ θ L versus H a :θ > θ L .

(18.22)

Hypotheses (18.21) is similar to the hypotheses set in Simon’s two-stage optimal design for cancer research. At the first stage, Simon suggested testing whether the response rate has exceeded a pre-specified undesirable response rate. If yes, then proceed to test whether the response rate has achieved a prespecified desirable response rate. Note that Simon’s hypothesis testing actually is an interval hypothesis testing. On the other hand, hypotheses (18.22) is a typical one-sided test for non-inferiority of the test treatment as compared

FIGURE 18.1 Demonstrating “effectiveness” or “not ineffectiveness.”

480

Innovative Statistics in Regulatory Science

to the placebo. Thus, the rejection of inferiority leads to the conclusion of non-inferiority that consists of equivalence (the area of inconclusiveness, i.e., θ L < θ < θU ) and superiority (i.e., effectiveness). For a given sample size, the traditional approach for clinical evaluation of the drug product under investigation can only demonstrate that the drug product is not ineffective when the null hypothesis is rejected. For demonstrating the drug product is truly effective, we need to perform another test to rule out the possibility of inconclusiveness (i.e., to reduce the probability of inconclusiveness). In practice, however, we typically test point hypotheses of equality at the α = 5% level of significance. The  rejection of the null hypothesis leads to the conclusion that there is a treatment effect. An adequate sample size is then selected to have a desired power (say 80%) to determine whether the observed treatment effect is clinically meaningful and hence the claim that the effectiveness is demonstrated. For testing point hypotheses of no treatment effect, many researchers prefer testing the null hypothesis at the α = 1% rather than α = 5% in order to account for the possibility of inconclusiveness. In other words, if the observed p-value falls between 1% and 5%, we claim the test result is inconclusive. It should be noted that the concept of point hypotheses testing for no treatment effect is very different from interval hypotheses testing (18.19) and one-sided hypotheses testing for non-inferiority (18.20). In practice, however, point hypotheses testing, interval hypotheses testing, and one-sided hypotheses for testing non-inferiority have been mixed up and used in pharmaceutical research and development. 18.6.2 Two-Stage Adaptive Trial Design for Rare Disease Product Development As discussed in the previous sections, in-availability of patients in clinical trials due to small population size of rare disease and how to achieve the same standard for regulatory review and approval are probably the most obstacles and challenges in rare disease product development. In this section, to address these dilemmas, we propose a two-stage adaptive trial design for demonstrating “not ineffectiveness” at the first stage and then demonstrating “effectiveness” at the second stage of rare disease drug product. The proposed two-stage adaptive trial design is briefly outlined below (see also Chow and Huang, 2019): Stage 1. Construct a ( 1 − α ) × 100% confidence interval for θ based on previous/pilot studies or literature review. Then, based on n1 subjects available at Stage 1, test hypotheses (18.20) for non-inferiority (i.e., test for not ineffectiveness at the α 1 level, a pre-specified level of significance). If it fails to reject the null hypothesis of ineffectiveness, then stop the trial due to futility. Otherwise proceed to next stage. Stage 2. Recruit additional n2 subjects at the second stage. At this stage, sample size re-estimation may be performed for achieving the desirable statistical assurance (say 80%) for establishment of effectiveness

Rare Diseases Drug Development

481

of the test treatment under investigation. At the second stage, a statistical test is performed to assure that the probability of the area of inclusiveness is within an acceptable range at the α 2 level, a prespecified level of significance. Under the proposed two-stage adaptive trial design, it can be showed that the overall type I error rate is a function of α 1 and α 2 (see also Chow and Huang, 2019). Thus, with appropriate choice of α 1, we may reduce the sample size required for demonstrating “not ineffectiveness.” However, it is suggested that the selection of α 1 and α 2 be specified in the study protocol. The poststudy adjustment is not encouraged. For review and approval of rare diseases drug products, thus, we propose to first demonstrate not ineffectiveness with limited information available at a pre-specified level of significance and then collect additional information to rule out inconclusiveness for demonstration of effectiveness at a pre-specified level of significance under the proposed two-stage adaptive trial design. 18.6.3 Probability Monitoring Procedure for Sample Size In rare disease clinical trials, power calculation for required sample size may not be feasible for rare disease clinical trials due to limited number of subjects available, especially when the anticipated treatment effect is relatively small. In this case, alternative methods such as precision analysis (or confidence interval approach), reproducibility analysis, or probability monitoring approach, and Bayesian approach may be considered for providing substantial evidence with certain statistical assurance (Chow et al., 2017). It, however, should be noted that the resultant sample sizes from these different analyses could be very different with different levels of statistical assurance achieved. Thus, for rare disease clinical trials, it is suggested that an appropriate sample size should be selected for achieving certain statistical assurance under a valid trial design. As an example, an appropriate sample size may be selected based on a probability monitoring approach such that the probability of crossing safety boundary is controlled at a pre-specified level of significance. Suppose an investigator plans to monitor the safety of a rare disease clinical trial sequentially at several times, ti , i = 1,…, K . Let ni and Pi be the sample size and the probability of observing an event at time ti . Thus, an appropriate sample size can be selected such that the following probability of crossing safety stopping boundary is less than a pre-specified level of significance pk = P {across safety stopping boundary| nk , Pk } < α , k = 1,…, K .

(18.23)

In  practice, it is suggested that statistical methods used for data analysis should be consistent with statistical methods used for sample size estimation for scientific validity of the intended clinical trial. The concepts of power analysis, precision analysis, reproducibility analysis, probability monitoring

482

Innovative Statistics in Regulatory Science

approach, and Bayesian approach should not be mixed up with, while the statistical methods for data analysis should reflect the desired statistical assurance under the trial design.

18.7 Concluding Remarks As discussed, for rare disease drug development, power analysis for sample size calculation may not be feasible due to the fact that there is small patient population. The  FDA  draft guidance emphasizes that the same standards for regulatory approval will be applied to rare diseases drug development despite of small patient population. Thus, often there is insufficient power for rare disease drug clinical investigation. In this case, it is suggested that sample size calculation or justification should be performed based on precision analysis, reproducibility analysis, or probability monitoring approach for achieving certain statistical assurance. In practice, it is a dilemma for having same standard with fewer subjects in rare disease drug development. Thus, it is suggested that innovative design and statistical methods should be considered and implemented for obtaining substantial evidence regarding effectiveness and safety in support of regulatory approval of rare disease drug products. In this chapter, several innovative trial designs such as complete n-of-1 trial design, adaptive seamless trial design, trial design utilizing the concept of master protocols, and Bayesian trial design are introduced. The corresponding statistical methods and sample size requirement under respective study designs are derived. These study designs are useful in speeding up rare disease development process and identifying any signal, pattern or trend, and/or optimal clinical benefits of the rare disease drug products under investigation. Due to the small patient population in rare disease clinical development, the concept of generalizability probability can be used to determine whether the clinical results can be generalized from the targeted patient population (e.g., adults) to a different but similar patient population (e.g., pediatrics or elderly) with the same rare disease. In  practice, the generalizability probability can be evaluated through the assessment of sensitivity index between the targeted patient population and the different patient population (Lu et al., 2017). The degree of generalizability probability can then be used to judge whether the intended trial has provided substantial evidence regarding the effectiveness and safety for the different patient population (e.g., pediatrics or elderly). In practice, although an innovative and yet complex trial design may be useful in rare disease drug development, it may introduce operational bias to the trial and consequently increase the probability of making errors. It is then suggested that quality, validity, and integrity of the intended trial utilizing an innovative trial design should be maintained.

Bibliography Afonja, B. (1972). The moments of the maximum of correlated normal and t-variates. Journal of the Royal Statistical Society Series B, 34, 251–262. Agin, M.A., Aronstein, W.S., Ferber, G., Geraldes, M.C., Locke, C., and Sager, P. (2008). QT/QTc prolongation in placebo-treated subjects: A PhRMA collaborative data analysis. Journal of Biopharmaceutical Statistics, 18, 408–426. Agresti, A. and Min, Y. (2005). Simple improved confidence intervals for comparing matched proportions. Statistics in Medicine, 24(5), 729–740. Akaike, H. (1974). A new look at statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. Alosh, M. (2009). The  impact of missing data in a generalized integer-valued autoregression model for count data. Journal of Biopharmaceutical Statistics, 19, 1039–1054. Anderson, S. and Hauck, W.W. (1990). Consideration of individual bioequivalence. Journal of Pharmacokinetics and Biopharmaceutics, 18, 259–273. Atkinson, A.C. and Donev, A.N. (1992). Optimum Experimental Designs, Oxford University Press, New York. Austin, P.C. (2011). An Introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46, 399–424. Babb, J., Rogatko, A., and Zacks, S. (1998). Cancer phase I clinical trials: Efficient dose escalation with overdose control. Statistics in Medicine, 17, 1103–1120. Babb, J.S. and Rogatko, A. (2004). Bayesian methods for cancer phase I clinical trials. In Advances in Clinical Trial Biostatistics, Geller, N.L. (Ed.), Marcel Dekker, New York. Bailar, J.C. (1992). Some use of statistical thinking. In Medical Use of Statistics, Bailar, J.C. and Mosteller, F. (Eds.), New England Journal of Medicine Books, Boston, MA, pp. 5–26. Barnes, P.J., Pocock, S.J., Magnussen, H., Iqbal, A., Kramer, B., Higgins, M., and Lawrence, D. (2010). Integrating indacaterol dose selection in a clinical study in COPD using an adaptive seamless design. Pulmonary Pharmacology  & Therapeutics, 23(3), 165–171. Barrentine, L.B. (1991). Concepts for R&R Studies, ASQC Quality Press, Milwaukee, WI. Barry, M.J., Fowler, F.J. Jr., O’Leary, M.P., Bruskewitz, R.C., Holtgrewe, H.L., Mebust, W.K., and Cockett, A.T. (1992). The American Urological Association Symptom Index for benign prostatic hyperplasia. Journal of Urology, 148, 1549–1557. Basford, K.E., Greenway, D.R., McLachlan, G.J., and Peel, D. (1997). Standard errors of fitted component means of normal mixtures, Computational Statistics, 12, 1–17. Bauer, P. and Kieser, M. (1999). Combining different phases in the development of medical treatments within a single trial. Statistics in Medicine, 14, 1595–1607. Bauer, P. and Kohne, K. (1994). Evaluation of experiments with adaptive interim analysis. Biometrics, 1029–1041. Bauer, P. and Rohmel, J. (1995). An adaptive method for establishing a dose-response relationship. Statistics in Medicine, 14, 1595–1607.

483

484

Bibliography

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of Royal Statistical Society, Series B, 57, 289–300. Bensoussan, A., Talley, N.J., Hing, M., Menzies, R., Guo, A., and Ngu, M. (1998). Treatment of irritable bowel syndrome with Chinese herbal medicine. Journal of American Medical Association, 280, 1585–1589. Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed., SpringerVerlag, New York. Bergner, M., Bobbitt, R.A., Carter, W.B., and Gilson, B.S. (1981). The sickness impact profile: Development and final revision of a health status measure. Medical Care, 19, 787–805. Berger, R.L. and Hsu, J.C. (1996). Bioequivalence trials, intersection-union tests and equivalence confidence sets (with discussion). Statistical Science, 11, 283–319. Bergum, J.S. (1988). Constructing acceptance limits for multiple stage USP tests. Proceedings of the Biopharmaceutical Section of the American Statistical Association, pp. 197–201. BHAT (1983). Beta-Blocker Heart Attack Trial Research Group. A randomized trial of propranolol in patients with acute myocardial infarction: Morbidity. JAMA, 250, 2814–2819. Bhatt, D.L. and Mehta, C. (2016). Adaptive designs for clinical trials. New England Journal of Medicine, 375(1), 65–74. Blackwell, D. and Hodges, J.L. Jr. (1957). Design for the control of selection bias. The Annals of Mathematical Statistics, 28, 449–460. Blair, R.C. and Cole, S.R. (2002). Two-sided equivalence testing of the difference between two means. Journal of Modern Applied Statistical Methods, 1, 139–142. Bofinger, E. (1985). Expanded confidence intervals. Communications in Statistics, Theory and Methods, 14, 1849–1864. Bofinger, E. (1992). Expanded confidence intervals, one-sided tests and equivalence testing. Journal of Biopharmaceutical Statistics, 2, 181–188. Bollier, D. (2010). The Promise and Peril of Big Data, The Aspen Institute, Washington, DC. Brannath, W., Koening, F., and Bauer, P. (2003). Improved repeated confidence bounds in trials with a maximal goal. Biometrical Journal, 45, 311–324. Branson, M. and Whitehead, W. (2002). Estimating a treatment effect in survival studies in which patients switch treatment. Statistics in Medicine, 21, 2449–2463. Breunig, R. (2001). An almost unbiased estimator of the coefficient of variation. Economics Letters, 70, 15–19. Brookmeyer, R. and Crowley, J. (1982). A confidence interval for the median survival time. Biometrics, 38, 29–41. Brown, B.W. (1980). The crossover experiment for clinical trials. Biometrics, 36, 69–79. Brown, L.D., Hwang, J.T.G., and Munk, A. (1997). An unbiased test for the bioequivalence problem. The Annals of Statistics, 25, 2345–2367. Brownell, K.D. and Stunkard, A.J. (1982). The double-blind in danger untoward consequences of informed consent. The American Journal of Psychiatry, 139, 1487–1489. Canales, R.D., Luo, Y., Willey, J.C., Austermiller, B., Barbacioru, C.C., Boysen, C., Hunkapiller, K. et al. (2006). Evaluation of DNA microarray results with quantitative gene expression platforms. Nature Biotechnology, 24, 1115–1122. Caraco, Y. (2004). Genes and the response to drugs. The  New England Journal of Medicine, 351, 2867–2869.

Bibliography

485

Cardoso, F., Piccart-Gebhart, Van’t Veer L., Rutgers, E. on behalf of the TRANSBIG (2007). The MINDACT trial: the first prospective clinical validtion of a genomic tool. Molecular Oncology, 1, 246–251. Carrasco, J.L. and Jover, L. (2003). Assessing individual bioequivalence using structural equation model. Statistics in Medicine, 22, 901–912. Casciano, D.A. and Woodcock, J. (2006). Empowering microarrays in the regulatory setting. Nature Biotechnology, 24, 1103. Casella, G. and Berger, R.L. (2002). Statistical Inference, 2nd ed., Duxbury Advanced Series, Duxbury, Pacific Grove, CA. CAST (1989). Cardiac Arrhythmia Supression Trial. Preliminary report: Effect of encainide and flecainide on mortality in a randomized trial of arrhythmia supression after myocardial infarction. The New England Journal of Medicine, 321, 406–412. CBER/FDA. (1999). CBER/FDA Memorandum. Summary of CBER considerations on selected aspects of active controlled trial design and analysis for the evaluation of thrombolytics in acute MI, June 1999. Chang, M. (2005a). Bayesian adaptive design with biomarkers. Invited presentation at the IBC. Second Annual Conference: Implementing Adaptive Designs for Drug Development, November 7–8, 2005, Princeton, NJ. Chang, M. (2005b). Adaptive clinical trial design. Presented at International Conference for Stochastic Process and Data Analysis, Brest, France, May, 2005. Chang, M. (2007). Adaptive design method based on sum of p-values. Statistics in Medicine, 26, 2772–2784. Chang, M. (2008). Adaptive Design Theory and Implementation Using SAS and R, Chapman and Hall/CRC Press, Taylor & Francis Group, New York. Chang, M. and Chow, S.C. (2005). A  Hybrid Bayesian adaptive design for dose response trials. Journal of Biopharmaceutical Statistics, 15, 677–691. Chang, M. and Chow, S.C. (2006). Power and sample size for dose response studies. In Dose Finding in Drug Development, Ting, N. (Ed.), Springer, New York. Charkravarty, A., Burzykowski, M., Molenberghs, G., and Buyse, T. (2005). Regulatory aspects in using surrogate markers in clinical trials. In  The  Evaluation of Surrogate Endpoint, Burzykoski, T., Molenberghs, G., and Buyse, M. (Eds.), Springer, New York. Chen, J. and Chen, C. (2003). Microarray gene expression. In  Encyclopedia of Biopharmaceutical Statistics, Chow, S.C. (Ed.), Marcel Dekker, New  York, pp. 599–613. Chen, M.L., Patnaik, R., Hauck, W.W., Schuirmann, D.F., Hyslop, T., and Williams, R. (2000). An individual bioequivalence criterion: Regulatory considerations. Statistics in Medicine, 19, 2821–2842. Chen, X., Luo, X., and Capizzi, T. (2005). The application of enhanced parallel gatekeeping strategies. Statistics in Medicine, 24, 1385–1397. Chen, Y.J., Gesser, R., and Luxembourg, A. (2015). A seamless phase IIB/III adaptive outcome trial: Design rationale and implementation challenges. Clinical Trials, 12(1), 84–90. Cheng, B., Chow, S.C., Burt, D., and Cosmatos, D. (2008). Statistical assessment of QT/ QTc prolongation based on maximum of correlated normal random variables. Journal of Biopharmaceutical Statistics, 18, 494–501. Cheng, B. and Shao, J. (2007). Exact tests for negligible interaction in two-way linear models. Statistica Sinica, 17, 1441–1455.

486

Bibliography

Cheng, B., Zhang, B., and Chow, S.C. (2019). Unified approaches to assessing treatment effect of traditional Chinese medicine based on health profiles. Journal of Biopharmaceutical Statistics, To appear. Chirino, A.J. and Mire-Sluis, A. (2004). Characterizing biological products and assessing comparability following manufacturing changes. Nature Biotechnology, 22, 1383–1391. Chung, W.H., Hung, S.I., Hong, H.S., Hsih, M.S., Yang, L.C., Ho, H.C., Wu, J.Y., and Chen, Y.T. (2004). Medical genetics: A marker for Stevens-Johnson syndrome. Nature, 428 (6982), 486. Chow, S.C. (1997). Good statistics practice in the drug development and regulatory approval process. Drug Information Journal, 31, 1157–1166. Chow, S.C. (1999). Individual bioequivalence—a review of FDA draft guidance. Drug Information Journal, 33, 435–444. Chow, S.C. (2007). Statistics in translational medicine. Presented at Current Advances in Evaluation of Research & Development of Translational Medicine, National Health Research Institutes, Taipei, Taiwan, October 19, 2007. Chow, S.C. (2010). Generalizability probability of clinical results. In  Encyclopedia of Biopharmaceutical Statistics, Chow, S.C. (Ed.), Informa Healthcare, Taylor  & Francis Group, London, UK, pp. 534–536. Chow, S.C. (2011). Controversial Issues in Clinical Trials, Chapman and Hall/CRC Press, Taylor & Francis Group, New York. Chow, S.C. (2013). Biosimilars: Design and Analysis of Follow-on Biologics, Chapman and Hall/CRC Press, Taylor & Francis Group, New York. Chow, S.C. (2015). Quantitative Methods for Traditional Chinese Medicine Development, Chapman and Hall/CRC Press, Taylor & Francis Group, New York. Chow, S.C. (2018). Non-medical switch in biosimilar product development. Enliven: Biosimilars Bioavailability, 2(1): e001. Chow, S.C. and Chang, M. (2005). Statistical consideration of adaptive methods in clinical development. Journal of Biopharmaceutical Statistics, 15, 575–591. Chow, S.C. and Chang, M. (2006). Adaptive Design Methods in Clinical Trials, Chapman and Hall/CRC Press, Taylor & Francis Group, New York. Chow, S.C. and Chang, M. (2008). Adaptive design methods in clinical trials–a review. The Orphanet Journal of Rare Diseases, 3, 1–13. Chow, S.C. and Chang, M. (2011). Adaptive Design Methods in Clinical Trials, 2nd ed., Chapman and Hall/CRC Press, Taylor & Francis Group, New York, NY. Chow, S.C., Chang, M., and Pong, A. (2005). Statistical consideration of adaptive methods in clinical development. Journal of Biopharmaceutical Statistics, 15, 575–591. Chow, S.C., Cheng, B., and Cosmatos, D. (2008). On power and sample size calculation for QT studies with recording replicates at given time point. Journal of Biopharmaceutical Statistics, 18, 483–493. Chow, S.C. and Corey, R. (2011). Benefits, challenges and obstacles of adaptive designs in clinical trials. The  Orphanet Journal of Rare Diseases, 6, 79. doi:10.1186/1750-1172-6-79. Chow, S.C., Corey, R., and Lin, M. (2012). On independence of data monitoring committee in adaptive clinical trial. Journal of Biopharmaceutical Statistics, 22, 853–867. Chow, S.C., Endrenyi, L., Lachenbruch, P.A., Yang, L.Y., and Chi, E. (2011). Scientific factors for assessing biosimilarity and drug Interchangeability of follow-on Biologics. Biosimilars, 1, 13–26.

Bibliography

487

Chow, S.C. and Hsiao, C.F. (2010). Bridging diversity: Extrapolating foreign data to a new region. Pharmaceutical Medicine, 24, 349–362. Chow, S.C., Hsieh, T.C., Chi, E., and Yang, J. (2010). A comparison of moment-based and probability-based criteria for assessment of follow-on biologics. Journal of Biopharmaceutical Statistics, 20, 31–45. Chow, S.C. and Huang, Z.P. (2019). Demonstrating effectiveness or demonstrating not ineffectiveness – A potential solution for rare disease drug product development? Journal of Biopharmaceutical Statistics, In press. Chow, S.C. and Ki, F. (1994). On statistical characteristics of quality of life assessment. Journal of Biopharmaceutical Statistics, 4, 1–17. Chow, S.C. and Ki, F. (1996). Statistical issues in quality of life assessment. Journal of Biopharmaceutical Statistics, 6, 37–48. Chow, S.C. and Kong, Y.Y. (2015). On big data analytics in biomedical research. Journal of Biometrics and Biostatistics, 6, 236. doi:10.4172/2155-6180.1000236. Chow, S.C. and Lin, M. (2015). Analysis of two-stage adaptive seamless trial design. Pharmaceutica Analytica Acta, 6, 3. doi:10.4172/2153-2435.1000341. Chow, S.C. and Liu, J.P. (1992a). Design and Analysis of Bioavailability and Bioequivalence Studies, Marcel Dekker, New York. Chow, S.C. and Liu, J.P. (1992b). On assessment of bioequivalence under a higherorder crossover design. Journal Biopharmaceutical Statistics, 2, 239–256. Chow, S.C. and Liu, J.P. (1995). Statistical Design and Analysis in Pharmaceutical Science: Validation, Process Control, and Stability, Marcel Dekker, New York. Chow, S.C. and Liu, J.P. (1997). Meta-analysis for bioequivalence review. Journal of Biopharmaceutical Statistics, 7, 97–111. Chow, S.C. and Liu, J.P. (1998a). Design and Analysis of Animal Studies in Pharmaceutical Development, Marcel Dekker, New York. Chow, S.C. and Liu, J.P. (1998b). Design and Analysis of Clinical Trials, John Wiley  & Sons, New York. Chow, S.C. and Liu, J.P. (2000). Design and Analysis of Bioavailability and Bioequivalence Studies, Revised and expanded, 2nd ed., Marcel Dekker, New York. Chow, S.C. and Liu, J.P. (2003). Design and Analysis of Clinical Trials, 2nd ed., John Wiley & Sons, New York. Chow, S.C. and Liu, J.P. (2008). Design and Analysis of Bioavailability and Bioequivalence Studies, 3rd ed., Chapman Hall/CRC Press, Taylor & Francis Group, New York. Chow, S.C. and Liu, J.P. (2010). Statistical assessment of biosimilar products. Journal of Biopharmaceutical Statistics, 20, 10–30. Chow, S.C. and Liu, J.P. (2013). Design and Analysis of Clinical Trials, Revised and Expanded, 3rd ed., John Wiley & Sons, New York. Chow, S.C., Lu, Q., and Tse, S.K. (2007). Statistical analysis for two-stage adaptive design with different study endpoints. Journal of Biopharmaceutical Statistics, 17, 1163–1176. Chow, S.C., Pong, A., and Chang, Y.W. (2006). On traditional Chinese medicine clinical trials. Drug Information Journal, 40, 395–406. Chow, S.C. and Shao, J. (1997). Statistical methods for two-sequence dual crossover designs with incomplete data. Statistics in Medicine, 16, 1031–1039. Chow, S.C. and Shao, J. (2002a). Statistics in Drug Research, Marcel Dekker, New York. Chow, S.C. and Shao, J. (2002b). A note on statistical methods for assessing therapeutic equivalence. Controlled Clinical Trials, 23, 515–520.

488

Bibliography

Chow, S.C. and Shao, J. (2004). Analysis of clinical data with breached blindness. Statistics in Medicine, 23, 1185–1193. Chow, S.C. and Shao, J. (2005). Inference for clinical trials with some protocol amendments. Journal of Biopharmaceutical Statistics, 15, 659–666. Chow, S.C. and Shao, J. (2006). On non-inferiority margin and statistical tests in active control trials. Statistics in Medicine, 25, 1101–1113. Chow, S.C. and Shao, J. (2007). Stability analysis for drugs with multiple ingredients. Statistics in Medicine, 26, 1512–1517. Chow, S.C., Shao, J., and Hu, Y.P. (2002). Assessing sensitivity and similarity in bridging studies. Journal of Biopharmaceutical Statistics, 12, 385–400. Chow, S.C., Shao, J., and Li, L. (2004). Assessing bioequivalence using genomic data. Journal of Biopharmaceutical Statistics, 14, 869–880. Chow, S.C., Shao, J., and Wang, H. (2002a). A  note on sample size calculation for mean comparisons based on non-central t-statistics. Journal of Biopharmaceutical Statistics, 12, 441–456. Chow, S.C., Shao, J., and Wang, H. (2002b). Individual bioequivalence testing under 2 × 3 crossover designs. Statistics in Medicine, 21, 629–648. Chow, S.C., Shao, J., and Wang, H. (2003). Statistical tests for population bioequivalence. Statistica Sinica, 13, 539–554. Chow, S.C., Shao, J., and Wang, H. (2008). Sample Size Calculation in Clinical Research, Chapman and Hall/CRC Press, Taylor & Francis Group, New York. Chow, S.C., Shao, J., Wang, H., and Lokhnygina, Y. (2017). Sample Size Calculations in Clinical Research, 3rd ed., Taylor & Francis Group, New York. Chow, S.C. and Tse, S.K. (1991). On the estimation of total variability in assay validation. Statistics in Medicine, 10, 1543–1553. Chow, S.C. and Tu, Y.H. (2009). On two-stage seamless adaptive design in clinical trials. Journal of Formosan Medical Association, 107(12), S51–S59. Chow, S.C. and Wang, H. (2001). On sample size calculation in bioequivalence trials. Journal of Pharmacokinetics and Pharmacodynamics, 28, 155–169. Chow, S.C., Song, F.Y., and Bai, H. (2016). Analytical similarity assessment in biosimilar studies. AAPS Journal, 18(3), 670–677. Chow, S.C., Xu, H., Endrenyi, L., and Song, F.Y. (2015). A new scaled criterion for drug interchangeability. Chinese Journal of Pharmaceutical Analysis, 35(5), 844–848. Christensen, R. (1996). Exact tests for variance components. Biometrics, 52, 309–314. Chuang, C. (1987). The analysis of a titration study. Statistics in Medicine, 6, 583–590. Chung-Stein, C. (1996). Summarizing laboratory data with different reference ranges in multi-center clinical trials. Drug Information Journal, 26, 77–84. Chung-Stein, C., Anderson, K., Gallo, P., and Collins, S. (2006). Sample size reestimation: A review and recommendations. Drug Information Journal, 40, 475–484. Church, J.D. and Harris, B. (1970). The estimation of reliability from stress-strength relationships. Technometrics, 12, 49–54. Cochran, W.G. (1977). Sampling Techniques, 3rd ed., Wiley, New York. Cochran, W.G. and Cox, G.M. (1957). Experimental Designs, 2nd ed., Wiley, New York. P.18. Coors, M., Bauer, L., Edwards, K., Erickson, K., Goldenberg, A., Goodale, J., Goodman, K. et  al. (2017). Ethical issues related to clinical research and rare diseases. Translational Science of Rare Diseases, 2, 175–194. Cosmatos, D. and Chow, S.C. (2008). Translational Medicine, Chapman and Hall/CRC Press, Taylor & Francis Group, New York.

Bibliography

489

CPMP. (1990). The  Committee for Proprietary Medicinal Products Working Party on Efficacy of Medicinal Products. Note for Guidance; Good Clinical Practice for Trials on Medicinal Products in the European Community; Commission of European Communities: Brussels, Belgium 1990—111/396/88-EN Final. CPMP. (1997). Points to consider: The  assessment of the potential for QT interval prolongation by non-cardiovascular products. Available at: www.coresearch. biz/regulations/cpmp.pdf. Crommelin, D., Bermejo, T., Bissig, M., Damianns, J., Kramer, I., Rambourg, P., Scroccaro, G., Strukelj, B., Tredree, R., and Ronco, C. (2005). Biosimilars, generic versions of the first generation of therapeutic proteins: Do they exist? Cardiovascular Disorders in Hemodialysis, 149, 287–294. Crowley, J. (2001). Handbook of Statistics in Clinical Oncology, Marcel Dekker, New York. CTriSoft Intl. (2002). Clinical Trial Design with ExpDesign Studio, www.ctrisoft.net., CTriSoft International, Lexington, MA. Cui, C. and Chow, S.C. (2018). Clinical trial: n-of-1 design analysis. In Encyclopedia of Biopharmaceutical Statistics, 4th ed., Chow, S.C. (Ed.), CRC Press, Taylor & Francis Group, New York, pp. 564–571. Cui, L., Hung, H.M.J., and Wang, S.J. (1999). Modification of sample size in group sequential trials. Biometrics, 55, 853–857. Dalton, W.S. and Friend, S.H. (2006). Cancer biomarkers–an invitation to the table. Science, 312, 1165–1168. D’Agostino, R.B., Massaro, J.M., and Sullivan, L.M. (2003). Non-inferiority trials: Design concepts and issues–the encounters of academic consultants in statistics. Statistics in Medicine, 22, 169–186. Davison, A.C. (2003). Statistical Models, Cambridge University Press, New  York, pp. 33–35. DerSimonian, R. and Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177–188. DeMets, D.L., Furberg, C.D., and Friedman, L.M. (2006). Data Monitoring in Clinical Trials: A Case Studies Approach, Springer, New York. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38. Dent, S.F. and Eisenhauer, E.A. (1996). Phase I trial design: Are new methodologies being put into practice? Annals of Oncology, 7, 561–566. DerSimonian, R. and Laird, N. (1986). Meta-analysis in clinical trials. Control Clin Trials. 7, 177–188. DeSouza, C.M., Legedza, T.R., and Sankoh, A.J. (2009). An overview of practical approaches for handling missing data in clinical trials. Journal of Biopharmaceutical Statistics, 19, 1055–1073. Deuflhard, P. (2004). Newton Methods for Nonlinear Problems. Affine Invariance and Adaptive Algorithms, Springer Series in Computational Mathematics, Vol.  35, Springer, Berlin, Germany. Di Scala, L. and Glimm, E. (2011). Time-to-event analysis with treatment arm selection at interim. Statistics in Medicine, 30, 3067–3081. Diggle, P. and Kenward, M.G. (1994). Informative dropout in longitudinal data analysis (with discussion). Applied Statistics, 43, 49–94. Dixon, D.O., Freedman, R.S., Herson, J., Hughes, M., Kim, K., Silerman, M.H., and Tangen, C.M. (2006). Guidelines for data and safety monitoring for clinical trials not requiring traditional data monitoring committees. Clinical Trials, 3, 314–319.

490

Bibliography

Dmitrienko, A., Molenberghs, G., Chung-Stein, C., and Offen, W. (2005). Analysis of Clinical Trials Using SAS: A Practical Guide, SAS Press, Gary, NC. Dmitrienko, A., Offen, W., Wang, O., and Xiao D. (2006). Gatekeeping procedures in dose-response clinical trials based on the Dunnett test. Pharmaceutical Statistics, 5, 19–28. Dmitrienko, A., Offen, W., and Westfall, P.H. (2003). Gatekeeping strategies for clinical trials that do not  require all primary effects to be significant. Statistics in Medicine, 22, 2387–2400. Dobbin, K.K., Beer, D.G., Meyerson, M., Yeatman, T.J., Gerald, W.L., Jacobson, J.W., Conley, B. et  al. (2005). Interlaboratory comparability study of cancer gene expression analysis using oligonucleotide microarrays. Clinical Cancer Research, 11, 565–573. DOH. (2004a). Draft Guidance for IND of Traditional Chinese Medicine. The Department of Health, Taipei, Taiwan. DOH. (2004b). Draft Guidance for NDA  of Traditional Chinese Medicine. The Department of Health, Taipei, Taiwan. Dubey, S.D. (1991). Some thought on the one-sided and two-sided tests. Journal of Biopharmaceutical Statistics, 1, 139–150. Dudoit, S., Yang, Y.H., Callow, M.J., and Speed, T.P. (2002). Statistical methods for identifying differentially expressed genes in replicated cDNA  microarray experiments. Statistica Sinica, 12, 111–139. Dunnett, C.W. (1955). Multivariate normal probability integrals with product correlation structure, Algorithm AS251. Journal of American Statistical Association, 50, 1096–1121. Eaton, M.L., Muirhead, R.J., Mancuso, J.Y., and Kolluri, S. (2006). A confidence interval for the maximal mean QT interval change caused by drug effect. Dug Information Journal, 40, 267–271. Efron, B. (1971). Forcing a sequential experiment to be balanced. Biometrika, 58, 403–417. Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on crossvalidation. Journal of American Statistical Association, 78, 316–331. Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of American Statistical Association, 81, 461–470. Efron, B. and Tibshirani, R,J. (1993). An Introduction to the Bootstrap, Chapman and Hall, New York. Eisenhauer, E.A., O’Dwyer, P.J., Christian, M., and Humphrey, J.S. (2000). Phase I clinical trial design in cancer drug development. Journal of Clinical Oncology, 18, 684–692. Ellenberg, J.H. (1990). Biostatistical collaboration in medical research. Biometrics, 46, 1–32. Ellenberg, S.S., Fleming, T.R., and DeMets, D.L. (2002). Data Monitoring Committees in Clinical Trials: A  Practical Perspective, John Wiley  & Sons, New York. EMEA. (2001). Note for guidance on the investigation of bioavailability and bioequivalence. The European Medicines Agency Evaluation of Medicines for Human Use. EMEA/EWP/QWP/1401/98, London, UK.

Bibliography

491

EMEA. (2002). Point to consider on methodological issues in confirmatory clinical trials with flexible design and analysis plan. The  European Agency for the Evaluation of Medicinal Products Evaluation of Medicines for Human Use. CPMP/EWP/2459/02, London, UK. EMEA. (2003a). Note for guidance on comparability of medicinal products containing biotechnology-derived proteins as drug substance: Non clinical and clinical issues. The European Medicines Agency Evaluation of Medicines for Human Use. EMEA/CHMP/3097/02, London, UK. EMEA. (2003b). Rev. 1 Guideline on comparability of medicinal products containing biotechnology-derived proteins as drug substance: Quality issues. The  European Medicines Agency Evaluation of Medicines for Human Use. EMEA/CHMP/BWP/3207/00/Rev 1, London, UK. EMEA. (2005a). Guideline on similar biological medicinal products. The  European Medicines Agency Evaluation of Medicines for Human Use. EMEA/ CHMP/437/04, London, UK. EMEA. (2005b). Draft guideline on similar biological medicinal products containing biotechnology-derived proteins as drug substance: Quality issues. The  European Medicines Agency Evaluation of Medicines for Human Use. EMEA/CHMP/49348/05, London, UK. EMEA. (2005c). Draft annex guideline on similar biological medicinal products containing biotechnology-derived proteins as drug substance: Non  clinical and clinical issues–Guidance on biosimilar medicinal products containing recombinant erythropoietins. The European Medicines Agency Evaluation of Medicines for Human Use. EMEA/CHMP/94526/05, London, UK. EMEA. (2005d). Draft annex guideline on similar biological medicinal products containing biotechnology-derived proteins as drug substance: Non clinical and clinical issues–Guidance on biosimilar medicinal products containing Recombinant Granulocyte-Colony Stimulating Factor. The  European Medicines Agency Evaluation of Medicines for Human Use. EMEA/CHMP /31329 /05, London, UK. EMEA. (2005e). Draft annex guideline on similar biological medicinal products containing biotechnology-derived proteins as drug substance: Non-clinical and clinical issues  – Guidance on biosimilar medicinal products containing Somatropin. The  European Medicines Agency Evaluation of Medicines for Human Use. EMEA/CHMP/94528/05, London, UK. EMEA. (2005f). Draft annex guideline on similar biological medicinal products containing biotechnology-derived proteins as drug substance: Non  clinical and clinical issues–Guidance on biosimilar medicinal products containing recombinant human insulin. The European Medicines Agency Evaluation of Medicines for Human Use. EMEA/CHMP/32775/05, London, UK. EMEA. (2005g). Guideline on the clinical investigating of the pharmacokinetics of therapeutic proteins. The European Medicines Agency Evaluation of Medicines for Human Use. EMEA/CHMP/89249/04, London, UK. EMEA. (2006). Reflection paper on methodological issues in confirmatory clinical trials with flexible design and analysis plan. The  European Agency for the Evaluation of Medicinal Products Evaluation of Medicines for Human Use. CPMP/EWP/2459/02, London, UK.

492

Bibliography

EMEA. (2007). Reflection paper on methodological issues in confirmatory clinical trials planned with an adaptive design. EMEA Doc. Ref. CHMP/EWP/2459/02, October 20  Available at http://www.emea.europa.eu/pdfs/human/ ewp/245902enadopted.pdf. Emerson, J.D. (1982). Nonparametric confidence intervals for the median in the presence of right censoring. Biometrics, 38, 17–27. Endrenyi, L., Declerck, P., and Chow, S.C. (2017). Biosimilar Drug Product Development, CRC Press, Taylor & Francis Group, New York. Enis, P. and Geisser, S. (1971). Estimation of the probability that Y